Skip to content

Unify OpenSearch startup connection gate: phase-aware retry/fallback across both startup paths #36244

Description

@fabrizzio-dotCMS

Description

dotCMS has inconsistent OpenSearch (OS) connection gating at startup, and the OS readiness gate does not respect the migration phase. Three concrete gaps were confirmed against the current code (see file:line below). Together they cause an opaque, late-failing crash when OS is unreachable/misconfigured during startup, instead of a fast, actionable failure with phase-correct behavior (fallback during migration, abort once OS is primary).

This is the follow-up to a finding made on 2026-06-12 while smoke-testing Phase 3 against the os-migration limited-user stack: the real failure was a TLS scheme mismatch (http:// against an OS 3.x port that only accepts https), surfacing as ConnectionClosedException deep inside createContentIndex ~30s into startup (SystemExitManager - Startup failure), with no actionable message.

Current behavior (confirmed)

Gate Where Today
ES readiness ESIndexAPI.waitUtilIndexReady() (ESIndexAPI.java:1150) Retries ES_CONNECTION_ATTEMPTS (default 24, 5s sleep); on exhaustion → SystemExitManager.immediateExit(1). Always aborts.
OS readiness OSIndexAPIImpl.waitUtilIndexReady() (OSIndexAPIImpl.java:512) Mirrors ES retry (OS_CONNECTION_ATTEMPTS / ES_CONNECTION_ATTEMPTS, configurable sleep); on exhaustion → immediateExit(1). Always aborts — not phase-aware.
Semantic validator IndexStartupValidator.validateIndexingConfig() via ContentletIndexAPIImpl.checkAndInitializeIndex() (ContentletIndexAPIImpl.java:524-541) Single probe, no retry. Phase-aware: Phase 3 → DotRuntimeException; Phase 0–2 → haltMigration() (fallback).

Startup ordering problem (empty-DB path)

By servlet load-on-startup order:

  1. MainServlet (=1) calls APILocator.getESIndexAPI().waitUtilIndexReady() (MainServlet.java:113) — waits on ES only, hardcoded, even in Phase 3 where OS is primary — then runs the startup tasks.
  2. Inside the tasks: Task00004LoadStarter → DotCMSInitDb.loadStarterSite → refreshAllContent → ContentletIndexAPIImpl.fullReindexStart → initIndex → bootstrapAndPoint → bootstrapAndPointOS → createContentIndex(IndexTag.OS). OS indices are created here, before any OS connection gate.
  3. InitServlet (=8) runs checkAndInitializeIndex() → the phase-aware validator — too late: indices already exist, so if (!indexReady()) is false and the gate is effectively skipped.

Scope note: the empty-DB bypass only bites when DB is empty AND migration is started (Phase ≥ 1), because initIndex() computes osNeeded = isMigrationStarted() && !indexReadyOS() (ContentletIndexAPIImpl.java:688-689). In Phase 0 no OS indices are created, so there is no gap. This is exactly the os-migration fresh-install scenario.

Target behavior

Phase OS unreachable after N retries Behavior
Phase 1 / 2 (shadow) yes Fallback to ES (haltMigration() / phase reset)
Phase 3 (OS primary) yes Abort, same as ES does today — no fallback

Consistent with [feedback] "OS write failures are fire-and-forget in Phase 1/2 but must propagate in Phase 3 (OS is primary)".

Acceptance Criteria

A — Startup readiness gate waits on the primary store, not ES-hardcoded

  • MainServlet startup readiness wait routes through the phase-aware router (getIndexAPI().waitUtilIndexReady(), IndexAPIImpl.java:193) instead of getESIndexAPI() directly, so the gate waits on the primary store for the current phase (ES in Phase 0–1, OS in Phase 3).
  • In Phase 0 / Phase 1, ES remains the store waited on (no behavior change for non-migrating installs).

B — OS waitUtilIndexReady() outcome is phase-aware

  • In Phase 3, exhausting OS_CONNECTION_ATTEMPTS retries aborts startup (current immediateExit behavior preserved) with an actionable fatal message.
  • In Phase 1 / 2, exhausting retries does not kill the server: it falls back to ES via haltMigration() (phase reset), logging a clear WARN/ERROR explaining the fallback.
  • Retry count and sleep remain configurable (OS_CONNECTION_ATTEMPTS w/ ES_CONNECTION_ATTEMPTS fallback; OS_CONNECTION_RETRY_SLEEP_SECONDS).

C — Connection gate runs before OS index creation on the empty-DB path (unification, minimal)

  • Before bootstrapAndPointOS creates OS indices (ContentletIndexAPIImpl.initIndex/bootstrapAndPoint, ~682-734), the primary-store connection gate (retry-N, phase-aware outcome from B) runs first.
  • On empty-DB + Phase ≥ 1, an unreachable/misconfigured OS fails fast with an actionable message (config + phase + endpoint), rather than throwing ConnectionClosedException deep inside createContentIndex.
  • Both startup paths (populated-DB via InitServlet and empty-DB via Task00004LoadStarter) pass through the same connection gate before creating/using OS indices — the gate no longer lives only in InitServlet (which runs too late).
  • Minimal scope: C reuses the retry + phase-aware outcome from B. It does not move the full semantic validateIndexingConfig() (OS version + endpoint-separation) into the bootstrap hot path — that stays in checkAndInitializeIndex().

General

  • No regression to non-migrating (Phase 0) startup on either populated or empty DB.
  • New/changed phase-branching behavior covered by tests (unit/IT as appropriate).

Out of scope (separate follow-up)

  • Error classification hardening of the connection catch: distinguish TLS-scheme mismatch (http vs https) vs HTTP 403 (auth) vs connection-refused (unreachable), so the fatal/fallback message names the actual cause. This is the diagnostic improvement that motivated the finding but is orthogonal to the gating/fallback rework here — to be filed as a child issue.

Priority

Medium — reliability gap; bites in OS-migration startup scenarios (empty DB + Phase ≥ 1, or any OS misconfiguration in Phase 3). Not a default-install (Phase 0) regression.

Additional Context

  • Originated from the 2026-06-12 Phase 3 smoke against the os-migration limited-user stack.
  • Relevant phase semantics: IndexConfigHelper.MigrationPhase (PHASE_0…PHASE_3), isMigrationStarted(), isMigrationComplete(), haltMigration().
  • Files: MainServlet.java:113, IndexAPIImpl.java:193, OSIndexAPIImpl.java:512, ESIndexAPI.java:1150, ContentletIndexAPIImpl.java:522-558,682-734,874-891, IndexStartupValidator.java, IndexConfigHelper.java.

Metadata

Metadata

Type

No fields configured for Task.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions