Description
dotCMS has inconsistent OpenSearch (OS) connection gating at startup, and the OS readiness gate does not respect the migration phase. Three concrete gaps were confirmed against the current code (see file:line below). Together they cause an opaque, late-failing crash when OS is unreachable/misconfigured during startup, instead of a fast, actionable failure with phase-correct behavior (fallback during migration, abort once OS is primary).
This is the follow-up to a finding made on 2026-06-12 while smoke-testing Phase 3 against the os-migration limited-user stack: the real failure was a TLS scheme mismatch (http:// against an OS 3.x port that only accepts https), surfacing as ConnectionClosedException deep inside createContentIndex ~30s into startup (SystemExitManager - Startup failure), with no actionable message.
Current behavior (confirmed)
| Gate |
Where |
Today |
| ES readiness |
ESIndexAPI.waitUtilIndexReady() (ESIndexAPI.java:1150) |
Retries ES_CONNECTION_ATTEMPTS (default 24, 5s sleep); on exhaustion → SystemExitManager.immediateExit(1). Always aborts. |
| OS readiness |
OSIndexAPIImpl.waitUtilIndexReady() (OSIndexAPIImpl.java:512) |
Mirrors ES retry (OS_CONNECTION_ATTEMPTS / ES_CONNECTION_ATTEMPTS, configurable sleep); on exhaustion → immediateExit(1). Always aborts — not phase-aware. |
| Semantic validator |
IndexStartupValidator.validateIndexingConfig() via ContentletIndexAPIImpl.checkAndInitializeIndex() (ContentletIndexAPIImpl.java:524-541) |
Single probe, no retry. Phase-aware: Phase 3 → DotRuntimeException; Phase 0–2 → haltMigration() (fallback). |
Startup ordering problem (empty-DB path)
By servlet load-on-startup order:
MainServlet (=1) calls APILocator.getESIndexAPI().waitUtilIndexReady() (MainServlet.java:113) — waits on ES only, hardcoded, even in Phase 3 where OS is primary — then runs the startup tasks.
- Inside the tasks:
Task00004LoadStarter → DotCMSInitDb.loadStarterSite → refreshAllContent → ContentletIndexAPIImpl.fullReindexStart → initIndex → bootstrapAndPoint → bootstrapAndPointOS → createContentIndex(IndexTag.OS). OS indices are created here, before any OS connection gate.
InitServlet (=8) runs checkAndInitializeIndex() → the phase-aware validator — too late: indices already exist, so if (!indexReady()) is false and the gate is effectively skipped.
Scope note: the empty-DB bypass only bites when DB is empty AND migration is started (Phase ≥ 1), because initIndex() computes osNeeded = isMigrationStarted() && !indexReadyOS() (ContentletIndexAPIImpl.java:688-689). In Phase 0 no OS indices are created, so there is no gap. This is exactly the os-migration fresh-install scenario.
Target behavior
| Phase |
OS unreachable after N retries |
Behavior |
| Phase 1 / 2 (shadow) |
yes |
Fallback to ES (haltMigration() / phase reset) |
| Phase 3 (OS primary) |
yes |
Abort, same as ES does today — no fallback |
Consistent with [feedback] "OS write failures are fire-and-forget in Phase 1/2 but must propagate in Phase 3 (OS is primary)".
Acceptance Criteria
A — Startup readiness gate waits on the primary store, not ES-hardcoded
B — OS waitUtilIndexReady() outcome is phase-aware
C — Connection gate runs before OS index creation on the empty-DB path (unification, minimal)
General
Out of scope (separate follow-up)
- Error classification hardening of the connection
catch: distinguish TLS-scheme mismatch (http vs https) vs HTTP 403 (auth) vs connection-refused (unreachable), so the fatal/fallback message names the actual cause. This is the diagnostic improvement that motivated the finding but is orthogonal to the gating/fallback rework here — to be filed as a child issue.
Priority
Medium — reliability gap; bites in OS-migration startup scenarios (empty DB + Phase ≥ 1, or any OS misconfiguration in Phase 3). Not a default-install (Phase 0) regression.
Additional Context
- Originated from the 2026-06-12 Phase 3 smoke against the
os-migration limited-user stack.
- Relevant phase semantics:
IndexConfigHelper.MigrationPhase (PHASE_0…PHASE_3), isMigrationStarted(), isMigrationComplete(), haltMigration().
- Files:
MainServlet.java:113, IndexAPIImpl.java:193, OSIndexAPIImpl.java:512, ESIndexAPI.java:1150, ContentletIndexAPIImpl.java:522-558,682-734,874-891, IndexStartupValidator.java, IndexConfigHelper.java.
Description
dotCMS has inconsistent OpenSearch (OS) connection gating at startup, and the OS readiness gate does not respect the migration phase. Three concrete gaps were confirmed against the current code (see file:line below). Together they cause an opaque, late-failing crash when OS is unreachable/misconfigured during startup, instead of a fast, actionable failure with phase-correct behavior (fallback during migration, abort once OS is primary).
This is the follow-up to a finding made on 2026-06-12 while smoke-testing Phase 3 against the
os-migrationlimited-user stack: the real failure was a TLS scheme mismatch (http://against an OS 3.x port that only acceptshttps), surfacing asConnectionClosedExceptiondeep insidecreateContentIndex~30s into startup (SystemExitManager - Startup failure), with no actionable message.Current behavior (confirmed)
ESIndexAPI.waitUtilIndexReady()(ESIndexAPI.java:1150)ES_CONNECTION_ATTEMPTS(default 24, 5s sleep); on exhaustion →SystemExitManager.immediateExit(1). Always aborts.OSIndexAPIImpl.waitUtilIndexReady()(OSIndexAPIImpl.java:512)OS_CONNECTION_ATTEMPTS/ES_CONNECTION_ATTEMPTS, configurable sleep); on exhaustion →immediateExit(1). Always aborts — not phase-aware.IndexStartupValidator.validateIndexingConfig()viaContentletIndexAPIImpl.checkAndInitializeIndex()(ContentletIndexAPIImpl.java:524-541)DotRuntimeException; Phase 0–2 →haltMigration()(fallback).Startup ordering problem (empty-DB path)
By servlet
load-on-startuporder:MainServlet(=1) callsAPILocator.getESIndexAPI().waitUtilIndexReady()(MainServlet.java:113) — waits on ES only, hardcoded, even in Phase 3 where OS is primary — then runs the startup tasks.Task00004LoadStarter → DotCMSInitDb.loadStarterSite → refreshAllContent → ContentletIndexAPIImpl.fullReindexStart → initIndex → bootstrapAndPoint → bootstrapAndPointOS → createContentIndex(IndexTag.OS). OS indices are created here, before any OS connection gate.InitServlet(=8) runscheckAndInitializeIndex()→ the phase-aware validator — too late: indices already exist, soif (!indexReady())is false and the gate is effectively skipped.Scope note: the empty-DB bypass only bites when DB is empty AND migration is started (Phase ≥ 1), because
initIndex()computesosNeeded = isMigrationStarted() && !indexReadyOS()(ContentletIndexAPIImpl.java:688-689). In Phase 0 no OS indices are created, so there is no gap. This is exactly theos-migrationfresh-install scenario.Target behavior
haltMigration()/ phase reset)Consistent with [feedback] "OS write failures are fire-and-forget in Phase 1/2 but must propagate in Phase 3 (OS is primary)".
Acceptance Criteria
A — Startup readiness gate waits on the primary store, not ES-hardcoded
MainServletstartup readiness wait routes through the phase-aware router (getIndexAPI().waitUtilIndexReady(),IndexAPIImpl.java:193) instead ofgetESIndexAPI()directly, so the gate waits on the primary store for the current phase (ES in Phase 0–1, OS in Phase 3).B — OS
waitUtilIndexReady()outcome is phase-awareOS_CONNECTION_ATTEMPTSretries aborts startup (currentimmediateExitbehavior preserved) with an actionable fatal message.haltMigration()(phase reset), logging a clear WARN/ERROR explaining the fallback.OS_CONNECTION_ATTEMPTSw/ES_CONNECTION_ATTEMPTSfallback;OS_CONNECTION_RETRY_SLEEP_SECONDS).C — Connection gate runs before OS index creation on the empty-DB path (unification, minimal)
bootstrapAndPointOScreates OS indices (ContentletIndexAPIImpl.initIndex/bootstrapAndPoint, ~682-734), the primary-store connection gate (retry-N, phase-aware outcome from B) runs first.ConnectionClosedExceptiondeep insidecreateContentIndex.InitServletand empty-DB viaTask00004LoadStarter) pass through the same connection gate before creating/using OS indices — the gate no longer lives only inInitServlet(which runs too late).validateIndexingConfig()(OS version + endpoint-separation) into the bootstrap hot path — that stays incheckAndInitializeIndex().General
Out of scope (separate follow-up)
catch: distinguish TLS-scheme mismatch (httpvshttps) vs HTTP 403 (auth) vs connection-refused (unreachable), so the fatal/fallback message names the actual cause. This is the diagnostic improvement that motivated the finding but is orthogonal to the gating/fallback rework here — to be filed as a child issue.Priority
Medium — reliability gap; bites in OS-migration startup scenarios (empty DB + Phase ≥ 1, or any OS misconfiguration in Phase 3). Not a default-install (Phase 0) regression.
Additional Context
os-migrationlimited-user stack.IndexConfigHelper.MigrationPhase(PHASE_0…PHASE_3),isMigrationStarted(),isMigrationComplete(),haltMigration().MainServlet.java:113,IndexAPIImpl.java:193,OSIndexAPIImpl.java:512,ESIndexAPI.java:1150,ContentletIndexAPIImpl.java:522-558,682-734,874-891,IndexStartupValidator.java,IndexConfigHelper.java.