fix(opensearch-migration): unify phase-aware OS startup connection gate (#36244)#36248
fix(opensearch-migration): unify phase-aware OS startup connection gate (#36244)#36248fabrizzio-dotCMS wants to merge 3 commits into
Conversation
…te (#36244) dotCMS had inconsistent OpenSearch connection gating at startup. The OS readiness gate was not phase-aware and the empty-DB bootstrap path created OS indices with no connection gate at all, so an unreachable/misconfigured OS surfaced as an opaque ConnectionClosedException deep inside createContentIndex ~30s into startup instead of a fast, actionable failure. Root cause found while implementing: OSIndexAPIImpl.getClusterStats() swallows all exceptions and returns a non-null empty result, so the retry loop in waitUtilIndexReady() could never observe a failure — the gate was dead code and always passed. Changes (minimal scope): - B: OSIndexAPIImpl.waitUtilIndexReady() now probes with client.info() (which propagates transport/TLS/auth failures) and its exhaustion outcome is phase-aware — Phase 3 aborts via SystemExitManager.immediateExit with an actionable message (phase + endpoints + cause); Phase 1/2 halts the migration (haltMigration → ES-only fallback) and returns false instead of killing the server. Retry count/sleep remain configurable. - C: ContentletIndexAPIImpl.bootstrapAndPointOS() runs the phase-aware OS gate before creating OS indices. As the single chokepoint for OS index creation, both startup paths (populated-DB via InitServlet and empty-DB via Task00004LoadStarter) now pass through the same gate; a shadow-phase fallback skips OS creation. - A: MainServlet startup readiness wait clarified — getESIndexAPI() already returns the phase-aware router, so it waits on the primary store for the phase; on a shadow-phase fallback to ES it waits again to gate the new primary. Adds OSIndexAPIImplWaitReadyTest covering the Phase 1/2 ES-only fallback (no exit, phase reset to 0). Phase 3 abort is verified by IT/manual QA since SystemExitManager halts the JVM. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @fabrizzio-dotCMS's task in 1m 38s —— View job Rollback Safety Analysis
Result: ✅ Safe to Roll BackThe label AnalysisThe PR touches 5 files — all pure Java logic and test changes, no database migrations, no Elasticsearch mapping changes, no data transformations. Each category was checked:
What ChangedAll changes are behavioral only within the startup connection-gate logic:
None of these changes write to the database, modify ES/OS index mappings, or alter any stored data format. Rolling back to N-1 restores the previous startup behavior with no residual side effects. |
🤖 Bedrock Review —
|
…uite Move OSIndexAPIImplWaitReadyTest (a misplaced unit test under dotCMS/src/test) into the integration module as OSIndexAPIImplWaitReadyIT and register it in OpenSearchUpgradeSuite so CI actually runs it under the opensearch-upgrade profile. Tests for the OS-touching paths belong in dotcms-integration with the IT suffix, not as plain unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
✅ Phase 0 ↔ Phase 3 boot congruence (safety note)Verified that the Phase 3 fail-loud behavior this PR introduces is congruent with the long-standing Phase 0 behavior — it is not a new/novel startup contract, it just gives "OS as primary" the same treatment "ES as primary" has always had. Same code path
Same outcome on retry exhaustion
// Phase 0 — ESIndexAPI#waitUtilIndexReady (pre-existing, untouched)
if (stats == null) {
Logger.fatal(this.getClass(), "No Elasticsearch, dying an ugly death");
SystemExitManager.immediateExit(1, "Elasticsearch connection failed after maximum attempts");
}
// Phase 3 — OSIndexAPIImpl#handleConnectionExhausted (this PR)
if (phase.isMigrationComplete()) {
Logger.fatal(this.getClass(), detail + " — OS is the primary store ... cannot fall back to ES. ...");
SystemExitManager.immediateExit(1, "OpenSearch connection failed in PHASE_3_OPENSEARCH_ONLY");
}Why this matters for safety
The one (correct) asymmetryOS additionally has the Phase 1/2 shadow fallback ( Verdict: Phase 3 inherits the proven Phase 0 boot contract rather than inventing a new one. The OS side additionally improves on it (actionable message + fixes the swallowing-probe bug). |
Problem
dotCMS had inconsistent OpenSearch connection gating at startup (#36244). The OS readiness gate was not phase-aware, and the empty-DB bootstrap path created OS indices with no connection gate at all — so an unreachable/misconfigured OS surfaced as an opaque
ConnectionClosedExceptiondeep insidecreateContentIndex~30s into startup (SystemExitManager - Startup failure) instead of a fast, actionable failure.Root cause found while implementing
OSIndexAPIImpl.getClusterStats()swallows every exception and returns a non-null empty result, so the retry loop inwaitUtilIndexReady()could never observe a failure — the OS connection gate was effectively dead code and always passed.Changes (minimal scope)
B —
OSIndexAPIImpl.waitUtilIndexReady()is phase-awareclient.info()(which propagates transport/TLS/auth failures), replacing the swallowinggetClusterStats()probe.SystemExitManager.immediateExit(1)with an actionable FATAL message (phase + endpoints + cause). No fallback.haltMigration()(reset to Phase 0, ES-only fallback) + ERROR log; returnsfalseinstead of killing the server.OS_CONNECTION_ATTEMPTSw/ES_CONNECTION_ATTEMPTSfallback,OS_CONNECTION_RETRY_SLEEP_SECONDS).C — Connection gate runs before OS index creation on both startup paths
ContentletIndexAPIImpl.bootstrapAndPointOS()runs the phase-aware OS gate (operationsOS.indexAPI().waitUtilIndexReady()) before creating OS indices. As the single chokepoint for OS index creation, both startup paths — populated-DB (InitServlet) and empty-DB (Task00004LoadStarter) — now pass through the same gate. A shadow-phase fallback skips OS creation.A —
MainServletwaits on the primary store, not ES-hardcodedgetESIndexAPI()already returns the phase-aware router (IndexAPIImpl), so the startup wait routes to the primary store for the phase (ES in 0–1, OS in 2–3). Comment clarified; on a shadow-phase fallback to ES the wait runs again so the new primary (ES) is gated too. No Phase 0 behavior change.Acceptance criteria
validateIndexingConfig()into the hot path)Testing
./mvnw compile -pl :dotcms-core✅ (Java 25)OSIndexAPIImplWaitReadyTest(Phase 1/2 ES-only fallback) +PhaseRouterTest+ContentletIndexAPIImplPhaseTest.SystemExitManagerhalts the JVM, so the abort branch is not safely unit-testable.Out of scope (separate follow-up)
catch(TLS-scheme mismatch vs HTTP 403 vs connection-refused) — to be filed as a child issue.🤖 Generated with Claude Code