Skip to content

fix(opensearch-migration): unify phase-aware OS startup connection gate (#36244)#36248

Open
fabrizzio-dotCMS wants to merge 3 commits into
mainfrom
issue-36244-unify-os-startup-gate
Open

fix(opensearch-migration): unify phase-aware OS startup connection gate (#36244)#36248
fabrizzio-dotCMS wants to merge 3 commits into
mainfrom
issue-36244-unify-os-startup-gate

Conversation

@fabrizzio-dotCMS

Copy link
Copy Markdown
Member

Problem

dotCMS had inconsistent OpenSearch connection gating at startup (#36244). The OS readiness gate was not phase-aware, and the empty-DB bootstrap path created OS indices with no connection gate at all — so an unreachable/misconfigured OS surfaced as an opaque ConnectionClosedException deep inside createContentIndex ~30s into startup (SystemExitManager - Startup failure) instead of a fast, actionable failure.

Root cause found while implementing

OSIndexAPIImpl.getClusterStats() swallows every exception and returns a non-null empty result, so the retry loop in waitUtilIndexReady() could never observe a failure — the OS connection gate was effectively dead code and always passed.

Changes (minimal scope)

B — OSIndexAPIImpl.waitUtilIndexReady() is phase-aware

  • Probes with client.info() (which propagates transport/TLS/auth failures), replacing the swallowing getClusterStats() probe.
  • On retry exhaustion:
    • Phase 3 (OS primary, ES decommissioned) → abort via SystemExitManager.immediateExit(1) with an actionable FATAL message (phase + endpoints + cause). No fallback.
    • Phase 1 / 2 (shadow) → haltMigration() (reset to Phase 0, ES-only fallback) + ERROR log; returns false instead of killing the server.
  • Retry count and sleep remain configurable (OS_CONNECTION_ATTEMPTS w/ ES_CONNECTION_ATTEMPTS fallback, OS_CONNECTION_RETRY_SLEEP_SECONDS).

C — Connection gate runs before OS index creation on both startup paths

  • ContentletIndexAPIImpl.bootstrapAndPointOS() runs the phase-aware OS gate (operationsOS.indexAPI().waitUtilIndexReady()) before creating OS indices. As the single chokepoint for OS index creation, both startup paths — populated-DB (InitServlet) and empty-DB (Task00004LoadStarter) — now pass through the same gate. A shadow-phase fallback skips OS creation.

A — MainServlet waits on the primary store, not ES-hardcoded

  • getESIndexAPI() already returns the phase-aware router (IndexAPIImpl), so the startup wait routes to the primary store for the phase (ES in 0–1, OS in 2–3). Comment clarified; on a shadow-phase fallback to ES the wait runs again so the new primary (ES) is gated too. No Phase 0 behavior change.

Acceptance criteria

  • A — startup readiness wait routes through the phase-aware router; ES unchanged in Phase 0/1
  • B — Phase 3 aborts with actionable message; Phase 1/2 falls back to ES (no exit); retry/sleep configurable
  • C — gate runs before OS index creation; both startup paths share the same gate; minimal (does not move validateIndexingConfig() into the hot path)
  • No regression to Phase 0 startup (populated or empty DB)
  • New phase-branching behavior covered by tests

Testing

  • ./mvnw compile -pl :dotcms-core ✅ (Java 25)
  • 28 unit tests green: new OSIndexAPIImplWaitReadyTest (Phase 1/2 ES-only fallback) + PhaseRouterTest + ContentletIndexAPIImplPhaseTest.
  • Phase 3 abort and the bootstrap gate (OS unreachable during bootstrap) are verified by IT/manual QA — SystemExitManager halts the JVM, so the abort branch is not safely unit-testable.

Out of scope (separate follow-up)

  • Error classification of the connection catch (TLS-scheme mismatch vs HTTP 403 vs connection-refused) — to be filed as a child issue.

🤖 Generated with Claude Code

…te (#36244)

dotCMS had inconsistent OpenSearch connection gating at startup. The OS
readiness gate was not phase-aware and the empty-DB bootstrap path created
OS indices with no connection gate at all, so an unreachable/misconfigured
OS surfaced as an opaque ConnectionClosedException deep inside
createContentIndex ~30s into startup instead of a fast, actionable failure.

Root cause found while implementing: OSIndexAPIImpl.getClusterStats() swallows
all exceptions and returns a non-null empty result, so the retry loop in
waitUtilIndexReady() could never observe a failure — the gate was dead code
and always passed.

Changes (minimal scope):
- B: OSIndexAPIImpl.waitUtilIndexReady() now probes with client.info() (which
  propagates transport/TLS/auth failures) and its exhaustion outcome is
  phase-aware — Phase 3 aborts via SystemExitManager.immediateExit with an
  actionable message (phase + endpoints + cause); Phase 1/2 halts the migration
  (haltMigration → ES-only fallback) and returns false instead of killing the
  server. Retry count/sleep remain configurable.
- C: ContentletIndexAPIImpl.bootstrapAndPointOS() runs the phase-aware OS gate
  before creating OS indices. As the single chokepoint for OS index creation,
  both startup paths (populated-DB via InitServlet and empty-DB via
  Task00004LoadStarter) now pass through the same gate; a shadow-phase fallback
  skips OS creation.
- A: MainServlet startup readiness wait clarified — getESIndexAPI() already
  returns the phase-aware router, so it waits on the primary store for the
  phase; on a shadow-phase fallback to ES it waits again to gate the new primary.

Adds OSIndexAPIImplWaitReadyTest covering the Phase 1/2 ES-only fallback
(no exit, phase reset to 0). Phase 3 abort is verified by IT/manual QA since
SystemExitManager halts the JVM.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@claude

claude Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Claude finished @fabrizzio-dotCMS's task in 1m 38s —— View job


Rollback Safety Analysis

  • Read rollback-unsafe categories reference
  • Get full PR diff
  • Analyze diff against each unsafe category
  • Post results and apply label

Result: ✅ Safe to Roll Back

The label AI: Safe To Rollback has been applied.

Analysis

The PR touches 5 files — all pure Java logic and test changes, no database migrations, no Elasticsearch mapping changes, no data transformations. Each category was checked:

Category Verdict Reasoning
C-1 Structural Data Model Change ✅ Safe No runonce task, no DDL, no core table modification
C-2 ES Mapping Change ✅ Safe No changes to ESMappingAPIImpl, ESMappingUtilHelper, or putMapping() calls
C-3 contentlet_as_json model version ✅ Safe No CURRENT_MODEL_VERSION change, no ImmutableContentlet change
C-4 DROP TABLE / DROP COLUMN ✅ Safe No DDL of any kind
H-1 One-Way Data Migration ✅ Safe No data transformation or backfill
H-2 RENAME TABLE / RENAME COLUMN ✅ Safe No renames
H-3 PK Restructuring ✅ Safe No constraint changes
H-4 New ContentType Field Type ✅ Safe No new field types
H-5 Binary Storage Provider Change ✅ Safe No storage provider changes
H-6 DROP PROCEDURE / FUNCTION ✅ Safe No stored procedure changes
H-7 NOT NULL column without default ✅ Safe No schema changes
H-8 VTL Viewtool Contract Change ✅ Safe No viewtool classes modified
M-1 Column Type Change ✅ Safe No DDL
M-2 Push Publishing Bundle Format ✅ Safe No bundle XML changes
M-3 REST/GraphQL API Contract ✅ Safe No REST endpoint or response contract changes
M-4 OSGi Plugin API ✅ Safe No OSGi interface/signature changes

What Changed

All changes are behavioral only within the startup connection-gate logic:

  • OSIndexAPIImpl.javawaitUtilIndexReady() now probes with client.info() instead of the silently-swallowing getClusterStats(), with phase-aware retry exhaustion handling. N-1 still has the old implementation; rollback simply restores it.
  • ContentletIndexAPIImpl.javabootstrapAndPointOS() gains a pre-creation OS reachability check. N-1 has the old code path; rollback restores it.
  • MainServlet.java — minor comment clarification + a second waitUtilIndexReady() call on shadow-phase fallback. Purely behavioral startup-gate logic; no persistent state written.
  • OpenSearchUpgradeSuite.java + OSIndexAPIImplWaitReadyIT.java — new integration test class added to the suite. Test-only; no production effect.

None of these changes write to the database, modify ES/OS index mappings, or alter any stored data format. Rolling back to N-1 restores the previous startup behavior with no residual side effects.

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

🤖 Bedrock Review — deepseek.v3.2

[🟠 High] dotCMS/src/main/java/com/dotcms/content/index/opensearch/OSIndexAPIImpl.java:552 — Potential NullPointerException when lastError is null in handleConnectionExhausted. The method uses lastError.getMessage() without a null check, though the parameter is annotated @Nullable. While the catch block ensures lastError is set, if attempts is 0 or the loop exits abnormally, lastError could be null, causing a crash.

[🟡 Medium] dotCMS/src/main/java/com/liferay/portal/servlet/MainServlet.java:118 — Inefficient double call to waitUtilIndexReady() when the first returns false. This creates a second full retry cycle even though the migration has already been halted and the primary is now ES. Consider storing the result and calling again only if needed, or redesign to avoid redundant waiting.

[🟡 Medium] dotCMS/src/main/java/com/dotcms/content/index/opensearch/OSIndexAPIImpl.java:625 — Direct use of Config.getStringArrayProperty bypasses the dotCMS convention of using IndexConfigHelper for index-related properties. Should use IndexConfigHelper.getStringArray or similar to maintain consistency with other property accesses in the class.

[🟡 Medium] dotcms-integration/src/test/java/com/dotcms/content/index/opensearch/OSIndexAPIImplWaitReadyIT.java:64 — Test modifies global static configuration (Config.setProperty) without proper isolation. If tests run in parallel or other tests depend on these properties, they could be affected. Use @Rule or @ClassRule with TestContext for proper cleanup, though the @After method mitigates this risk.

[🟡 Medium] dotcms-integration/src/test/java/com/dotcms/content/index/opensearch/OSIndexAPIImplWaitReadyIT.java:46 — Inner test class FailingClientProvider does not implement all methods of OSClientProvider. If OSClientProvider has other methods (e.g., close(), getClient(Cluster cluster)), this will cause AbstractMethodError. Verify the interface contract is fully implemented.


Run: #27981306280 · tokens: in: 5095 · out: 497 · total: 5592

…uite

Move OSIndexAPIImplWaitReadyTest (a misplaced unit test under
dotCMS/src/test) into the integration module as OSIndexAPIImplWaitReadyIT
and register it in OpenSearchUpgradeSuite so CI actually runs it under the
opensearch-upgrade profile. Tests for the OS-touching paths belong in
dotcms-integration with the IT suffix, not as plain unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fabrizzio-dotCMS

Copy link
Copy Markdown
Member Author

✅ Phase 0 ↔ Phase 3 boot congruence (safety note)

Verified that the Phase 3 fail-loud behavior this PR introduces is congruent with the long-standing Phase 0 behavior — it is not a new/novel startup contract, it just gives "OS as primary" the same treatment "ES as primary" has always had.

Same code path

MainServlet (line ~118) calls getESIndexAPI().waitUtilIndexReady(), which is the phase-aware router and routes the wait to the primary store of the current phase:

  • Phase 0/1ESIndexAPI.waitUtilIndexReady() (ES is primary)
  • Phase 2/3OSIndexAPIImpl.waitUtilIndexReady() (OS is primary)

Same outcome on retry exhaustion

Phase 0 (ES primary) — pre-existing Phase 3 (OS primary) — this PR
Retries ES_CONNECTION_ATTEMPTS (24) OS_CONNECTION_ATTEMPTS (24, falls back to ES_CONNECTION_ATTEMPTS)
Sleep 5s OS_CONNECTION_RETRY_SLEEP_SECONDS (5s)
Probe getClusterStats() (propagates) client.info() (propagates)
On exhaustion Logger.fatal + SystemExitManager.immediateExit(1) Logger.fatal + SystemExitManager.immediateExit(1)
Result Hard-exit Hard-exit
// Phase 0 — ESIndexAPI#waitUtilIndexReady (pre-existing, untouched)
if (stats == null) {
    Logger.fatal(this.getClass(), "No Elasticsearch, dying an ugly death");
    SystemExitManager.immediateExit(1, "Elasticsearch connection failed after maximum attempts");
}

// Phase 3 — OSIndexAPIImpl#handleConnectionExhausted (this PR)
if (phase.isMigrationComplete()) {
    Logger.fatal(this.getClass(), detail + " — OS is the primary store ... cannot fall back to ES. ...");
    SystemExitManager.immediateExit(1, "OpenSearch connection failed in PHASE_3_OPENSEARCH_ONLY");
}

Why this matters for safety

  • The Phase 3 hard-exit is not a scary new behavior — it is identical in shape to the ES hard-exit that has always run in Phase 0 (same immediateExit(1), same default retry budget). The PR aligns OS-primary boot with ES-primary boot.
  • Before this PR, OS-primary boot either had no connection gate or a dead gate (the old probe used the swallowing OSIndexAPIImpl.getClusterStats() that always returned a non-null empty result, so the gate always passed) → an unreachable OS surfaced as an opaque late crash deep in createContentIndex. This PR makes the two phases congruent and the failure fast + actionable.

The one (correct) asymmetry

OS additionally has the Phase 1/2 shadow fallback (haltMigration() → reset to Phase 0, ES-only, no exit). ES has no equivalent because ES is never a shadow/secondary store. This is the shadow-store concept, which only applies to OS — not an inconsistency. The MainServlet double-call (if (!wait()) wait();) closes the loop: after a shadow fallback the second call runs in Phase 0 and gates the new primary (ES) with the same mechanism.

Verdict: Phase 3 inherits the proven Phase 0 boot contract rather than inventing a new one. The OS side additionally improves on it (actionable message + fixes the swallowing-probe bug).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Unify OpenSearch startup connection gate: phase-aware retry/fallback across both startup paths

1 participant