Skip to content

test(search): de-flake nightly search-it & ui-it reindex tests#29188

Open
mohityadav766 wants to merge 1 commit into
mainfrom
fix-flaky-tests-search
Open

test(search): de-flake nightly search-it & ui-it reindex tests#29188
mohityadav766 wants to merge 1 commit into
mainfrom
fix-flaky-tests-search

Conversation

@mohityadav766

@mohityadav766 mohityadav766 commented Jun 18, 2026

Copy link
Copy Markdown
Member

Describe your changes:

Fixes #29187

The JavaUIIT Integration Tests (Nightly) workflow (search-it-nightly + ui-it-nightly) was failing on nearly every PR independent of content (it even failed on a no-op version-bump PR). These are timing/scope races in the #28637 reindex test suite — test-side only: no product behavior changed and no assertions weakened (two were made stricter).

Root causes & fixes

Test Root cause Fix
DbToEsCountReconciliationIT cluster-wide DB↔ES counts read immediately after reindex Success; engine refresh lags → es < db converge-poll until counts reconcile (keeps the cluster-wide umbrella; a real regression still never converges)
DistributedAutoTuneReindexUIIT fuzzy q=<prefix> count also matched other parallel ui-it tests' entities (6900 vs 1500) strict name.keyword prefix count, cluster-alias-aware index resolution, converge-poll, removed Thread.sleep
SimpleReindexTriggerUIIT, SelectiveFieldReindexUIIT, LongCompoundNameSearchUIIT, PipelineOwnerIndexUIIT Awaitility .ignoreNoExceptions() aborted the poll on a transient Playwright/search error .ignoreExceptions() so transient errors retry on the next tick
SelectiveFieldReindexUIIT depended on the heavy global /data-quality page rendering under load assert testCase/testSuite via their search index directly (the contract the DQ list is backed by)

SearchAvailableDuringReindexUIIT / SearchAvailableAllKindsDuringReindexUIIT are intentionally left unchanged — their .ignoreNoExceptions() is a deliberate fail-fast for the zero-downtime guarantee (a mid-flight blackout/duplicate must fail immediately; transient 503 shard-lag is already absorbed inside probeIndexToleratingShardLag).

Type of change:

  • Bug fix (test reliability)

Tests:

Each modified test verified green on local testcontainers (Postgres + Elasticsearch, real Playwright browser):

  • DbToEsCountReconciliationIT — 2/2 runs
  • DistributedAutoTuneReindexUIIT, SimpleReindexTriggerUIIT, SelectiveFieldReindexUIIT, LongCompoundNameSearchUIIT, PipelineOwnerIndexUIIT — green

mvn test-compile and mvn spotless:check both pass.

🤖 Generated with Claude Code

Greptile Summary

This PR de-flakes the nightly search-it and ui-it reindex test suite by addressing root causes (ES refresh lag, fuzzy query scope bleed, and ignoreNoExceptions() aborting on transient errors) without weakening any assertions — two are in fact made stricter.

  • DB↔ES reconciliation (DbToEsCountReconciliationIT): converts the immediate post-reindex count comparison into an Awaitility convergence poll so ES refresh lag no longer causes spurious failures; a real regression still never converges and fails.
  • Distributed reindex count (DistributedAutoTuneReindexUIIT): replaces fuzzy q=<prefix> probe + Thread.sleep with a strict name.keyword prefix count via SearchAssertions and cluster-alias-aware IndexAliasInspector, removing the cross-test count bleed (6900 vs expected 1500) that was the primary flake source.
  • ignoreNoExceptions()ignoreExceptions() (SimpleReindexTriggerUIIT, LongCompoundNameSearchUIIT, PipelineOwnerIndexUIIT, SelectiveFieldReindexUIIT): Awaitility's ignoreNoExceptions() aborts the poll on any exception including transient Playwright/search errors; ignoreExceptions() lets those retry.
  • DQ page dependency (SelectiveFieldReindexUIIT): replaces heavy /data-quality global page renders for testCase/testSuite with direct name.keyword index counts — the exact contract the DQ list is backed by.

Confidence Score: 5/5

All changes are test-only; no product behaviour is altered and no assertion is weakened.

Every change targets a clearly identified root cause — refresh lag, cross-test query bleed, or an incorrect Awaitility exception policy. The convergence poll in DbToEsCountReconciliationIT still catches genuine indexer regressions (they never converge). The switch from fuzzy q= to strict name.keyword prefix in DistributedAutoTuneReindexUIIT is strictly more precise. No logic shared with production code is touched.

No files require special attention.

Important Files Changed

Filename Overview
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/search/DbToEsCountReconciliationIT.java Wraps the DB↔ES count reconciliation assertion in an Awaitility convergence poll to absorb post-reindex ES refresh lag; extracts collectCountMismatches() helper. Logic is sound.
openmetadata-integration-tests/src/test/java/org/openmetadata/playwright/scenarios/search/reindex/DistributedAutoTuneReindexUIIT.java Replaces fuzzy q= probe + Thread.sleep + hardcoded alias map with strict name.keyword prefix count via SearchAssertions + IndexAliasInspector and Awaitility poll. Root cause of the 6900 vs 1500 flake is addressed cleanly.
openmetadata-integration-tests/src/test/java/org/openmetadata/playwright/scenarios/search/reindex/SelectiveFieldReindexUIIT.java Replaces DataQualityListPage-based UI assertions for testCase/testSuite with direct search-index count checks; removes flaky heavy-page dependency. ignoreNoExceptions() → ignoreExceptions() for the shared pollUiAssertion.
openmetadata-integration-tests/src/test/java/org/openmetadata/playwright/scenarios/search/issues/LongCompoundNameSearchUIIT.java Adds awaitDiscoverableInExplore() polling loop with short per-tick Playwright timeout; changes ignoreNoExceptions() → ignoreExceptions() so transient errors retry.
openmetadata-integration-tests/src/test/java/org/openmetadata/playwright/scenarios/search/reindex/SimpleReindexTriggerUIIT.java Two ignoreNoExceptions() → ignoreExceptions() changes in separate Awaitility blocks. No logic changes.
openmetadata-integration-tests/src/test/java/org/openmetadata/playwright/scenarios/search/issues/PipelineOwnerIndexUIIT.java Single-line change: ignoreNoExceptions() → ignoreExceptions() so transient Playwright/search errors retry instead of aborting the poll.

Reviews (1): Last reviewed commit: "test(search): de-flake nightly search-it..." | Re-trigger Greptile

The JavaUIIT nightly workflow (search-it-nightly + ui-it-nightly) was failing on
nearly every PR independent of content — timing/scope races in the #28637 reindex
test suite, not product regressions. No production code changed; no assertions
weakened (two made stricter).

- DbToEsCountReconciliationIT: wrap the cluster-wide per-type DB<->ES count check
  in a converge-poll so it absorbs post-reindex engine-refresh lag; keeps the
  umbrella, a real indexer regression still never converges.
- DistributedAutoTuneReindexUIIT: replace the fuzzy /v1/search/query?q=prefix
  cohort count (matched other parallel tests' entities -> 6900 vs 1500) with a
  strict name.keyword prefix count, cluster-alias-aware index resolution, and a
  converge-poll; remove Thread.sleep.
- SimpleReindexTriggerUIIT, SelectiveFieldReindexUIIT, LongCompoundNameSearchUIIT,
  PipelineOwnerIndexUIIT: Awaitility .ignoreNoExceptions() -> .ignoreExceptions()
  so a transient Playwright/search error retries instead of aborting the poll.
- SelectiveFieldReindexUIIT: move the two heavy global /data-quality page checks to
  direct testCase/testSuite search-index presence (the contract the DQ list is
  backed by); harden LongCompoundName's Explore probe into a re-navigating poll.

SearchAvailable{,AllKinds}DuringReindexUIIT intentionally keep .ignoreNoExceptions()
(fail-fast on a zero-downtime violation).

Verified: each test green on local testcontainers Postgres + Elasticsearch.

Fixes #29187

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

❌ PR checklist incomplete

This PR cannot be merged until the following are addressed on its linked issue:

The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically.

Maintainers can bypass this check by adding the skip-pr-checks label.

@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jun 18, 2026
@gitar-bot

gitar-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown
Code Review ✅ Approved

Replaces flaky polling and hardcoded sleeps with robust converge-polling and strict keyword matching in search integration tests. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky nightly search-it & ui-it reindex tests (timing/scope races, not product bugs)

1 participant