Skip to content

fix(it): stop flaky integration tests (#27649)#27651

Open
mohityadav766 wants to merge 7 commits intomainfrom
fix-flaky-tests
Open

fix(it): stop flaky integration tests (#27649)#27651
mohityadav766 wants to merge 7 commits intomainfrom
fix-flaky-tests

Conversation

@mohityadav766
Copy link
Copy Markdown
Member

@mohityadav766 mohityadav766 commented Apr 23, 2026

Summary

Fixes #27649 — two backend integration tests that time out under concurrent load.

GlossaryOntologyExportIT — add @Isolated

RdfUpdater is a JVM-wide singleton. This class's @BeforeAll flips it on (and @AfterAll flips it off). While it's on, every test class running concurrently starts doing synchronous Fuseki writes on every entity create — which saturates the Dropwizard thread pool and turns the 60s HTTP timeout into the observed request timed out at GlossaryOntologyExportIT.java:142.

PR #27172 already moved this class to @Execution(SAME_THREAD) and bumped the timeout to 60s, but SAME_THREAD only serialises within the class — the parallel failsafe execution still runs the rest of the *IT classes alongside it. RdfResourceIT already uses @Isolated for the same reason; this brings us in line.

WorkflowDefinitionResourceIT#triggerWorkflow_SDK — drop redundant wait + diagnostic aliases

The test at line 2086 was timing out with the generic "Condition org.openmetadata.it.tests.WorkflowDefinitionResourceIT$$Lambda/... was not fulfilled within 2 minutes" — no way to tell which of the two atMost(120s) polls died.

  • Removed waitForWorkflowDeployment(client, workflowName) from triggerWorkflow_SDKcreateDataCompletenessWorkflow_SDK already calls it during creation.
  • Added descriptive aliases to await(...) so the next failure, if any, names the workflow or the FQN that didn't appear in the search index.

Test plan

  • mvn verify -Dit.test=GlossaryOntologyExportIT#testExportGlossaryAsRdfXml — passed (3:37)
  • mvn verify -Dit.test=GlossaryOntologyExportIT#testExportGlossaryAsRdfXml+testExportGlossaryAsTurtle with @Isolated applied — 2/2 passed
  • mvn verify -Dit.test=WorkflowDefinitionResourceIT#test_DataCompletenessWorkflow_SDK — passed
  • mvn test-compile clean
  • Full CI run under concurrent load (cannot reproduce the flake locally without the full suite)

🤖 Generated with Claude Code


Summary by Gitar

  • Search indexing improvements:
    • Introduced activeStagedIndices map in SearchRepository to track indexes during reindexing operations.
    • Added resolveWriteIndex to dynamically route live entity writes to staged indexes, preventing data loss during alias swaps.
    • Implemented registerStagedIndex and unregisterStagedIndex to manage staged index lifecycle in DefaultRecreateHandler.

This will update automatically on new commits.

GlossaryOntologyExportIT: mark @isolated. @BeforeAll flips RdfUpdater (a
JVM-wide singleton) on, which makes every concurrent test class start
doing synchronous Fuseki writes on entity create, saturating the
Dropwizard thread pool and causing 60s request timeouts. @execution
(SAME_THREAD) alone only serialises within this class.

WorkflowDefinitionResourceIT#triggerWorkflow_SDK: drop the redundant
waitForWorkflowDeployment call — the create path already waits. Add
descriptive aliases to the two await() polls so the next flake tells
us which FQN or workflow name actually timed out instead of an
anonymous lambda.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 23, 2026 06:29
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 23, 2026
@mohityadav766 mohityadav766 changed the title fix(it): stop two flaky integration tests (#27649) fix(it): stop flaky integration tests (#27649) Apr 23, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes flaky backend integration tests by preventing cross-test interference from JVM-wide RDF toggling and improving Awaitility timeout diagnostics in workflow tests.

Changes:

  • Added JUnit @Isolated to GlossaryOntologyExportIT to prevent concurrent test classes from inheriting the global RdfUpdater configuration.
  • Removed a redundant workflow deployment wait in triggerWorkflow_SDK (deployment is already awaited during workflow creation).
  • Added descriptive aliases to Awaitility await(...) calls to make future timeouts actionable.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/WorkflowDefinitionResourceIT.java Removes redundant wait and adds Awaitility aliases for clearer timeout failures.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/GlossaryOntologyExportIT.java Isolates RDF export tests to avoid JVM-wide singleton leakage across parallel integration tests.

@mohityadav766 mohityadav766 self-assigned this Apr 23, 2026
pmbrull
pmbrull previously approved these changes Apr 23, 2026
Live search indexing was silently skipped whenever a reindex job was in
RUNNING/READY/STOPPING state. SearchRepository.createEntityIndex() and
six sibling methods consulted SearchIndexRetryQueue.isEntityTypeSuspended()
and returned early with nothing written, nothing enqueued — entities
vanished from search until a future reindex happened to cover them.

The retry worker doubled down: when the scope refresh observed an active
job, it purged the retry queue; and processRecord() deleted records
whose type was suspended. So even manually enqueued retries were wiped.

This is how the #27649 flake surfaced: AppsResourceIT triggers
SearchIndexingApplication runs and its best-effort 30s wait silently
swallows timeouts. If a run was still RUNNING when AppsResourceIT
finished, the next class in the sequential fork
(WorkflowDefinitionResourceIT) inherited the suspension and its
freshly-created tables were never indexed — waitForEntityIndexedInSearch
then timed out at 120s. Same mechanism bites real users mid-reindex in
production.

Remove the suspension mechanism entirely:

* SearchRepository — drop the 8 isEntityTypeSuspended() early-returns;
  the client-availability path already enqueues for retry on its own.
* SearchIndexRetryWorker — drop refreshReindexSuspensionScopeIfNeeded()
  and the suspension branches in processRecord(); remove the retry-queue
  purge on suspendAll.
* SearchIndexRetryQueue — delete the updateSuspension / clearSuspension
  / isEntityTypeSuspended / isStreamingSuspended / isSuspendAllStreaming
  / getSuspendedEntityTypes API and the static AtomicBoolean /
  AtomicReference they backed.
* Drop the two IT cases that asserted the removed behaviour.

Live writes now always reach the search client; reindex and live
writes both target the same indices as before. Version conflicts
between the two paths (stale reindex batch overwriting a newer live
write) remain possible as they did before suspension was introduced —
that is the race suspension was meant to dodge, but dropping writes
altogether was worse than the race.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 23, 2026 08:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings April 23, 2026 10:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

🟡 Playwright Results — all passed (20 flaky)

✅ 3691 passed · ❌ 0 failed · 🟡 20 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 481 0 0 4
🟡 Shard 2 652 0 4 7
🟡 Shard 3 663 0 3 1
🟡 Shard 4 645 0 3 27
🟡 Shard 5 608 0 3 42
🟡 Shard 6 642 0 7 8
🟡 20 flaky test(s) (passed on retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/ColumnBulkOperations.spec.ts › should discard changes when closing drawer without saving (shard 2, 1 retry)
  • Features/DataQuality/ColumnLevelTests.spec.ts › Column Values Missing Count To Be Equal (shard 2, 1 retry)
  • Features/DomainTierCertificationVoting.spec.ts › Domain - Certification assign, update, and remove (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should not show online status for inactive users (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › Entity Reference List (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
  • Pages/DomainAdvanced.spec.ts › User with domain access can view subdomains (shard 4, 1 retry)
  • Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 2 retries)
  • Features/AutoPilot.spec.ts › Create Service and check the AutoPilot status (shard 6, 1 retry)
  • Pages/HyperlinkCustomProperty.spec.ts › should display URL when no display text is provided (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for topic -> container (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
  • Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

The distributed reindex has a TOCTOU: partitions read from a DB
snapshot at T0 and write to a staged index, then at T1 (seconds
later) the alias is atomically swapped from the old index to the
staged one and the old index is deleted. Any entity that live-writers
create between T0 and T1 goes via the alias → old index, and is
destroyed when that old index is deleted post-swap.

The CI log for #27649 shows this directly:

  10:13:35  staged table_search_index_rebuild_…_215646 built from snapshot
  10:13:40  POST /v1/tables table1_gold → written to alias target
            (old index _179670)
  10:13:40  table2_silver, table3_bronze, table4_brass all written to
            old index _179670
  10:13:42  Atomically swapped aliases from [_179670] to _215646
  10:13:42  Successfully deleted index _179670
  10:13:43+ waitForEntityIndexedInSearch polls, finds nothing, times
            out at 2 min

Removing the silent-skip suspension mechanism in the previous commit
exposed this race (it had been hidden by dropping the writes outright,
which was strictly worse).

Route live writes to the staged index during the reindex window:

* SearchRepository gains an activeStagedIndices map (entityType →
  stagedIndex) plus register/unregister/resolveWriteIndex. Writes
  resolve to the staged index when one is registered for the type,
  otherwise to the canonical alias — the existing behaviour.
* DefaultRecreateHandler.recreateIndexFromMapping registers the
  staged index as soon as it is created; finalizeReindex and
  promoteEntityIndex unregister it on every exit path (successful
  swap, swap failure, failed-reindex delete, exception).
* Every live-write path in SearchRepository — createEntityIndex,
  createEntitiesIndex, indexTableColumns, indexColumnsForTables,
  updateEntityIndex, createTimeSeriesEntity, updateTimeSeriesEntity,
  deleteEntityIndex, deleteEntityByFQNPrefix, deleteTimeSeriesEntityById
  — goes through resolveWriteIndex instead of reading the canonical
  alias directly.

During a reindex, live writes land in the index that the alias will
promote to; after the swap the alias points to that same index and
subsequent writes continue to reach the same place. Old-index deletion
no longer discards fresh data.

Note: searches through the alias during the brief reindex window (<
seconds in the CI log) can miss a write until the swap lands — an
acceptable trade compared to silently dropping the write or losing it
on deletion. The #27649 test tolerates this because its 120s poll
spans many swap cycles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 23, 2026

Code Review ✅ Approved

Eliminates flaky integration tests by refining test synchronization, ensuring reliable execution across the suite. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky Integration/Unit Tests backend

3 participants