Skip to content

HDDS-13661. Fix flaky TestKeyDeletingService#testPurgeKeysRequestBatching#10025

Open
arunsarin85 wants to merge 1 commit intoapache:masterfrom
arunsarin85:HDDS-13661
Open

HDDS-13661. Fix flaky TestKeyDeletingService#testPurgeKeysRequestBatching#10025
arunsarin85 wants to merge 1 commit intoapache:masterfrom
arunsarin85:HDDS-13661

Conversation

@arunsarin85
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Fixed two race conditions in TestKeyDeletingService$RequestBatching#testPurgeKeysRequestBatching that caused intermittent

  1. Removed background thread interference
  2. Ensured deleted keys are visible in RocksDB before the service runs
  3. Removed the @flaky("HDDS-13661") annotation

Please describe your PR in detail:
The test testPurgeKeysRequestBatching validates that KeyDeletingService correctly batches PurgeKeysRequests to stay within the Ratis byte limit (introduced by HDDS-13517). It creates and deletes 50 keys, then triggers the service manually via runPeriodicalTaskNow(), asserting that all 50 keys are purged across the captured batches.

Root cause 1 — background thread race:
In the intermittent failure scenario, the background thread could pick up all 50 keys but only submit 7 of the 8 batches before getAllValues() is called, while the 8th batch (containing the 50th key) is still in-flight. The fix sets OZONE_BLOCK_DELETING_SERVICE_INTERVAL to 1 day so the background thread never fires during the test window.
Root cause 2 — DoubleBuffer visibility:
inserts om.awaitDoubleBufferFlush() after createAndDeleteKeys() but before resume(), ensuring all 50 entries are committed to RocksDB before the service runs.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13661

How was this patch tested?

  • The flaky test was reproduced locally (intermittent expected: <50> but was: <49>) before the fix was applied.
  • After the fix, the test was stress-run 20 consecutive times using a local stress-run script (run-flaky-test.sh) with no failures:
image

@adoroszlai
Copy link
Copy Markdown
Contributor

Thanks @arunsarin85 for working on this. Can you please trigger a run of flaky-test-check for TestKeyDeletingService with and without this change (i.e. branch HDDS-13661 and master)?

@arunsarin85
Copy link
Copy Markdown
Contributor Author

arunsarin85 commented Apr 3, 2026

Thanks @arunsarin85 for working on this. Can you please trigger a run of flaky-test-check for TestKeyDeletingService with and without this change (i.e. branch HDDS-13661 and master)?

Hi @adoroszlai ,

Ran testPurgeKeysRequestBatching 100 times locally on master (without the fix) - all passed. Looks like the race window is too narrow to trigger reliably on local machine, but opens up under the load and timing conditions of CI environments.

full_log_TestKeyDeletingService$RequestBatching_testPurgeKeysRequestBatching_20260403_111721.txt

@adoroszlai
Copy link
Copy Markdown
Contributor

Ran testPurgeKeysRequestBatching 100 times locally on master (without the fix) - all passed. Looks like the race window is too narrow to trigger reliably on local machine, but opens up under the load and timing conditions of CI environments.

Please use:
https://github.com/arunsarin85/ozone/actions/workflows/intermittent-test-check.yml

testRatisLimitBytes, StorageUnit.BYTES);
// Use a very large service interval so the background thread never fires
// during the test, preventing concurrent processing with runPeriodicalTaskNow().
conf.setTimeDuration(OZONE_BLOCK_DELETING_SERVICE_INTERVAL, 1, TimeUnit.DAYS);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should suspend the background worker instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants