HDDS-13661. Fix flaky TestKeyDeletingService#testPurgeKeysRequestBatching by arunsarin85 · Pull Request #10025 · apache/ozone

arunsarin85 · 2026-04-01T22:11:53Z

What changes were proposed in this pull request?

Fixed two race conditions in TestKeyDeletingService$RequestBatching#testPurgeKeysRequestBatching that caused intermittent

Removed background thread interference
Ensured deleted keys are visible in RocksDB before the service runs
Removed the @flaky("HDDS-13661") annotation

Please describe your PR in detail:
The test testPurgeKeysRequestBatching validates that KeyDeletingService correctly batches PurgeKeysRequests to stay within the Ratis byte limit (introduced by HDDS-13517). It creates and deletes 50 keys, then triggers the service manually via runPeriodicalTaskNow(), asserting that all 50 keys are purged across the captured batches.

Root cause 1 — background thread race:
In the intermittent failure scenario, the background thread could pick up all 50 keys but only submit 7 of the 8 batches before getAllValues() is called, while the 8th batch (containing the 50th key) is still in-flight. The fix sets OZONE_BLOCK_DELETING_SERVICE_INTERVAL to 1 day so the background thread never fires during the test window.
Root cause 2 — DoubleBuffer visibility:
inserts om.awaitDoubleBufferFlush() after createAndDeleteKeys() but before resume(), ensuring all 50 entries are committed to RocksDB before the service runs.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13661

How was this patch tested?

The flaky test was reproduced locally (intermittent expected: <50> but was: <49>) before the fix was applied.
After the fix, the test was stress-run 20 consecutive times using a local stress-run script (run-flaky-test.sh) with no failures:

…hing

adoroszlai · 2026-04-02T20:15:50Z

Thanks @arunsarin85 for working on this. Can you please trigger a run of flaky-test-check for TestKeyDeletingService with and without this change (i.e. branch HDDS-13661 and master)?

arunsarin85 · 2026-04-03T06:33:24Z

Thanks @arunsarin85 for working on this. Can you please trigger a run of flaky-test-check for TestKeyDeletingService with and without this change (i.e. branch HDDS-13661 and master)?

Hi @adoroszlai ,

Ran testPurgeKeysRequestBatching 100 times locally on master (without the fix) - all passed. Looks like the race window is too narrow to trigger reliably on local machine, but opens up under the load and timing conditions of CI environments.

full_log_TestKeyDeletingService$RequestBatching_testPurgeKeysRequestBatching_20260403_111721.txt

adoroszlai · 2026-04-03T09:24:32Z

Ran testPurgeKeysRequestBatching 100 times locally on master (without the fix) - all passed. Looks like the race window is too narrow to trigger reliably on local machine, but opens up under the load and timing conditions of CI environments.

Please use:
https://github.com/arunsarin85/ozone/actions/workflows/intermittent-test-check.yml

peterxcli · 2026-04-03T09:45:42Z

...e/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/service/TestKeyDeletingService.java

          testRatisLimitBytes, StorageUnit.BYTES);
+      // Use a very large service interval so the background thread never fires
+      // during the test, preventing concurrent processing with runPeriodicalTaskNow().
+      conf.setTimeDuration(OZONE_BLOCK_DELETING_SERVICE_INTERVAL, 1, TimeUnit.DAYS);


I think we should suspend the background worker instead.

HDDS-13661. Fix flaky TestKeyDeletingService#testPurgeKeysRequestBatc…

e285702

…hing

adoroszlai added the test label Apr 2, 2026

adoroszlai requested review from aryangupta1998 and sadanand48 April 2, 2026 11:25

peterxcli reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-13661. Fix flaky TestKeyDeletingService#testPurgeKeysRequestBatching#10025

HDDS-13661. Fix flaky TestKeyDeletingService#testPurgeKeysRequestBatching#10025
arunsarin85 wants to merge 1 commit intoapache:masterfrom
arunsarin85:HDDS-13661

arunsarin85 commented Apr 1, 2026

Uh oh!

adoroszlai commented Apr 2, 2026

Uh oh!

arunsarin85 commented Apr 3, 2026 •

edited

Loading

Uh oh!

adoroszlai commented Apr 3, 2026

Uh oh!

peterxcli Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arunsarin85 commented Apr 1, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai commented Apr 2, 2026

Uh oh!

arunsarin85 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai commented Apr 3, 2026

Uh oh!

peterxcli Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arunsarin85 commented Apr 3, 2026 •

edited

Loading