HDDS-13661. Fix flaky TestKeyDeletingService#testPurgeKeysRequestBatching#10025
HDDS-13661. Fix flaky TestKeyDeletingService#testPurgeKeysRequestBatching#10025arunsarin85 wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thanks @arunsarin85 for working on this. Can you please trigger a run of |
Hi @adoroszlai , Ran testPurgeKeysRequestBatching 100 times locally on master (without the fix) - all passed. Looks like the race window is too narrow to trigger reliably on local machine, but opens up under the load and timing conditions of CI environments. full_log_TestKeyDeletingService$RequestBatching_testPurgeKeysRequestBatching_20260403_111721.txt |
Please use: |
| testRatisLimitBytes, StorageUnit.BYTES); | ||
| // Use a very large service interval so the background thread never fires | ||
| // during the test, preventing concurrent processing with runPeriodicalTaskNow(). | ||
| conf.setTimeDuration(OZONE_BLOCK_DELETING_SERVICE_INTERVAL, 1, TimeUnit.DAYS); |
There was a problem hiding this comment.
I think we should suspend the background worker instead.
What changes were proposed in this pull request?
Fixed two race conditions in TestKeyDeletingService$RequestBatching#testPurgeKeysRequestBatching that caused intermittent
Please describe your PR in detail:
The test testPurgeKeysRequestBatching validates that KeyDeletingService correctly batches PurgeKeysRequests to stay within the Ratis byte limit (introduced by HDDS-13517). It creates and deletes 50 keys, then triggers the service manually via runPeriodicalTaskNow(), asserting that all 50 keys are purged across the captured batches.
Root cause 1 — background thread race:
In the intermittent failure scenario, the background thread could pick up all 50 keys but only submit 7 of the 8 batches before getAllValues() is called, while the 8th batch (containing the 50th key) is still in-flight. The fix sets OZONE_BLOCK_DELETING_SERVICE_INTERVAL to 1 day so the background thread never fires during the test window.
Root cause 2 — DoubleBuffer visibility:
inserts om.awaitDoubleBufferFlush() after createAndDeleteKeys() but before resume(), ensuring all 50 entries are committed to RocksDB before the service runs.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13661
How was this patch tested?