Skip to content

HDDS-8703. Integration test for SnapshotDeletingService during OM failover#10024

Open
arunsarin85 wants to merge 2 commits intoapache:masterfrom
arunsarin85:HDDS-8703
Open

HDDS-8703. Integration test for SnapshotDeletingService during OM failover#10024
arunsarin85 wants to merge 2 commits intoapache:masterfrom
arunsarin85:HDDS-8703

Conversation

@arunsarin85
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Added two integration tests to TestOzoneManagerHASnapshot that verify SnapshotDeletingService (SDS) behaves correctly when an OM leader failover happens while snapshot cleanup is pending.

testSnapshotDeletingServiceDuringOMFailover Simulates SDS being blocked on the old leader (via suspend()) while a snapshot is queued for deletion. Triggers a leader failover by shutting down the old leader. Verifies that the new leader's SDS independently picks up the pending SNAPSHOT_DELETED entry, purges it from the DB, and leaves the snapshot chain in a consistent state.

testSnapshotDeletingServiceWithMultipleSnapshotsDuringFailover Extends the above scenario to 3 snapshots queued for deletion simultaneously before the failover. Verifies the new leader's SDS correctly processes the full backlog and that chain integrity holds after all cleanups complete.

Please describe your PR in detail:
SnapshotDeletingService (SDS) is a background service on the OM leader responsible for cleaning up deleted snapshots.Two @test methods are added:

  1. testSnapshotDeletingServiceDuringOMFailover
  • Creates 5 keys and one snapshot
  • Suspends SDS on the current leader (simulates SDS blocked mid-run)
  • Deletes the snapshot, waits for it to reach SNAPSHOT_DELETED state in DB
  • Shuts down the old leader → forces election of a genuinely different new leader (cluster has 3 OMs, quorum=2)
  • Waits for the new leader's SDS to purge the snapshot from snapshotInfoTable
  • Asserts snapshot chain is not corrupted
  • finally block restores the 3-node cluster by restarting the old OM
  1. testSnapshotDeletingServiceWithMultipleSnapshotsDuringFailover
  • Creates 3 snapshots with distinct keys captured in each
  • Suspends SDS on the current leader
  • Deletes all 3 snapshots, waits for all to reach SNAPSHOT_DELETED
  • Shuts down the old leader, waits for new leader election
  • Waits for the new leader's SDS to purge all 3 snapshots
  • Asserts snapshot chain integrity

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8703

How was this patch tested?

The two new tests were run locally against a 3-node MiniOzoneHAClusterImpl:

mvn test -pl hadoop-ozone/integration-test -am
-Dtest="TestOzoneManagerHASnapshot#testSnapshotDeletingServiceDuringOMFailover+testSnapshotDeletingServiceWithMultipleSnapshotsDuringFailover"
-DfailIfNoTests=false
Result:

Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 33.88 s
BUILD SUCCESS
image

@adoroszlai adoroszlai changed the title HDDS-8703. [Snapshot] Integration test for SnapshotDeletingService during OM failover HDDS-8703. Integration test for SnapshotDeletingService during OM failover Apr 2, 2026
Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @arunsarin85 for the patch.

* consistent. (HDDS-8703)
*/
@Test
public void testSnapshotDeletingServiceWithMultipleSnapshotsDuringFailover()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this test parameterized with numSnapshots 1 and 3, then testSnapshotDeletingServiceDuringOMFailover can be removed, and this one renamed to testSnapshotDeletingServiceDuringOMFailover.

@ParametizedTest
@ValueSource(ints = { 1, 3 })
void testSnapshotDeletingServiceDuringOMFailover(int numSnapshots)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! Updated as recommended.

The two separate @test methods have been merged into a single @ParameterizedTest:

@ParameterizedTest
@ValueSource(ints = {1, 3})
public void testSnapshotDeletingServiceDuringOMFailover(int numSnapshots)

numSnapshots=1 covers the single-snapshot failover scenario
numSnapshots=3 covers the multi-snapshot backlog scenario

image

@adoroszlai adoroszlai added test snapshot https://issues.apache.org/jira/browse/HDDS-6517 labels Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517 test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants