Skip to content

RATIS-2501. Improve diagnostics for testInstallSnapshotDuringBootstrap timeout failures.#1427

Merged
szetszwo merged 1 commit intoapache:RATIS-1931_grpc-zero-copyfrom
slfan1989:RATIS-2501
Apr 12, 2026
Merged

RATIS-2501. Improve diagnostics for testInstallSnapshotDuringBootstrap timeout failures.#1427
szetszwo merged 1 commit intoapache:RATIS-1931_grpc-zero-copyfrom
slfan1989:RATIS-2501

Conversation

@slfan1989
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR improves the diagnostics and reliability of InstallSnapshotNotificationTests.testInstallSnapshotDuringBootstrap by adding bounded retry logic and comprehensive cluster state logging when configuration changes fail.

Key changes:

  1. Added setConfigurationWithBoundedRetry() method:
  • Limits setConfiguration retries to 30 attempts (vs unlimited retries previously)
  • Adds 1-second sleep between retry attempts
  • Logs each attempt with leader ID and target configuration
  • Dumps detailed cluster state on failure
  1. Added waitAndCheckNewConfWithDiagnostics() method:
  • Wraps RaftServerTestUtil.waitAndCheckNewConf() with exception handling
  • Triggers cluster state dump on any assertion or exception
  1. Added dumpClusterState() diagnostic method:
  • Logs comprehensive cluster information including:
    • Snapshot request/notification counts
    • Leader snapshot info
    • Per-division state (role, leader, term, indices, configuration)
    • Follower next/match indices
    • All server logs
  1. Updated `testInstallSnapshotDuringBootstrap():
  • Replaced direct cluster.setConfiguration() call with bounded retry version
  • Replaced configuration check with diagnostic version

What is the link to the Apache JIRA

JIRA: RATIS-2501. Improve diagnostics for testInstallSnapshotDuringBootstrap timeout failures.

Please replace this section with the link to the Apache JIRA)

How was this patch tested?

  • Existing unit tests

@slfan1989
Copy link
Copy Markdown
Contributor Author

@szetszwo I’ve recently been working on improvements related to gRPC zero-copy and have spent some time reviewing and understanding the relevant code. Based on the progress so far, I think it makes sense to continue moving this work forward, and I plan to keep refining this area over the coming period.

Previously, I submitted change #1417, which mainly added some metrics. However, during testing, some unit tests failed due to timeouts, while the same tests passed when run locally. Given this situation, I’m planning to submit a new PR to improve the error messages in the unit tests.

Could you mind taking a look at this change? Thank you very much!

Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slfan1989 , thanks a lot for fixing gRPC zero-copy!

+1 the change looks good.

@szetszwo szetszwo merged commit f76cb2e into apache:RATIS-1931_grpc-zero-copy Apr 12, 2026
15 checks passed
@slfan1989
Copy link
Copy Markdown
Contributor Author

@slfan1989 , thanks a lot for fixing gRPC zero-copy!

+1 the change looks good.

@szetszwo Thank you so much for reviewing the code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants