RATIS-2501. Improve diagnostics for testInstallSnapshotDuringBootstrap timeout failures.#1427
Conversation
…p timeout failures.
|
@szetszwo I’ve recently been working on improvements related to gRPC zero-copy and have spent some time reviewing and understanding the relevant code. Based on the progress so far, I think it makes sense to continue moving this work forward, and I plan to keep refining this area over the coming period. Previously, I submitted change #1417, which mainly added some metrics. However, during testing, some unit tests failed due to timeouts, while the same tests passed when run locally. Given this situation, I’m planning to submit a new PR to improve the error messages in the unit tests. Could you mind taking a look at this change? Thank you very much! |
There was a problem hiding this comment.
@slfan1989 , thanks a lot for fixing gRPC zero-copy!
+1 the change looks good.
@szetszwo Thank you so much for reviewing the code! |
What changes were proposed in this pull request?
This PR improves the diagnostics and reliability of
InstallSnapshotNotificationTests.testInstallSnapshotDuringBootstrapby adding bounded retry logic and comprehensive cluster state logging when configuration changes fail.Key changes:
setConfigurationretries to 30 attempts (vs unlimited retries previously)RaftServerTestUtil.waitAndCheckNewConf()with exception handlingcluster.setConfiguration()call with bounded retry versionWhat is the link to the Apache JIRA
JIRA: RATIS-2501. Improve diagnostics for testInstallSnapshotDuringBootstrap timeout failures.
Please replace this section with the link to the Apache JIRA)
How was this patch tested?