HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts before marking volumes failed.#9954
HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts before marking volumes failed.#9954devmadhuu wants to merge 5 commits intoapache:masterfrom
Conversation
… before marking volumes failed.
...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
Outdated
Show resolved
Hide resolved
...vice/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java
Show resolved
Hide resolved
...c/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
Show resolved
Hide resolved
...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
Outdated
Show resolved
Hide resolved
...vice/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java
Outdated
Show resolved
Hide resolved
...vice/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java
Outdated
Show resolved
Hide resolved
ptlrs
left a comment
There was a problem hiding this comment.
Thanks for the PR @devmadhuu
| // Move the sliding window of IO test results forward 1 and check threshold. | ||
| if (advanceIOWindow(diskChecksPassed)) { | ||
| // If the failure threshold has been crossed, fail the volume without | ||
| // further scans. Once the volume is failed, it will not be checked | ||
| // anymore. The failure counts can be left as is. |
There was a problem hiding this comment.
Can we remove all changes not related to consecutiveTimeoutCount.
These changes conflict with the PR #8843 which transitions the StorageVolume class to use the new SlidingWindow implementation.
| public void resetTimeoutCount() { | ||
| int prev = consecutiveTimeoutCount.getAndSet(0); | ||
| if (prev > 0 && LOG.isDebugEnabled()) { | ||
| LOG.debug("Volume {} completed a healthy check. Consecutive timeout" | ||
| + " count reset from {} to 0.", this, prev); | ||
| } | ||
| } |
There was a problem hiding this comment.
We are using AtomicInteger consecutiveTimeoutCount to essentially fail if we see two consecutive failures.
This can also be modeled using a Sliding Window similar to what we do for tracking volume check failures.
We can create a new sliding window which keeps track of the timeouts with a max toleration of 1.
If we use the new SlidingWindow.java implementation, we will also not have to worry about resetting the count as the time based policy will automatically take care of it.
The time validity of the window can be 70 minutes, sufficient for two checkAllVolumes to complete.
priyeshkaratha
left a comment
There was a problem hiding this comment.
Thanks @devmadhuu for the patch. Changes LGTM
What changes were proposed in this pull request?
This PR addresses the problem of latch timeout for pending volumes not reported any result.
StorageVolumeChecker.checkAllVolumes()waits on a singleCountDownLatchfor all volume health checks to complete. If the latch expires before any volume finishes — due to any transient stall — every pending volume is immediately marked FAILED with zero tolerance, producing false-positive volume failures.The existing per-volume IO-failure sliding window in
StorageVolume.check()does not address this because it only applies when a check completes, not when the latch times out.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14871
How was this patch tested?
This patch has been tested by extending 3 unit tests in existing test class :
TestStorageVolumeHealthChecks