HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts before marking volumes failed. by devmadhuu · Pull Request #9954 · apache/ozone

devmadhuu · 2026-03-20T13:04:41Z

What changes were proposed in this pull request?

This PR addresses the problem of latch timeout for pending volumes not reported any result.

StorageVolumeChecker.checkAllVolumes() waits on a single CountDownLatch for all volume health checks to complete. If the latch expires before any volume finishes — due to any transient stall — every pending volume is immediately marked FAILED with zero tolerance, producing false-positive volume failures.

The existing per-volume IO-failure sliding window in StorageVolume.check() does not address this because it only applies when a check completes, not when the latch times out.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14871

How was this patch tested?

This patch has been tested by extending 3 unit tests in existing test class : TestStorageVolumeHealthChecks

… before marking volumes failed.

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java

...vice/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java

...c/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java

...vice/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java

ptlrs

Thanks for the PR @devmadhuu

ptlrs · 2026-03-29T00:01:17Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java

+    // Move the sliding window of IO test results forward 1 and check threshold.
+    if (advanceIOWindow(diskChecksPassed)) {
+      // If the failure threshold has been crossed, fail the volume without
+      // further scans. Once the volume is failed, it will not be checked
+      // anymore. The failure counts can be left as is.


Can we remove all changes not related to consecutiveTimeoutCount.

These changes conflict with the PR #8843 which transitions the StorageVolume class to use the new SlidingWindow implementation.

ptlrs · 2026-03-29T00:11:33Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java

+  public void resetTimeoutCount() {
+    int prev = consecutiveTimeoutCount.getAndSet(0);
+    if (prev > 0 && LOG.isDebugEnabled()) {
+      LOG.debug("Volume {} completed a healthy check. Consecutive timeout"
+          + " count reset from {} to 0.", this, prev);
+    }
+  }


We are using AtomicInteger consecutiveTimeoutCount to essentially fail if we see two consecutive failures.

This can also be modeled using a Sliding Window similar to what we do for tracking volume check failures.

We can create a new sliding window which keeps track of the timeouts with a max toleration of 1.

If we use the new SlidingWindow.java implementation, we will also not have to worry about resetting the count as the time based policy will automatically take care of it.

The time validity of the window can be 70 minutes, sufficient for two checkAllVolumes to complete.

devmadhuu · 2026-03-30T04:59:10Z

@ptlrs Thanks for your review. So lets wait for your #8843 to get merged, so that I can revisit this PR and do the changes accordingly.

priyeshkaratha

Thanks @devmadhuu for the patch. Changes LGTM

HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts…

c702794

… before marking volumes failed.

devmadhuu requested review from ChenSammi, adoroszlai and errose28 March 20, 2026 13:04

HDDS-14871. Fixed test case failure.

d91cf3d

devmadhuu marked this pull request as ready for review March 23, 2026 06:20

ChenSammi reviewed Mar 24, 2026

View reviewed changes

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java Outdated Show resolved Hide resolved

ChenSammi reviewed Mar 24, 2026

View reviewed changes

...vice/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java Show resolved Hide resolved

HDDS-14871. Fixed review comments.

de0a027

ChenSammi reviewed Mar 26, 2026

View reviewed changes

...c/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java Show resolved Hide resolved

ChenSammi reviewed Mar 26, 2026

View reviewed changes

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java Outdated Show resolved Hide resolved

ChenSammi reviewed Mar 26, 2026

View reviewed changes

...vice/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java Outdated Show resolved Hide resolved

ChenSammi reviewed Mar 26, 2026

View reviewed changes

...vice/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java Outdated Show resolved Hide resolved

HDDS-14871. Fixed review comments.

16bdb26

devmadhuu requested a review from ChenSammi March 27, 2026 08:16

HDDS-14871. Fixed findbugs.

452cd11

ptlrs reviewed Mar 29, 2026

View reviewed changes

priyeshkaratha approved these changes Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts before marking volumes failed.#9954

HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts before marking volumes failed.#9954
devmadhuu wants to merge 5 commits intoapache:masterfrom
devmadhuu:HDDS-14871

devmadhuu commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptlrs left a comment

Uh oh!

ptlrs Mar 29, 2026

Uh oh!

ptlrs Mar 29, 2026

Uh oh!

devmadhuu commented Mar 30, 2026

Uh oh!

priyeshkaratha left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

devmadhuu commented Mar 20, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptlrs left a comment

Choose a reason for hiding this comment

Uh oh!

ptlrs Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

ptlrs Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

devmadhuu commented Mar 30, 2026

Uh oh!

priyeshkaratha left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants