Skip to content

HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts before marking volumes failed.#9954

Open
devmadhuu wants to merge 5 commits intoapache:masterfrom
devmadhuu:HDDS-14871
Open

HDDS-14871. DataNode: tolerate per-volume health-check latch timeouts before marking volumes failed.#9954
devmadhuu wants to merge 5 commits intoapache:masterfrom
devmadhuu:HDDS-14871

Conversation

@devmadhuu
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR addresses the problem of latch timeout for pending volumes not reported any result.

StorageVolumeChecker.checkAllVolumes() waits on a single CountDownLatch for all volume health checks to complete. If the latch expires before any volume finishes — due to any transient stall — every pending volume is immediately marked FAILED with zero tolerance, producing false-positive volume failures.

The existing per-volume IO-failure sliding window in StorageVolume.check() does not address this because it only applies when a check completes, not when the latch times out.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14871

How was this patch tested?

This patch has been tested by extending 3 unit tests in existing test class : TestStorageVolumeHealthChecks

@devmadhuu devmadhuu marked this pull request as ready for review March 23, 2026 06:20
@devmadhuu devmadhuu requested a review from ChenSammi March 27, 2026 08:16
Copy link
Copy Markdown
Contributor

@ptlrs ptlrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @devmadhuu

Comment on lines +722 to +726
// Move the sliding window of IO test results forward 1 and check threshold.
if (advanceIOWindow(diskChecksPassed)) {
// If the failure threshold has been crossed, fail the volume without
// further scans. Once the volume is failed, it will not be checked
// anymore. The failure counts can be left as is.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove all changes not related to consecutiveTimeoutCount.

These changes conflict with the PR #8843 which transitions the StorageVolume class to use the new SlidingWindow implementation.

Comment on lines +791 to +797
public void resetTimeoutCount() {
int prev = consecutiveTimeoutCount.getAndSet(0);
if (prev > 0 && LOG.isDebugEnabled()) {
LOG.debug("Volume {} completed a healthy check. Consecutive timeout"
+ " count reset from {} to 0.", this, prev);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using AtomicInteger consecutiveTimeoutCount to essentially fail if we see two consecutive failures.

This can also be modeled using a Sliding Window similar to what we do for tracking volume check failures.

We can create a new sliding window which keeps track of the timeouts with a max toleration of 1.

If we use the new SlidingWindow.java implementation, we will also not have to worry about resetting the count as the time based policy will automatically take care of it.

The time validity of the window can be 70 minutes, sufficient for two checkAllVolumes to complete.

@devmadhuu
Copy link
Copy Markdown
Contributor Author

@ptlrs Thanks for your review. So lets wait for your #8843 to get merged, so that I can revisit this PR and do the changes accordingly.

Copy link
Copy Markdown
Contributor

@priyeshkaratha priyeshkaratha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devmadhuu for the patch. Changes LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants