Recover disk-space metrics when a cached FileStore's directory is removed during region migration#17930
Closed
CRZbulabula wants to merge 1 commit into
Closed
Conversation
…emoved A cached FileStore pins the exact path it was resolved from. When that path is deleted while IoTDB is running (e.g. an empty data region directory removed during region migration), every disk-space query against the stale FileStore throws NoSuchFileException, which was logged at ERROR on every heartbeat and flooded the DataNode log. Store the configured disk dirs and, when a space query fails, re-resolve the FileStores once via FileStoreUtils#getFileStore (which walks up to an existing ancestor on the same device) so the metric recovers on the next sampling. Remaining failures are logged at WARN instead of ERROR.
Contributor
Author
|
Superseded by #17931, which uses a branch on apache/iotdb directly instead of a fork. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This is a follow-up to #17880 ("Fix empty snapshot loading and region cleanup"), addressing the second problem reported in the same scenario: a cluster that contains an empty
DataRegion(auto-created by the ConfigNode after a scale-out, carrying0SeriesPartitionSlot) being migrated during scale operations.While #17880 fixed the empty-snapshot loading (
SnapshotLoader) and the region-cleanup timeout (TableDiskUsageIndex/DataRegion), the affected DataNode kept flooding its log withERRORentries like:Root cause
SystemMetrics#setDiskDirsresolves each configured disk directory into ajava.nio.file.FileStoreonce at startup and caches the resulting objects. AFileStorepins the exact path it was resolved from; on Linux everygetTotalSpace()/getUnallocatedSpace()/getUsableSpace()call re-runsstatvfson that pinned path.When that directory is removed while IoTDB is running (e.g. an empty data region directory is deleted during region migration), the pinned path no longer exists and every space query throws
NoSuchFileException. Because disk metrics are sampled on every DataNode heartbeat (and on every Prometheus scrape), the staleFileStorewas logged atERRORon every sampling, never recovered, and flooded the log.Fix
SystemMetricsnow also stores the configured disk dirs.FileStorefails, theFileStoreset is re-resolved once viaFileStoreUtils#getFileStore, which walks up to an existing ancestor directory on the same device. The metric then recovers on the next sampling instead of staying broken forever.WARNinstead ofERROR, so it can no longer flood the log.fileStores/diskDirsare madevolatileand the re-resolution is done copy-on-write, since the getters are invoked concurrently from the heartbeat and Prometheus-reporter threads.Behavior
0and spammingERRORlogs.PingCode: V2-974
This PR has:
Key changed/added classes (or packages if there are too many classes) in this PR
org.apache.iotdb.metrics.metricsets.system.SystemMetricsorg.apache.iotdb.metrics.metricsets.system.SystemMetricsTest(new)