Recover disk-space metrics when a cached FileStore's directory is removed during region migration by CRZbulabula · Pull Request #17930 · apache/iotdb

CRZbulabula · 2026-06-12T08:23:22Z

Description

This is a follow-up to #17880 ("Fix empty snapshot loading and region cleanup"), addressing the second problem reported in the same scenario: a cluster that contains an empty DataRegion (auto-created by the ConfigNode after a scale-out, carrying 0 SeriesPartitionSlot) being migrated during scale operations.

While #17880 fixed the empty-snapshot loading (SnapshotLoader) and the region-cleanup timeout (TableDiskUsageIndex / DataRegion), the affected DataNode kept flooding its log with ERROR entries like:

ERROR o.a.i.m.m.s.SystemMetrics:366 - Failed to statistic the size of /data (/dev/sdb1), because
java.nio.file.NoSuchFileException: /data/.../data/datanode/system
	at ...
	at org.apache.iotdb.metrics.metricsets.system.SystemMetrics.getSystemDiskAvailableSpace(SystemMetrics.java:364)
	at org.apache.iotdb.metrics.core.type.IoTDBAutoGauge.getValue(IoTDBAutoGauge.java:43)
	at ...DataNodeInternalRPCServiceImpl.sampleDiskLoad(...)
	at ...getDataNodeHeartBeat(...)

Root cause

SystemMetrics#setDiskDirs resolves each configured disk directory into a java.nio.file.FileStore once at startup and caches the resulting objects. A FileStore pins the exact path it was resolved from; on Linux every getTotalSpace() / getUnallocatedSpace() / getUsableSpace() call re-runs statvfs on that pinned path.

When that directory is removed while IoTDB is running (e.g. an empty data region directory is deleted during region migration), the pinned path no longer exists and every space query throws NoSuchFileException. Because disk metrics are sampled on every DataNode heartbeat (and on every Prometheus scrape), the stale FileStore was logged at ERROR on every sampling, never recovered, and flooded the log.

Fix

SystemMetrics now also stores the configured disk dirs.
When a space query against a cached FileStore fails, the FileStore set is re-resolved once via FileStoreUtils#getFileStore, which walks up to an existing ancestor directory on the same device. The metric then recovers on the next sampling instead of staying broken forever.
A failure that persists even after re-resolving (practically impossible, since the lookup ultimately falls back to an existing directory) is logged at WARN instead of ERROR, so it can no longer flood the log.
fileStores / diskDirs are made volatile and the re-resolution is done copy-on-write, since the getters are invoked concurrently from the heartbeat and Prometheus-reporter threads.

Behavior

No behavioral change on the happy path: when all directories exist, the reported total/free/available disk space is identical to before.
When a backing directory disappears, the metric self-heals on the next sample (re-binding to a still-existing ancestor on the same device) rather than returning 0 and spamming ERROR logs.

PingCode: V2-974

This PR has:

been self-reviewed.
- concurrent read
- concurrent write
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.

Key changed/added classes (or packages if there are too many classes) in this PR

org.apache.iotdb.metrics.metricsets.system.SystemMetrics
org.apache.iotdb.metrics.metricsets.system.SystemMetricsTest (new)

…emoved A cached FileStore pins the exact path it was resolved from. When that path is deleted while IoTDB is running (e.g. an empty data region directory removed during region migration), every disk-space query against the stale FileStore throws NoSuchFileException, which was logged at ERROR on every heartbeat and flooded the DataNode log. Store the configured disk dirs and, when a space query fails, re-resolve the FileStores once via FileStoreUtils#getFileStore (which walks up to an existing ancestor on the same device) so the metric recovers on the next sampling. Remaining failures are logged at WARN instead of ERROR.

CRZbulabula · 2026-06-12T08:34:36Z

Superseded by #17931, which uses a branch on apache/iotdb directly instead of a fork.

CRZbulabula closed this Jun 12, 2026

CRZbulabula deleted the fix_v2_974_system_metrics_filestore branch June 12, 2026 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover disk-space metrics when a cached FileStore's directory is removed during region migration#17930

Recover disk-space metrics when a cached FileStore's directory is removed during region migration#17930
CRZbulabula wants to merge 1 commit into
apache:masterfrom
CRZbulabula:fix_v2_974_system_metrics_filestore

CRZbulabula commented Jun 12, 2026

Uh oh!

CRZbulabula commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CRZbulabula commented Jun 12, 2026

Description

Root cause

Fix

Behavior

Key changed/added classes (or packages if there are too many classes) in this PR

Uh oh!

CRZbulabula commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant