Skip to content

Recover disk-space metrics when a cached FileStore's directory is removed during region migration#17930

Closed
CRZbulabula wants to merge 1 commit into
apache:masterfrom
CRZbulabula:fix_v2_974_system_metrics_filestore
Closed

Recover disk-space metrics when a cached FileStore's directory is removed during region migration#17930
CRZbulabula wants to merge 1 commit into
apache:masterfrom
CRZbulabula:fix_v2_974_system_metrics_filestore

Conversation

@CRZbulabula

Copy link
Copy Markdown
Contributor

Description

This is a follow-up to #17880 ("Fix empty snapshot loading and region cleanup"), addressing the second problem reported in the same scenario: a cluster that contains an empty DataRegion (auto-created by the ConfigNode after a scale-out, carrying 0 SeriesPartitionSlot) being migrated during scale operations.

While #17880 fixed the empty-snapshot loading (SnapshotLoader) and the region-cleanup timeout (TableDiskUsageIndex / DataRegion), the affected DataNode kept flooding its log with ERROR entries like:

ERROR o.a.i.m.m.s.SystemMetrics:366 - Failed to statistic the size of /data (/dev/sdb1), because
java.nio.file.NoSuchFileException: /data/.../data/datanode/system
	at ...
	at org.apache.iotdb.metrics.metricsets.system.SystemMetrics.getSystemDiskAvailableSpace(SystemMetrics.java:364)
	at org.apache.iotdb.metrics.core.type.IoTDBAutoGauge.getValue(IoTDBAutoGauge.java:43)
	at ...DataNodeInternalRPCServiceImpl.sampleDiskLoad(...)
	at ...getDataNodeHeartBeat(...)

Root cause

SystemMetrics#setDiskDirs resolves each configured disk directory into a java.nio.file.FileStore once at startup and caches the resulting objects. A FileStore pins the exact path it was resolved from; on Linux every getTotalSpace() / getUnallocatedSpace() / getUsableSpace() call re-runs statvfs on that pinned path.

When that directory is removed while IoTDB is running (e.g. an empty data region directory is deleted during region migration), the pinned path no longer exists and every space query throws NoSuchFileException. Because disk metrics are sampled on every DataNode heartbeat (and on every Prometheus scrape), the stale FileStore was logged at ERROR on every sampling, never recovered, and flooded the log.

Fix

  • SystemMetrics now also stores the configured disk dirs.
  • When a space query against a cached FileStore fails, the FileStore set is re-resolved once via FileStoreUtils#getFileStore, which walks up to an existing ancestor directory on the same device. The metric then recovers on the next sampling instead of staying broken forever.
  • A failure that persists even after re-resolving (practically impossible, since the lookup ultimately falls back to an existing directory) is logged at WARN instead of ERROR, so it can no longer flood the log.
  • fileStores / diskDirs are made volatile and the re-resolution is done copy-on-write, since the getters are invoked concurrently from the heartbeat and Prometheus-reporter threads.

Behavior

  • No behavioral change on the happy path: when all directories exist, the reported total/free/available disk space is identical to before.
  • When a backing directory disappears, the metric self-heals on the next sample (re-binding to a still-existing ancestor on the same device) rather than returning 0 and spamming ERROR logs.

PingCode: V2-974


This PR has:

  • been self-reviewed.
    • concurrent read
    • concurrent write
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.

Key changed/added classes (or packages if there are too many classes) in this PR
  • org.apache.iotdb.metrics.metricsets.system.SystemMetrics
  • org.apache.iotdb.metrics.metricsets.system.SystemMetricsTest (new)

…emoved

A cached FileStore pins the exact path it was resolved from. When that
path is deleted while IoTDB is running (e.g. an empty data region
directory removed during region migration), every disk-space query
against the stale FileStore throws NoSuchFileException, which was logged
at ERROR on every heartbeat and flooded the DataNode log.

Store the configured disk dirs and, when a space query fails, re-resolve
the FileStores once via FileStoreUtils#getFileStore (which walks up to an
existing ancestor on the same device) so the metric recovers on the next
sampling. Remaining failures are logged at WARN instead of ERROR.
@CRZbulabula

Copy link
Copy Markdown
Contributor Author

Superseded by #17931, which uses a branch on apache/iotdb directly instead of a fork.

@CRZbulabula CRZbulabula deleted the fix_v2_974_system_metrics_filestore branch June 12, 2026 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant