diff --git a/src/current/v24.1/cluster-setup-troubleshooting.md b/src/current/v24.1/cluster-setup-troubleshooting.md index 8286bae7c86..d2ffad2d8d8 100644 --- a/src/current/v24.1/cluster-setup-troubleshooting.md +++ b/src/current/v24.1/cluster-setup-troubleshooting.md @@ -431,6 +431,7 @@ Symptoms of disk stalls include: - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload. - [Node liveness issues](#node-liveness-issues). - Writes on one node come to a halt. This can happen because in rare cases, a node may be able to perform liveness checks (which involve writing to disk) even though it cannot write other data to disk due to one or more slow/stalled calls to `fsync`. Because the node is passing its liveness checks, it is able to hang onto its leases even though it cannot make progress on the ranges for which it is the leaseholder. This wedged node has a ripple effect on the rest of the cluster such that all processing of the ranges whose leaseholders are on that node basically grinds to a halt. As mentioned above, CockroachDB's disk stall detection will attempt to shut down the node when it detects this state. +- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` Causes of disk stalls include: @@ -448,6 +449,12 @@ CockroachDB's built-in disk stall detection works as follows: - During [node liveness heartbeats](#node-liveness-issues), the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk as part of the node liveness heartbeat process. +If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent: + +- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` + +Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem. + #### Disk utilization is different across nodes in the cluster This is expected behavior. diff --git a/src/current/v24.3/cluster-setup-troubleshooting.md b/src/current/v24.3/cluster-setup-troubleshooting.md index 8286bae7c86..eef6ebcd543 100644 --- a/src/current/v24.3/cluster-setup-troubleshooting.md +++ b/src/current/v24.3/cluster-setup-troubleshooting.md @@ -431,6 +431,7 @@ Symptoms of disk stalls include: - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload. - [Node liveness issues](#node-liveness-issues). - Writes on one node come to a halt. This can happen because in rare cases, a node may be able to perform liveness checks (which involve writing to disk) even though it cannot write other data to disk due to one or more slow/stalled calls to `fsync`. Because the node is passing its liveness checks, it is able to hang onto its leases even though it cannot make progress on the ranges for which it is the leaseholder. This wedged node has a ripple effect on the rest of the cluster such that all processing of the ranges whose leaseholders are on that node basically grinds to a halt. As mentioned above, CockroachDB's disk stall detection will attempt to shut down the node when it detects this state. +- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` Causes of disk stalls include: @@ -448,6 +449,12 @@ CockroachDB's built-in disk stall detection works as follows: - During [node liveness heartbeats](#node-liveness-issues), the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk as part of the node liveness heartbeat process. +If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent: + + - `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` + +Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem. + #### Disk utilization is different across nodes in the cluster This is expected behavior. diff --git a/src/current/v25.2/cluster-setup-troubleshooting.md b/src/current/v25.2/cluster-setup-troubleshooting.md index 9df807f685a..971298022c0 100644 --- a/src/current/v25.2/cluster-setup-troubleshooting.md +++ b/src/current/v25.2/cluster-setup-troubleshooting.md @@ -415,6 +415,7 @@ Symptoms of disk stalls include: - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload. - [Node liveness issues](#node-liveness-issues). +- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` Causes of disk stalls include: @@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows: - During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk. +If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent: + +- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` + +Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem. + {% include_cached new-in.html version="v25.2" %} {% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %} #### Disk utilization is different across nodes in the cluster diff --git a/src/current/v25.3/cluster-setup-troubleshooting.md b/src/current/v25.3/cluster-setup-troubleshooting.md index c6f774750cd..9972c334d98 100644 --- a/src/current/v25.3/cluster-setup-troubleshooting.md +++ b/src/current/v25.3/cluster-setup-troubleshooting.md @@ -415,6 +415,7 @@ Symptoms of disk stalls include: - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload. - [Node liveness issues](#node-liveness-issues). +- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` Causes of disk stalls include: @@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows: - During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk. +If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent: + +- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` + +Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem. + {% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %} #### Disk utilization is different across nodes in the cluster diff --git a/src/current/v25.4/cluster-setup-troubleshooting.md b/src/current/v25.4/cluster-setup-troubleshooting.md index c6f774750cd..9972c334d98 100644 --- a/src/current/v25.4/cluster-setup-troubleshooting.md +++ b/src/current/v25.4/cluster-setup-troubleshooting.md @@ -415,6 +415,7 @@ Symptoms of disk stalls include: - Bad cluster write performance, usually in the form of a substantial drop in QPS for a given workload. - [Node liveness issues](#node-liveness-issues). +- Messages like the following start appearing in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage): `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` Causes of disk stalls include: @@ -432,6 +433,12 @@ CockroachDB's built-in disk stall detection works as follows: - During [store liveness]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases) heartbeats, the [storage engine]({% link {{ page.version.version }}/architecture/storage-layer.md %}) writes to disk. +If you see messages like the following in the [`STORAGE` logging channel]({% link {{ page.version.version }}/logging.md %}#storage), it is an early sign of severe I/O slowness, and usually means a fatal stall is imminent: + +- `disk slowness detected: write to file {path/to/store/*.sst} has been ongoing for {duration}s` + +Repeated occurrences of this message usually mean the node is effectively degraded: it will struggle to hold [range leases]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) and serve requests, and will degrade the entire cluster generally. Do not raise the stall thresholds to mask hardware issues. Instead, [drain and decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}) and replace the [underlying storage]({% link {{ page.version.version }}/cockroach-start.md %}#storage). If you are considering tuning, refer to [`storage.max_sync_duration`]({% link {{ page.version.version }}/cluster-settings.md %}#setting-storage-max-sync-duration) (or the corresponding environment variable `COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`), but note that increasing these values generally prolongs unavailability rather than fixing the underlying problem. + {% include {{ page.version.version }}/leader-leases-node-heartbeat-use-cases.md %} #### Disk utilization is different across nodes in the cluster