From 9e5e077a5adb80d4e32333a748ab5fa5b50fc4d4 Mon Sep 17 00:00:00 2001
From: Alan Conway <aconway@redhat.com>
Date: Wed, 3 Dec 2025 13:02:35 -0500
Subject: [PATCH] doc: Article on high volume log loss.

This guide explains how to handle scenarios where high-volume logging can cause log loss in
OpenShift clusters, and how to configure your cluster to minimize this risk.
---
 docs/administration/README.adoc               |   3 +-
 docs/administration/high-volume-log-loss.adoc | 342 ++++++++++++++++++
 2 files changed, 344 insertions(+), 1 deletion(-)
 create mode 100644 docs/administration/high-volume-log-loss.adoc

diff --git a/docs/administration/README.adoc b/docs/administration/README.adoc
index f41c5fdc24..d1390cec5e 100644
--- a/docs/administration/README.adoc
+++ b/docs/administration/README.adoc
@@ -4,4 +4,5 @@
 * link:clusterlogforwarder.adoc[Log Collection and Forwarding]
 * Enabling event collection by link:deploy-event-router.md[Deploying the Event Router]
 * link:logfilemetricexporter.adoc[Collecting Container Log Metrics]
-* Example of a link:lokistack.adoc[complete Logging Solution] using LokiStack and UIPlugin
\ No newline at end of file
+* Example of a link:lokistack.adoc[complete Logging Solution] using LokiStack and UIPlugin
+* Configuring for link:large-volume.adoc[high volume log loss]
diff --git a/docs/administration/high-volume-log-loss.adoc b/docs/administration/high-volume-log-loss.adoc
new file mode 100644
index 0000000000..3a2ab4df5d
--- /dev/null
+++ b/docs/administration/high-volume-log-loss.adoc
@@ -0,0 +1,342 @@
+= High volume log loss
+:doctype: article
+:toc: left
+:stem:
+
+This guide explains how high log volumes in OpenShift clusters can cause log loss,
+and how to configure your cluster to minimize this risk.
+
+[WARNING]
+====
+#If your data requires guaranteed delivery *_do not send it as logs_*# +
+Logs were never intended to provide guaranteed delivery or long-term storage.
+Rotating disk files without any form of flow-control is inherently unreliable.
+Guaranteed delivery requires modifying your application to use a reliable, end-to-end messaging
+protocol, for example Kafka, AMQP, or MQTT.
+
+It is theoretically impossible to prevent log loss under all conditions.
+You can configure log storage to avoid loss under expected average and peak loads.
+====
+
+== Overview
+
+=== Log loss
+
+Container logs are written to `/var/log/pods`.
+The forwarder reads and forwards logs as quickly as possible.
+There are always some _unread logs_, written but not yet read by the forwarder.
+
+_Kubelet_ rotates log files and deletes old files periodically to enforce per-container limits.
+Kubelet and the forwarder act independently.
+There is no coordination or flow-control that can ensure logs get forwarded before they are deleted.
+
+_Log Loss_ occurs when _unread logs_ are deleted by Kubelet _before_ being read by the forwarder.
+footnote:[It is also possible to lose logs _after_ forwarding, we won't discuss that here.]
+Lost logs are gone from the file-system and have not been forwarded, so they likely cannot be recovered.
+
+=== Log rotation
+
+Kubelet rotation parameters are:
+[horizontal]
+containerLogMaxSize:: Max size of a single log file (default 10MiB)
+containerLogMaxFiles:: Max number of log files per container (default 5)
+
+A container writes to one active log file.
+When the active file reaches `containerLogMaxSize` the log files are rotated:
+
+. the old active file becomes the most recent archive
+. a new active file is created
+. if there are more than `containerLogMaxFiles` files, the oldest is deleted.
+
+=== Modes of operation
+
+[horizontal]
+writeRate:: long-term average logs per second per container written to `/var/log`
+sendRate:: long-term average logs per second per container forwarded to the store
+
+During _normal operation_ sendRate keeps up with writeRate (on average).
+The number of unread logs is small, and does not grow over time.
+
+Logging is _overloaded_ when writeRate exceeds sendRate (on average) for some period of time.
+This could be due to faster log writing and/or slower sending.
+During overload, unread logs accumulate.
+If the overload lasts long enough, log rotation may delete unread logs causing log loss.
+
+After an overload, logging needs time to _recover_ and process the excess of unread logs.
+Until the backlog clears, the system is more vulnerable to log loss if there is another overload.
+
+== Metrics for logging
+
+Relevant metrics include:
+[horizontal]
+vector_*:: The `vector` process deployed by the log forwarder generates metrics for log collection, buffering and forwarding.
+log_logged_bytes_total:: The `LogFileMetricExporter` measures disk writes _before_ logs are read by the forwarder.
+  To measure end-to-end log loss it is important to measure data that is _not_ yet read by the forwarder.
+kube_*:: Metrics from the Kubernetes cluster.
+
+[CAUTION]
+====
+Metrics named `_bytes_` count bytes, metrics named `_events_` count log records.
+
+The forwarder adds metadata to the logs before sending so you cannot assume that a log
+record written to `/var/log` is the same size in bytes as the record sent to the store.
+
+Use event and byte metrics carefully in calculations to get the correct results.
+====
+
+=== Log File Metric Exporter
+
+The metric `log_logged_bytes_total` is the number of bytes written to each file in `/var/log/pods` by a container.
+This is independent of whether the forwarder reads or forwards the data.
+To generate this metric, create a `LogFileMetricExporter`:
+
+[,yaml]
+----
+apiVersion: logging.openshift.io/v1alpha1
+kind: LogFileMetricExporter
+metadata:
+  name: instance
+  namespace: openshift-logging
+----
+
+== Limitations
+
+Write rate metrics only cover container logs in `/var/log/pods`.
+The following are excluded from these metrics:
+
+* Node-level logs (journal, systemd, audit)
+* API audit logs
+
+This may cause discrepancies when comparing write vs send rates.
+The principles still apply, but account for this additional volume in capacity planning.
+
+=== Using metrics to measure log activity
+
+The PromQL queries below are averaged over an hour of cluster operation, you may want to take longer samples for more stable results.
+
+.*TotalWriteRateBytes* (bytes/sec, all containers)
+----
+sum(rate(log_logged_bytes_total[1h]))
+----
+
+.*TotalSendRateEvents* (events/sec, all containers)
+----
+sum(rate(vector_component_sent_events_total{component_kind="sink",component_type!="prometheus_exporter"}[1h]))
+----
+
+.*LogSizeBytes* (bytes): Average size of a log record on /var/log disk
+----
+sum(increase(vector_component_received_bytes_total{component_type="kubernetes_logs"}[1h])) /
+sum(increase(vector_component_received_events_total{component_type="kubernetes_logs"}[1h]))
+----
+
+.*MaxContainerWriteRateBytes* (bytes/sec per container): The max rate determines per-container log loss.
+----
+max(rate(log_logged_bytes_total[1h]))
+----
+
+NOTE: The queries above are for container logs only.
+Node and audit may also be forwarded (depending on your `ClusterLogForwarder` configuration)
+which may cause discrepancies when comparing write and send rates.
+
+== Recommendations
+
+=== Estimate long-term load
+
+Estimate your expected steady-state load, spike patterns, and tolerable outage duration.
+The long-term average send rate *must* exceed the write rate (including spikes) to allow recovery after overloads.
+
+----
+TotalWriteRateBytes < TotalSendRateLogs × LogSizeBytes
+----
+
+=== Configure Kubelet rotation
+
+Configure rotation parameters based on the _noisiest_ containers in your cluster,
+with the highest write rates (`MaxContainerWriteRateBytes`) that you want to protect.
+
+For an outage of length `MaxOutageTime`:
+
+.Maximum per-container log storage
+----
+MaxContainerSizeBytes = MaxOutageTime × MaxContainerWriteRateBytes
+----
+
+.Kubelet configuration
+----
+containerLogMaxFiles = N
+containerLogMaxSize = MaxContainerSizeBytes / N
+----
+
+NOTE: N should be a relatively small number of files, the default is 5.
+The files can be as large as needed so that `N*containerLogMaxSize > MaxContainerSizeBytes`
+
+=== Estimate total disk requirements
+
+Most containers write far less than `MaxContainerSizeBytes`.
+Total disk space is based on cluster-wide average write rates, not on the noisiest containers.
+
+.Minimum total disk space required
+----
+DiskTotalSize = MaxOutageTime × TotalWriteRateBytes × SafetyFactor
+----
+
+.Recovery time to clear the backlog from a max outage:
+----
+RecoveryTime = (MaxOutageTime × TotalWriteRateBytes) / (TotalSendRateLogs × LogSizeBytes)
+----
+
+[TIP]
+.To check the size of the /var/log partition on each node
+[source,console]
+----
+for NODE in $(oc get nodes -o name);
+  do echo "# $NODE"; oc debug -q $NODE -- df -h /var/log;
+done
+----
+
+==== Example
+
+The default Kubelet settings allow 50MB per container log:
+----
+containerLogMaxFiles: 5     # Max 5 files per container log
+containerLogMaxSize: 10MB   # Max 10 MB per file
+----
+
+Suppose we observe log loss during a 3-minute outage (forwarder is unable to forward any logs).
+This implies the noisiest containers are writing at least 50MB of logs _each_ during the 3 minute outage:
+
+----
+MaxContainerWriteRateBytes ≥ 50MB / 180s ≈ 278KB/s
+----
+
+Now suppose we want to handle an outage of up to 1 hour, without loss,
+rounding up to a maximum per-container write rate of 300KB/s.
+
+----
+MaxStoragePerContainerBytes = 300KB/s × 3600s ≈ 1GB
+
+containerLogMaxFiles: 10
+containerLogMaxSize: 100MB
+----
+
+For total disk space, suppose the cluster writes 2MB/s for all containers:
+
+----
+MaxOutageTime = 3600
+TotalWriteRateBytes = 2MB/s
+SafetyFactor = 1.5
+
+DiskTotalSize = 3600s × 2MB/s × 1.5 = 10GB
+----
+
+NOTE: `MaxStoragePerContainerBytes=1GB` applies only to the noisiest containers.
+The `DiskTotalSize=10GB` is based on the cluster-wide average write rates.
+
+=== Configure Kubelet log limits
+
+Here is an example `KubeletConfig` resource (OpenShift 4.6+). +
+It provides `50MB × 10 files = 500MB` per container.
+
+[,yaml]
+----
+apiVersion: machineconfiguration.openshift.io/v1
+kind: KubeletConfig
+metadata:
+  name: increase-log-limits
+spec:
+  machineConfigPoolSelector:
+    matchLabels:
+      machineconfiguration.openshift.io/role: worker
+  kubeletConfig:
+    containerLogMaxSize: 50Mi
+    containerLogMaxFiles: 10
+----
+
+You can modify `MachineConfig` resources on older versions of OpenShift that don't support `KubeletConfig`.
+
+=== Apply and verify configuration
+
+*To apply the KubeletConfig:*
+[,bash]
+----
+# Apply the configuration
+oc apply -f kubelet-log-limits.yaml
+
+# Monitor the roll-out (this will cause node reboots)
+oc get kubeletconfig
+oc get mcp -w
+----
+
+*To verify the configuration is active:*
+[,bash]
+----
+# Check that all nodes are updated
+oc get nodes
+
+# Verify the kubelet configuration on a node
+oc debug node/<node-name>
+chroot /host
+grep -E "(containerLogMaxSize|containerLogMaxFiles)" /etc/kubernetes/kubelet/kubelet.conf
+
+# Check effective log limits for running containers
+find /var/log -name "*.log" -exec ls -lah {} \; | head -20
+
+----
+
+The configuration rollout typically takes 10-20 minutes as nodes are updated in rolling fashion.
+
+== Alternative (non)-solutions
+
+This section presents what seem like alternative solutions at first glance, but have significant problems.
+
+=== Large forwarder buffers
+
+Instead of modifying rotation parameters, make the forwarder's internal buffers very large.
+
+==== Duplication of logs
+
+Forwarder buffers are stored on the same disk partition as `/var/log`.
+When the forwarder reads logs, they remain in `/var/log` until rotation deletes them.
+This means the forwarder buffer mostly duplicates data from `/var/log` files,
+which requires up to double the disk space for logs waiting to be forwarded.
+
+==== Buffer design mismatch
+
+Forwarder buffers are optimized for transmitting data efficiently, based on characteristics of the remote store.
+
+- *Intended purpose:* Hold records that are ready-to-send or in-flight awaiting acknowledgement.
+- *Typical time-frame:* Seconds to minutes of buffering for round-trip request/response times.
+- *Not designed for:* Hours/days of log accumulation during extended outages
+
+==== Supporting other logging tools
+
+Expanding `/var/log` benefits _any_ logging tool, including:
+
+- `oc logs` for local debugging or troubleshooting log collection
+- Standard Unix tools when debugging via `oc rsh`
+
+Expanding forwarder buffers only benefits the forwarder, and costs more in disk space.
+
+If you deploy multiple forwarders, each additional forwarder will need its own buffer space.
+If you expand `/var/log`, all forwarders share the same storage.
+
+=== Persistent volume buffers
+
+Since large forwarder buffers compete for disk space with `/var/log`,
+what about storing forwarder buffers on a separate persistent volume?
+
+This would still double the storage requirements (using a separate disk) but
+the real problem is that a PV is not a local disk, it is a network service.
+Using PVs for buffer storage introduces new network dependencies and reliability and performance issues.
+The underlying buffer management code is optimized for local disk response times.
+
+== Summary
+
+1. *Monitor log patterns:* Use Prometheus metrics to measure log generation and send rates
+2. *Calculate storage requirements:* Account for peak periods, recovery time, and spikes
+3. *Increase kubelet log rotation limits:* Allow greater storage for noisy containers
+4. *Plan for peak scenarios:* Size storage to handle expected patterns without loss
+
+TIP: The OpenShift console Observe>Dashboard section includes helpful log-related dashboards.
+