From 9e5e077a5adb80d4e32333a748ab5fa5b50fc4d4 Mon Sep 17 00:00:00 2001 From: Alan Conway Date: Wed, 3 Dec 2025 13:02:35 -0500 Subject: [PATCH] doc: Article on high volume log loss. This guide explains how to handle scenarios where high-volume logging can cause log loss in OpenShift clusters, and how to configure your cluster to minimize this risk. --- docs/administration/README.adoc | 3 +- docs/administration/high-volume-log-loss.adoc | 342 ++++++++++++++++++ 2 files changed, 344 insertions(+), 1 deletion(-) create mode 100644 docs/administration/high-volume-log-loss.adoc diff --git a/docs/administration/README.adoc b/docs/administration/README.adoc index f41c5fdc24..d1390cec5e 100644 --- a/docs/administration/README.adoc +++ b/docs/administration/README.adoc @@ -4,4 +4,5 @@ * link:clusterlogforwarder.adoc[Log Collection and Forwarding] * Enabling event collection by link:deploy-event-router.md[Deploying the Event Router] * link:logfilemetricexporter.adoc[Collecting Container Log Metrics] -* Example of a link:lokistack.adoc[complete Logging Solution] using LokiStack and UIPlugin \ No newline at end of file +* Example of a link:lokistack.adoc[complete Logging Solution] using LokiStack and UIPlugin +* Configuring for link:large-volume.adoc[high volume log loss] diff --git a/docs/administration/high-volume-log-loss.adoc b/docs/administration/high-volume-log-loss.adoc new file mode 100644 index 0000000000..3a2ab4df5d --- /dev/null +++ b/docs/administration/high-volume-log-loss.adoc @@ -0,0 +1,342 @@ += High volume log loss +:doctype: article +:toc: left +:stem: + +This guide explains how high log volumes in OpenShift clusters can cause log loss, +and how to configure your cluster to minimize this risk. + +[WARNING] +==== +#If your data requires guaranteed delivery *_do not send it as logs_*# + +Logs were never intended to provide guaranteed delivery or long-term storage. +Rotating disk files without any form of flow-control is inherently unreliable. +Guaranteed delivery requires modifying your application to use a reliable, end-to-end messaging +protocol, for example Kafka, AMQP, or MQTT. + +It is theoretically impossible to prevent log loss under all conditions. +You can configure log storage to avoid loss under expected average and peak loads. +==== + +== Overview + +=== Log loss + +Container logs are written to `/var/log/pods`. +The forwarder reads and forwards logs as quickly as possible. +There are always some _unread logs_, written but not yet read by the forwarder. + +_Kubelet_ rotates log files and deletes old files periodically to enforce per-container limits. +Kubelet and the forwarder act independently. +There is no coordination or flow-control that can ensure logs get forwarded before they are deleted. + +_Log Loss_ occurs when _unread logs_ are deleted by Kubelet _before_ being read by the forwarder. +footnote:[It is also possible to lose logs _after_ forwarding, we won't discuss that here.] +Lost logs are gone from the file-system and have not been forwarded, so they likely cannot be recovered. + +=== Log rotation + +Kubelet rotation parameters are: +[horizontal] +containerLogMaxSize:: Max size of a single log file (default 10MiB) +containerLogMaxFiles:: Max number of log files per container (default 5) + +A container writes to one active log file. +When the active file reaches `containerLogMaxSize` the log files are rotated: + +. the old active file becomes the most recent archive +. a new active file is created +. if there are more than `containerLogMaxFiles` files, the oldest is deleted. + +=== Modes of operation + +[horizontal] +writeRate:: long-term average logs per second per container written to `/var/log` +sendRate:: long-term average logs per second per container forwarded to the store + +During _normal operation_ sendRate keeps up with writeRate (on average). +The number of unread logs is small, and does not grow over time. + +Logging is _overloaded_ when writeRate exceeds sendRate (on average) for some period of time. +This could be due to faster log writing and/or slower sending. +During overload, unread logs accumulate. +If the overload lasts long enough, log rotation may delete unread logs causing log loss. + +After an overload, logging needs time to _recover_ and process the excess of unread logs. +Until the backlog clears, the system is more vulnerable to log loss if there is another overload. + +== Metrics for logging + +Relevant metrics include: +[horizontal] +vector_*:: The `vector` process deployed by the log forwarder generates metrics for log collection, buffering and forwarding. +log_logged_bytes_total:: The `LogFileMetricExporter` measures disk writes _before_ logs are read by the forwarder. + To measure end-to-end log loss it is important to measure data that is _not_ yet read by the forwarder. +kube_*:: Metrics from the Kubernetes cluster. + +[CAUTION] +==== +Metrics named `_bytes_` count bytes, metrics named `_events_` count log records. + +The forwarder adds metadata to the logs before sending so you cannot assume that a log +record written to `/var/log` is the same size in bytes as the record sent to the store. + +Use event and byte metrics carefully in calculations to get the correct results. +==== + +=== Log File Metric Exporter + +The metric `log_logged_bytes_total` is the number of bytes written to each file in `/var/log/pods` by a container. +This is independent of whether the forwarder reads or forwards the data. +To generate this metric, create a `LogFileMetricExporter`: + +[,yaml] +---- +apiVersion: logging.openshift.io/v1alpha1 +kind: LogFileMetricExporter +metadata: + name: instance + namespace: openshift-logging +---- + +== Limitations + +Write rate metrics only cover container logs in `/var/log/pods`. +The following are excluded from these metrics: + +* Node-level logs (journal, systemd, audit) +* API audit logs + +This may cause discrepancies when comparing write vs send rates. +The principles still apply, but account for this additional volume in capacity planning. + +=== Using metrics to measure log activity + +The PromQL queries below are averaged over an hour of cluster operation, you may want to take longer samples for more stable results. + +.*TotalWriteRateBytes* (bytes/sec, all containers) +---- +sum(rate(log_logged_bytes_total[1h])) +---- + +.*TotalSendRateEvents* (events/sec, all containers) +---- +sum(rate(vector_component_sent_events_total{component_kind="sink",component_type!="prometheus_exporter"}[1h])) +---- + +.*LogSizeBytes* (bytes): Average size of a log record on /var/log disk +---- +sum(increase(vector_component_received_bytes_total{component_type="kubernetes_logs"}[1h])) / +sum(increase(vector_component_received_events_total{component_type="kubernetes_logs"}[1h])) +---- + +.*MaxContainerWriteRateBytes* (bytes/sec per container): The max rate determines per-container log loss. +---- +max(rate(log_logged_bytes_total[1h])) +---- + +NOTE: The queries above are for container logs only. +Node and audit may also be forwarded (depending on your `ClusterLogForwarder` configuration) +which may cause discrepancies when comparing write and send rates. + +== Recommendations + +=== Estimate long-term load + +Estimate your expected steady-state load, spike patterns, and tolerable outage duration. +The long-term average send rate *must* exceed the write rate (including spikes) to allow recovery after overloads. + +---- +TotalWriteRateBytes < TotalSendRateLogs × LogSizeBytes +---- + +=== Configure Kubelet rotation + +Configure rotation parameters based on the _noisiest_ containers in your cluster, +with the highest write rates (`MaxContainerWriteRateBytes`) that you want to protect. + +For an outage of length `MaxOutageTime`: + +.Maximum per-container log storage +---- +MaxContainerSizeBytes = MaxOutageTime × MaxContainerWriteRateBytes +---- + +.Kubelet configuration +---- +containerLogMaxFiles = N +containerLogMaxSize = MaxContainerSizeBytes / N +---- + +NOTE: N should be a relatively small number of files, the default is 5. +The files can be as large as needed so that `N*containerLogMaxSize > MaxContainerSizeBytes` + +=== Estimate total disk requirements + +Most containers write far less than `MaxContainerSizeBytes`. +Total disk space is based on cluster-wide average write rates, not on the noisiest containers. + +.Minimum total disk space required +---- +DiskTotalSize = MaxOutageTime × TotalWriteRateBytes × SafetyFactor +---- + +.Recovery time to clear the backlog from a max outage: +---- +RecoveryTime = (MaxOutageTime × TotalWriteRateBytes) / (TotalSendRateLogs × LogSizeBytes) +---- + +[TIP] +.To check the size of the /var/log partition on each node +[source,console] +---- +for NODE in $(oc get nodes -o name); + do echo "# $NODE"; oc debug -q $NODE -- df -h /var/log; +done +---- + +==== Example + +The default Kubelet settings allow 50MB per container log: +---- +containerLogMaxFiles: 5 # Max 5 files per container log +containerLogMaxSize: 10MB # Max 10 MB per file +---- + +Suppose we observe log loss during a 3-minute outage (forwarder is unable to forward any logs). +This implies the noisiest containers are writing at least 50MB of logs _each_ during the 3 minute outage: + +---- +MaxContainerWriteRateBytes ≥ 50MB / 180s ≈ 278KB/s +---- + +Now suppose we want to handle an outage of up to 1 hour, without loss, +rounding up to a maximum per-container write rate of 300KB/s. + +---- +MaxStoragePerContainerBytes = 300KB/s × 3600s ≈ 1GB + +containerLogMaxFiles: 10 +containerLogMaxSize: 100MB +---- + +For total disk space, suppose the cluster writes 2MB/s for all containers: + +---- +MaxOutageTime = 3600 +TotalWriteRateBytes = 2MB/s +SafetyFactor = 1.5 + +DiskTotalSize = 3600s × 2MB/s × 1.5 = 10GB +---- + +NOTE: `MaxStoragePerContainerBytes=1GB` applies only to the noisiest containers. +The `DiskTotalSize=10GB` is based on the cluster-wide average write rates. + +=== Configure Kubelet log limits + +Here is an example `KubeletConfig` resource (OpenShift 4.6+). + +It provides `50MB × 10 files = 500MB` per container. + +[,yaml] +---- +apiVersion: machineconfiguration.openshift.io/v1 +kind: KubeletConfig +metadata: + name: increase-log-limits +spec: + machineConfigPoolSelector: + matchLabels: + machineconfiguration.openshift.io/role: worker + kubeletConfig: + containerLogMaxSize: 50Mi + containerLogMaxFiles: 10 +---- + +You can modify `MachineConfig` resources on older versions of OpenShift that don't support `KubeletConfig`. + +=== Apply and verify configuration + +*To apply the KubeletConfig:* +[,bash] +---- +# Apply the configuration +oc apply -f kubelet-log-limits.yaml + +# Monitor the roll-out (this will cause node reboots) +oc get kubeletconfig +oc get mcp -w +---- + +*To verify the configuration is active:* +[,bash] +---- +# Check that all nodes are updated +oc get nodes + +# Verify the kubelet configuration on a node +oc debug node/ +chroot /host +grep -E "(containerLogMaxSize|containerLogMaxFiles)" /etc/kubernetes/kubelet/kubelet.conf + +# Check effective log limits for running containers +find /var/log -name "*.log" -exec ls -lah {} \; | head -20 + +---- + +The configuration rollout typically takes 10-20 minutes as nodes are updated in rolling fashion. + +== Alternative (non)-solutions + +This section presents what seem like alternative solutions at first glance, but have significant problems. + +=== Large forwarder buffers + +Instead of modifying rotation parameters, make the forwarder's internal buffers very large. + +==== Duplication of logs + +Forwarder buffers are stored on the same disk partition as `/var/log`. +When the forwarder reads logs, they remain in `/var/log` until rotation deletes them. +This means the forwarder buffer mostly duplicates data from `/var/log` files, +which requires up to double the disk space for logs waiting to be forwarded. + +==== Buffer design mismatch + +Forwarder buffers are optimized for transmitting data efficiently, based on characteristics of the remote store. + +- *Intended purpose:* Hold records that are ready-to-send or in-flight awaiting acknowledgement. +- *Typical time-frame:* Seconds to minutes of buffering for round-trip request/response times. +- *Not designed for:* Hours/days of log accumulation during extended outages + +==== Supporting other logging tools + +Expanding `/var/log` benefits _any_ logging tool, including: + +- `oc logs` for local debugging or troubleshooting log collection +- Standard Unix tools when debugging via `oc rsh` + +Expanding forwarder buffers only benefits the forwarder, and costs more in disk space. + +If you deploy multiple forwarders, each additional forwarder will need its own buffer space. +If you expand `/var/log`, all forwarders share the same storage. + +=== Persistent volume buffers + +Since large forwarder buffers compete for disk space with `/var/log`, +what about storing forwarder buffers on a separate persistent volume? + +This would still double the storage requirements (using a separate disk) but +the real problem is that a PV is not a local disk, it is a network service. +Using PVs for buffer storage introduces new network dependencies and reliability and performance issues. +The underlying buffer management code is optimized for local disk response times. + +== Summary + +1. *Monitor log patterns:* Use Prometheus metrics to measure log generation and send rates +2. *Calculate storage requirements:* Account for peak periods, recovery time, and spikes +3. *Increase kubelet log rotation limits:* Allow greater storage for noisy containers +4. *Plan for peak scenarios:* Size storage to handle expected patterns without loss + +TIP: The OpenShift console Observe>Dashboard section includes helpful log-related dashboards. +