fix(operator): enable etcd auto-compaction + raise backend quota#151
Open
kamir wants to merge 2 commits into
Open
fix(operator): enable etcd auto-compaction + raise backend quota#151kamir wants to merge 2 commits into
kamir wants to merge 2 commits into
Conversation
The operator-managed etcd takes one revision per offset update (one per produce). Without compaction the MVCC store reaches the default 2 GiB quota under sustained load and the broker then fails produce with 'mvcc: database space exceeded'. Add periodic auto-compaction (5m retention) and raise the backend quota to 4 GiB so bursts do not trip the quota between compactions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… mode Build on the periodic auto-compaction + raised backend quota change. Compaction reclaims revisions logically; it does not shrink the physical bbolt file, so this reduces NOSPACE frequency rather than eliminating it. Physical reclaim (defrag) stays a tracked follow-up. - Expose mode, retention, and quota as env-configurable knobs following the existing KAFSCALE_OPERATOR_ETCD_* pattern, with periodic / 5m / 4 GiB as safe defaults. Each parser validates and falls back to the default on empty or garbage input, so an override can never disable the quota or pick an invalid mode. - Memory-mode guard: when the data dir is a tmpfs emptyDir, set a memory request (= quota) and limit (= quota + 512 MiB headroom) on the etcd container so a large quota cannot drive unbounded node RAM and OOM the node. Disk-backed (PVC) mode is unchanged. - Table-driven test for etcdArgs() asserting the three flags emit the defaults and honour env overrides, plus tests for the memory-mode limit and the absence of container resources in disk mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The operator-managed etcd writes roughly one MVCC revision per offset update (one per produce). With no compaction the store grows until it reaches etcd's default 2 GiB backend quota under sustained load, and the broker then fails produce with
mvcc: database space exceeded.What this PR does (mitigation, not a full fix)
Enable periodic auto-compaction and raise the backend quota so NOSPACE happens far less often:
--auto-compaction-mode=periodic--auto-compaction-retention=5m--quota-backend-bytes=4294967296(4 GiB)Be explicit about the limit: compaction is not defrag. Auto-compaction frees old revisions logically, so bbolt pages become reusable, but it does not shrink the physical db file, and
--quota-backend-bytesis enforced against the physical size. Onlyetcdctl defragreclaims physical space, and there is no defrag in the repo yet. So this change reduces NOSPACE frequency and buys headroom; it does not close the loop. Physical reclaim (defrag) and NOSPACE-alarm detect/disarm are a tracked follow-up (see below).Configurable, with safe defaults
The three values follow the existing
KAFSCALE_OPERATOR_ETCD_*operator-env pattern so high-write clusters can tune without a code change. Defaults are the values above, so the behaviour is opt-out:KAFSCALE_OPERATOR_ETCD_AUTO_COMPACTION_MODEperiodicperiodicorrevision; anything else falls back to the defaultKAFSCALE_OPERATOR_ETCD_AUTO_COMPACTION_RETENTION5mperiodic, count forrevision)KAFSCALE_OPERATOR_ETCD_QUOTA_BACKEND_BYTES4294967296(4 GiB)Each parser validates and falls back to the default on empty or garbage input, so an override can never accidentally disable the quota or select an invalid mode.
Memory-mode guard
In
KAFSCALE_OPERATOR_ETCD_STORAGE_MEMORY=truemode the data dir is a tmpfsemptyDir{medium:Memory}with no size cap of its own, so a 4 GiB quota could drive about 4 GiB of node RAM and risk an OOM that takes the node down. The tmpfs allocation counts against the etcd container's memory cgroup, so this PR sets, only in memory mode, a memory request equal to the quota and a memory limit of quota + 512 MiB headroom (etcd heap and bbolt mmap pages). The kernel then reclaims or kills inside the container before the node is starved, and the scheduler reserves real capacity. Disk-backed (PVC) mode is unchanged and sets no container resources.Why 5m and 4 GiB (reasoning, not measured)
The lab is currently offline, so these are back-of-envelope figures, not a live capture. etcd writes about one revision per offset commit. Counting the offset key, its value, and bbolt page plus index overhead, a conservative figure is on the order of 1 KiB of backend growth per revision. At a sustained 5,000 produce/sec that is about 5 MiB/s of pre-compaction growth.
These numbers scale linearly with write rate; that is exactly why the three values are env-configurable. A live before/after capture is pending the lab coming back online.
Reproducible load recipe
To reproduce and validate against a real write rate:
Bring up a managed-etcd KafScale cluster (operator default path, no external etcd endpoints).
Drive sustained produce, for example with
kafka-producer-perf-test:Watch the backend size and revision counters on each etcd member:
The metric to graph is
etcd_db_total_size_in_bytes(physical backend size). Without this change it climbs monotonically to the 2 GiB wall; with it, physical size still grows between compactions (no defrag), but logical space is reclaimed every 5m and the 4 GiB quota gives headroom. Also watchetcd_debug_mvcc_db_total_size_in_use_in_bytes(logical in-use) to see compaction working: it should sawtooth, not climb.Follow-up (separate PR/issue)
The durability work that closes the loop is intentionally out of scope here, splitting the claim keeps this change focused and reviewable:
NOSPACEalarm and the keyspace stays read-only even after compaction; an operator that detects the alarm, runs defrag, then disarms it is the missing piece.Test
go build ./...andgo test ./pkg/operator/...are green;gofmtandgo vetclean. A new table-driven test assertsetcdArgs()emits the three flags with the default values and that the env overrides change them, plus tests covering the memory-mode container memory limit and the absence of container resources in disk mode.