fix(operator): enable etcd auto-compaction + raise backend quota by kamir · Pull Request #151 · KafScale/platform

kamir · 2026-06-04T17:36:08Z

Problem

The operator-managed etcd writes roughly one MVCC revision per offset update (one per produce). With no compaction the store grows until it reaches etcd's default 2 GiB backend quota under sustained load, and the broker then fails produce with mvcc: database space exceeded.

What this PR does (mitigation, not a full fix)

Enable periodic auto-compaction and raise the backend quota so NOSPACE happens far less often:

--auto-compaction-mode=periodic
--auto-compaction-retention=5m
--quota-backend-bytes=4294967296 (4 GiB)

Be explicit about the limit: compaction is not defrag. Auto-compaction frees old revisions logically, so bbolt pages become reusable, but it does not shrink the physical db file, and --quota-backend-bytes is enforced against the physical size. Only etcdctl defrag reclaims physical space, and there is no defrag in the repo yet. So this change reduces NOSPACE frequency and buys headroom; it does not close the loop. Physical reclaim (defrag) and NOSPACE-alarm detect/disarm are a tracked follow-up (see below).

Configurable, with safe defaults

The three values follow the existing KAFSCALE_OPERATOR_ETCD_* operator-env pattern so high-write clusters can tune without a code change. Defaults are the values above, so the behaviour is opt-out:

Env var	Default	Meaning
`KAFSCALE_OPERATOR_ETCD_AUTO_COMPACTION_MODE`	`periodic`	`periodic` or `revision`; anything else falls back to the default
`KAFSCALE_OPERATOR_ETCD_AUTO_COMPACTION_RETENTION`	`5m`	retention window (duration for `periodic`, count for `revision`)
`KAFSCALE_OPERATOR_ETCD_QUOTA_BACKEND_BYTES`	`4294967296` (4 GiB)	backend quota in bytes

Each parser validates and falls back to the default on empty or garbage input, so an override can never accidentally disable the quota or select an invalid mode.

Memory-mode guard

In KAFSCALE_OPERATOR_ETCD_STORAGE_MEMORY=true mode the data dir is a tmpfs emptyDir{medium:Memory} with no size cap of its own, so a 4 GiB quota could drive about 4 GiB of node RAM and risk an OOM that takes the node down. The tmpfs allocation counts against the etcd container's memory cgroup, so this PR sets, only in memory mode, a memory request equal to the quota and a memory limit of quota + 512 MiB headroom (etcd heap and bbolt mmap pages). The kernel then reclaims or kills inside the container before the node is starved, and the scheduler reserves real capacity. Disk-backed (PVC) mode is unchanged and sets no container resources.

Why 5m and 4 GiB (reasoning, not measured)

The lab is currently offline, so these are back-of-envelope figures, not a live capture. etcd writes about one revision per offset commit. Counting the offset key, its value, and bbolt page plus index overhead, a conservative figure is on the order of 1 KiB of backend growth per revision. At a sustained 5,000 produce/sec that is about 5 MiB/s of pre-compaction growth.

Time to hit the 2 GiB default quota at 5 MiB/s: 2 GiB / 5 MiB/s is about 410 s, roughly 7 minutes. That matches the observed NOSPACE wall under sustained load.
With periodic compaction at 5m retention, the live revision set per window is about 5 MiB/s x 300 s, roughly 1.5 GiB. That sits under the 4 GiB quota with headroom, so the broker stops hitting NOSPACE between compaction cycles. The 4 GiB quota absorbs bursts and the un-defragged free pages that compaction leaves behind.

These numbers scale linearly with write rate; that is exactly why the three values are env-configurable. A live before/after capture is pending the lab coming back online.

Reproducible load recipe

To reproduce and validate against a real write rate:

Bring up a managed-etcd KafScale cluster (operator default path, no external etcd endpoints).

Drive sustained produce, for example with kafka-producer-perf-test:

kafka-producer-perf-test \
  --topic etcd-load \
  --num-records 50000000 \
  --record-size 256 \
  --throughput 5000 \
  --producer-props bootstrap.servers=<kafscale-proxy>:9092

Watch the backend size and revision counters on each etcd member:
```
etcdctl endpoint status --write-out=table
```
The metric to graph is etcd_db_total_size_in_bytes (physical backend size). Without this change it climbs monotonically to the 2 GiB wall; with it, physical size still grows between compactions (no defrag), but logical space is reclaimed every 5m and the 4 GiB quota gives headroom. Also watch etcd_debug_mvcc_db_total_size_in_use_in_bytes (logical in-use) to see compaction working: it should sawtooth, not climb.

Follow-up (separate PR/issue)

The durability work that closes the loop is intentionally out of scope here, splitting the claim keeps this change focused and reviewable:

Physical reclaim via defrag, either a CronJob or operator-driven, running one member at a time so quorum is never lost during the defrag stop-the-world on that member.
NOSPACE alarm detect and disarm: once the backend trips the quota etcd sets a NOSPACE alarm and the keyspace stays read-only even after compaction; an operator that detects the alarm, runs defrag, then disarms it is the missing piece.

Test

go build ./... and go test ./pkg/operator/... are green; gofmt and go vet clean. A new table-driven test asserts etcdArgs() emits the three flags with the default values and that the env overrides change them, plus tests covering the memory-mode container memory limit and the absence of container resources in disk mode.

The operator-managed etcd takes one revision per offset update (one per produce). Without compaction the MVCC store reaches the default 2 GiB quota under sustained load and the broker then fails produce with 'mvcc: database space exceeded'. Add periodic auto-compaction (5m retention) and raise the backend quota to 4 GiB so bursts do not trip the quota between compactions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… mode Build on the periodic auto-compaction + raised backend quota change. Compaction reclaims revisions logically; it does not shrink the physical bbolt file, so this reduces NOSPACE frequency rather than eliminating it. Physical reclaim (defrag) stays a tracked follow-up. - Expose mode, retention, and quota as env-configurable knobs following the existing KAFSCALE_OPERATOR_ETCD_* pattern, with periodic / 5m / 4 GiB as safe defaults. Each parser validates and falls back to the default on empty or garbage input, so an override can never disable the quota or pick an invalid mode. - Memory-mode guard: when the data dir is a tmpfs emptyDir, set a memory request (= quota) and limit (= quota + 512 MiB headroom) on the etcd container so a large quota cannot drive unbounded node RAM and OOM the node. Disk-backed (PVC) mode is unchanged. - Table-driven test for etcdArgs() asserting the three flags emit the defaults and honour env overrides, plus tests for the memory-mode limit and the absence of container resources in disk mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Scalytics and others added 2 commits June 4, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(operator): enable etcd auto-compaction + raise backend quota#151

fix(operator): enable etcd auto-compaction + raise backend quota#151
kamir wants to merge 2 commits into
KafScale:mainfrom
kamir:pr/etcd-auto-compaction

kamir commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kamir commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this PR does (mitigation, not a full fix)

Configurable, with safe defaults

Memory-mode guard

Why 5m and 4 GiB (reasoning, not measured)

Reproducible load recipe

Follow-up (separate PR/issue)

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kamir commented Jun 4, 2026 •

edited

Loading