Skip to content

fix(operator): enable etcd auto-compaction + raise backend quota#151

Open
kamir wants to merge 2 commits into
KafScale:mainfrom
kamir:pr/etcd-auto-compaction
Open

fix(operator): enable etcd auto-compaction + raise backend quota#151
kamir wants to merge 2 commits into
KafScale:mainfrom
kamir:pr/etcd-auto-compaction

Conversation

@kamir

@kamir kamir commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Problem

The operator-managed etcd writes roughly one MVCC revision per offset update (one per produce). With no compaction the store grows until it reaches etcd's default 2 GiB backend quota under sustained load, and the broker then fails produce with mvcc: database space exceeded.

What this PR does (mitigation, not a full fix)

Enable periodic auto-compaction and raise the backend quota so NOSPACE happens far less often:

  • --auto-compaction-mode=periodic
  • --auto-compaction-retention=5m
  • --quota-backend-bytes=4294967296 (4 GiB)

Be explicit about the limit: compaction is not defrag. Auto-compaction frees old revisions logically, so bbolt pages become reusable, but it does not shrink the physical db file, and --quota-backend-bytes is enforced against the physical size. Only etcdctl defrag reclaims physical space, and there is no defrag in the repo yet. So this change reduces NOSPACE frequency and buys headroom; it does not close the loop. Physical reclaim (defrag) and NOSPACE-alarm detect/disarm are a tracked follow-up (see below).

Configurable, with safe defaults

The three values follow the existing KAFSCALE_OPERATOR_ETCD_* operator-env pattern so high-write clusters can tune without a code change. Defaults are the values above, so the behaviour is opt-out:

Env var Default Meaning
KAFSCALE_OPERATOR_ETCD_AUTO_COMPACTION_MODE periodic periodic or revision; anything else falls back to the default
KAFSCALE_OPERATOR_ETCD_AUTO_COMPACTION_RETENTION 5m retention window (duration for periodic, count for revision)
KAFSCALE_OPERATOR_ETCD_QUOTA_BACKEND_BYTES 4294967296 (4 GiB) backend quota in bytes

Each parser validates and falls back to the default on empty or garbage input, so an override can never accidentally disable the quota or select an invalid mode.

Memory-mode guard

In KAFSCALE_OPERATOR_ETCD_STORAGE_MEMORY=true mode the data dir is a tmpfs emptyDir{medium:Memory} with no size cap of its own, so a 4 GiB quota could drive about 4 GiB of node RAM and risk an OOM that takes the node down. The tmpfs allocation counts against the etcd container's memory cgroup, so this PR sets, only in memory mode, a memory request equal to the quota and a memory limit of quota + 512 MiB headroom (etcd heap and bbolt mmap pages). The kernel then reclaims or kills inside the container before the node is starved, and the scheduler reserves real capacity. Disk-backed (PVC) mode is unchanged and sets no container resources.

Why 5m and 4 GiB (reasoning, not measured)

The lab is currently offline, so these are back-of-envelope figures, not a live capture. etcd writes about one revision per offset commit. Counting the offset key, its value, and bbolt page plus index overhead, a conservative figure is on the order of 1 KiB of backend growth per revision. At a sustained 5,000 produce/sec that is about 5 MiB/s of pre-compaction growth.

  • Time to hit the 2 GiB default quota at 5 MiB/s: 2 GiB / 5 MiB/s is about 410 s, roughly 7 minutes. That matches the observed NOSPACE wall under sustained load.
  • With periodic compaction at 5m retention, the live revision set per window is about 5 MiB/s x 300 s, roughly 1.5 GiB. That sits under the 4 GiB quota with headroom, so the broker stops hitting NOSPACE between compaction cycles. The 4 GiB quota absorbs bursts and the un-defragged free pages that compaction leaves behind.

These numbers scale linearly with write rate; that is exactly why the three values are env-configurable. A live before/after capture is pending the lab coming back online.

Reproducible load recipe

To reproduce and validate against a real write rate:

  1. Bring up a managed-etcd KafScale cluster (operator default path, no external etcd endpoints).

  2. Drive sustained produce, for example with kafka-producer-perf-test:

    kafka-producer-perf-test \
      --topic etcd-load \
      --num-records 50000000 \
      --record-size 256 \
      --throughput 5000 \
      --producer-props bootstrap.servers=<kafscale-proxy>:9092
    
  3. Watch the backend size and revision counters on each etcd member:

    etcdctl endpoint status --write-out=table
    

    The metric to graph is etcd_db_total_size_in_bytes (physical backend size). Without this change it climbs monotonically to the 2 GiB wall; with it, physical size still grows between compactions (no defrag), but logical space is reclaimed every 5m and the 4 GiB quota gives headroom. Also watch etcd_debug_mvcc_db_total_size_in_use_in_bytes (logical in-use) to see compaction working: it should sawtooth, not climb.

Follow-up (separate PR/issue)

The durability work that closes the loop is intentionally out of scope here, splitting the claim keeps this change focused and reviewable:

  • Physical reclaim via defrag, either a CronJob or operator-driven, running one member at a time so quorum is never lost during the defrag stop-the-world on that member.
  • NOSPACE alarm detect and disarm: once the backend trips the quota etcd sets a NOSPACE alarm and the keyspace stays read-only even after compaction; an operator that detects the alarm, runs defrag, then disarms it is the missing piece.

Test

go build ./... and go test ./pkg/operator/... are green; gofmt and go vet clean. A new table-driven test asserts etcdArgs() emits the three flags with the default values and that the env overrides change them, plus tests covering the memory-mode container memory limit and the absence of container resources in disk mode.

Scalytics and others added 2 commits June 4, 2026 19:36
The operator-managed etcd takes one revision per offset update (one per
produce). Without compaction the MVCC store reaches the default 2 GiB quota
under sustained load and the broker then fails produce with 'mvcc: database
space exceeded'. Add periodic auto-compaction (5m retention) and raise the
backend quota to 4 GiB so bursts do not trip the quota between compactions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… mode

Build on the periodic auto-compaction + raised backend quota change.
Compaction reclaims revisions logically; it does not shrink the physical
bbolt file, so this reduces NOSPACE frequency rather than eliminating it.
Physical reclaim (defrag) stays a tracked follow-up.

- Expose mode, retention, and quota as env-configurable knobs following
  the existing KAFSCALE_OPERATOR_ETCD_* pattern, with periodic / 5m /
  4 GiB as safe defaults. Each parser validates and falls back to the
  default on empty or garbage input, so an override can never disable the
  quota or pick an invalid mode.
- Memory-mode guard: when the data dir is a tmpfs emptyDir, set a memory
  request (= quota) and limit (= quota + 512 MiB headroom) on the etcd
  container so a large quota cannot drive unbounded node RAM and OOM the
  node. Disk-backed (PVC) mode is unchanged.
- Table-driven test for etcdArgs() asserting the three flags emit the
  defaults and honour env overrides, plus tests for the memory-mode limit
  and the absence of container resources in disk mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant