docs(design): propose workload-class isolation after 2026-04-24 XREAD starvation by bootjp · Pull Request #619 · bootjp/elastickv

bootjp · 2026-04-24T11:34:36Z

Summary

Design doc (only — no code in this PR) for a four-layer workload-isolation model, prompted by the 2026-04-24 incident's afternoon phase.

Problem: Today, one client host with 37 connections running a tight XREAD loop consumed 14 CPU cores on the leader via loadStreamAt → unmarshalStreamValue → proto.Unmarshal (81% of CPU per pprof). Raft goroutines couldn't get CPU → step_queue_full = 75,692 on the leader (vs 0-119 on followers) → Raft commit p99 jumped to 6-10s, Lua p99 stuck at 6-8s. Follower replication was healthy (applied-index within 34 of leader); the damage was entirely CPU-scheduling on the leader.

Gap: elastickv has no explicit workload-class isolation. Go's scheduler treats every goroutine equally; a single heavy command path can starve unrelated paths (raft, lease, Lua, GET/SET).

Four-layer defense model

Layer 1 — heavy-command worker pool: gate XREAD / KEYS / SCAN / Lua onto a bounded pool (~2 × GOMAXPROCS); reply -BUSY when full. Cheap commands keep their own fast path.
Layer 2 — locked OS threads for raft: runtime.LockOSThread() on the Ready loop + dispatcher lanes so the Go scheduler can't starve them. Not v1 — only if measurement after Layer 1 + 4 still shows step_queue_full > 0.
Layer 3 — per-client admission control: per-peer-IP connection cap (default 8). Extends, doesn't replace, roadmap item 6's global in-flight semaphore.
Layer 4 — XREAD O(N) → O(new): entry-per-key layout (!redis|stream|<key>|entry|<id>) with range-scan, dual-read migration fallback, legacy-removal gated on elastickv_stream_legacy_format_reads_total == 0. Hashes/sets/zsets share the same one-blob pattern and are called out as follow-up.

Recommended sequencing

Layer 4 (correctness bug, concentrated change) → Layer 1 (generic defense for next unknown hotspot) → Layer 3 (reconcile with roadmap item 6) → Layer 2 (only if forced by measurement).

Relationship to other in-flight work

Complements (does not replace) docs/design/2026_04_24_proposed_resilience_roadmap.md item 6 (admission control). This doc's Layer 3 focuses on per-client fairness; the roadmap's item 6 is global in-flight capping. Both are needed.
Consistent with memwatch (feat(memwatch): graceful shutdown on memory-pressure threshold #612): Layer 3 admission threshold should fire before memwatch's shutdown threshold — flagged as an open question in the doc.
Assumes WAL auto-repair (fix(raft): auto-repair truncated WAL tail on startup #613), GOMEMLIMIT defaults (ops(rolling-update): add GOMEMLIMIT=1800MiB + --memory=2500m defaults #617) are landed so the cluster survives long enough to matter.

Open questions called out in the doc

Static vs dynamic command classification (Layer 1)
-BUSY backoff semantics — how do we avoid client retry spinning becoming the new hot loop?
Number of locked OS threads on variable-core hosts (Layer 2)
Stream migration soak window before removing legacy-format fallback (Layer 4, currently 30 days, arbitrary)

Deliverable

docs/design/2026_04_24_proposed_workload_isolation.md — 446 lines, dated-prefix / **Status: Proposed** convention matching the rest of docs/design/. No code.

Test plan

File paths and function references in the doc spot-checked against origin/main
Cross-references to 2026_04_24_proposed_resilience_roadmap.md reconciled (complements, doesn't duplicate)
Design review — decide on the open questions before implementing Layer 4 (which blocks Layer 1 on XREAD specifically)

… starvation

chatgpt-codex-connector · 2026-04-24T11:34:41Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

coderabbitai · 2026-04-24T11:34:44Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 24 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 53 minutes and 24 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bd714ac9-e7e3-4b7d-b86e-a2699644bd74

📥 Commits

Reviewing files that changed from the base of the PR and between 538f4f6 and 2ae4056.

📒 Files selected for processing (1)

docs/design/2026_04_24_proposed_workload_isolation.md

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ops/workload-isolation-design

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This design document outlines a multi-layered approach to workload isolation in response to a CPU starvation incident caused by expensive XREAD operations. The proposal includes a bounded worker pool for heavy commands, per-client admission control, and a transition to per-entry storage for streams to improve performance. Feedback focuses on mitigating risks during stream migration, ensuring container-aware resource allocation, refining command classification for the worker pool, and improving client-side observability during load shedding.

gemini-code-assist · 2026-04-24T11:37:38Z

+3. On the next write (`XADD`/`XDEL`/...), rewrite to per-entry and
+   delete the legacy blob in the same commit.


Performing a full conversion of a large legacy stream during a synchronous write operation (XADD/XDEL) could be extremely expensive and might re-introduce CPU starvation or cause OOM, as it requires unmarshaling the entire blob and creating many new keys in a single Raft commit. This migration plan must detail strategies for zero-downtime cutovers to avoid service interruption. Consider an incremental migration strategy or a size-limited synchronous conversion to mitigate latency spikes and Raft heartbeat drops for very large streams.

Suggested change

3. On the next write (`XADD`/`XDEL`/...), rewrite to per-entry and

delete the legacy blob in the same commit.

3. On the next write (XADD/XDEL/...), perform an incremental or size-limited rewrite

to per-entry and delete the legacy blob in the same commit.

References

Production-grade system migrations must detail strategies for live or zero-downtime cutovers to avoid service interruption.

gemini-code-assist · 2026-04-24T11:37:38Z

+Static v1 classification:
+
+- **Pool-gated:** `XREAD`, `XRANGE`, `XREVRANGE`, `KEYS`, `SCAN`,
+  `HGETALL`, `SMEMBERS`, full-range `ZRANGE`/`ZRANGEBYSCORE`,


The 'full-range' qualifier for ZRANGE/ZRANGEBYSCORE suggests that the dispatcher must inspect command arguments (e.g., checking for 0 -1) before offloading. This contradicts the stated goal of 'static classification' by command name. If the gating is intended to be strictly static, all ZRANGE calls should be included, or the criteria for 'full-range' should be explicitly defined to avoid ambiguity in the implementation.

Suggested change

`HGETALL`, `SMEMBERS`, full-range `ZRANGE`/`ZRANGEBYSCORE`,

HGETALL, SMEMBERS, ZRANGE, ZRANGEBYSCORE,

gemini-code-assist · 2026-04-24T11:37:38Z

+### Recommended v1 shape
+
+Package-level pool in `adapter/` with a `Submit(command, fn)` entry
+point, sized `2 × runtime.GOMAXPROCS(0)` (env-overridable). Gated


Relying on runtime.GOMAXPROCS(0) for pool sizing can be unreliable in containerized environments (like Kubernetes) where it may return the host's CPU count instead of the container's quota. This could result in an oversized worker pool that fails to provide the intended isolation. It is recommended to ensure the implementation is container-aware, for example by using a library like uber-go/automaxprocs or explicitly respecting cgroup limits.

Suggested change

point, sized `2 × runtime.GOMAXPROCS(0)` (env-overridable). Gated

point, sized 2 × runtime.GOMAXPROCS(0) (env-overridable, container-aware). Gated

References

Design documents identifying significant operational risks must detail mitigation strategies.

gemini-code-assist · 2026-04-24T11:37:38Z

+
+### v1 vs later
+
+- **v1:** static list, single global pool, reject on full.


Sharing a single global pool for all gated commands means that a burst of management or debug commands (like KEYS * or SCAN) can still starve critical data-path commands (like XREAD or Lua scripts) by exhausting pool slots and triggering -BUSY responses. To achieve true workload-class isolation, consider using separate sub-pools or a slot reservation system for critical command classes even in the v1 implementation.

Suggested change

- **v1:** static list, single global pool, reject on full.

- v1: static list, per-class sub-pools or reservations, reject on full.

gemini-code-assist · 2026-04-24T11:37:38Z

+enforced at accept.** On reject, close the TCP connection immediately
+(no RESP — clients retry the connect). Per-client in-flight


Abruptly closing the TCP connection without a protocol-level error (RESP) can lead to poor observability for clients, as they will see 'connection reset' or 'EOF' instead of a clear 'server overloaded' message. This makes it harder to distinguish between network issues and intentional load shedding. Consider accepting the connection and sending a -BUSY error before closing, or ensuring that client libraries are specifically tuned to interpret abrupt closures as a backpressure signal.

Suggested change

enforced at accept.** On reject, close the TCP connection immediately

(no RESP — clients retry the connect). Per-client in-flight

enforced at accept. On reject, consider accepting and replying with -BUSY before closing

(no RESP — clients retry the connect). Per-client in-flight

- Layer 1 (line 84): replace "full-range ZRANGE/ZRANGEBYSCORE" with unconditional gating of the whole ZRANGE family. Arg-inspection contradicted the stated "static, byte-level classification"; a bounded ZRANGE 0 10 costs at most one unmarshal and gating it is cheaper than the dispatcher branch that would distinguish the bounded vs unbounded case. - Layer 1 (line 122, container awareness): document that runtime.GOMAXPROCS(0) returns the host CPU count on Linux without cgroup awareness and call out two mitigations (operator-set GOMAXPROCS env at rolling-update level, or wire uber-go/automaxprocs). v1 prefers the operator-set path for auditability; automaxprocs acceptable as follow-up. - Layer 1 (line 139, single-pool starvation): acknowledge the risk that KEYS/SCAN bursts can exhaust pool slots and starve XREAD/Lua. v1 still ships a single pool but requires a per-command submit metric so a tier split is measurable from observability rather than guessed; sub-pools/slot reservation are the named follow-up. - Layer 3 (line 252, reject semantics): change the v1 shape from "close TCP without RESP" to "accept, write -ERR, then close." The protocol-level error is the signal that distinguishes "server overload" from "network blip" on the client side. - Layer 4 (line 315, synchronous migration cost): add a chunked migration section. A single XADD on a 100k-entry legacy blob would rewrite every entry in one Raft commit, reproducing the CPU/commit spike the design is supposed to prevent. Document STREAM_MIGRATION_CHUNK (default 1024) and the rolling drain model; explicitly scope chunked migration as a stacked follow-up to PR #620 (which ships the simple one-txn version).

bootjp · 2026-04-24T15:14:39Z

Addressed Gemini design review in 9d44b6c:

HIGH (line 315, sync migration): added chunked migration section (STREAM_MIGRATION_CHUNK=1024 default). Scoped explicitly as a stacked follow-up to PR perf(redis): stream entry-per-key layout for O(new) XREAD #620 which ships the simple one-txn version. Doc now acknowledges that a 100k-entry legacy stream cannot be rewritten in a single XADD txn without reproducing the CPU spike.
MEDIUM (line 84, ZRANGE ambiguity): widened gating to the whole ZRANGE family. No arg-inspection → preserves static byte-level classification. A bounded ZRANGE 0 10 pays one pool submit for one unmarshal; still cheaper than the dispatcher branch that would distinguish bounded vs unbounded.
MEDIUM (line 122, container-awareness): called out the GOMAXPROCS vs cgroup-quota mismatch. v1 recommends operator-set GOMAXPROCS at the rolling-update layer (auditable); uber-go/automaxprocs is an accepted follow-up.
MEDIUM (line 139, single-pool starvation): kept single pool for v1 but required elastickv_heavy_command_pool_submit_total labelled by command so a sub-pool split becomes a measurable decision rather than a guess. Sub-pools / slot reservation documented as the named follow-up.
MEDIUM (line 252, RESP reject): changed v1 from "close TCP without RESP" to "accept, write -ERR max connections per client exceeded, close". Client sees a protocol-level signal instead of an opaque connection reset.

/gemini review
@codex review

gemini-code-assist

Code Review

This pull request proposes a four-layer workload isolation strategy to prevent resource starvation, covering worker pools for heavy commands, Raft thread pinning, per-client admission control, and stream storage optimization. The review identifies several significant design issues, including a potential deadlock when using XREAD BLOCK within a fixed-size pool, a correctness bug in the dual-format read logic during partial migrations, and the need for context propagation to handle Lua recursion safely. Feedback also highlights a contradiction regarding the classification of optimized stream commands and warns of potential connection storms under the proposed admission control mechanism.

gemini-code-assist · 2026-04-24T15:16:40Z

+- **Pool-gated:** `XREAD`, `XRANGE`, `XREVRANGE`, `KEYS`, `SCAN`,
+  `HGETALL`, `HVALS`, `HKEYS`, `SMEMBERS`, `SUNION`, `SINTER`,
+  `ZRANGE`, `ZRANGEBYSCORE`, `ZRANGEBYLEX`, `EVAL`/`EVALSHA`,
+  `FCALL`/`FCALL_RO`, the `*SCAN` family.


Gating XREAD to a small, fixed-size worker pool (2 × GOMAXPROCS) introduces a significant availability risk because XREAD is frequently used with the BLOCK option. A few long-polling clients could exhaust all worker slots, effectively performing a Denial of Service on all other "heavy" commands (such as KEYS, SCAN, or EVAL) that share the same pool. The design should specify that blocking operations are handled outside this bounded pool, or that only the CPU-intensive data processing (unmarshaling/scanning) is offloaded to the pool after the block condition is satisfied.

References

When a design document identifies a significant operational risk, it must also detail potential mitigation strategies.

gemini-code-assist · 2026-04-24T15:16:40Z

+1. On XREAD/XRANGE/XLEN/XREVRANGE, try the new per-entry layout first.
+2. If empty AND the legacy single-blob key exists, fall back to the
+   legacy path.


The dual-format read logic described here contains a correctness bug when combined with the "Chunked migration" strategy (lines 353-364). If a stream is partially migrated, the new per-entry layout will not be empty, which prevents the fallback to the legacy-suffix key in step 2. Consequently, readers will only see the migrated chunk and miss the remaining data still stored in the legacy format. The read path must be designed to check for and merge results from the legacy-suffix key whenever it exists, regardless of whether the new layout contains entries.

References

When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers to avoid service interruption.

gemini-code-assist · 2026-04-24T15:16:40Z

+- **(A) Outer holds the slot; inner `redis.call` runs inline and
+  ignores the pool.** No deadlock, inner cost shares outer accounting.
+  **Recommended v1.**


To implement Recommendation (A) for Lua recursion safely in Go, the context.Context passed through the command dispatcher must carry a flag indicating if the current goroutine is already executing within a heavy-command worker. Without this context propagation, the Submit entry point cannot reliably distinguish an "inner" call from a new client request, which would lead to the deadlock described in the "Risk to flag" section.

gemini-code-assist · 2026-04-24T15:16:40Z

+enforced at accept.** On reject, accept the TCP connection, write a
+`-ERR max connections per client exceeded` RESP error, then close —
+so the client sees a protocol-level message instead of a bare
+`connection reset` or `EOF` that's indistinguishable from a real


The strategy of accepting, erroring, and closing connections that exceed the per-IP cap may trigger a "connection storm" if the client uses an aggressive connection pool. Similar to the -BUSY retry storms discussed in the Open Questions (line 453), the design should consider how to mitigate the CPU overhead of repeatedly accepting and closing these connections. Documenting recommended client-side backoff for connection failures or implementing a small server-side delay before closing could help mitigate this.

References

When a design document identifies a significant operational risk, it must also detail potential mitigation strategies.

gemini-code-assist · 2026-04-24T15:16:40Z

+   with Layer 4 — streams become a cheap command once Layer 4 ships,
+   but the pool still catches Lua, KEYS, and HGETALL.


There is a contradiction between the command classification in Layer 1 and the sequencing goals. While line 84 lists XREAD as "Pool-gated", line 430 states that "streams become a cheap command once Layer 4 ships". If Layer 4 successfully optimizes XREAD to O(new), it should be moved to the "Ungated" list to avoid the unnecessary overhead of the worker pool hop (enqueue/pickup). The design should clarify the final classification of optimized stream commands.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d44b6c113

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T15:17:01Z

+1. On XREAD/XRANGE/XLEN/XREVRANGE, try the new per-entry layout first.
+2. If empty AND the legacy single-blob key exists, fall back to the
+   legacy path.


Merge both stream formats during chunked migration

The migration algorithm becomes incorrect once chunking is introduced: after the first chunk rewrite, part of a stream exists in the new per-entry layout and part remains in the legacy suffix, but this read rule only falls back to legacy when the new layout is empty. In that mixed state, XREAD/XRANGE can return incomplete results (e.g., IDs before the chunk boundary only see the migrated prefix), so the design needs a combined read path (or a migration order/marker that guarantees one source is authoritative per ID range).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-24T15:17:01Z

+**Container-aware sizing.** `runtime.GOMAXPROCS(0)` on Linux returns
+the host CPU count, not the container's cgroup quota (unless
+`GOMAXPROCS` is set explicitly or something has configured it).


Update GOMAXPROCS guidance for Go 1.25+ behavior

This guidance is factually stale for this repo’s toolchain (go 1.25.0): modern Go already derives default GOMAXPROCS from cgroup CPU quota on Linux unless explicitly disabled, so documenting that it always uses host CPU count will push operators toward unnecessary manual overrides. That override can also disable runtime auto-updates when quotas change, leading to avoidable mis-sizing.

Useful? React with 👍 / 👎.

- Layer 1 classification (line 87, Gemini HIGH): carve out blocking XREAD/BLPOP/BRPOP/BZPOP from pool gating. Blocking variants hold a slot while idle, trivially exhaust the pool; handle them on their own goroutine and re-gate on wake-up. - Layer 1 Lua recursion (line 117, Gemini Medium): document the context-propagation mechanism that makes option A implementable. A package-private sentinel value on ctx distinguishes inner call from new request; external callers cannot fake it. - Layer 1 GOMAXPROCS (line 134, Codex P2): correct stale guidance. Go 1.25+ already derives GOMAXPROCS from cgroup v2 quota, so the automaxprocs recommendation no longer applies to this repo; keep the override env knob for operators who want explicit control. - Layer 3 reject semantics (line 283, Gemini Medium): add a reject-storm mitigation. Rate-limit the reject itself — after R rejects/s to one peer, switch from accept+write+close to RST; recommend client-side backoff in the ops runbook. - Layer 4 migration (line 345, Gemini HIGH + Codex P1): split the migration doc into Mode A (simple, PR #620 ships this) and Mode B (chunked, stacked follow-up). The read rules differ: Mode A has no mixed state so fall-through is correct; Mode B has a legal mixed state and MUST always merge both layouts. Making the distinction explicit prevents the dual-read correctness bug the reviewers flagged — Mode B with Mode A's fall-through rule returns incomplete results during chunking. - Sequencing (line 431, Gemini Medium): resolve the XREAD gated-vs-cheap contradiction. Layer 4 makes XREAD steady-state O(new) but we keep it gated in Layer 1 v1 for three concrete reasons (large XRANGE bounds, legacy-fallback window, data-driven promotion); promotion to ungated is gated on the pool-submit metric added in Layer 1.

bootjp · 2026-04-24T15:38:21Z

Addressed round-2 in 731475e:

Gemini HIGH line 87 (XREAD BLOCK): carved blocking variants out of pool gating — they bypass the pool and re-gate on wake-up.
Gemini HIGH line 345 + Codex P1: split the migration section into Mode A (simple, PR perf(redis): stream entry-per-key layout for O(new) XREAD #620) and Mode B (chunked, follow-up). Mode B MUST use an always-merge read rule to avoid the mixed-state data loss reviewers flagged.
Gemini Medium line 117: documented the ctx-sentinel mechanism that makes Lua recursion option (A) implementable.
Gemini Medium line 283: added reject-storm mitigation (rate-limit the reject with RST fallback + client-side backoff in runbook).
Gemini Medium line 431: resolved the XREAD gated-vs-cheap contradiction — kept gated in v1 for three concrete reasons, promotion to ungated is metric-driven.
Codex P2 line 134: corrected GOMAXPROCS guidance — Go 1.25+ already derives from cgroup v2 quota, so the previous automaxprocs recommendation is stale for this repo.

/gemini review
@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 731475ebf9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T15:41:35Z

+consumers are active. v1 resolution: **blocking variants (`XREAD
+BLOCK`, `BLPOP`, `BRPOP`, `BZPOPMIN`/`MAX`) bypass the heavy-command
+pool and are handled on their own goroutine**. They are I/O-bound
+waiting, not CPU-bound; their CPU cost lands on wake-up, when the


Keep XREAD BLOCK under pool throttling

This proposal says blocking variants (including XREAD BLOCK) should bypass the heavy-command pool because they are "I/O-bound waiting," but the current implementation in adapter/redis_compat_commands.go (xread) busy-polls in a loop and repeatedly calls xreadOnce while waiting. That means blocked readers still execute stream reads/unmarshal work during the wait window, so bypassing the pool would let a client evade the new CPU guardrail simply by using BLOCK, recreating the starvation path this design is trying to prevent.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-24T15:41:35Z

+`elastickv_stream_legacy_format_reads_total` counts reads that
+touched a legacy-format key in either mode. Remove the legacy
+fallback only after it has sat at zero across all nodes for a soak
+window.


Require legacy-key depletion before dropping dual-read fallback

Using elastickv_stream_legacy_format_reads_total == 0 as the removal criterion can drop access to valid data: in Mode A, migration happens on the next write, so cold streams that are neither written nor read during the soak period can remain in legacy format indefinitely while this metric stays at zero. If fallback is removed on that signal alone, later reads of those dormant streams will miss existing data; the exit condition needs to prove no legacy-format keys remain, not just no recent legacy reads.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request proposes a four-layer architectural design for workload-class isolation to prevent expensive command paths from starving the Raft control plane. The strategy includes implementing a bounded worker pool for heavy commands, per-IP connection limits, and migrating Redis streams to a per-entry storage layout to reduce unmarshaling overhead. Feedback identifies a performance regression risk during chunked migration where legacy blobs still require full unmarshaling, points out that current blocking command implementations use CPU-intensive busy-polling rather than being purely I/O-bound, and highlights a context propagation gap that would prevent the proposed Lua recursion deadlock mitigation from functioning correctly.

gemini-code-assist · 2026-04-24T15:44:38Z

+**Mode B — chunked migration (follow-up):** each write drains at
+most `STREAM_MIGRATION_CHUNK` (default 1 024) entries from the legacy
+blob into per-entry keys, and leaves the rest in a *legacy-suffix*
+key until a subsequent write drains more. During this window the
+stream exists in BOTH layouts simultaneously: the oldest N entries
+are per-entry, the newer M entries are still in the suffix blob.
+
+Read rule for Mode B — **always merge both layouts**, do not
+fall-through on "new layout empty":
+
+1. Read `meta` if present; read all per-entry keys that match the
+   requested ID range.
+2. Read the legacy-suffix blob if present; decode only entries
+   falling in the ID range.
+3. Merge by ID order, deduplicate (the migrator is responsible for
+   never writing the same ID in both layouts in a single commit), and
+   return.


The read rule for Mode B (chunked migration) introduces a significant performance regression that may undermine the isolation goals. Since the legacy format is a single protobuf blob, decoding requires unmarshaling the entire blob, leading to O(N_suffix) CPU/memory usage. This poses an operational risk during the migration window. As this is a production-grade system migration, the plan must detail mitigation strategies (e.g., a temporary 'bridge' or 'dual-write proxy') to ensure a zero-downtime cutover and avoid service interruption.

References

When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

gemini-code-assist · 2026-04-24T15:44:38Z

+consumers are active. v1 resolution: **blocking variants (`XREAD
+BLOCK`, `BLPOP`, `BRPOP`, `BZPOPMIN`/`MAX`) bypass the heavy-command
+pool and are handled on their own goroutine**. They are I/O-bound
+waiting, not CPU-bound; their CPU cost lands on wake-up, when the
+dispatcher can re-evaluate whether to gate the follow-up work. The
+gating decision is made in `dispatchCommand` when it sees the command
+name plus the `BLOCK`/`B*` prefix — the simplest arg inspection we
+allow, limited to "is this command blocking?".


The assumption that blocking variants like BZPOPMIN are purely I/O-bound and not CPU-bound is technically incorrect for the current implementation. As seen in adapter/redis_compat_commands.go:3432, bzpopmin uses a busy-poll loop with time.Sleep(redisBusyPollBackoff). While time.Sleep yields the P, each wake-up triggers a KV lookup and Raft leadership verification, which are CPU-intensive operations. If many clients use blocking commands, the aggregate CPU load from polling could still starve the runtime. The design should either propose moving blocking commands to a truly event-driven model (e.g., using the internal pubsub/relay system) or acknowledge this as a remaining gap in the isolation model.

gemini-code-assist · 2026-04-24T15:44:39Z

+**Implementation note for (A): context propagation.** `Submit`
+identifies "inside a pool slot" by attaching a sentinel value to
+`context.Context` (`ctxKeyInPoolSlot`). The Lua adapter threads that
+`ctx` into every `redis.call` it makes; the dispatcher's pool-gate
+check returns immediately when `ctx.Value(ctxKeyInPoolSlot) != nil`
+instead of attempting another `Submit`. This is the only mechanism
+that reliably distinguishes "new client request" from "inner call"
+without tagging every goroutine or holding a pool-wide set of
+goroutine IDs. The sentinel must be package-private so external
+callers cannot fake it.


The proposed context propagation for the Lua recursion trap has a gap in the current architecture. RedisServer.runLuaScript (and the eval handler) currently creates a new context using context.WithTimeout(context.Background(), ...) (see adapter/redis_lua.go:118). If eval is executed within a worker pool slot, this new context will lose the ctxKeyInPoolSlot sentinel. To make this mechanism work, runLuaScript and all specialized command handlers in RedisServer must be refactored to accept and propagate the context provided by the dispatcher/pool, rather than starting from Background().

- Layer 1 blocking variants (Codex P1 + Gemini Medium): retract the previous "I/O-bound, bypass the pool" decision. The current adapter busy-polls XREAD BLOCK / B*POP with time.Sleep + KV+leader re-check on each wake, so bypassing would hand CPU unbounded to them. v1 keeps blocking variants gated; a condvar-based wake-up is listed as the prereq for an honest bypass in a stacked follow-up. - Layer 1 Lua ctx caveat (Gemini Medium): runLuaScript currently rebuilds the script context from context.Background(), which drops the ctxKeyInPoolSlot sentinel Submit installed when it dispatched the outer EVAL. Called out explicitly that Layer 1 MUST NOT ship before runLuaScript derives its context from the caller-supplied ctx; otherwise Lua scripts inside the pool self-deadlock on every redis.call. - Layer 4 Mode B cost (Gemini HIGH): Mode B preserves the O(N_suffix) unmarshal cost during the migration window, not constant O(new). Added a read-driven drain mitigation, top-N legacy-suffix-size metric for operator visibility, and an explicit note that Mode A is the conservative choice for operators who can't tolerate transient cost. - Layer 4 removal criterion (Codex P1): just "no legacy reads" could keep the fallback alive on cold streams forever. Require a paired elastickv_stream_legacy_format_keys_total derived from a periodic prefix-scan; the fallback is safe to remove only when both counters are zero.

bootjp · 2026-04-24T15:57:24Z

Round 3 in 104237d:

Codex P1 + Gemini Medium: retracted the "blocking variants bypass pool" decision — current adapter busy-polls XREAD BLOCK / B*POP, so bypass would hand them unbounded CPU. v1 keeps blocking gated; condvar wake-up is listed as the prereq for an honest bypass.
Gemini Medium: documented that runLuaScript currently rebuilds its ctx from context.Background(), clobbering the ctxKeyInPoolSlot sentinel. Called out explicitly that Layer 1 MUST NOT ship until this is fixed or Lua inside the pool self-deadlocks.
Gemini HIGH: Mode B cost model now states the O(N_suffix) unmarshal cost during the migration window; added read-driven drain mitigation and top-N legacy-suffix-size metric.
Codex P1: removal criterion now requires BOTH elastickv_stream_legacy_format_reads_total AND a new elastickv_stream_legacy_format_keys_total (from a periodic prefix scan) to be zero. Catches cold streams that a read-only metric would miss.

/gemini review
@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 104237dd85

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T15:59:48Z

+   compatibility code go away? Proposed:
+   `elastickv_stream_legacy_format_reads_total` = 0 for 30 days
+   across all nodes → remove in a follow-up PR. 30 days is arbitrary;


Use both migration counters for fallback removal

This open-question proposal conflicts with the Layer 4 removal rule above by allowing legacy fallback removal based only on elastickv_stream_legacy_format_reads_total. In the cold-stream scenario the doc already calls out, reads can stay at zero while legacy-format keys still exist, so following this criterion can remove compatibility code too early and make those streams unreadable after rollout.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-24T15:59:49Z

+periodic prefix scan (`!redis|stream|<...>` with no matching
+`!stream|meta|<...>`). The fallback is safe to remove only when


Match legacy-scan key pattern to the defined stream layout

The scan rule compares !redis|stream|<...> keys to a missing !stream|meta|<...> key, but the layout defined earlier in this doc uses !redis|stream|<key>|meta. If implemented literally, migrated streams can be misclassified as legacy because the meta lookup pattern is wrong, which breaks the new elastickv_stream_legacy_format_keys_total signal used to decide when fallback can be removed.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request proposes a multi-layered design for workload isolation to prevent CPU starvation, including a heavy-command worker pool, Raft thread pinning, per-IP connection caps, and a granular stream storage layout. Feedback focuses on operational risks during migration, specifically the starvation potential of ungated XADD commands and the inefficiency of the proposed legacy key scanning. Additionally, technical corrections were provided regarding the overhead of Go-level TCP resets and the all-or-nothing nature of Protobuf decoding which impacts the performance claims of the proposed migration modes.

gemini-code-assist · 2026-04-24T16:01:23Z

+- **Ungated:** `GET`, `SET`, `DEL`, `EXISTS`, `INCR`, `EXPIRE`, `TTL`,
+  `HGET`, `HSET`, `LPUSH`/`RPUSH`, `XADD`, single-key fast paths.


The classification of XADD as Ungated (line 89) poses a starvation risk during the transition to the new stream layout. As noted in the Layer 4 problem description (line 384), the current loadStreamAt implementation is O(N). Even after Layer 4 is implemented, the Mode A migration (line 414) requires a full rewrite of the legacy blob on the first write. If XADD remains ungated, a burst of writes to large legacy streams will trigger multiple O(N) unmarshal/marshal cycles on the dispatcher goroutines, potentially reproducing the CPU starvation incident. As this identifies a significant operational risk, the document must detail potential mitigation strategies, such as implementing a temporary 'bridge' or 'proxy' mode to handle the transition safely. Additionally, there is a discrepancy between this document and the codebase: lines 484 and 507 state that Hashes, Sets, and ZSets are currently 'one-blob', but adapter/redis_compat_helpers.go (lines 373, 419, 473) already implements wide-column loading and migration for these types.

References

When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.

gemini-code-assist · 2026-04-24T16:01:23Z

+the fallback. Add a paired counter
+`elastickv_stream_legacy_format_keys_total` populated from a
+periodic prefix scan (`!redis|stream|<...>` with no matching
+`!stream|meta|<...>`). The fallback is safe to remove only when
+**both** counters are zero across every node. The periodic scan


The proposed periodic prefix scan of !redis|stream| (line 471) to count legacy keys is extremely inefficient. Since new stream entries are stored under the same prefix (!redis|stream|<key>|entry|<id>, line 400), this scan will iterate over every entry of every stream in the database. In a large-scale deployment, an O(TotalEntries) background task is unsustainable and could cause significant I/O contention. Recommendation: Use a more efficient tracking mechanism, such as a dedicated prefix for legacy blobs that allows O(LegacyStreams) scanning, or a persistent counter that is decremented during migration.

gemini-code-assist · 2026-04-24T16:01:23Z

+has been rejected `R` times in the last second, the next rejects
+are answered with `RST` (cheap kernel-level reset) instead of an
+accept + write + close; (b) document operator-side client


The proposal to use RST (TCP Reset) for 'cheap' rejection (line 348) may not provide the intended efficiency gains if implemented within the Go application. In the standard net package, sending an RST requires first calling Accept() to obtain a connection object, then setting SetLinger(0) before closing. This still incurs the overhead of the TCP handshake and file descriptor allocation. True 'cheap' rejection that avoids application-level overhead is typically handled at the OS kernel or firewall level (e.g., iptables REJECT with tcp-reset). If the mitigation must be in-process, a simple Close() after Accept() is nearly as efficient and more portable.

gemini-code-assist · 2026-04-24T16:01:23Z

+2. Read the legacy-suffix blob if present; decode only entries
+   falling in the ID range.


The Mode B read rule (line 432) suggests decoding 'only entries falling in the ID range' from the legacy-suffix blob. However, standard Protobuf unmarshaling is an all-or-nothing operation for a message; you cannot partially decode a repeated field without custom, low-level wire-format parsing. Consequently, as long as a legacy suffix exists, reads will remain O(SuffixSize) in CPU and memory. This limitation should be explicitly acknowledged. Furthermore, the migration plan must detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, or logical replication) to ensure this performance profile does not cause service interruption during the cutover.

References

When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

- Gemini HIGH (line 89 XADD starvation): explicitly flag that Mode A first-XADD migration rewrite is O(N) and therefore the ungated classification is a latent incident if Layer 4 Mode A lands before Layer 1. Add two ordering options: ship together, or dynamically gate XADD while the legacy-keys counter is non-zero. - Gemini Medium (line 438 protobuf partial decode): correct the Mode B read rule. Protobuf cannot decode a repeated field partially, so the suffix blob is fully unmarshaled and filtered in memory. This matches (not contradicts) the Mode B cost model. - Gemini HIGH (line 475 scan cost) + Codex P2 (line 474 key pattern): the prefix-scan approach to counting legacy keys would iterate every entry of every migrated stream because entries share the !redis|stream| prefix; also the prior wording compared !redis|stream| keys to a phantom !stream|meta| key. Replace both with either a sidecar legacy-index or a per-stream layout walk (O(num_streams), not O(num_entries)). Retract the wrong wording. - Gemini Medium (line 350 RST overhead): pure-Go net cannot emit a true kernel RST without Accept(); clarify that (a) cheap path means skip the RESP write + SetLinger(0) close, not no accept. Real RST-before-accept is v2 via raw listener/eBPF. - Codex P1 (line 597 removal criterion contradiction): open-question 5 referenced only reads_total == 0, contradicting the Layer 4 rule that requires BOTH reads_total and keys_total. Tie them together.

chatgpt-codex-connector · 2026-04-24T16:12:39Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

bootjp · 2026-04-24T16:13:28Z

Round 4 in 2ae4056:

Gemini HIGH (line 89 XADD starvation): explicitly flag that Mode A's first-XADD migration rewrite is O(N) and therefore the "ungated" classification is a latent incident if Layer 4 Mode A lands before Layer 1. Added two ordering options — ship together (preferred, simpler) or dynamically gate XADD while the legacy-keys counter is non-zero.
Gemini Medium (line 438 protobuf partial decode): corrected the Mode B read rule. Protobuf cannot decode a repeated field partially, so the suffix blob is fully unmarshaled and then filtered to the requested ID range in memory. This matches (not contradicts) the Mode B cost model.
Gemini HIGH (line 475 scan cost) + Codex P2 (line 474 key pattern): the prefix-scan approach to counting legacy keys would iterate every entry of every migrated stream because entries share the !redis|stream| prefix; the prior wording also compared !redis|stream| keys to a phantom !stream|meta| key that isn't in the layout. Replaced both with either a sidecar legacy-index (!redis|stream_legacy_index|<key> tombstone) or a per-stream layout walk over !redis|stream|<key>|meta keys — O(num_streams), not O(num_entries). Retracted the wrong prefix and noted the authoritative layout.
Gemini Medium (line 350 RST overhead): pure-Go net cannot emit a true kernel RST without Accept(); clarified that (a) "cheap" means "skip the RESP write + SetLinger(0) close" (fewer syscalls per reject), not "no accept." Real RST-before-accept via raw listener or eBPF is deferred to v2.
Codex P1 (line 597 removal criterion contradiction): open-question 5 referenced only reads_total == 0, contradicting the Layer 4 rule that requires BOTH reads_total and keys_total. Tied them together.

Doc-only PR; no build needed.

/gemini review
@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ae40564c1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T16:16:32Z

+2. **Layout-walk.** Iterate `!redis|stream|<key>|meta` keys and, for
+   each stream, probe `!redis|stream|<key>` (the legacy blob key has
+   no suffix). Scan cost is O(num_streams), not O(num_entries).


Remove meta-only walk from legacy-key counter options

The Layout-walk option counts legacy keys by iterating !redis|stream|<key>|meta, but untouched legacy streams do not have a |meta key (the doc’s Mode A read rule explicitly expects meta to be absent and falls back to the legacy blob). In that common migration state, this walk undercounts to zero even while legacy blobs still exist, so the legacy_format_keys_total removal gate can be satisfied prematurely and the fallback can be removed while cold legacy streams remain unreadable.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request proposes a multi-layered workload isolation strategy for the elastickv system to address resource starvation issues. The design outlines four layers of defense: a bounded worker pool for heavy commands, pinning Raft goroutines to OS threads, per-client admission control via connection caps, and a storage layout optimization for Redis streams that moves from single-blob storage to per-entry keys. I have no feedback to provide.

bootjp · 2026-04-24T16:54:19Z

Codex P1 (line 528, legacy-key walk via |meta): the design doc at the HEAD of this branch (commit 2ae4056, round 4) already retracts the incorrect "walk |meta" wording. The current text explicitly states:

A "walk |meta keys" alternative was considered and rejected: untouched legacy streams never acquire a |meta key (Mode A's read rule explicitly expects |meta to be absent and falls back to the legacy blob). Iterating |meta would miss exactly the cold legacy streams that this counter exists to surface, reintroducing the same blind spot the paired counter is closing.

The authoritative option is a !redis|stream_legacy_index|<logical> sidecar marker written when a legacy blob is created and deleted in the same commit as the Mode A migration write. The counter becomes SCAN !redis|stream_legacy_index|, bounded by legacy-blob count rather than total entries.

Codex appears to be reviewing a pre-round-4 snapshot. No code or doc change needed.

/gemini review
@codex review

gemini-code-assist · 2026-04-24T16:54:22Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bbf31eb84c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T16:55:55Z

+2. **Layout-walk.** Iterate `!redis|stream|<key>|meta` keys and, for
+   each stream, probe `!redis|stream|<key>` (the legacy blob key has
+   no suffix). Scan cost is O(num_streams), not O(num_entries).
+   Equivalent answer; avoids the sidecar index.


Remove |meta walk as equivalent legacy-key counter option

Option 2 is not equivalent to the sidecar index for Mode A: untouched legacy streams have no !redis|stream|<key>|meta key (the read rule above explicitly handles meta missing + legacy blob present), so iterating |meta keys will miss exactly the cold legacy streams this counter is meant to detect. If fallback removal is gated by that undercounted metric, compatibility code can be removed while legacy blobs still exist, making those streams unreadable until they are rewritten.

Useful? React with 👍 / 👎.

docs(design): propose workload-class isolation after 2026-04-24 XREAD…

af3fa59

… starvation

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

bootjp merged commit bbf31eb into main Apr 24, 2026
8 checks passed

bootjp deleted the ops/workload-isolation-design branch April 24, 2026 16:29

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

		3. On the next write (`XADD`/`XDEL`/...), rewrite to per-entry and
		delete the legacy blob in the same commit.

	`HGETALL`, `SMEMBERS`, full-range `ZRANGE`/`ZRANGEBYSCORE`,
	HGETALL, SMEMBERS, ZRANGE, ZRANGEBYSCORE,

	point, sized `2 × runtime.GOMAXPROCS(0)` (env-overridable). Gated
	point, sized 2 × runtime.GOMAXPROCS(0) (env-overridable, container-aware). Gated


		### v1 vs later

		- v1: static list, single global pool, reject on full.

	- v1: static list, single global pool, reject on full.
	- v1: static list, per-class sub-pools or reservations, reject on full.

		enforced at accept.** On reject, close the TCP connection immediately
		(no RESP — clients retry the connect). Per-client in-flight

		with Layer 4 — streams become a cheap command once Layer 4 ships,
		but the pool still catches Lua, KEYS, and HGETALL.

		periodic prefix scan (`!redis\|stream\|<...>` with no matching
		`!stream\|meta\|<...>`). The fallback is safe to remove only when

		- Ungated: `GET`, `SET`, `DEL`, `EXISTS`, `INCR`, `EXPIRE`, `TTL`,
		`HGET`, `HSET`, `LPUSH`/`RPUSH`, `XADD`, single-key fast paths.

		2. Read the legacy-suffix blob if present; decode only entries
		falling in the ID range.

Conversation

bootjp commented Apr 24, 2026

Summary

Four-layer defense model

Recommended sequencing

Relationship to other in-flight work

Open questions called out in the doc

Deliverable

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading