docs(design): propose workload-class isolation after 2026-04-24 XREAD starvation#619
docs(design): propose workload-class isolation after 2026-04-24 XREAD starvation#619
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 53 minutes and 24 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This design document outlines a multi-layered approach to workload isolation in response to a CPU starvation incident caused by expensive XREAD operations. The proposal includes a bounded worker pool for heavy commands, per-client admission control, and a transition to per-entry storage for streams to improve performance. Feedback focuses on mitigating risks during stream migration, ensuring container-aware resource allocation, refining command classification for the worker pool, and improving client-side observability during load shedding.
| 3. On the next write (`XADD`/`XDEL`/...), rewrite to per-entry and | ||
| delete the legacy blob in the same commit. |
There was a problem hiding this comment.
Performing a full conversion of a large legacy stream during a synchronous write operation (XADD/XDEL) could be extremely expensive and might re-introduce CPU starvation or cause OOM, as it requires unmarshaling the entire blob and creating many new keys in a single Raft commit. This migration plan must detail strategies for zero-downtime cutovers to avoid service interruption. Consider an incremental migration strategy or a size-limited synchronous conversion to mitigate latency spikes and Raft heartbeat drops for very large streams.
| 3. On the next write (`XADD`/`XDEL`/...), rewrite to per-entry and | |
| delete the legacy blob in the same commit. | |
| 3. On the next write (XADD/XDEL/...), perform an incremental or size-limited rewrite | |
| to per-entry and delete the legacy blob in the same commit. |
References
- Production-grade system migrations must detail strategies for live or zero-downtime cutovers to avoid service interruption.
| Static v1 classification: | ||
|
|
||
| - **Pool-gated:** `XREAD`, `XRANGE`, `XREVRANGE`, `KEYS`, `SCAN`, | ||
| `HGETALL`, `SMEMBERS`, full-range `ZRANGE`/`ZRANGEBYSCORE`, |
There was a problem hiding this comment.
The 'full-range' qualifier for ZRANGE/ZRANGEBYSCORE suggests that the dispatcher must inspect command arguments (e.g., checking for 0 -1) before offloading. This contradicts the stated goal of 'static classification' by command name. If the gating is intended to be strictly static, all ZRANGE calls should be included, or the criteria for 'full-range' should be explicitly defined to avoid ambiguity in the implementation.
| `HGETALL`, `SMEMBERS`, full-range `ZRANGE`/`ZRANGEBYSCORE`, | |
| HGETALL, SMEMBERS, ZRANGE, ZRANGEBYSCORE, |
| ### Recommended v1 shape | ||
|
|
||
| Package-level pool in `adapter/` with a `Submit(command, fn)` entry | ||
| point, sized `2 × runtime.GOMAXPROCS(0)` (env-overridable). Gated |
There was a problem hiding this comment.
Relying on runtime.GOMAXPROCS(0) for pool sizing can be unreliable in containerized environments (like Kubernetes) where it may return the host's CPU count instead of the container's quota. This could result in an oversized worker pool that fails to provide the intended isolation. It is recommended to ensure the implementation is container-aware, for example by using a library like uber-go/automaxprocs or explicitly respecting cgroup limits.
| point, sized `2 × runtime.GOMAXPROCS(0)` (env-overridable). Gated | |
| point, sized 2 × runtime.GOMAXPROCS(0) (env-overridable, container-aware). Gated |
References
- Design documents identifying significant operational risks must detail mitigation strategies.
|
|
||
| ### v1 vs later | ||
|
|
||
| - **v1:** static list, single global pool, reject on full. |
There was a problem hiding this comment.
Sharing a single global pool for all gated commands means that a burst of management or debug commands (like KEYS * or SCAN) can still starve critical data-path commands (like XREAD or Lua scripts) by exhausting pool slots and triggering -BUSY responses. To achieve true workload-class isolation, consider using separate sub-pools or a slot reservation system for critical command classes even in the v1 implementation.
| - **v1:** static list, single global pool, reject on full. | |
| - v1: static list, per-class sub-pools or reservations, reject on full. |
| enforced at accept.** On reject, close the TCP connection immediately | ||
| (no RESP — clients retry the connect). Per-client in-flight |
There was a problem hiding this comment.
Abruptly closing the TCP connection without a protocol-level error (RESP) can lead to poor observability for clients, as they will see 'connection reset' or 'EOF' instead of a clear 'server overloaded' message. This makes it harder to distinguish between network issues and intentional load shedding. Consider accepting the connection and sending a -BUSY error before closing, or ensuring that client libraries are specifically tuned to interpret abrupt closures as a backpressure signal.
| enforced at accept.** On reject, close the TCP connection immediately | |
| (no RESP — clients retry the connect). Per-client in-flight | |
| enforced at accept. On reject, consider accepting and replying with -BUSY before closing | |
| (no RESP — clients retry the connect). Per-client in-flight |
- Layer 1 (line 84): replace "full-range ZRANGE/ZRANGEBYSCORE" with unconditional gating of the whole ZRANGE family. Arg-inspection contradicted the stated "static, byte-level classification"; a bounded ZRANGE 0 10 costs at most one unmarshal and gating it is cheaper than the dispatcher branch that would distinguish the bounded vs unbounded case. - Layer 1 (line 122, container awareness): document that runtime.GOMAXPROCS(0) returns the host CPU count on Linux without cgroup awareness and call out two mitigations (operator-set GOMAXPROCS env at rolling-update level, or wire uber-go/automaxprocs). v1 prefers the operator-set path for auditability; automaxprocs acceptable as follow-up. - Layer 1 (line 139, single-pool starvation): acknowledge the risk that KEYS/SCAN bursts can exhaust pool slots and starve XREAD/Lua. v1 still ships a single pool but requires a per-command submit metric so a tier split is measurable from observability rather than guessed; sub-pools/slot reservation are the named follow-up. - Layer 3 (line 252, reject semantics): change the v1 shape from "close TCP without RESP" to "accept, write -ERR, then close." The protocol-level error is the signal that distinguishes "server overload" from "network blip" on the client side. - Layer 4 (line 315, synchronous migration cost): add a chunked migration section. A single XADD on a 100k-entry legacy blob would rewrite every entry in one Raft commit, reproducing the CPU/commit spike the design is supposed to prevent. Document STREAM_MIGRATION_CHUNK (default 1024) and the rolling drain model; explicitly scope chunked migration as a stacked follow-up to PR #620 (which ships the simple one-txn version).
|
Addressed Gemini design review in 9d44b6c:
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request proposes a four-layer workload isolation strategy to prevent resource starvation, covering worker pools for heavy commands, Raft thread pinning, per-client admission control, and stream storage optimization. The review identifies several significant design issues, including a potential deadlock when using XREAD BLOCK within a fixed-size pool, a correctness bug in the dual-format read logic during partial migrations, and the need for context propagation to handle Lua recursion safely. Feedback also highlights a contradiction regarding the classification of optimized stream commands and warns of potential connection storms under the proposed admission control mechanism.
| - **Pool-gated:** `XREAD`, `XRANGE`, `XREVRANGE`, `KEYS`, `SCAN`, | ||
| `HGETALL`, `HVALS`, `HKEYS`, `SMEMBERS`, `SUNION`, `SINTER`, | ||
| `ZRANGE`, `ZRANGEBYSCORE`, `ZRANGEBYLEX`, `EVAL`/`EVALSHA`, | ||
| `FCALL`/`FCALL_RO`, the `*SCAN` family. |
There was a problem hiding this comment.
Gating XREAD to a small, fixed-size worker pool (2 × GOMAXPROCS) introduces a significant availability risk because XREAD is frequently used with the BLOCK option. A few long-polling clients could exhaust all worker slots, effectively performing a Denial of Service on all other "heavy" commands (such as KEYS, SCAN, or EVAL) that share the same pool. The design should specify that blocking operations are handled outside this bounded pool, or that only the CPU-intensive data processing (unmarshaling/scanning) is offloaded to the pool after the block condition is satisfied.
References
- When a design document identifies a significant operational risk, it must also detail potential mitigation strategies.
| 1. On XREAD/XRANGE/XLEN/XREVRANGE, try the new per-entry layout first. | ||
| 2. If empty AND the legacy single-blob key exists, fall back to the | ||
| legacy path. |
There was a problem hiding this comment.
The dual-format read logic described here contains a correctness bug when combined with the "Chunked migration" strategy (lines 353-364). If a stream is partially migrated, the new per-entry layout will not be empty, which prevents the fallback to the legacy-suffix key in step 2. Consequently, readers will only see the migrated chunk and miss the remaining data still stored in the legacy format. The read path must be designed to check for and merge results from the legacy-suffix key whenever it exists, regardless of whether the new layout contains entries.
References
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers to avoid service interruption.
| - **(A) Outer holds the slot; inner `redis.call` runs inline and | ||
| ignores the pool.** No deadlock, inner cost shares outer accounting. | ||
| **Recommended v1.** |
There was a problem hiding this comment.
To implement Recommendation (A) for Lua recursion safely in Go, the context.Context passed through the command dispatcher must carry a flag indicating if the current goroutine is already executing within a heavy-command worker. Without this context propagation, the Submit entry point cannot reliably distinguish an "inner" call from a new client request, which would lead to the deadlock described in the "Risk to flag" section.
| enforced at accept.** On reject, accept the TCP connection, write a | ||
| `-ERR max connections per client exceeded` RESP error, then close — | ||
| so the client sees a protocol-level message instead of a bare | ||
| `connection reset` or `EOF` that's indistinguishable from a real |
There was a problem hiding this comment.
The strategy of accepting, erroring, and closing connections that exceed the per-IP cap may trigger a "connection storm" if the client uses an aggressive connection pool. Similar to the -BUSY retry storms discussed in the Open Questions (line 453), the design should consider how to mitigate the CPU overhead of repeatedly accepting and closing these connections. Documenting recommended client-side backoff for connection failures or implementing a small server-side delay before closing could help mitigate this.
References
- When a design document identifies a significant operational risk, it must also detail potential mitigation strategies.
| with Layer 4 — streams become a cheap command once Layer 4 ships, | ||
| but the pool still catches Lua, KEYS, and HGETALL. |
There was a problem hiding this comment.
There is a contradiction between the command classification in Layer 1 and the sequencing goals. While line 84 lists XREAD as "Pool-gated", line 430 states that "streams become a cheap command once Layer 4 ships". If Layer 4 successfully optimizes XREAD to O(new), it should be moved to the "Ungated" list to avoid the unnecessary overhead of the worker pool hop (enqueue/pickup). The design should clarify the final classification of optimized stream commands.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9d44b6c113
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 1. On XREAD/XRANGE/XLEN/XREVRANGE, try the new per-entry layout first. | ||
| 2. If empty AND the legacy single-blob key exists, fall back to the | ||
| legacy path. |
There was a problem hiding this comment.
Merge both stream formats during chunked migration
The migration algorithm becomes incorrect once chunking is introduced: after the first chunk rewrite, part of a stream exists in the new per-entry layout and part remains in the legacy suffix, but this read rule only falls back to legacy when the new layout is empty. In that mixed state, XREAD/XRANGE can return incomplete results (e.g., IDs before the chunk boundary only see the migrated prefix), so the design needs a combined read path (or a migration order/marker that guarantees one source is authoritative per ID range).
Useful? React with 👍 / 👎.
| **Container-aware sizing.** `runtime.GOMAXPROCS(0)` on Linux returns | ||
| the host CPU count, not the container's cgroup quota (unless | ||
| `GOMAXPROCS` is set explicitly or something has configured it). |
There was a problem hiding this comment.
Update GOMAXPROCS guidance for Go 1.25+ behavior
This guidance is factually stale for this repo’s toolchain (go 1.25.0): modern Go already derives default GOMAXPROCS from cgroup CPU quota on Linux unless explicitly disabled, so documenting that it always uses host CPU count will push operators toward unnecessary manual overrides. That override can also disable runtime auto-updates when quotas change, leading to avoidable mis-sizing.
Useful? React with 👍 / 👎.
- Layer 1 classification (line 87, Gemini HIGH): carve out blocking XREAD/BLPOP/BRPOP/BZPOP from pool gating. Blocking variants hold a slot while idle, trivially exhaust the pool; handle them on their own goroutine and re-gate on wake-up. - Layer 1 Lua recursion (line 117, Gemini Medium): document the context-propagation mechanism that makes option A implementable. A package-private sentinel value on ctx distinguishes inner call from new request; external callers cannot fake it. - Layer 1 GOMAXPROCS (line 134, Codex P2): correct stale guidance. Go 1.25+ already derives GOMAXPROCS from cgroup v2 quota, so the automaxprocs recommendation no longer applies to this repo; keep the override env knob for operators who want explicit control. - Layer 3 reject semantics (line 283, Gemini Medium): add a reject-storm mitigation. Rate-limit the reject itself — after R rejects/s to one peer, switch from accept+write+close to RST; recommend client-side backoff in the ops runbook. - Layer 4 migration (line 345, Gemini HIGH + Codex P1): split the migration doc into Mode A (simple, PR #620 ships this) and Mode B (chunked, stacked follow-up). The read rules differ: Mode A has no mixed state so fall-through is correct; Mode B has a legal mixed state and MUST always merge both layouts. Making the distinction explicit prevents the dual-read correctness bug the reviewers flagged — Mode B with Mode A's fall-through rule returns incomplete results during chunking. - Sequencing (line 431, Gemini Medium): resolve the XREAD gated-vs-cheap contradiction. Layer 4 makes XREAD steady-state O(new) but we keep it gated in Layer 1 v1 for three concrete reasons (large XRANGE bounds, legacy-fallback window, data-driven promotion); promotion to ungated is gated on the pool-submit metric added in Layer 1.
|
Addressed round-2 in 731475e:
/gemini review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 731475ebf9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| consumers are active. v1 resolution: **blocking variants (`XREAD | ||
| BLOCK`, `BLPOP`, `BRPOP`, `BZPOPMIN`/`MAX`) bypass the heavy-command | ||
| pool and are handled on their own goroutine**. They are I/O-bound | ||
| waiting, not CPU-bound; their CPU cost lands on wake-up, when the |
There was a problem hiding this comment.
Keep XREAD BLOCK under pool throttling
This proposal says blocking variants (including XREAD BLOCK) should bypass the heavy-command pool because they are "I/O-bound waiting," but the current implementation in adapter/redis_compat_commands.go (xread) busy-polls in a loop and repeatedly calls xreadOnce while waiting. That means blocked readers still execute stream reads/unmarshal work during the wait window, so bypassing the pool would let a client evade the new CPU guardrail simply by using BLOCK, recreating the starvation path this design is trying to prevent.
Useful? React with 👍 / 👎.
| `elastickv_stream_legacy_format_reads_total` counts reads that | ||
| touched a legacy-format key in either mode. Remove the legacy | ||
| fallback only after it has sat at zero across all nodes for a soak | ||
| window. |
There was a problem hiding this comment.
Require legacy-key depletion before dropping dual-read fallback
Using elastickv_stream_legacy_format_reads_total == 0 as the removal criterion can drop access to valid data: in Mode A, migration happens on the next write, so cold streams that are neither written nor read during the soak period can remain in legacy format indefinitely while this metric stays at zero. If fallback is removed on that signal alone, later reads of those dormant streams will miss existing data; the exit condition needs to prove no legacy-format keys remain, not just no recent legacy reads.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request proposes a four-layer architectural design for workload-class isolation to prevent expensive command paths from starving the Raft control plane. The strategy includes implementing a bounded worker pool for heavy commands, per-IP connection limits, and migrating Redis streams to a per-entry storage layout to reduce unmarshaling overhead. Feedback identifies a performance regression risk during chunked migration where legacy blobs still require full unmarshaling, points out that current blocking command implementations use CPU-intensive busy-polling rather than being purely I/O-bound, and highlights a context propagation gap that would prevent the proposed Lua recursion deadlock mitigation from functioning correctly.
| **Mode B — chunked migration (follow-up):** each write drains at | ||
| most `STREAM_MIGRATION_CHUNK` (default 1 024) entries from the legacy | ||
| blob into per-entry keys, and leaves the rest in a *legacy-suffix* | ||
| key until a subsequent write drains more. During this window the | ||
| stream exists in BOTH layouts simultaneously: the oldest N entries | ||
| are per-entry, the newer M entries are still in the suffix blob. | ||
|
|
||
| Read rule for Mode B — **always merge both layouts**, do not | ||
| fall-through on "new layout empty": | ||
|
|
||
| 1. Read `meta` if present; read all per-entry keys that match the | ||
| requested ID range. | ||
| 2. Read the legacy-suffix blob if present; decode only entries | ||
| falling in the ID range. | ||
| 3. Merge by ID order, deduplicate (the migrator is responsible for | ||
| never writing the same ID in both layouts in a single commit), and | ||
| return. |
There was a problem hiding this comment.
The read rule for Mode B (chunked migration) introduces a significant performance regression that may undermine the isolation goals. Since the legacy format is a single protobuf blob, decoding requires unmarshaling the entire blob, leading to O(N_suffix) CPU/memory usage. This poses an operational risk during the migration window. As this is a production-grade system migration, the plan must detail mitigation strategies (e.g., a temporary 'bridge' or 'dual-write proxy') to ensure a zero-downtime cutover and avoid service interruption.
References
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
| consumers are active. v1 resolution: **blocking variants (`XREAD | ||
| BLOCK`, `BLPOP`, `BRPOP`, `BZPOPMIN`/`MAX`) bypass the heavy-command | ||
| pool and are handled on their own goroutine**. They are I/O-bound | ||
| waiting, not CPU-bound; their CPU cost lands on wake-up, when the | ||
| dispatcher can re-evaluate whether to gate the follow-up work. The | ||
| gating decision is made in `dispatchCommand` when it sees the command | ||
| name plus the `BLOCK`/`B*` prefix — the simplest arg inspection we | ||
| allow, limited to "is this command blocking?". |
There was a problem hiding this comment.
The assumption that blocking variants like BZPOPMIN are purely I/O-bound and not CPU-bound is technically incorrect for the current implementation. As seen in adapter/redis_compat_commands.go:3432, bzpopmin uses a busy-poll loop with time.Sleep(redisBusyPollBackoff). While time.Sleep yields the P, each wake-up triggers a KV lookup and Raft leadership verification, which are CPU-intensive operations. If many clients use blocking commands, the aggregate CPU load from polling could still starve the runtime. The design should either propose moving blocking commands to a truly event-driven model (e.g., using the internal pubsub/relay system) or acknowledge this as a remaining gap in the isolation model.
| **Implementation note for (A): context propagation.** `Submit` | ||
| identifies "inside a pool slot" by attaching a sentinel value to | ||
| `context.Context` (`ctxKeyInPoolSlot`). The Lua adapter threads that | ||
| `ctx` into every `redis.call` it makes; the dispatcher's pool-gate | ||
| check returns immediately when `ctx.Value(ctxKeyInPoolSlot) != nil` | ||
| instead of attempting another `Submit`. This is the only mechanism | ||
| that reliably distinguishes "new client request" from "inner call" | ||
| without tagging every goroutine or holding a pool-wide set of | ||
| goroutine IDs. The sentinel must be package-private so external | ||
| callers cannot fake it. |
There was a problem hiding this comment.
The proposed context propagation for the Lua recursion trap has a gap in the current architecture. RedisServer.runLuaScript (and the eval handler) currently creates a new context using context.WithTimeout(context.Background(), ...) (see adapter/redis_lua.go:118). If eval is executed within a worker pool slot, this new context will lose the ctxKeyInPoolSlot sentinel. To make this mechanism work, runLuaScript and all specialized command handlers in RedisServer must be refactored to accept and propagate the context provided by the dispatcher/pool, rather than starting from Background().
- Layer 1 blocking variants (Codex P1 + Gemini Medium): retract the previous "I/O-bound, bypass the pool" decision. The current adapter busy-polls XREAD BLOCK / B*POP with time.Sleep + KV+leader re-check on each wake, so bypassing would hand CPU unbounded to them. v1 keeps blocking variants gated; a condvar-based wake-up is listed as the prereq for an honest bypass in a stacked follow-up. - Layer 1 Lua ctx caveat (Gemini Medium): runLuaScript currently rebuilds the script context from context.Background(), which drops the ctxKeyInPoolSlot sentinel Submit installed when it dispatched the outer EVAL. Called out explicitly that Layer 1 MUST NOT ship before runLuaScript derives its context from the caller-supplied ctx; otherwise Lua scripts inside the pool self-deadlock on every redis.call. - Layer 4 Mode B cost (Gemini HIGH): Mode B preserves the O(N_suffix) unmarshal cost during the migration window, not constant O(new). Added a read-driven drain mitigation, top-N legacy-suffix-size metric for operator visibility, and an explicit note that Mode A is the conservative choice for operators who can't tolerate transient cost. - Layer 4 removal criterion (Codex P1): just "no legacy reads" could keep the fallback alive on cold streams forever. Require a paired elastickv_stream_legacy_format_keys_total derived from a periodic prefix-scan; the fallback is safe to remove only when both counters are zero.
|
Round 3 in 104237d:
/gemini review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 104237dd85
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| compatibility code go away? Proposed: | ||
| `elastickv_stream_legacy_format_reads_total` = 0 for 30 days | ||
| across all nodes → remove in a follow-up PR. 30 days is arbitrary; |
There was a problem hiding this comment.
Use both migration counters for fallback removal
This open-question proposal conflicts with the Layer 4 removal rule above by allowing legacy fallback removal based only on elastickv_stream_legacy_format_reads_total. In the cold-stream scenario the doc already calls out, reads can stay at zero while legacy-format keys still exist, so following this criterion can remove compatibility code too early and make those streams unreadable after rollout.
Useful? React with 👍 / 👎.
| periodic prefix scan (`!redis|stream|<...>` with no matching | ||
| `!stream|meta|<...>`). The fallback is safe to remove only when |
There was a problem hiding this comment.
Match legacy-scan key pattern to the defined stream layout
The scan rule compares !redis|stream|<...> keys to a missing !stream|meta|<...> key, but the layout defined earlier in this doc uses !redis|stream|<key>|meta. If implemented literally, migrated streams can be misclassified as legacy because the meta lookup pattern is wrong, which breaks the new elastickv_stream_legacy_format_keys_total signal used to decide when fallback can be removed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request proposes a multi-layered design for workload isolation to prevent CPU starvation, including a heavy-command worker pool, Raft thread pinning, per-IP connection caps, and a granular stream storage layout. Feedback focuses on operational risks during migration, specifically the starvation potential of ungated XADD commands and the inefficiency of the proposed legacy key scanning. Additionally, technical corrections were provided regarding the overhead of Go-level TCP resets and the all-or-nothing nature of Protobuf decoding which impacts the performance claims of the proposed migration modes.
| - **Ungated:** `GET`, `SET`, `DEL`, `EXISTS`, `INCR`, `EXPIRE`, `TTL`, | ||
| `HGET`, `HSET`, `LPUSH`/`RPUSH`, `XADD`, single-key fast paths. |
There was a problem hiding this comment.
The classification of XADD as Ungated (line 89) poses a starvation risk during the transition to the new stream layout. As noted in the Layer 4 problem description (line 384), the current loadStreamAt implementation is O(N). Even after Layer 4 is implemented, the Mode A migration (line 414) requires a full rewrite of the legacy blob on the first write. If XADD remains ungated, a burst of writes to large legacy streams will trigger multiple O(N) unmarshal/marshal cycles on the dispatcher goroutines, potentially reproducing the CPU starvation incident. As this identifies a significant operational risk, the document must detail potential mitigation strategies, such as implementing a temporary 'bridge' or 'proxy' mode to handle the transition safely. Additionally, there is a discrepancy between this document and the codebase: lines 484 and 507 state that Hashes, Sets, and ZSets are currently 'one-blob', but adapter/redis_compat_helpers.go (lines 373, 419, 473) already implements wide-column loading and migration for these types.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
| the fallback. Add a paired counter | ||
| `elastickv_stream_legacy_format_keys_total` populated from a | ||
| periodic prefix scan (`!redis|stream|<...>` with no matching | ||
| `!stream|meta|<...>`). The fallback is safe to remove only when | ||
| **both** counters are zero across every node. The periodic scan |
There was a problem hiding this comment.
The proposed periodic prefix scan of !redis|stream| (line 471) to count legacy keys is extremely inefficient. Since new stream entries are stored under the same prefix (!redis|stream|<key>|entry|<id>, line 400), this scan will iterate over every entry of every stream in the database. In a large-scale deployment, an O(TotalEntries) background task is unsustainable and could cause significant I/O contention. Recommendation: Use a more efficient tracking mechanism, such as a dedicated prefix for legacy blobs that allows O(LegacyStreams) scanning, or a persistent counter that is decremented during migration.
| has been rejected `R` times in the last second, the next rejects | ||
| are answered with `RST` (cheap kernel-level reset) instead of an | ||
| accept + write + close; (b) document operator-side client |
There was a problem hiding this comment.
The proposal to use RST (TCP Reset) for 'cheap' rejection (line 348) may not provide the intended efficiency gains if implemented within the Go application. In the standard net package, sending an RST requires first calling Accept() to obtain a connection object, then setting SetLinger(0) before closing. This still incurs the overhead of the TCP handshake and file descriptor allocation. True 'cheap' rejection that avoids application-level overhead is typically handled at the OS kernel or firewall level (e.g., iptables REJECT with tcp-reset). If the mitigation must be in-process, a simple Close() after Accept() is nearly as efficient and more portable.
| 2. Read the legacy-suffix blob if present; decode only entries | ||
| falling in the ID range. |
There was a problem hiding this comment.
The Mode B read rule (line 432) suggests decoding 'only entries falling in the ID range' from the legacy-suffix blob. However, standard Protobuf unmarshaling is an all-or-nothing operation for a message; you cannot partially decode a repeated field without custom, low-level wire-format parsing. Consequently, as long as a legacy suffix exists, reads will remain O(SuffixSize) in CPU and memory. This limitation should be explicitly acknowledged. Furthermore, the migration plan must detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, or logical replication) to ensure this performance profile does not cause service interruption during the cutover.
References
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
- Gemini HIGH (line 89 XADD starvation): explicitly flag that Mode A first-XADD migration rewrite is O(N) and therefore the ungated classification is a latent incident if Layer 4 Mode A lands before Layer 1. Add two ordering options: ship together, or dynamically gate XADD while the legacy-keys counter is non-zero. - Gemini Medium (line 438 protobuf partial decode): correct the Mode B read rule. Protobuf cannot decode a repeated field partially, so the suffix blob is fully unmarshaled and filtered in memory. This matches (not contradicts) the Mode B cost model. - Gemini HIGH (line 475 scan cost) + Codex P2 (line 474 key pattern): the prefix-scan approach to counting legacy keys would iterate every entry of every migrated stream because entries share the !redis|stream| prefix; also the prior wording compared !redis|stream| keys to a phantom !stream|meta| key. Replace both with either a sidecar legacy-index or a per-stream layout walk (O(num_streams), not O(num_entries)). Retract the wrong wording. - Gemini Medium (line 350 RST overhead): pure-Go net cannot emit a true kernel RST without Accept(); clarify that (a) cheap path means skip the RESP write + SetLinger(0) close, not no accept. Real RST-before-accept is v2 via raw listener/eBPF. - Codex P1 (line 597 removal criterion contradiction): open-question 5 referenced only reads_total == 0, contradicting the Layer 4 rule that requires BOTH reads_total and keys_total. Tie them together.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Round 4 in 2ae4056:
Doc-only PR; no build needed. /gemini review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ae40564c1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 2. **Layout-walk.** Iterate `!redis|stream|<key>|meta` keys and, for | ||
| each stream, probe `!redis|stream|<key>` (the legacy blob key has | ||
| no suffix). Scan cost is O(num_streams), not O(num_entries). |
There was a problem hiding this comment.
Remove meta-only walk from legacy-key counter options
The Layout-walk option counts legacy keys by iterating !redis|stream|<key>|meta, but untouched legacy streams do not have a |meta key (the doc’s Mode A read rule explicitly expects meta to be absent and falls back to the legacy blob). In that common migration state, this walk undercounts to zero even while legacy blobs still exist, so the legacy_format_keys_total removal gate can be satisfied prematurely and the fallback can be removed while cold legacy streams remain unreadable.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request proposes a multi-layered workload isolation strategy for the elastickv system to address resource starvation issues. The design outlines four layers of defense: a bounded worker pool for heavy commands, pinning Raft goroutines to OS threads, per-client admission control via connection caps, and a storage layout optimization for Redis streams that moves from single-blob storage to per-entry keys. I have no feedback to provide.
|
Codex P1 (line 528, legacy-key walk via
The authoritative option is a Codex appears to be reviewing a pre-round-4 snapshot. No code or doc change needed. /gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bbf31eb84c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 2. **Layout-walk.** Iterate `!redis|stream|<key>|meta` keys and, for | ||
| each stream, probe `!redis|stream|<key>` (the legacy blob key has | ||
| no suffix). Scan cost is O(num_streams), not O(num_entries). | ||
| Equivalent answer; avoids the sidecar index. |
There was a problem hiding this comment.
Remove
|meta walk as equivalent legacy-key counter option
Option 2 is not equivalent to the sidecar index for Mode A: untouched legacy streams have no !redis|stream|<key>|meta key (the read rule above explicitly handles meta missing + legacy blob present), so iterating |meta keys will miss exactly the cold legacy streams this counter is meant to detect. If fallback removal is gated by that undercounted metric, compatibility code can be removed while legacy blobs still exist, making those streams unreadable until they are rewritten.
Useful? React with 👍 / 👎.
Summary
Design doc (only — no code in this PR) for a four-layer workload-isolation model, prompted by the 2026-04-24 incident's afternoon phase.
Problem: Today, one client host with 37 connections running a tight XREAD loop consumed 14 CPU cores on the leader via
loadStreamAt → unmarshalStreamValue → proto.Unmarshal(81% of CPU per pprof). Raft goroutines couldn't get CPU → step_queue_full = 75,692 on the leader (vs 0-119 on followers) → Raft commit p99 jumped to 6-10s, Lua p99 stuck at 6-8s. Follower replication was healthy (applied-index within 34 of leader); the damage was entirely CPU-scheduling on the leader.Gap: elastickv has no explicit workload-class isolation. Go's scheduler treats every goroutine equally; a single heavy command path can starve unrelated paths (raft, lease, Lua, GET/SET).
Four-layer defense model
2 × GOMAXPROCS); reply-BUSYwhen full. Cheap commands keep their own fast path.runtime.LockOSThread()on the Ready loop + dispatcher lanes so the Go scheduler can't starve them. Not v1 — only if measurement after Layer 1 + 4 still showsstep_queue_full > 0.!redis|stream|<key>|entry|<id>) with range-scan, dual-read migration fallback, legacy-removal gated onelastickv_stream_legacy_format_reads_total == 0. Hashes/sets/zsets share the same one-blob pattern and are called out as follow-up.Recommended sequencing
Layer 4 (correctness bug, concentrated change) → Layer 1 (generic defense for next unknown hotspot) → Layer 3 (reconcile with roadmap item 6) → Layer 2 (only if forced by measurement).
Relationship to other in-flight work
docs/design/2026_04_24_proposed_resilience_roadmap.mditem 6 (admission control). This doc's Layer 3 focuses on per-client fairness; the roadmap's item 6 is global in-flight capping. Both are needed.Open questions called out in the doc
-BUSYbackoff semantics — how do we avoid client retry spinning becoming the new hot loop?Deliverable
docs/design/2026_04_24_proposed_workload_isolation.md— 446 lines, dated-prefix /**Status: Proposed**convention matching the rest ofdocs/design/. No code.Test plan
origin/main2026_04_24_proposed_resilience_roadmap.mdreconciled (complements, doesn't duplicate)