Skip to content

docs(design): propose workload-class isolation after 2026-04-24 XREAD starvation#619

Merged
bootjp merged 5 commits intomainfrom
ops/workload-isolation-design
Apr 24, 2026
Merged

docs(design): propose workload-class isolation after 2026-04-24 XREAD starvation#619
bootjp merged 5 commits intomainfrom
ops/workload-isolation-design

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 24, 2026

Summary

Design doc (only — no code in this PR) for a four-layer workload-isolation model, prompted by the 2026-04-24 incident's afternoon phase.

Problem: Today, one client host with 37 connections running a tight XREAD loop consumed 14 CPU cores on the leader via loadStreamAt → unmarshalStreamValue → proto.Unmarshal (81% of CPU per pprof). Raft goroutines couldn't get CPU → step_queue_full = 75,692 on the leader (vs 0-119 on followers) → Raft commit p99 jumped to 6-10s, Lua p99 stuck at 6-8s. Follower replication was healthy (applied-index within 34 of leader); the damage was entirely CPU-scheduling on the leader.

Gap: elastickv has no explicit workload-class isolation. Go's scheduler treats every goroutine equally; a single heavy command path can starve unrelated paths (raft, lease, Lua, GET/SET).

Four-layer defense model

  • Layer 1 — heavy-command worker pool: gate XREAD / KEYS / SCAN / Lua onto a bounded pool (~2 × GOMAXPROCS); reply -BUSY when full. Cheap commands keep their own fast path.
  • Layer 2 — locked OS threads for raft: runtime.LockOSThread() on the Ready loop + dispatcher lanes so the Go scheduler can't starve them. Not v1 — only if measurement after Layer 1 + 4 still shows step_queue_full > 0.
  • Layer 3 — per-client admission control: per-peer-IP connection cap (default 8). Extends, doesn't replace, roadmap item 6's global in-flight semaphore.
  • Layer 4 — XREAD O(N) → O(new): entry-per-key layout (!redis|stream|<key>|entry|<id>) with range-scan, dual-read migration fallback, legacy-removal gated on elastickv_stream_legacy_format_reads_total == 0. Hashes/sets/zsets share the same one-blob pattern and are called out as follow-up.

Recommended sequencing

Layer 4 (correctness bug, concentrated change) → Layer 1 (generic defense for next unknown hotspot) → Layer 3 (reconcile with roadmap item 6) → Layer 2 (only if forced by measurement).

Relationship to other in-flight work

Open questions called out in the doc

  • Static vs dynamic command classification (Layer 1)
  • -BUSY backoff semantics — how do we avoid client retry spinning becoming the new hot loop?
  • Number of locked OS threads on variable-core hosts (Layer 2)
  • Stream migration soak window before removing legacy-format fallback (Layer 4, currently 30 days, arbitrary)

Deliverable

docs/design/2026_04_24_proposed_workload_isolation.md — 446 lines, dated-prefix / **Status: Proposed** convention matching the rest of docs/design/. No code.

Test plan

  • File paths and function references in the doc spot-checked against origin/main
  • Cross-references to 2026_04_24_proposed_resilience_roadmap.md reconciled (complements, doesn't duplicate)
  • Design review — decide on the open questions before implementing Layer 4 (which blocks Layer 1 on XREAD specifically)

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 24 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 53 minutes and 24 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bd714ac9-e7e3-4b7d-b86e-a2699644bd74

📥 Commits

Reviewing files that changed from the base of the PR and between 538f4f6 and 2ae4056.

📒 Files selected for processing (1)
  • docs/design/2026_04_24_proposed_workload_isolation.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ops/workload-isolation-design

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This design document outlines a multi-layered approach to workload isolation in response to a CPU starvation incident caused by expensive XREAD operations. The proposal includes a bounded worker pool for heavy commands, per-client admission control, and a transition to per-entry storage for streams to improve performance. Feedback focuses on mitigating risks during stream migration, ensuring container-aware resource allocation, refining command classification for the worker pool, and improving client-side observability during load shedding.

Comment on lines +314 to +315
3. On the next write (`XADD`/`XDEL`/...), rewrite to per-entry and
delete the legacy blob in the same commit.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Performing a full conversion of a large legacy stream during a synchronous write operation (XADD/XDEL) could be extremely expensive and might re-introduce CPU starvation or cause OOM, as it requires unmarshaling the entire blob and creating many new keys in a single Raft commit. This migration plan must detail strategies for zero-downtime cutovers to avoid service interruption. Consider an incremental migration strategy or a size-limited synchronous conversion to mitigate latency spikes and Raft heartbeat drops for very large streams.

Suggested change
3. On the next write (`XADD`/`XDEL`/...), rewrite to per-entry and
delete the legacy blob in the same commit.
3. On the next write (XADD/XDEL/...), perform an incremental or size-limited rewrite
to per-entry and delete the legacy blob in the same commit.
References
  1. Production-grade system migrations must detail strategies for live or zero-downtime cutovers to avoid service interruption.

Static v1 classification:

- **Pool-gated:** `XREAD`, `XRANGE`, `XREVRANGE`, `KEYS`, `SCAN`,
`HGETALL`, `SMEMBERS`, full-range `ZRANGE`/`ZRANGEBYSCORE`,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The 'full-range' qualifier for ZRANGE/ZRANGEBYSCORE suggests that the dispatcher must inspect command arguments (e.g., checking for 0 -1) before offloading. This contradicts the stated goal of 'static classification' by command name. If the gating is intended to be strictly static, all ZRANGE calls should be included, or the criteria for 'full-range' should be explicitly defined to avoid ambiguity in the implementation.

Suggested change
`HGETALL`, `SMEMBERS`, full-range `ZRANGE`/`ZRANGEBYSCORE`,
HGETALL, SMEMBERS, ZRANGE, ZRANGEBYSCORE,

### Recommended v1 shape

Package-level pool in `adapter/` with a `Submit(command, fn)` entry
point, sized `2 × runtime.GOMAXPROCS(0)` (env-overridable). Gated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Relying on runtime.GOMAXPROCS(0) for pool sizing can be unreliable in containerized environments (like Kubernetes) where it may return the host's CPU count instead of the container's quota. This could result in an oversized worker pool that fails to provide the intended isolation. It is recommended to ensure the implementation is container-aware, for example by using a library like uber-go/automaxprocs or explicitly respecting cgroup limits.

Suggested change
point, sized `2 × runtime.GOMAXPROCS(0)` (env-overridable). Gated
point, sized 2 × runtime.GOMAXPROCS(0) (env-overridable, container-aware). Gated
References
  1. Design documents identifying significant operational risks must detail mitigation strategies.


### v1 vs later

- **v1:** static list, single global pool, reject on full.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Sharing a single global pool for all gated commands means that a burst of management or debug commands (like KEYS * or SCAN) can still starve critical data-path commands (like XREAD or Lua scripts) by exhausting pool slots and triggering -BUSY responses. To achieve true workload-class isolation, consider using separate sub-pools or a slot reservation system for critical command classes even in the v1 implementation.

Suggested change
- **v1:** static list, single global pool, reject on full.
- v1: static list, per-class sub-pools or reservations, reject on full.

Comment on lines +251 to +252
enforced at accept.** On reject, close the TCP connection immediately
(no RESP — clients retry the connect). Per-client in-flight
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Abruptly closing the TCP connection without a protocol-level error (RESP) can lead to poor observability for clients, as they will see 'connection reset' or 'EOF' instead of a clear 'server overloaded' message. This makes it harder to distinguish between network issues and intentional load shedding. Consider accepting the connection and sending a -BUSY error before closing, or ensuring that client libraries are specifically tuned to interpret abrupt closures as a backpressure signal.

Suggested change
enforced at accept.** On reject, close the TCP connection immediately
(no RESP — clients retry the connect). Per-client in-flight
enforced at accept. On reject, consider accepting and replying with -BUSY before closing
(no RESP — clients retry the connect). Per-client in-flight

- Layer 1 (line 84): replace "full-range ZRANGE/ZRANGEBYSCORE" with
  unconditional gating of the whole ZRANGE family. Arg-inspection
  contradicted the stated "static, byte-level classification"; a
  bounded ZRANGE 0 10 costs at most one unmarshal and gating it is
  cheaper than the dispatcher branch that would distinguish the
  bounded vs unbounded case.
- Layer 1 (line 122, container awareness): document that
  runtime.GOMAXPROCS(0) returns the host CPU count on Linux without
  cgroup awareness and call out two mitigations (operator-set
  GOMAXPROCS env at rolling-update level, or wire
  uber-go/automaxprocs). v1 prefers the operator-set path for
  auditability; automaxprocs acceptable as follow-up.
- Layer 1 (line 139, single-pool starvation): acknowledge the risk
  that KEYS/SCAN bursts can exhaust pool slots and starve XREAD/Lua.
  v1 still ships a single pool but requires a per-command submit
  metric so a tier split is measurable from observability rather
  than guessed; sub-pools/slot reservation are the named follow-up.
- Layer 3 (line 252, reject semantics): change the v1 shape from
  "close TCP without RESP" to "accept, write -ERR, then close." The
  protocol-level error is the signal that distinguishes "server
  overload" from "network blip" on the client side.
- Layer 4 (line 315, synchronous migration cost): add a chunked
  migration section. A single XADD on a 100k-entry legacy blob would
  rewrite every entry in one Raft commit, reproducing the CPU/commit
  spike the design is supposed to prevent. Document
  STREAM_MIGRATION_CHUNK (default 1024) and the rolling drain model;
  explicitly scope chunked migration as a stacked follow-up to PR
  #620 (which ships the simple one-txn version).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Addressed Gemini design review in 9d44b6c:

  • HIGH (line 315, sync migration): added chunked migration section (STREAM_MIGRATION_CHUNK=1024 default). Scoped explicitly as a stacked follow-up to PR perf(redis): stream entry-per-key layout for O(new) XREAD #620 which ships the simple one-txn version. Doc now acknowledges that a 100k-entry legacy stream cannot be rewritten in a single XADD txn without reproducing the CPU spike.
  • MEDIUM (line 84, ZRANGE ambiguity): widened gating to the whole ZRANGE family. No arg-inspection → preserves static byte-level classification. A bounded ZRANGE 0 10 pays one pool submit for one unmarshal; still cheaper than the dispatcher branch that would distinguish bounded vs unbounded.
  • MEDIUM (line 122, container-awareness): called out the GOMAXPROCS vs cgroup-quota mismatch. v1 recommends operator-set GOMAXPROCS at the rolling-update layer (auditable); uber-go/automaxprocs is an accepted follow-up.
  • MEDIUM (line 139, single-pool starvation): kept single pool for v1 but required elastickv_heavy_command_pool_submit_total labelled by command so a sub-pool split becomes a measurable decision rather than a guess. Sub-pools / slot reservation documented as the named follow-up.
  • MEDIUM (line 252, RESP reject): changed v1 from "close TCP without RESP" to "accept, write -ERR max connections per client exceeded, close". Client sees a protocol-level signal instead of an opaque connection reset.

/gemini review
@codex review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a four-layer workload isolation strategy to prevent resource starvation, covering worker pools for heavy commands, Raft thread pinning, per-client admission control, and stream storage optimization. The review identifies several significant design issues, including a potential deadlock when using XREAD BLOCK within a fixed-size pool, a correctness bug in the dual-format read logic during partial migrations, and the need for context propagation to handle Lua recursion safely. Feedback also highlights a contradiction regarding the classification of optimized stream commands and warns of potential connection storms under the proposed admission control mechanism.

Comment on lines +84 to +87
- **Pool-gated:** `XREAD`, `XRANGE`, `XREVRANGE`, `KEYS`, `SCAN`,
`HGETALL`, `HVALS`, `HKEYS`, `SMEMBERS`, `SUNION`, `SINTER`,
`ZRANGE`, `ZRANGEBYSCORE`, `ZRANGEBYLEX`, `EVAL`/`EVALSHA`,
`FCALL`/`FCALL_RO`, the `*SCAN` family.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Gating XREAD to a small, fixed-size worker pool (2 × GOMAXPROCS) introduces a significant availability risk because XREAD is frequently used with the BLOCK option. A few long-polling clients could exhaust all worker slots, effectively performing a Denial of Service on all other "heavy" commands (such as KEYS, SCAN, or EVAL) that share the same pool. The design should specify that blocking operations are handled outside this bounded pool, or that only the CPU-intensive data processing (unmarshaling/scanning) is offloaded to the pool after the block condition is satisfied.

References
  1. When a design document identifies a significant operational risk, it must also detail potential mitigation strategies.

Comment on lines +343 to +345
1. On XREAD/XRANGE/XLEN/XREVRANGE, try the new per-entry layout first.
2. If empty AND the legacy single-blob key exists, fall back to the
legacy path.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The dual-format read logic described here contains a correctness bug when combined with the "Chunked migration" strategy (lines 353-364). If a stream is partially migrated, the new per-entry layout will not be empty, which prevents the fallback to the legacy-suffix key in step 2. Consequently, readers will only see the migrated chunk and miss the remaining data still stored in the legacy format. The read path must be designed to check for and merge results from the legacy-suffix key whenever it exists, regardless of whether the new layout contains entries.

References
  1. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers to avoid service interruption.

Comment on lines +115 to +117
- **(A) Outer holds the slot; inner `redis.call` runs inline and
ignores the pool.** No deadlock, inner cost shares outer accounting.
**Recommended v1.**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To implement Recommendation (A) for Lua recursion safely in Go, the context.Context passed through the command dispatcher must carry a flag indicating if the current goroutine is already executing within a heavy-command worker. Without this context propagation, the Submit entry point cannot reliably distinguish an "inner" call from a new client request, which would lead to the deadlock described in the "Risk to flag" section.

Comment on lines +280 to +283
enforced at accept.** On reject, accept the TCP connection, write a
`-ERR max connections per client exceeded` RESP error, then close —
so the client sees a protocol-level message instead of a bare
`connection reset` or `EOF` that's indistinguishable from a real
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The strategy of accepting, erroring, and closing connections that exceed the per-IP cap may trigger a "connection storm" if the client uses an aggressive connection pool. Similar to the -BUSY retry storms discussed in the Open Questions (line 453), the design should consider how to mitigate the CPU overhead of repeatedly accepting and closing these connections. Documenting recommended client-side backoff for connection failures or implementing a small server-side delay before closing could help mitigate this.

References
  1. When a design document identifies a significant operational risk, it must also detail potential mitigation strategies.

Comment on lines +430 to +431
with Layer 4 — streams become a cheap command once Layer 4 ships,
but the pool still catches Lua, KEYS, and HGETALL.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a contradiction between the command classification in Layer 1 and the sequencing goals. While line 84 lists XREAD as "Pool-gated", line 430 states that "streams become a cheap command once Layer 4 ships". If Layer 4 successfully optimizes XREAD to O(new), it should be moved to the "Ungated" list to avoid the unnecessary overhead of the worker pool hop (enqueue/pickup). The design should clarify the final classification of optimized stream commands.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d44b6c113

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +343 to +345
1. On XREAD/XRANGE/XLEN/XREVRANGE, try the new per-entry layout first.
2. If empty AND the legacy single-blob key exists, fall back to the
legacy path.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Merge both stream formats during chunked migration

The migration algorithm becomes incorrect once chunking is introduced: after the first chunk rewrite, part of a stream exists in the new per-entry layout and part remains in the legacy suffix, but this read rule only falls back to legacy when the new layout is empty. In that mixed state, XREAD/XRANGE can return incomplete results (e.g., IDs before the chunk boundary only see the migrated prefix), so the design needs a combined read path (or a migration order/marker that guarantees one source is authoritative per ID range).

Useful? React with 👍 / 👎.

Comment on lines +132 to +134
**Container-aware sizing.** `runtime.GOMAXPROCS(0)` on Linux returns
the host CPU count, not the container's cgroup quota (unless
`GOMAXPROCS` is set explicitly or something has configured it).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Update GOMAXPROCS guidance for Go 1.25+ behavior

This guidance is factually stale for this repo’s toolchain (go 1.25.0): modern Go already derives default GOMAXPROCS from cgroup CPU quota on Linux unless explicitly disabled, so documenting that it always uses host CPU count will push operators toward unnecessary manual overrides. That override can also disable runtime auto-updates when quotas change, leading to avoidable mis-sizing.

Useful? React with 👍 / 👎.

- Layer 1 classification (line 87, Gemini HIGH): carve out blocking
  XREAD/BLPOP/BRPOP/BZPOP from pool gating. Blocking variants hold a
  slot while idle, trivially exhaust the pool; handle them on their
  own goroutine and re-gate on wake-up.
- Layer 1 Lua recursion (line 117, Gemini Medium): document the
  context-propagation mechanism that makes option A implementable.
  A package-private sentinel value on ctx distinguishes inner call
  from new request; external callers cannot fake it.
- Layer 1 GOMAXPROCS (line 134, Codex P2): correct stale guidance.
  Go 1.25+ already derives GOMAXPROCS from cgroup v2 quota, so the
  automaxprocs recommendation no longer applies to this repo; keep
  the override env knob for operators who want explicit control.
- Layer 3 reject semantics (line 283, Gemini Medium): add a
  reject-storm mitigation. Rate-limit the reject itself — after R
  rejects/s to one peer, switch from accept+write+close to RST;
  recommend client-side backoff in the ops runbook.
- Layer 4 migration (line 345, Gemini HIGH + Codex P1): split the
  migration doc into Mode A (simple, PR #620 ships this) and Mode B
  (chunked, stacked follow-up). The read rules differ: Mode A has
  no mixed state so fall-through is correct; Mode B has a legal
  mixed state and MUST always merge both layouts. Making the
  distinction explicit prevents the dual-read correctness bug the
  reviewers flagged — Mode B with Mode A's fall-through rule
  returns incomplete results during chunking.
- Sequencing (line 431, Gemini Medium): resolve the XREAD
  gated-vs-cheap contradiction. Layer 4 makes XREAD steady-state
  O(new) but we keep it gated in Layer 1 v1 for three concrete
  reasons (large XRANGE bounds, legacy-fallback window, data-driven
  promotion); promotion to ungated is gated on the pool-submit
  metric added in Layer 1.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Addressed round-2 in 731475e:

  • Gemini HIGH line 87 (XREAD BLOCK): carved blocking variants out of pool gating — they bypass the pool and re-gate on wake-up.
  • Gemini HIGH line 345 + Codex P1: split the migration section into Mode A (simple, PR perf(redis): stream entry-per-key layout for O(new) XREAD #620) and Mode B (chunked, follow-up). Mode B MUST use an always-merge read rule to avoid the mixed-state data loss reviewers flagged.
  • Gemini Medium line 117: documented the ctx-sentinel mechanism that makes Lua recursion option (A) implementable.
  • Gemini Medium line 283: added reject-storm mitigation (rate-limit the reject with RST fallback + client-side backoff in runbook).
  • Gemini Medium line 431: resolved the XREAD gated-vs-cheap contradiction — kept gated in v1 for three concrete reasons, promotion to ungated is metric-driven.
  • Codex P2 line 134: corrected GOMAXPROCS guidance — Go 1.25+ already derives from cgroup v2 quota, so the previous automaxprocs recommendation is stale for this repo.

/gemini review
@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 731475ebf9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +100 to +103
consumers are active. v1 resolution: **blocking variants (`XREAD
BLOCK`, `BLPOP`, `BRPOP`, `BZPOPMIN`/`MAX`) bypass the heavy-command
pool and are handled on their own goroutine**. They are I/O-bound
waiting, not CPU-bound; their CPU cost lands on wake-up, when the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep XREAD BLOCK under pool throttling

This proposal says blocking variants (including XREAD BLOCK) should bypass the heavy-command pool because they are "I/O-bound waiting," but the current implementation in adapter/redis_compat_commands.go (xread) busy-polls in a loop and repeatedly calls xreadOnce while waiting. That means blocked readers still execute stream reads/unmarshal work during the wait window, so bypassing the pool would let a client evade the new CPU guardrail simply by using BLOCK, recreating the starvation path this design is trying to prevent.

Useful? React with 👍 / 👎.

Comment on lines +420 to +423
`elastickv_stream_legacy_format_reads_total` counts reads that
touched a legacy-format key in either mode. Remove the legacy
fallback only after it has sat at zero across all nodes for a soak
window.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require legacy-key depletion before dropping dual-read fallback

Using elastickv_stream_legacy_format_reads_total == 0 as the removal criterion can drop access to valid data: in Mode A, migration happens on the next write, so cold streams that are neither written nor read during the soak period can remain in legacy format indefinitely while this metric stays at zero. If fallback is removed on that signal alone, later reads of those dormant streams will miss existing data; the exit condition needs to prove no legacy-format keys remain, not just no recent legacy reads.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a four-layer architectural design for workload-class isolation to prevent expensive command paths from starving the Raft control plane. The strategy includes implementing a bounded worker pool for heavy commands, per-IP connection limits, and migrating Redis streams to a per-entry storage layout to reduce unmarshaling overhead. Feedback identifies a performance regression risk during chunked migration where legacy blobs still require full unmarshaling, points out that current blocking command implementations use CPU-intensive busy-polling rather than being purely I/O-bound, and highlights a context propagation gap that would prevent the proposed Lua recursion deadlock mitigation from functioning correctly.

Comment on lines +396 to +412
**Mode B — chunked migration (follow-up):** each write drains at
most `STREAM_MIGRATION_CHUNK` (default 1 024) entries from the legacy
blob into per-entry keys, and leaves the rest in a *legacy-suffix*
key until a subsequent write drains more. During this window the
stream exists in BOTH layouts simultaneously: the oldest N entries
are per-entry, the newer M entries are still in the suffix blob.

Read rule for Mode B — **always merge both layouts**, do not
fall-through on "new layout empty":

1. Read `meta` if present; read all per-entry keys that match the
requested ID range.
2. Read the legacy-suffix blob if present; decode only entries
falling in the ID range.
3. Merge by ID order, deduplicate (the migrator is responsible for
never writing the same ID in both layouts in a single commit), and
return.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The read rule for Mode B (chunked migration) introduces a significant performance regression that may undermine the isolation goals. Since the legacy format is a single protobuf blob, decoding requires unmarshaling the entire blob, leading to O(N_suffix) CPU/memory usage. This poses an operational risk during the migration window. As this is a production-grade system migration, the plan must detail mitigation strategies (e.g., a temporary 'bridge' or 'dual-write proxy') to ensure a zero-downtime cutover and avoid service interruption.

References
  1. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

Comment on lines +100 to +107
consumers are active. v1 resolution: **blocking variants (`XREAD
BLOCK`, `BLPOP`, `BRPOP`, `BZPOPMIN`/`MAX`) bypass the heavy-command
pool and are handled on their own goroutine**. They are I/O-bound
waiting, not CPU-bound; their CPU cost lands on wake-up, when the
dispatcher can re-evaluate whether to gate the follow-up work. The
gating decision is made in `dispatchCommand` when it sees the command
name plus the `BLOCK`/`B*` prefix — the simplest arg inspection we
allow, limited to "is this command blocking?".
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assumption that blocking variants like BZPOPMIN are purely I/O-bound and not CPU-bound is technically incorrect for the current implementation. As seen in adapter/redis_compat_commands.go:3432, bzpopmin uses a busy-poll loop with time.Sleep(redisBusyPollBackoff). While time.Sleep yields the P, each wake-up triggers a KV lookup and Raft leadership verification, which are CPU-intensive operations. If many clients use blocking commands, the aggregate CPU load from polling could still starve the runtime. The design should either propose moving blocking commands to a truly event-driven model (e.g., using the internal pubsub/relay system) or acknowledge this as a remaining gap in the isolation model.

Comment on lines +136 to +145
**Implementation note for (A): context propagation.** `Submit`
identifies "inside a pool slot" by attaching a sentinel value to
`context.Context` (`ctxKeyInPoolSlot`). The Lua adapter threads that
`ctx` into every `redis.call` it makes; the dispatcher's pool-gate
check returns immediately when `ctx.Value(ctxKeyInPoolSlot) != nil`
instead of attempting another `Submit`. This is the only mechanism
that reliably distinguishes "new client request" from "inner call"
without tagging every goroutine or holding a pool-wide set of
goroutine IDs. The sentinel must be package-private so external
callers cannot fake it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposed context propagation for the Lua recursion trap has a gap in the current architecture. RedisServer.runLuaScript (and the eval handler) currently creates a new context using context.WithTimeout(context.Background(), ...) (see adapter/redis_lua.go:118). If eval is executed within a worker pool slot, this new context will lose the ctxKeyInPoolSlot sentinel. To make this mechanism work, runLuaScript and all specialized command handlers in RedisServer must be refactored to accept and propagate the context provided by the dispatcher/pool, rather than starting from Background().

- Layer 1 blocking variants (Codex P1 + Gemini Medium): retract the
  previous "I/O-bound, bypass the pool" decision. The current
  adapter busy-polls XREAD BLOCK / B*POP with time.Sleep +
  KV+leader re-check on each wake, so bypassing would hand CPU
  unbounded to them. v1 keeps blocking variants gated; a
  condvar-based wake-up is listed as the prereq for an honest
  bypass in a stacked follow-up.
- Layer 1 Lua ctx caveat (Gemini Medium): runLuaScript currently
  rebuilds the script context from context.Background(), which
  drops the ctxKeyInPoolSlot sentinel Submit installed when it
  dispatched the outer EVAL. Called out explicitly that Layer 1
  MUST NOT ship before runLuaScript derives its context from the
  caller-supplied ctx; otherwise Lua scripts inside the pool
  self-deadlock on every redis.call.
- Layer 4 Mode B cost (Gemini HIGH): Mode B preserves the
  O(N_suffix) unmarshal cost during the migration window, not
  constant O(new). Added a read-driven drain mitigation, top-N
  legacy-suffix-size metric for operator visibility, and an
  explicit note that Mode A is the conservative choice for
  operators who can't tolerate transient cost.
- Layer 4 removal criterion (Codex P1): just "no legacy reads"
  could keep the fallback alive on cold streams forever. Require
  a paired elastickv_stream_legacy_format_keys_total derived from
  a periodic prefix-scan; the fallback is safe to remove only when
  both counters are zero.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Round 3 in 104237d:

  • Codex P1 + Gemini Medium: retracted the "blocking variants bypass pool" decision — current adapter busy-polls XREAD BLOCK / B*POP, so bypass would hand them unbounded CPU. v1 keeps blocking gated; condvar wake-up is listed as the prereq for an honest bypass.
  • Gemini Medium: documented that runLuaScript currently rebuilds its ctx from context.Background(), clobbering the ctxKeyInPoolSlot sentinel. Called out explicitly that Layer 1 MUST NOT ship until this is fixed or Lua inside the pool self-deadlocks.
  • Gemini HIGH: Mode B cost model now states the O(N_suffix) unmarshal cost during the migration window; added read-driven drain mitigation and top-N legacy-suffix-size metric.
  • Codex P1: removal criterion now requires BOTH elastickv_stream_legacy_format_reads_total AND a new elastickv_stream_legacy_format_keys_total (from a periodic prefix scan) to be zero. Catches cold streams that a read-only metric would miss.

/gemini review
@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 104237dd85

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +595 to +597
compatibility code go away? Proposed:
`elastickv_stream_legacy_format_reads_total` = 0 for 30 days
across all nodes → remove in a follow-up PR. 30 days is arbitrary;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use both migration counters for fallback removal

This open-question proposal conflicts with the Layer 4 removal rule above by allowing legacy fallback removal based only on elastickv_stream_legacy_format_reads_total. In the cold-stream scenario the doc already calls out, reads can stay at zero while legacy-format keys still exist, so following this criterion can remove compatibility code too early and make those streams unreadable after rollout.

Useful? React with 👍 / 👎.

Comment on lines +473 to +474
periodic prefix scan (`!redis|stream|<...>` with no matching
`!stream|meta|<...>`). The fallback is safe to remove only when
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match legacy-scan key pattern to the defined stream layout

The scan rule compares !redis|stream|<...> keys to a missing !stream|meta|<...> key, but the layout defined earlier in this doc uses !redis|stream|<key>|meta. If implemented literally, migrated streams can be misclassified as legacy because the meta lookup pattern is wrong, which breaks the new elastickv_stream_legacy_format_keys_total signal used to decide when fallback can be removed.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a multi-layered design for workload isolation to prevent CPU starvation, including a heavy-command worker pool, Raft thread pinning, per-IP connection caps, and a granular stream storage layout. Feedback focuses on operational risks during migration, specifically the starvation potential of ungated XADD commands and the inefficiency of the proposed legacy key scanning. Additionally, technical corrections were provided regarding the overhead of Go-level TCP resets and the all-or-nothing nature of Protobuf decoding which impacts the performance claims of the proposed migration modes.

Comment on lines +88 to +89
- **Ungated:** `GET`, `SET`, `DEL`, `EXISTS`, `INCR`, `EXPIRE`, `TTL`,
`HGET`, `HSET`, `LPUSH`/`RPUSH`, `XADD`, single-key fast paths.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The classification of XADD as Ungated (line 89) poses a starvation risk during the transition to the new stream layout. As noted in the Layer 4 problem description (line 384), the current loadStreamAt implementation is O(N). Even after Layer 4 is implemented, the Mode A migration (line 414) requires a full rewrite of the legacy blob on the first write. If XADD remains ungated, a burst of writes to large legacy streams will trigger multiple O(N) unmarshal/marshal cycles on the dispatcher goroutines, potentially reproducing the CPU starvation incident. As this identifies a significant operational risk, the document must detail potential mitigation strategies, such as implementing a temporary 'bridge' or 'proxy' mode to handle the transition safely. Additionally, there is a discrepancy between this document and the codebase: lines 484 and 507 state that Hashes, Sets, and ZSets are currently 'one-blob', but adapter/redis_compat_helpers.go (lines 373, 419, 473) already implements wide-column loading and migration for these types.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.

Comment on lines +471 to +475
the fallback. Add a paired counter
`elastickv_stream_legacy_format_keys_total` populated from a
periodic prefix scan (`!redis|stream|<...>` with no matching
`!stream|meta|<...>`). The fallback is safe to remove only when
**both** counters are zero across every node. The periodic scan
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The proposed periodic prefix scan of !redis|stream| (line 471) to count legacy keys is extremely inefficient. Since new stream entries are stored under the same prefix (!redis|stream|<key>|entry|<id>, line 400), this scan will iterate over every entry of every stream in the database. In a large-scale deployment, an O(TotalEntries) background task is unsustainable and could cause significant I/O contention. Recommendation: Use a more efficient tracking mechanism, such as a dedicated prefix for legacy blobs that allows O(LegacyStreams) scanning, or a persistent counter that is decremented during migration.

Comment on lines +348 to +350
has been rejected `R` times in the last second, the next rejects
are answered with `RST` (cheap kernel-level reset) instead of an
accept + write + close; (b) document operator-side client
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposal to use RST (TCP Reset) for 'cheap' rejection (line 348) may not provide the intended efficiency gains if implemented within the Go application. In the standard net package, sending an RST requires first calling Accept() to obtain a connection object, then setting SetLinger(0) before closing. This still incurs the overhead of the TCP handshake and file descriptor allocation. True 'cheap' rejection that avoids application-level overhead is typically handled at the OS kernel or firewall level (e.g., iptables REJECT with tcp-reset). If the mitigation must be in-process, a simple Close() after Accept() is nearly as efficient and more portable.

Comment on lines +437 to +438
2. Read the legacy-suffix blob if present; decode only entries
falling in the ID range.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Mode B read rule (line 432) suggests decoding 'only entries falling in the ID range' from the legacy-suffix blob. However, standard Protobuf unmarshaling is an all-or-nothing operation for a message; you cannot partially decode a repeated field without custom, low-level wire-format parsing. Consequently, as long as a legacy suffix exists, reads will remain O(SuffixSize) in CPU and memory. This limitation should be explicitly acknowledged. Furthermore, the migration plan must detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, or logical replication) to ensure this performance profile does not cause service interruption during the cutover.

References
  1. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

- Gemini HIGH (line 89 XADD starvation): explicitly flag that Mode A
  first-XADD migration rewrite is O(N) and therefore the ungated
  classification is a latent incident if Layer 4 Mode A lands before
  Layer 1. Add two ordering options: ship together, or dynamically
  gate XADD while the legacy-keys counter is non-zero.
- Gemini Medium (line 438 protobuf partial decode): correct the Mode B
  read rule. Protobuf cannot decode a repeated field partially, so the
  suffix blob is fully unmarshaled and filtered in memory. This matches
  (not contradicts) the Mode B cost model.
- Gemini HIGH (line 475 scan cost) + Codex P2 (line 474 key pattern):
  the prefix-scan approach to counting legacy keys would iterate every
  entry of every migrated stream because entries share the
  !redis|stream| prefix; also the prior wording compared !redis|stream|
  keys to a phantom !stream|meta| key. Replace both with either a
  sidecar legacy-index or a per-stream layout walk (O(num_streams), not
  O(num_entries)). Retract the wrong wording.
- Gemini Medium (line 350 RST overhead): pure-Go net cannot emit a
  true kernel RST without Accept(); clarify that (a) cheap path means
  skip the RESP write + SetLinger(0) close, not no accept. Real
  RST-before-accept is v2 via raw listener/eBPF.
- Codex P1 (line 597 removal criterion contradiction): open-question 5
  referenced only reads_total == 0, contradicting the Layer 4 rule that
  requires BOTH reads_total and keys_total. Tie them together.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Round 4 in 2ae4056:

  • Gemini HIGH (line 89 XADD starvation): explicitly flag that Mode A's first-XADD migration rewrite is O(N) and therefore the "ungated" classification is a latent incident if Layer 4 Mode A lands before Layer 1. Added two ordering options — ship together (preferred, simpler) or dynamically gate XADD while the legacy-keys counter is non-zero.
  • Gemini Medium (line 438 protobuf partial decode): corrected the Mode B read rule. Protobuf cannot decode a repeated field partially, so the suffix blob is fully unmarshaled and then filtered to the requested ID range in memory. This matches (not contradicts) the Mode B cost model.
  • Gemini HIGH (line 475 scan cost) + Codex P2 (line 474 key pattern): the prefix-scan approach to counting legacy keys would iterate every entry of every migrated stream because entries share the !redis|stream| prefix; the prior wording also compared !redis|stream| keys to a phantom !stream|meta| key that isn't in the layout. Replaced both with either a sidecar legacy-index (!redis|stream_legacy_index|<key> tombstone) or a per-stream layout walk over !redis|stream|<key>|meta keys — O(num_streams), not O(num_entries). Retracted the wrong prefix and noted the authoritative layout.
  • Gemini Medium (line 350 RST overhead): pure-Go net cannot emit a true kernel RST without Accept(); clarified that (a) "cheap" means "skip the RESP write + SetLinger(0) close" (fewer syscalls per reject), not "no accept." Real RST-before-accept via raw listener or eBPF is deferred to v2.
  • Codex P1 (line 597 removal criterion contradiction): open-question 5 referenced only reads_total == 0, contradicting the Layer 4 rule that requires BOTH reads_total and keys_total. Tied them together.

Doc-only PR; no build needed.

/gemini review
@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ae40564c1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +526 to +528
2. **Layout-walk.** Iterate `!redis|stream|<key>|meta` keys and, for
each stream, probe `!redis|stream|<key>` (the legacy blob key has
no suffix). Scan cost is O(num_streams), not O(num_entries).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove meta-only walk from legacy-key counter options

The Layout-walk option counts legacy keys by iterating !redis|stream|<key>|meta, but untouched legacy streams do not have a |meta key (the doc’s Mode A read rule explicitly expects meta to be absent and falls back to the legacy blob). In that common migration state, this walk undercounts to zero even while legacy blobs still exist, so the legacy_format_keys_total removal gate can be satisfied prematurely and the fallback can be removed while cold legacy streams remain unreadable.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a multi-layered workload isolation strategy for the elastickv system to address resource starvation issues. The design outlines four layers of defense: a bounded worker pool for heavy commands, pinning Raft goroutines to OS threads, per-client admission control via connection caps, and a storage layout optimization for Redis streams that moves from single-blob storage to per-entry keys. I have no feedback to provide.

@bootjp bootjp merged commit bbf31eb into main Apr 24, 2026
8 checks passed
@bootjp bootjp deleted the ops/workload-isolation-design branch April 24, 2026 16:29
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Codex P1 (line 528, legacy-key walk via |meta): the design doc at the HEAD of this branch (commit 2ae4056, round 4) already retracts the incorrect "walk |meta" wording. The current text explicitly states:

A "walk |meta keys" alternative was considered and rejected: untouched legacy streams never acquire a |meta key (Mode A's read rule explicitly expects |meta to be absent and falls back to the legacy blob). Iterating |meta would miss exactly the cold legacy streams that this counter exists to surface, reintroducing the same blind spot the paired counter is closing.

The authoritative option is a !redis|stream_legacy_index|<logical> sidecar marker written when a legacy blob is created and deleted in the same commit as the Mode A migration write. The counter becomes SCAN !redis|stream_legacy_index|, bounded by legacy-blob count rather than total entries.

Codex appears to be reviewing a pre-round-4 snapshot. No code or doc change needed.

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bbf31eb84c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +526 to +529
2. **Layout-walk.** Iterate `!redis|stream|<key>|meta` keys and, for
each stream, probe `!redis|stream|<key>` (the legacy blob key has
no suffix). Scan cost is O(num_streams), not O(num_entries).
Equivalent answer; avoids the sidecar index.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove |meta walk as equivalent legacy-key counter option

Option 2 is not equivalent to the sidecar index for Mode A: untouched legacy streams have no !redis|stream|<key>|meta key (the read rule above explicitly handles meta missing + legacy blob present), so iterating |meta keys will miss exactly the cold legacy streams this counter is meant to detect. If fallback removal is gated by that undercounted metric, compatibility code can be removed while legacy blobs still exist, making those streams unreadable until they are rewritten.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant