Skip to content

perf(metrics): batch tokenization with defer-to-flush drain#350

Merged
viraatc merged 1 commit into
mainfrom
perf/tok-batch-clean
Jun 30, 2026
Merged

perf(metrics): batch tokenization with defer-to-flush drain#350
viraatc merged 1 commit into
mainfrom
perf/tok-batch-clean

Conversation

@viraatc

@viraatc viraatc commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

What

ISL/OSL/TPOT need a tokenizer pass per completion. main dispatches one
asyncio task per event into a 2-thread pool — at high completion rates the
backlog grows unboundedly and the end-of-run drain takes ~an hour per million
samples. This PR batches: triggers enqueue O(1); a small live lane keeps live
metrics current; the end-of-run drain tokenizes everything left through a
process-sharded pool that uses the whole machine.

How

  • BatchTokenizer — the drain runs encode_batch_fast (Rust, rayon)
    across auto-sized worker processes, one pinned per 8-core block of the
    allowed CPU universe (probed via expand_to_all_online_cpus(), then the
    aggregator's inherited mask is restored — the service stays wherever
    the parent placed it). No silent fallbacks: a tokenizer without a fast
    backend, or a failed/over-budget warmup, is a clean startup error. macOS
    shards unpinned (rayon capped per worker) at full speed.
  • Live lane — in-process threads (--metrics-tokenizer-workers, schema
    default 2, the pre-existing knob and footprint; 0 = defer everything to
    the drain), rayon-capped, slice-capped per flush. Owned by the queue
    (start_live); the publisher knows nothing about tokenization.
  • TokenBatchQueue — buffers (text, on_count) per event; live
    failures/cancellations re-queue items (no sample loss), drain failures are
    terminal and stay counted in n_pending_tasks (incomplete-drain contract:
    state == complete && n_pending_tasks > 0). Drain budget --drain-timeout
    (default 60 s, 0 = unlimited); finalize always runs.
  • MetricsTable is fully synchronous; CORES_PER_WORKER is a module
    constant. Defaults are single-sourced in config/schema.py
    (metrics_drain_timeout_s 60 s, metrics_tokenizer_workers 2); the
    service args are required and always forwarded by the benchmark.

Validation

  • Unit suite green (176 metrics-aggregator: queue contract, shard sizing,
    drain timeout/failure, live requeue, RAYON caps, wiring seams);
    pre-commit clean. Offline-burst e2e: state=complete, all series
    populated, drain to n_pending_tasks=0.
  • Sharding is default-on through the real launch path (verified on a
    48-core x86 host and a 144-core GB200): the drain shards span the machine
    while the aggregator keeps its inherited mask.

Tokenizer micro-benchmark (GB200, real DeepSeek-R1 tokenizer)

144-core Grace, corpus = MLPerf DS-R1 prompts tiled to the dataset-mean OSL
of 3877 tokens; identical token counts both sides.

impl parallelism texts/s tokens/s speedup
main 2 threads, per-text encode 313 1.21 M
this PR 18 shards, batched encode 11,951 46.3 M 38×

1M-sample end-to-end A/B vs main

Offline 1M samples, streaming, DS-R1 tokenizer, server-paced at 8k QPS with
~1k-token outputs. Both sides: 1,000,000/1,000,000, state=complete,
n_pending_tasks=0, identical token series.

host impl backlog at ENDED drain total speedup
GB200 144c main 2,970,972 3,362 s 58.1 min
GB200 144c this PR 2,782,912 42.9 s 3.2 min 18.1×
B200 192c main 2,994,925 3,286 s 56.9 min
B200 192c this PR 2,788,032 61 s 3.4 min 16.5×

Measured on the final design (in-process live lane, --tokenizer-workers 2,
300 s drain default). The live lane keeps ~7% of tokenizations current; the
rest (~2.78M) defer to the end-of-run drain, which the sharded pool clears in
43-61 s. A 1M-sample run needs the 300 s budget — 60 s drops the backlog.
main rows (unlimited drain budget, or they never finish) and the
micro-benchmark are unaffected.

🤖 Generated with Claude Code

@viraatc viraatc requested a review from a team June 9, 2026 20:33
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions Bot requested review from arekay-nv and nvzhihanj June 9, 2026 20:33

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the thread-based TokenizePool with a process-sharded BatchTokenizer and a TokenBatchQueue to buffer and batch tokenization work (ISL/OSL/TPOT) during metrics aggregation, preventing the system from falling behind on high-throughput runs. The review feedback highlights critical reliability improvements in token_metrics.py. Specifically, it is recommended to wrap the queue's flush logic in a try...finally block to prevent self._inflight from leaking on exceptions or cancellations. Additionally, count_texts and count_texts_async should explicitly check if the tokenizer is closed, and close() should wait for process pools to shut down to avoid resource leaks.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py Outdated
@viraatc viraatc force-pushed the perf/tok-batch-clean branch 2 times, most recently from 39b4a9b to c1d2cb7 Compare June 10, 2026 01:12
@viraatc viraatc force-pushed the perf/tok-batch-clean branch from b1395ab to 1e502c5 Compare June 10, 2026 21:57

@nvzhihanj nvzhihanj left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — re-audit after rebase (HEAD 4633699d)

Reviewed by: Claude + Codex (low reasoning, correctness pass) · Depth: thorough. Focus (as requested): is the metrics-tokenization change modular / clean / non-intrusive (existing benchmark behavior preserved), and are redundant/meaningless tests added.

Verdict: the rebase reduced the intrusiveness but did not resolve it. Replacing the pre_publish hook (full tokenizer pool fired from the publish tick) with a bounded single-shard live lane is a real improvement. But mid-run tokenization is still on by default (--live-tokenizers defaults to 1 → 0.25s live flush), the live shard runs on the highest core block which overlaps the HTTP workers (compute_affinity_plan Phase-3 spillover), and the PR removed the only opt-out (metrics_tokenizer_workers) without replacement — so the observability component can perturb the SUT during measurement and the operator can't turn it off via config. Headline recommendation: default --live-tokenizers 0 for measurement-grade runs (defer all tokenization to the post-run drain), or confine the live shard to cores disjoint from worker_cpu_sets; restore a benchmark-reachable knob. (A1, A2, A3.)

Otherwise clean / non-intrusive. The change stays in the aggregator subprocess (only cross-module touch: importing endpoint_client.cpu_affinity). The consumer contract is verified intactSessionState, the MetricsSnapshot schema, publisher cadence, and the state==COMPLETE && n_pending_tasks>0 incomplete-drain signal are unchanged; flush_remaining is bounded by the drain budget and never raises; the live-loop's failure cannot skip publish_final. The "shard or exit cleanly" fallback and the unpinned-without-affinity (macOS) path are correct and tested.

Tests: no redundant or meaningless tests. The new branches are mostly well covered with behavior-grounded assertions (the _setup_shards decision matrix, no-fast-backend-exit, unpinned-without-affinity, warmup-failure-exit, flush_remaining timeout/failure, live-loop start/stop/survives-failure, expand_to_all_online_cpus). Removing the old metrics_tokenizer_workers tests was correct (dead). The problems are coverage gaps, not redundancy: the aggregator-side start_live wiring is untested (A5) and TestAggregatorArgs no longer pins the forwarded-args contract (A6). Two _FakeProc-injection tests are borderline-coupled to internals but still verify fan-out/reassembly; TestEvenChunks is trivial-but-cheap. No mock-only or duplicate tests found.

Codex findings — not posted: (1) a multi-turn-ISL precompute regression at execute.py:351 — that's PR #349's change, out of scope here; (2) a shutdown(wait=False) worker-terminate race — _terminate_procs already defensively handles _processes is None and CPython doesn't synchronously null it, so the specific mechanism couldn't be verified → dropped. Existing gemini/github-code-quality token_metrics.py comments (flush-exception inflight; closed-tokenizer guards; close() shutdown leak; Protocol ...pass) are unaddressed but deduped here, not re-posted.

Comment thread src/inference_endpoint/commands/benchmark/execute.py
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py Outdated
Comment thread docs/async_utils/services/metrics_aggregator/DESIGN.md Outdated
Comment thread tests/unit/commands/test_benchmark.py
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py Outdated
@viraatc viraatc force-pushed the perf/tok-batch-clean branch from 8f547af to f1ac948 Compare June 16, 2026 04:42

@arekay-nv arekay-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Comment thread docs/async_utils/services/metrics_aggregator/DESIGN.md
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py Outdated
Comment thread src/inference_endpoint/config/schema.py Outdated
@viraatc viraatc requested a review from nv-alicheng June 26, 2026 22:30

@nv-alicheng nv-alicheng left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

Posted 9 inline findings. Well-engineered PR — no critical/high defects; the drain/flush hot path is largely correct. Findings are condition-gated edge cases + polish. See summary comment for the tiered breakdown + one untested-path note that couldn't be posted inline (handler code unchanged in the diff).

Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py Outdated
Comment thread src/inference_endpoint/endpoint_client/cpu_affinity.py
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py Outdated
Comment thread docs/async_utils/services/metrics_aggregator/DESIGN.md
@nv-alicheng

Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

9 inline findings across 4 files. No critical/high defects — the defer-to-flush hot path is largely correct: the batch is detached before tokenization, drain failures are terminal-and-pending-only (no double-count, no silent loss on the normal paths), publish_final/_finalize run on every terminal path, idempotency is guarded, and the schema is the single source of truth for defaults (drain-timeout 300s, tokenizer-workers 2 consistent across schema, all 3 templates, AGENTS.md, DESIGN.md). aiohttp 3.14.1 bump consistent in pyproject + uv.lock. Findings below are condition-gated edge cases and polish.

🟡 Should Fix (medium)

# File Line Category Reviewer(s) Summary
1 metrics_aggregator/token_metrics.py 461 concurrency Codex close() self._thread.shutdown(wait=True) on the live thread pool can block teardown past --drain-timeout while a live encode is in-flight; SIGTERM path also doesn't stop the live loop
2 endpoint_client/cpu_affinity.py 334 bug Codex sysfs online unreadable → no widening → shard pool pinned to narrow loadgen mask → drain timeout on large runs in sysfs-filtered containers
3 metrics_aggregator/token_metrics.py 626 data-integrity Claude drain-phase zip(..., strict=True) in the else bypasses the independent-phase failure handling: a wrong-length tokenizer result records a partial prefix then propagates, dropping the detached message-phase items

🔵 Consider (low)

# File Line Category Reviewer(s) Summary
4 metrics_aggregator/token_metrics.py 135 security Claude trust_remote_code=True now replicated across N shard worker processes; broadens blast radius
5 metrics_aggregator/token_metrics.py 666 error-handling Claude flush_remaining 'never raises' contract broken by unguarded live-task teardown vs CancelledError
6 metrics_aggregator/token_metrics.py 170 concurrency Claude _terminate_procs silently no-ops if CPython-private _processes shape changes, reintroducing the exit-stall it prevents
7 metrics_aggregator/aggregator.py 122 api-contract Claude required kw drain_timeout_s declared after defaulted kw-only params; reads as optional
8 metrics_aggregator/token_metrics.py 361 design Claude concrete count_texts_async/token_count_message_async omit the Protocol's positional-only /
9 metrics_aggregator/DESIGN.md 24 documentation Claude lifecycle diagram draws INTERRUPTED only off DRAINING; SIGTERM can interrupt from any state

Not posted inline (handler code unchanged in the diff, so no valid inline anchor):

  • __main__.py _on_sigterm/_signal_finalizetesting (Claude): the SIGTERM-during-active-drain race (handler fires while flush_remaining is mid-drain) is untested. Single-INTERRUPTED-snapshot / shutdown_event-set-once / n_pending_tasks correctness rests on reasoning, not a test. Recommend a test firing the handler against a slow in-flight flush_remaining.

⚠️ Commit hygiene: 20 commits including ~8 apparent fixups (fix(metrics): …, chore(metrics): drain-timeout default back to 60s, etc.). Consider squashing the iteration history before merge.

@nv-alicheng

Copy link
Copy Markdown
Collaborator

Design note — scope over architecture

Independent of the line-level findings already posted, a framing for the rework as a whole. The architecture here is forced by the constraints and the PR gets it right: tokenize is heavy CPU work that must not block the loop, must not contend with the loadgen mid-run, must keep up at 50k+ QPS, and must yield exact final numbers. GIL ⇒ real parallelism needs processes; a single BPE rayon pool saturates ~8 cores; the work is only needed for metrics ⇒ it can defer past the run. Those facts make defer-to-flush batching + process-sharding pinned to disjoint core blocks essentially the only answer, and I'd converge on the same — along with no-silent-fallback and the exact-or-flagged n_pending_tasks contract.

Several subtle calls are better than a first pass would make: the sharding-by-core-block thesis is measured (16k vs 1.5k texts/s), the warmup is a bounded startup error that races the launch budget, shards stay idle until ENDED so tokenization never perturbs the running benchmark, and the DESIGN.md actually documents the contracts. Good work.

The one theme worth weighing is scope, not structure: the live in-process tokenization lane roughly doubles the edge-case surface and accounts for most of the review findings (live-cancel, re-queue-on-failure, the live thread-pool blocking teardown, the bounded per-flush cap). The authoritative metrics are computed at the drain regardless; the live lane exists only to keep token metrics current in the TUI mid-run, and --tokenizer-workers=0 ("defer all") is already the simpler universe.

So if live token metrics are not a hard requirement, a KISS v1 would be drain-only, adding the live lane later if operators ask for live OSL. That removes the live flag threaded through flush/count_texts_async, the re-queue paths, the live-loop lifecycle, and the second executor. If live IS required, the lane as built is a legitimate, well-engineered answer — and the gap then narrows to two refactors:

  1. Make CPU-universe discovery a pure query. The probe currently does getaffinity → expand_to_all_online_cpus() (mutates affinity as a side effect) → restore. The save/probe/restore dance and its restore-failure branch only exist because the helper isn't pure. A pure "list allowed CPUs" query removes the dance and gives a clean cgroup-cpuset fallback when /sys/.../online is unreadable (the affinity finding, addressed at the design level rather than patched).

  2. Split flush() into flush_text + flush_messages. One method juggling a shared failure var, two re-queue paths, and a zip(strict=True) in the else is what produced the data-integrity finding. Two small methods with independent error handling remove that hazard and read better.

None of this is a rework — the bones are right and more thoroughly justified than most perf PRs. It's about trimming optional scope out of v1 and making the two trickiest spots (affinity, the dual-phase flush) simpler.

(Independent design review by Claude at a maintainer's request — not a re-run of the automated council above.)

viraatc added a commit that referenced this pull request Jun 30, 2026
…e hardening

Addresses open review threads on #350:

- drain-timeout default 300s -> 0 (unlimited) per maintainer review; an
  incomplete drain is already flagged via n_pending_tasks, never silent.
  Single-sourced in schema; --drain-timeout / --tokenizer-workers gain
  service-side defaults (0 / 2) so the service is hand-launchable, with the
  benchmark still forwarding schema values as the source of truth.
- cpu_affinity.expand_to_all_online_cpus: when sysfs `online` is
  unreadable/filtered (some containers), widen against all logical CPUs
  instead of silently leaking the narrow inherited mask (which starved the
  shard pool on large runs). Kernel clamps to the cgroup cpuset.
- drain text phase: isolate a wrong-length tokenizer result as a phase
  failure so the message phase still runs (was zip(strict=True) raising
  out of flush()).
- BatchTokenizer.close(): cancel_futures on the live thread pool so only an
  in-flight encode (bounded) is waited on at teardown.
- _terminate_procs: log loud if CPython's private _processes attr is missing
  rather than silent no-op.
- aggregator drain log: success line is now an else-branch of the
  incomplete-drain warning; group drain_timeout_s with defaulted kwargs.
- docs/comments: TPOT excludes TTFT; empty-output is not an anomaly;
  trust_remote_code rationale; flush_remaining CancelledError contract;
  DESIGN.md INTERRUPTED reachable from any state + worker SIGINT/interrupt
  path; align TokenCounter Protocol positional-only markers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
viraatc added a commit that referenced this pull request Jun 30, 2026
ISL/OSL/TPOT need a tokenizer pass per completion. The old path dispatched
one asyncio task per event into a 2-thread pool; at high completion rates the
backlog grew unbounded and the end-of-run drain took ~an hour per million
samples. This batches: triggers enqueue O(1); a small in-process live lane
keeps live metrics current; the end-of-run drain tokenizes everything left
through a process-sharded pool (one worker pinned per 8-core block).

- BatchTokenizer: sharded drain via encode_batch_fast (Rust/rayon), auto-sized
  to the allowed CPU universe; the aggregator's inherited mask is restored. No
  silent fallbacks — a tokenizer without a fast backend, or a failed/over-budget
  warmup, is a clean startup error. macOS shards unpinned.
- TokenBatchQueue: buffers (text, on_count) per event; live failures requeue
  (no sample loss); drain failures stay counted in n_pending_tasks
  (state == complete && n_pending_tasks > 0 = incomplete drain).
- Drain budget --drain-timeout (schema default 0 = unlimited; never exits while
  samples are still pending). --tokenizer-workers (default 2; 0 = defer all).

Review hardening (PR #350): cpu_affinity widens against all logical CPUs when
sysfs `online` is unreadable; drain isolates wrong-length tokenizer results so
the message phase still runs; close() drops queued live encodes; _terminate_procs
snapshots worker handles before shutdown() nulls _processes; TPOT docstring notes
TTFT exclusion; Protocol positional-only markers aligned.

Rebased onto main: reconciled with #372 (use_legacy_loadgen_qps_metrics) —
the ENDED drain finalize and the SIGTERM handler both refresh the
legacy_loadgen_window_duration_ns counter alongside the token-queue drain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@viraatc viraatc force-pushed the perf/tok-batch-clean branch from bd20315 to 89a065a Compare June 30, 2026 03:01
viraatc added a commit that referenced this pull request Jun 30, 2026
ISL/OSL/TPOT need a tokenizer pass per completion. The old path dispatched
one asyncio task per event into a 2-thread pool; at high completion rates the
backlog grew unbounded and the end-of-run drain took ~an hour per million
samples. This batches: triggers enqueue O(1); a small in-process live lane
keeps live metrics current; the end-of-run drain tokenizes everything left
through a process-sharded pool (one worker pinned per 8-core block).

- BatchTokenizer: sharded drain via encode_batch_fast (Rust/rayon), auto-sized
  to the allowed CPU universe; the aggregator's inherited mask is restored. No
  silent fallbacks — a tokenizer without a fast backend, or a failed/over-budget
  warmup, is a clean startup error. macOS shards unpinned.
- TokenBatchQueue: buffers (text, on_count) per event; live failures requeue
  (no sample loss); drain failures stay counted in n_pending_tasks
  (state == complete && n_pending_tasks > 0 = incomplete drain).
- Drain budget --drain-timeout (schema default 0 = unlimited; never exits while
  samples are still pending). --tokenizer-workers (default 2; 0 = defer all).

Review hardening (PR #350): cpu_affinity widens against all logical CPUs when
sysfs `online` is unreadable; drain isolates wrong-length tokenizer results so
the message phase still runs; close() drops queued live encodes; TPOT docstring
notes TTFT exclusion; Protocol positional-only markers aligned.

Simplification pass: drain flatten uses itertools.chain.from_iterable (1.44x);
worker termination uses public multiprocessing.active_children() instead of
ProcessPoolExecutor._processes (also catches init-hung workers); the shard-core
probe-and-restore moved to cpu_affinity.cgroup_clamped_cpus() so building a
tokenizer no longer mutates the aggregator's own mask; flush(live=bool) split
into intent-named flush_live_once()/drain_all() over a private _flush; dead
_live_workers field removed.

Rebased onto main: reconciled with #372 (use_legacy_loadgen_qps_metrics) — the
ENDED drain finalize and the SIGTERM handler both refresh the
legacy_loadgen_window_duration_ns counter alongside the token-queue drain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@viraatc viraatc force-pushed the perf/tok-batch-clean branch from 89a065a to f35ee08 Compare June 30, 2026 05:19
ISL/OSL/TPOT need a tokenizer pass per completion. The old path dispatched
one asyncio task per event into a 2-thread pool; at high completion rates the
backlog grew unbounded and the end-of-run drain took ~an hour per million
samples. This batches: triggers enqueue O(1); a small in-process live lane
keeps live metrics current; the end-of-run drain tokenizes everything left
through a process-sharded pool (one worker pinned per 8-core block).

- BatchTokenizer: sharded drain via encode_batch_fast (Rust/rayon), auto-sized
  to the allowed CPU universe; the aggregator's inherited mask is restored. No
  silent fallbacks — a tokenizer without a fast backend, or a failed/over-budget
  warmup, is a clean startup error. macOS shards unpinned.
- TokenBatchQueue: buffers (text, on_count) per event; live failures requeue
  (no sample loss); drain failures stay counted in n_pending_tasks
  (state == complete && n_pending_tasks > 0 = incomplete drain).
- Drain budget --drain-timeout (schema default 0 = unlimited; never exits while
  samples are still pending). --tokenizer-workers (default 2; 0 = defer all).

Review hardening (PR #350): cpu_affinity widens against all logical CPUs when
sysfs `online` is unreadable; drain isolates wrong-length tokenizer results so
the message phase still runs; close() drops queued live encodes; TPOT docstring
notes TTFT exclusion; Protocol positional-only markers aligned.

Simplification pass: drain flatten uses itertools.chain.from_iterable (1.44x);
worker termination uses public multiprocessing.active_children() instead of
ProcessPoolExecutor._processes (also catches init-hung workers); the shard-core
probe-and-restore moved to cpu_affinity.cgroup_clamped_cpus() so building a
tokenizer no longer mutates the aggregator's own mask; flush(live=bool) split
into intent-named flush_live_once()/drain_all() over a private _flush; dead
_live_workers field removed.

Rebased onto main: reconciled with #372 (use_legacy_loadgen_qps_metrics) — the
ENDED drain finalize and the SIGTERM handler both refresh the
legacy_loadgen_window_duration_ns counter alongside the token-queue drain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@viraatc viraatc force-pushed the perf/tok-batch-clean branch from f35ee08 to d70afe3 Compare June 30, 2026 22:12
@viraatc viraatc merged commit 550ef85 into main Jun 30, 2026
8 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 30, 2026
@viraatc viraatc deleted the perf/tok-batch-clean branch June 30, 2026 23:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants