experiments: add token-savings runner for learn/recall#261
Conversation
Standalone (non-pytest) script that runs utterance 1 to seed guidelines and utterance 2 with vs. without recall in the claude-sandbox, comparing token usage from --output-format json plus per-turn usage from saved transcripts. Supports --shared-seed to reuse one seeded workspace across N measure runs. Lives under experiments/ so ad-hoc measurement work is separated from the test suite.
Documents what experiments/ is for (ad-hoc measurement, not the test suite), how to run token_savings.py, what the results layout looks like, and the rough wall-clock budget.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds an experiments README and a new experiment script, experiments/token_savings.py, to run Dockerized sandbox prompts, parse transcript JSONL, compare token usage for runs with recalled guidelines versus fresh workspaces, and produce timestamped markdown reports and raw JSON results. ChangesToken Savings Measurement Experiment
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
experiments/token_savings.py (1)
92-92: ⚡ Quick winA timeout (or other exception) in one run aborts the whole experiment and discards all prior results.
subprocess.run(..., timeout=SESSION_TIMEOUT_SECONDS)raisessubprocess.TimeoutExpiredon timeout, which propagates uncaught up through the run loop inmain. Sinceraw.json/report.mdare written only after all runs complete, a single hungclaudeinvocation loses every completed run — costly given the documented multi-minute-per-run wall-clock budget. Consider catching it here and surfacing it as a per-run error so the rest of the experiment proceeds.♻️ Capture timeout as a normal failure instead of crashing
- proc = subprocess.run(cmd, capture_output=True, text=True, timeout=SESSION_TIMEOUT_SECONDS) + try: + proc = subprocess.run(cmd, capture_output=True, text=True, timeout=SESSION_TIMEOUT_SECONDS) + except subprocess.TimeoutExpired as exc: + # Synthesize a failed CompletedProcess so callers record an error + # for this run rather than aborting the entire experiment. + return subprocess.CompletedProcess( + cmd, returncode=124, stdout=exc.stdout or "", stderr="timeout" + ), NoneNote: callers already branch on
proc.returncode != 0and recordmeasure_failed/seed_failed, so a non-zeroreturncodehere integrates cleanly.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@experiments/token_savings.py` at line 92, Wrap the subprocess.run call that assigns proc (subprocess.run(..., timeout=SESSION_TIMEOUT_SECONDS)) in a try/except that catches subprocess.TimeoutExpired (and optionally other exceptions) and converts the failure into a per-run failure instead of letting it propagate to main: on exception, synthesize a failure "proc" (or set variables used later) with a non-zero returncode and stderr/stdout containing the exception message so the existing returncode checks and measure_failed/seed_failed bookkeeping still record this run as failed; do this change in experiments/token_savings.py around the subprocess.run invocation so raw.json/report.md continue to be written after the loop completes.experiments/README.md (1)
17-18: ⚡ Quick winClarify Docker prerequisite to mention daemon.
The script checks both that Docker is installed and that the daemon is running (see
_check_prerequisites()), but the README only mentions "Docker." Consider clarifying to "Docker (installed and daemon running)" for precision.Based on learnings from the implementation contract in _check_prerequisites() which validates both
shutil.which("docker")anddocker inforeturncode.📝 Proposed clarification
-**Requires:** Docker, the `claude-sandbox` image (`just sandbox-build claude`), +**Requires:** Docker (installed and daemon running), the `claude-sandbox` image (`just sandbox-build claude`), and `ANTHROPIC_API_KEY` (or `ANTHROPIC_AUTH_TOKEN`) in the environment.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@experiments/README.md` around lines 17 - 18, Update the README entry that currently says "Docker" to clarify the daemon requirement—e.g., change the prerequisite text to "Docker (installed and daemon running)" or similar; this aligns the documentation with the runtime checks performed by the _check_prerequisites() function which verifies both shutil.which("docker") and successful `docker info` execution, ensuring readers know the Docker daemon must be running.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@experiments/README.md`:
- Around line 17-18: Update the README entry that currently says "Docker" to
clarify the daemon requirement—e.g., change the prerequisite text to "Docker
(installed and daemon running)" or similar; this aligns the documentation with
the runtime checks performed by the _check_prerequisites() function which
verifies both shutil.which("docker") and successful `docker info` execution,
ensuring readers know the Docker daemon must be running.
In `@experiments/token_savings.py`:
- Line 92: Wrap the subprocess.run call that assigns proc (subprocess.run(...,
timeout=SESSION_TIMEOUT_SECONDS)) in a try/except that catches
subprocess.TimeoutExpired (and optionally other exceptions) and converts the
failure into a per-run failure instead of letting it propagate to main: on
exception, synthesize a failure "proc" (or set variables used later) with a
non-zero returncode and stderr/stdout containing the exception message so the
existing returncode checks and measure_failed/seed_failed bookkeeping still
record this run as failed; do this change in experiments/token_savings.py around
the subprocess.run invocation so raw.json/report.md continue to be written after
the loop completes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 28b0b470-cfd8-47ca-adca-0cfe9c7ccfa5
📒 Files selected for processing (2)
experiments/README.mdexperiments/token_savings.py
Fixes failing CI check: check-formatting (3.12)
Summary
Adds an
experiments/directory with a standalone (non-pytest) script that measures the token / wall-clock / step gap on a measured utterance when guidelines from a prior utterance are recallable vs. not. Adapted fromtests/e2e/test_claude_sandbox_learn_recall.pybut runs as a script and prints a comparison table — no assertions.This is the runner only. Result artifacts from my own runs against the EXIF demo are kept out of this PR; headline numbers and the writeup are on issue #260.
What's in here
experiments/token_savings.py— runs utterance 1 to seed guidelines, then utterance 2 with vs. without recall, N times per condition. Captures token usage fromclaude --output-format jsonplus per-turn usage from saved transcripts. Supports--shared-seedto reuse one seeded workspace across N measure runs.experiments/README.md— whatexperiments/is for, how to run, results layout.Why a new top-level dir
tests/is for the test suite (CI-runnable, asserts). This is ad-hoc measurement that produces numbers, not pass/fail. If something here graduates into a regression check, move it undertests/.Test plan
python3 experiments/token_savings.py --helplists--runs,--shared-seed,--keep-workspaces.python3 experiments/token_savings.py --runs 1 --shared-seedend-to-end smoke run (requires Docker, theclaude-sandboximage, and an Anthropic API key in env). Should produceexperiments/results/token_savings_<timestamp>/{report.md, raw.json}and a stdout table.Related
Summary by CodeRabbit
Documentation
New Features