Skip to content

experiments: add token-savings runner for learn/recall#261

Merged
visahak merged 3 commits into
AgentToolkit:mainfrom
vinodmut:token-savings-experiment
May 29, 2026
Merged

experiments: add token-savings runner for learn/recall#261
visahak merged 3 commits into
AgentToolkit:mainfrom
vinodmut:token-savings-experiment

Conversation

@vinodmut
Copy link
Copy Markdown
Contributor

@vinodmut vinodmut commented May 29, 2026

Summary

Adds an experiments/ directory with a standalone (non-pytest) script that measures the token / wall-clock / step gap on a measured utterance when guidelines from a prior utterance are recallable vs. not. Adapted from tests/e2e/test_claude_sandbox_learn_recall.py but runs as a script and prints a comparison table — no assertions.

This is the runner only. Result artifacts from my own runs against the EXIF demo are kept out of this PR; headline numbers and the writeup are on issue #260.

What's in here

  • experiments/token_savings.py — runs utterance 1 to seed guidelines, then utterance 2 with vs. without recall, N times per condition. Captures token usage from claude --output-format json plus per-turn usage from saved transcripts. Supports --shared-seed to reuse one seeded workspace across N measure runs.
  • experiments/README.md — what experiments/ is for, how to run, results layout.

Why a new top-level dir

tests/ is for the test suite (CI-runnable, asserts). This is ad-hoc measurement that produces numbers, not pass/fail. If something here graduates into a regression check, move it under tests/.

Test plan

  • python3 experiments/token_savings.py --help lists --runs, --shared-seed, --keep-workspaces.
  • python3 experiments/token_savings.py --runs 1 --shared-seed end-to-end smoke run (requires Docker, the claude-sandbox image, and an Anthropic API key in env). Should produce experiments/results/token_savings_<timestamp>/{report.md, raw.json} and a stdout table.

Related

Summary by CodeRabbit

  • Documentation

    • Added README documenting ad-hoc measurement scripts, setup, run examples, outputs, and expected artifacts.
  • New Features

    • Added a token-savings experiment that runs repeated measurements, parses results, computes per-run and aggregate token metrics (means/ranges and savings), and writes timestamped reports and raw output.

Review Change Stack

vinodmut added 2 commits May 29, 2026 09:58
Standalone (non-pytest) script that runs utterance 1 to seed guidelines
and utterance 2 with vs. without recall in the claude-sandbox, comparing
token usage from --output-format json plus per-turn usage from saved
transcripts. Supports --shared-seed to reuse one seeded workspace across
N measure runs. Lives under experiments/ so ad-hoc measurement work is
separated from the test suite.
Documents what experiments/ is for (ad-hoc measurement, not the test
suite), how to run token_savings.py, what the results layout looks like,
and the rough wall-clock budget.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f68dae9-3e0a-4bca-ad1a-38cda7b69b74

📥 Commits

Reviewing files that changed from the base of the PR and between 7259fe8 and 5f71a05.

📒 Files selected for processing (1)
  • experiments/token_savings.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • experiments/token_savings.py

📝 Walkthrough

Walkthrough

This PR adds an experiments README and a new experiment script, experiments/token_savings.py, to run Dockerized sandbox prompts, parse transcript JSONL, compare token usage for runs with recalled guidelines versus fresh workspaces, and produce timestamped markdown reports and raw JSON results.

Changes

Token Savings Measurement Experiment

Layer / File(s) Summary
Documentation
experiments/README.md
README describing the experiments directory, token_savings.py usage, environment/Docker requirements, example commands, output artifacts, and per-run directory structure.
Experiment Foundation & Sandbox Integration
experiments/token_savings.py
Shared constants and prereq checks (Docker, sandbox image, Anthropic env vars); workspace cloning; Dockerized claude invocation with JSON output parsing and fallback recovery for non-strict stdout.
Transcript & Entity Parsing
experiments/token_savings.py
Utilities to parse per-turn token usage from transcript JSONL, select newest transcript excluding prior paths, list learned .evolve/entities/*.md, and extract headline usage blocks from JSON.
With-guidelines per-run flow
experiments/token_savings.py
Seed a fresh workspace with utterance 1, confirm learned entities, run measurement utterance 2 on same workspace, capture headline/per-turn usage and structured errors on failures.
Shared-seed mode
experiments/token_savings.py
Seed once, enumerate recallable entities, run measurement N times on same workspace while snapshotting prior transcripts to identify each run's newest transcript.
Without-guidelines per-run flow
experiments/token_savings.py
Fresh workspace creation and single measurement run with no seeding; record headline and per-turn usage and structured errors on sandbox failure.
Metric Aggregation & Table Rendering
experiments/token_savings.py
Compute mean/min/max/stdev across runs for token/duration metrics, format markdown table with savings deltas and percentages, and render per-turn rows for a representative run.
Report Generation & CLI Orchestration
experiments/token_savings.py
Generate report.md and raw.json in timestamped results dir; CLI parses --runs, --shared-seed, --keep-workspaces; optional workspace cleanup; exit code 1 if any run errored.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A script hops through sandboxes, measuring tokens with care,
With guidelines recalled or a fresh workspace bare,
Counting every prompt byte, computing the gain,
Writing markdown reports so the numbers remain,
Hopping back to burrow with results in a chain! 📊✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'experiments: add token-savings runner for learn/recall' directly and accurately summarizes the main change: adding a token-savings measurement script for learn/recall functionality to the experiments directory.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
experiments/token_savings.py (1)

92-92: ⚡ Quick win

A timeout (or other exception) in one run aborts the whole experiment and discards all prior results.

subprocess.run(..., timeout=SESSION_TIMEOUT_SECONDS) raises subprocess.TimeoutExpired on timeout, which propagates uncaught up through the run loop in main. Since raw.json/report.md are written only after all runs complete, a single hung claude invocation loses every completed run — costly given the documented multi-minute-per-run wall-clock budget. Consider catching it here and surfacing it as a per-run error so the rest of the experiment proceeds.

♻️ Capture timeout as a normal failure instead of crashing
-    proc = subprocess.run(cmd, capture_output=True, text=True, timeout=SESSION_TIMEOUT_SECONDS)
+    try:
+        proc = subprocess.run(cmd, capture_output=True, text=True, timeout=SESSION_TIMEOUT_SECONDS)
+    except subprocess.TimeoutExpired as exc:
+        # Synthesize a failed CompletedProcess so callers record an error
+        # for this run rather than aborting the entire experiment.
+        return subprocess.CompletedProcess(
+            cmd, returncode=124, stdout=exc.stdout or "", stderr="timeout"
+        ), None

Note: callers already branch on proc.returncode != 0 and record measure_failed/seed_failed, so a non-zero returncode here integrates cleanly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@experiments/token_savings.py` at line 92, Wrap the subprocess.run call that
assigns proc (subprocess.run(..., timeout=SESSION_TIMEOUT_SECONDS)) in a
try/except that catches subprocess.TimeoutExpired (and optionally other
exceptions) and converts the failure into a per-run failure instead of letting
it propagate to main: on exception, synthesize a failure "proc" (or set
variables used later) with a non-zero returncode and stderr/stdout containing
the exception message so the existing returncode checks and
measure_failed/seed_failed bookkeeping still record this run as failed; do this
change in experiments/token_savings.py around the subprocess.run invocation so
raw.json/report.md continue to be written after the loop completes.
experiments/README.md (1)

17-18: ⚡ Quick win

Clarify Docker prerequisite to mention daemon.

The script checks both that Docker is installed and that the daemon is running (see _check_prerequisites()), but the README only mentions "Docker." Consider clarifying to "Docker (installed and daemon running)" for precision.

Based on learnings from the implementation contract in _check_prerequisites() which validates both shutil.which("docker") and docker info returncode.

📝 Proposed clarification
-**Requires:** Docker, the `claude-sandbox` image (`just sandbox-build claude`),
+**Requires:** Docker (installed and daemon running), the `claude-sandbox` image (`just sandbox-build claude`),
 and `ANTHROPIC_API_KEY` (or `ANTHROPIC_AUTH_TOKEN`) in the environment.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@experiments/README.md` around lines 17 - 18, Update the README entry that
currently says "Docker" to clarify the daemon requirement—e.g., change the
prerequisite text to "Docker (installed and daemon running)" or similar; this
aligns the documentation with the runtime checks performed by the
_check_prerequisites() function which verifies both shutil.which("docker") and
successful `docker info` execution, ensuring readers know the Docker daemon must
be running.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@experiments/README.md`:
- Around line 17-18: Update the README entry that currently says "Docker" to
clarify the daemon requirement—e.g., change the prerequisite text to "Docker
(installed and daemon running)" or similar; this aligns the documentation with
the runtime checks performed by the _check_prerequisites() function which
verifies both shutil.which("docker") and successful `docker info` execution,
ensuring readers know the Docker daemon must be running.

In `@experiments/token_savings.py`:
- Line 92: Wrap the subprocess.run call that assigns proc (subprocess.run(...,
timeout=SESSION_TIMEOUT_SECONDS)) in a try/except that catches
subprocess.TimeoutExpired (and optionally other exceptions) and converts the
failure into a per-run failure instead of letting it propagate to main: on
exception, synthesize a failure "proc" (or set variables used later) with a
non-zero returncode and stderr/stdout containing the exception message so the
existing returncode checks and measure_failed/seed_failed bookkeeping still
record this run as failed; do this change in experiments/token_savings.py around
the subprocess.run invocation so raw.json/report.md continue to be written after
the loop completes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 28b0b470-cfd8-47ca-adca-0cfe9c7ccfa5

📥 Commits

Reviewing files that changed from the base of the PR and between 003bd53 and 7259fe8.

📒 Files selected for processing (2)
  • experiments/README.md
  • experiments/token_savings.py

Fixes failing CI check: check-formatting (3.12)
@visahak visahak merged commit c57148b into AgentToolkit:main May 29, 2026
17 checks passed
@vinodmut vinodmut deleted the token-savings-experiment branch May 29, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants