experiments: add token-savings runner for learn/recall by vinodmut · Pull Request #261 · AgentToolkit/altk-evolve

vinodmut · 2026-05-29T14:58:29Z

Summary

Adds an experiments/ directory with a standalone (non-pytest) script that measures the token / wall-clock / step gap on a measured utterance when guidelines from a prior utterance are recallable vs. not. Adapted from tests/e2e/test_claude_sandbox_learn_recall.py but runs as a script and prints a comparison table — no assertions.

This is the runner only. Result artifacts from my own runs against the EXIF demo are kept out of this PR; headline numbers and the writeup are on issue #260.

What's in here

experiments/token_savings.py — runs utterance 1 to seed guidelines, then utterance 2 with vs. without recall, N times per condition. Captures token usage from claude --output-format json plus per-turn usage from saved transcripts. Supports --shared-seed to reuse one seeded workspace across N measure runs.
experiments/README.md — what experiments/ is for, how to run, results layout.

Why a new top-level dir

tests/ is for the test suite (CI-runnable, asserts). This is ad-hoc measurement that produces numbers, not pass/fail. If something here graduates into a regression check, move it under tests/.

Test plan

python3 experiments/token_savings.py --help lists --runs, --shared-seed, --keep-workspaces.
python3 experiments/token_savings.py --runs 1 --shared-seed end-to-end smoke run (requires Docker, the claude-sandbox image, and an Anthropic API key in env). Should produce experiments/results/token_savings_<timestamp>/{report.md, raw.json} and a stdout table.

Summary by CodeRabbit

Documentation
- Added README documenting ad-hoc measurement scripts, setup, run examples, outputs, and expected artifacts.
New Features
- Added a token-savings experiment that runs repeated measurements, parses results, computes per-run and aggregate token metrics (means/ranges and savings), and writes timestamped reports and raw output.

Standalone (non-pytest) script that runs utterance 1 to seed guidelines and utterance 2 with vs. without recall in the claude-sandbox, comparing token usage from --output-format json plus per-turn usage from saved transcripts. Supports --shared-seed to reuse one seeded workspace across N measure runs. Lives under experiments/ so ad-hoc measurement work is separated from the test suite.

Documents what experiments/ is for (ad-hoc measurement, not the test suite), how to run token_savings.py, what the results layout looks like, and the rough wall-clock budget.

coderabbitai · 2026-05-29T14:58:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f68dae9-3e0a-4bca-ad1a-38cda7b69b74

📥 Commits

Reviewing files that changed from the base of the PR and between 7259fe8 and 5f71a05.

📒 Files selected for processing (1)

experiments/token_savings.py

🚧 Files skipped from review as they are similar to previous changes (1)

experiments/token_savings.py

📝 Walkthrough

Walkthrough

This PR adds an experiments README and a new experiment script, experiments/token_savings.py, to run Dockerized sandbox prompts, parse transcript JSONL, compare token usage for runs with recalled guidelines versus fresh workspaces, and produce timestamped markdown reports and raw JSON results.

Changes

Token Savings Measurement Experiment

Layer / File(s)	Summary
Documentation `experiments/README.md`	README describing the experiments directory, `token_savings.py` usage, environment/Docker requirements, example commands, output artifacts, and per-run directory structure.
Experiment Foundation & Sandbox Integration `experiments/token_savings.py`	Shared constants and prereq checks (Docker, sandbox image, Anthropic env vars); workspace cloning; Dockerized `claude` invocation with JSON output parsing and fallback recovery for non-strict stdout.
Transcript & Entity Parsing `experiments/token_savings.py`	Utilities to parse per-turn token usage from transcript JSONL, select newest transcript excluding prior paths, list learned `.evolve/entities/*.md`, and extract headline usage blocks from JSON.
With-guidelines per-run flow `experiments/token_savings.py`	Seed a fresh workspace with utterance 1, confirm learned entities, run measurement utterance 2 on same workspace, capture headline/per-turn usage and structured errors on failures.
Shared-seed mode `experiments/token_savings.py`	Seed once, enumerate recallable entities, run measurement N times on same workspace while snapshotting prior transcripts to identify each run's newest transcript.
Without-guidelines per-run flow `experiments/token_savings.py`	Fresh workspace creation and single measurement run with no seeding; record headline and per-turn usage and structured errors on sandbox failure.
Metric Aggregation & Table Rendering `experiments/token_savings.py`	Compute mean/min/max/stdev across runs for token/duration metrics, format markdown table with savings deltas and percentages, and render per-turn rows for a representative run.
Report Generation & CLI Orchestration `experiments/token_savings.py`	Generate `report.md` and `raw.json` in timestamped results dir; CLI parses `--runs`, `--shared-seed`, `--keep-workspaces`; optional workspace cleanup; exit code 1 if any run errored.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A script hops through sandboxes, measuring tokens with care,
With guidelines recalled or a fresh workspace bare,
Counting every prompt byte, computing the gain,
Writing markdown reports so the numbers remain,
Hopping back to burrow with results in a chain! 📊✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'experiments: add token-savings runner for learn/recall' directly and accurately summarizes the main change: adding a token-savings measurement script for learn/recall functionality to the experiments directory.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

experiments/token_savings.py (1)
92-92: ⚡ Quick win

A timeout (or other exception) in one run aborts the whole experiment and discards all prior results.

subprocess.run(..., timeout=SESSION_TIMEOUT_SECONDS) raises subprocess.TimeoutExpired on timeout, which propagates uncaught up through the run loop in main. Since raw.json/report.md are written only after all runs complete, a single hung claude invocation loses every completed run — costly given the documented multi-minute-per-run wall-clock budget. Consider catching it here and surfacing it as a per-run error so the rest of the experiment proceeds.
♻️ Capture timeout as a normal failure instead of crashing
-    proc = subprocess.run(cmd, capture_output=True, text=True, timeout=SESSION_TIMEOUT_SECONDS)
+    try:
+        proc = subprocess.run(cmd, capture_output=True, text=True, timeout=SESSION_TIMEOUT_SECONDS)
+    except subprocess.TimeoutExpired as exc:
+        # Synthesize a failed CompletedProcess so callers record an error
+        # for this run rather than aborting the entire experiment.
+        return subprocess.CompletedProcess(
+            cmd, returncode=124, stdout=exc.stdout or "", stderr="timeout"
+        ), None
Note: callers already branch on proc.returncode != 0 and record measure_failed/seed_failed, so a non-zero returncode here integrates cleanly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@experiments/token_savings.py` at line 92, Wrap the subprocess.run call that
assigns proc (subprocess.run(..., timeout=SESSION_TIMEOUT_SECONDS)) in a
try/except that catches subprocess.TimeoutExpired (and optionally other
exceptions) and converts the failure into a per-run failure instead of letting
it propagate to main: on exception, synthesize a failure "proc" (or set
variables used later) with a non-zero returncode and stderr/stdout containing
the exception message so the existing returncode checks and
measure_failed/seed_failed bookkeeping still record this run as failed; do this
change in experiments/token_savings.py around the subprocess.run invocation so
raw.json/report.md continue to be written after the loop completes.
experiments/README.md (1)
17-18: ⚡ Quick win

Clarify Docker prerequisite to mention daemon.

The script checks both that Docker is installed and that the daemon is running (see _check_prerequisites()), but the README only mentions "Docker." Consider clarifying to "Docker (installed and daemon running)" for precision.

Based on learnings from the implementation contract in _check_prerequisites() which validates both shutil.which("docker") and docker info returncode.
📝 Proposed clarification
-**Requires:** Docker, the `claude-sandbox` image (`just sandbox-build claude`),
+**Requires:** Docker (installed and daemon running), the `claude-sandbox` image (`just sandbox-build claude`),
 and `ANTHROPIC_API_KEY` (or `ANTHROPIC_AUTH_TOKEN`) in the environment.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@experiments/README.md` around lines 17 - 18, Update the README entry that
currently says "Docker" to clarify the daemon requirement—e.g., change the
prerequisite text to "Docker (installed and daemon running)" or similar; this
aligns the documentation with the runtime checks performed by the
_check_prerequisites() function which verifies both shutil.which("docker") and
successful `docker info` execution, ensuring readers know the Docker daemon must
be running.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@experiments/README.md`:
- Around line 17-18: Update the README entry that currently says "Docker" to
clarify the daemon requirement—e.g., change the prerequisite text to "Docker
(installed and daemon running)" or similar; this aligns the documentation with
the runtime checks performed by the _check_prerequisites() function which
verifies both shutil.which("docker") and successful `docker info` execution,
ensuring readers know the Docker daemon must be running.

In `@experiments/token_savings.py`:
- Line 92: Wrap the subprocess.run call that assigns proc (subprocess.run(...,
timeout=SESSION_TIMEOUT_SECONDS)) in a try/except that catches
subprocess.TimeoutExpired (and optionally other exceptions) and converts the
failure into a per-run failure instead of letting it propagate to main: on
exception, synthesize a failure "proc" (or set variables used later) with a
non-zero returncode and stderr/stdout containing the exception message so the
existing returncode checks and measure_failed/seed_failed bookkeeping still
record this run as failed; do this change in experiments/token_savings.py around
the subprocess.run invocation so raw.json/report.md continue to be written after
the loop completes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 28b0b470-cfd8-47ca-adca-0cfe9c7ccfa5

📥 Commits

Reviewing files that changed from the base of the PR and between 003bd53 and 7259fe8.

📒 Files selected for processing (2)

experiments/README.md
experiments/token_savings.py

Fixes failing CI check: check-formatting (3.12)

vinodmut added 2 commits May 29, 2026 09:58

docs(experiments): add README for the experiments dir

7259fe8

Documents what experiments/ is for (ad-hoc measurement, not the test suite), how to run token_savings.py, what the results layout looks like, and the rough wall-clock budget.

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

fix(experiments): apply ruff format to token_savings.py

5f71a05

Fixes failing CI check: check-formatting (3.12)

vinodmut requested review from illeatmyhat, jayaramkr and visahak May 29, 2026 15:40

visahak approved these changes May 29, 2026

View reviewed changes

visahak merged commit c57148b into AgentToolkit:main May 29, 2026
17 checks passed

vinodmut deleted the token-savings-experiment branch May 29, 2026 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments: add token-savings runner for learn/recall#261

experiments: add token-savings runner for learn/recall#261
visahak merged 3 commits into
AgentToolkit:mainfrom
vinodmut:token-savings-experiment

vinodmut commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vinodmut commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in here

Why a new top-level dir

Test plan

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinodmut commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading