compress: e2e + recon scoring, OpenAI-std env vars#2
Open
Conversation
Find the shortest natural-language prompt that elicits a functionally- equivalent shprout from an LLM. Multi-stage: - eval.py: 3-tier scoring (regex → trace via fake_api+sandbox → judge) - fake_api.py: stateful local OpenAI-compatible server, scripted replies, 503 after T2 to catch termination bugs, logs full request/response trace - eval_simple.py: lightweight gen + judge-vs-real-shprout (single Claude call asks "is this functionally equivalent to shprout source?") - search.py: hand-rolled loop, Claude proposes shorter prompts, logs every attempt to candidates.jsonl - visualize.py: HTML dashboard via the show skill, Pareto frontier highlighted, source-length comparison Initial run: 312-byte prompt scores 1.00 (vs 856-byte source); 166-byte prompt scores 0.80. Models swapped to deepseek-pro (gen) + claude-large (judge) for the next iteration.
Pollinations is OpenAI-compatible, so the litellm wrapper bought us nothing — and it caused mystery hangs (deepseek-pro stalled silently). Direct urllib calls are 20× faster on cache hits and have zero deps. - Single chat() helper in eval_simple.py shared by all three modules - Model names lose the "openai/" prefix (just "claude-large", etc.) - pyproject.toml dependencies = [] — pure stdlib - pip install gepa was inherited from the abandoned framework run
A 117-byte natural-language prompt now elicits a functionally-equivalent shprout from claude-large, verified by all three judges (openai-large, claude-large, claude-opus-4.7) at exactly 8/10. That's 7.32× compression vs the 856-byte source. What changed: - search.py: leaderboard-aware proposer (top-K shown each iter), Pareto seeding via --seed-from-snapshots, --proposer/--log CLI flags, dedup, temperature 1.0 (Bedrock cap) - eval_simple.py: chat() retries without temperature on the "deprecated" 400 (claude-opus-4.7 reasoning models reject it), GEN_MODEL/JUDGE_MODEL pickable via env var - model_sweep.py: 17-generator benchmark on the champion prompt - head_to_head.py: n×n generator/judge grid for two models - visualize.py: load all candidates-*.jsonl snapshots, per-run breakdown - visualize_crossjudge.py: cross-judge dashboard showing min/mean per candidate so single-judge bias is visible - README: rewrite compression section with headline, approach, lessons Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t @ 60% e2e
shprout: rename env vars to OpenAI standard (OPENAI_API_KEY/MODEL/OPENAI_BASE_URL),
loop bound 20→10, drop "with bearer" from descriptions.
compress/:
- e2e_test.py: sandboxed runtime test against SHA256 task; runtime model = claude
- cycle.py: closed-loop search with reconstruction scoring + 90% prompt-length cap
- reconstruction.py: deterministic normalized-diff scorer (strip comments,
collapse whitespace, alpha-rename internal vars; preserve env-var names + API path)
- eval_simple.py: hardened JUDGE_SYS with cap-at-4 for missing critical behaviors,
seed param to bypass cache, generate() with seed_base
- search.py: rubric-aware proposer, iteration-seed prefix to break dup wall,
rejection-feedback retries with temperature ramp
- visualize_e2e.py: HTML dashboard renderer for e2e candidate results
Findings (n≥3 sandboxed runs):
- 254B prompt → claude-large gen → 2596-3124B bash → 60% e2e on claude runtime
- 288B prompt + brevity hint → claude-opus-4.7 gen → 392-481B bash → 67% e2e
- opus-4.7 actually golfs when asked; claude-large defaults to verbose enterprise bash
- Recon objective rejected: claude-large refuses to compress under 90% of target,
pastes shprout verbatim instead. The LLM's K-complexity floor for shprout
(under our prompts) is roughly the script itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot of the generator-comparison runs that previously lived only in /tmp: - sample-claude-large-2596B.sh (no brevity hint, 3/5 e2e) - sample-opus-4.7-brevity-395B.sh (brevity hint, 2/3 e2e) - sample-opus-4.7-think-392B.sh (think-and-golf, 1/3 e2e) - show_compare.py (dashboard generator, paths fixed to repo) - results.json (structured metrics) - README.md (index) Headline: opus-4.7 + brevity hint produces 395B passing bash — 2.30x smaller than the 910B reference.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
shprout: env vars → OpenAI standard (OPENAI_API_KEY/MODEL/OPENAI_BASE_URL), loop bound 20→10compress/cycle.py: closed-loop search with reconstruction scoring + 90% prompt-length capcompress/reconstruction.py: deterministic normalized-diff scorercompress/e2e_test.py: sandboxed runtime test (SHA256 task →f017597f)compress/visualize_e2e.py: HTML dashboardcompress/{eval_simple,search}.py: hardened judge, rubric-aware proposer, rejection-feedback retriesFindings (n≥3 sandboxed runs)
Test plan
OPENAI_API_KEY=… MODEL=claude OPENAI_BASE_URL=… ./shprout "print sha256 of 'shprout'"→ outputsf017597fcd compress && uv run python e2e_test.py→ e2e harness runs🤖 Generated with Claude Code