compress: e2e + recon scoring, OpenAI-std env vars by voodoohop · Pull Request #2 · pollinations/shprout

voodoohop · 2026-04-25T17:43:58Z

Changes

shprout: env vars → OpenAI standard (OPENAI_API_KEY/MODEL/OPENAI_BASE_URL), loop bound 20→10
compress/cycle.py: closed-loop search with reconstruction scoring + 90% prompt-length cap
compress/reconstruction.py: deterministic normalized-diff scorer
compress/e2e_test.py: sandboxed runtime test (SHA256 task → f017597f)
compress/visualize_e2e.py: HTML dashboard
compress/{eval_simple,search}.py: hardened judge, rubric-aware proposer, rejection-feedback retries

Findings (n≥3 sandboxed runs)

254B prompt → claude-large gen → ~2700B bash → 60% e2e
288B prompt + brevity hint → claude-opus-4.7 gen → ~430B bash → 67% e2e
opus-4.7 actually golfs when asked; claude-large defaults to verbose bash regardless
Recon objective failed: claude-large pastes shprout verbatim rather than compress

Test plan

OPENAI_API_KEY=… MODEL=claude OPENAI_BASE_URL=… ./shprout "print sha256 of 'shprout'" → outputs f017597f
cd compress && uv run python e2e_test.py → e2e harness runs

🤖 Generated with Claude Code

Find the shortest natural-language prompt that elicits a functionally- equivalent shprout from an LLM. Multi-stage: - eval.py: 3-tier scoring (regex → trace via fake_api+sandbox → judge) - fake_api.py: stateful local OpenAI-compatible server, scripted replies, 503 after T2 to catch termination bugs, logs full request/response trace - eval_simple.py: lightweight gen + judge-vs-real-shprout (single Claude call asks "is this functionally equivalent to shprout source?") - search.py: hand-rolled loop, Claude proposes shorter prompts, logs every attempt to candidates.jsonl - visualize.py: HTML dashboard via the show skill, Pareto frontier highlighted, source-length comparison Initial run: 312-byte prompt scores 1.00 (vs 856-byte source); 166-byte prompt scores 0.80. Models swapped to deepseek-pro (gen) + claude-large (judge) for the next iteration.

Pollinations is OpenAI-compatible, so the litellm wrapper bought us nothing — and it caused mystery hangs (deepseek-pro stalled silently). Direct urllib calls are 20× faster on cache hits and have zero deps. - Single chat() helper in eval_simple.py shared by all three modules - Model names lose the "openai/" prefix (just "claude-large", etc.) - pyproject.toml dependencies = [] — pure stdlib - pip install gepa was inherited from the abandoned framework run

A 117-byte natural-language prompt now elicits a functionally-equivalent shprout from claude-large, verified by all three judges (openai-large, claude-large, claude-opus-4.7) at exactly 8/10. That's 7.32× compression vs the 856-byte source. What changed: - search.py: leaderboard-aware proposer (top-K shown each iter), Pareto seeding via --seed-from-snapshots, --proposer/--log CLI flags, dedup, temperature 1.0 (Bedrock cap) - eval_simple.py: chat() retries without temperature on the "deprecated" 400 (claude-opus-4.7 reasoning models reject it), GEN_MODEL/JUDGE_MODEL pickable via env var - model_sweep.py: 17-generator benchmark on the champion prompt - head_to_head.py: n×n generator/judge grid for two models - visualize.py: load all candidates-*.jsonl snapshots, per-run breakdown - visualize_crossjudge.py: cross-judge dashboard showing min/mean per candidate so single-judge bias is visible - README: rewrite compression section with headline, approach, lessons Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t @ 60% e2e shprout: rename env vars to OpenAI standard (OPENAI_API_KEY/MODEL/OPENAI_BASE_URL), loop bound 20→10, drop "with bearer" from descriptions. compress/: - e2e_test.py: sandboxed runtime test against SHA256 task; runtime model = claude - cycle.py: closed-loop search with reconstruction scoring + 90% prompt-length cap - reconstruction.py: deterministic normalized-diff scorer (strip comments, collapse whitespace, alpha-rename internal vars; preserve env-var names + API path) - eval_simple.py: hardened JUDGE_SYS with cap-at-4 for missing critical behaviors, seed param to bypass cache, generate() with seed_base - search.py: rubric-aware proposer, iteration-seed prefix to break dup wall, rejection-feedback retries with temperature ramp - visualize_e2e.py: HTML dashboard renderer for e2e candidate results Findings (n≥3 sandboxed runs): - 254B prompt → claude-large gen → 2596-3124B bash → 60% e2e on claude runtime - 288B prompt + brevity hint → claude-opus-4.7 gen → 392-481B bash → 67% e2e - opus-4.7 actually golfs when asked; claude-large defaults to verbose enterprise bash - Recon objective rejected: claude-large refuses to compress under 90% of target, pastes shprout verbatim instead. The LLM's K-complexity floor for shprout (under our prompts) is roughly the script itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Snapshot of the generator-comparison runs that previously lived only in /tmp: - sample-claude-large-2596B.sh (no brevity hint, 3/5 e2e) - sample-opus-4.7-brevity-395B.sh (brevity hint, 2/3 e2e) - sample-opus-4.7-think-392B.sh (think-and-golf, 1/3 e2e) - show_compare.py (dashboard generator, paths fixed to repo) - results.json (structured metrics) - README.md (index) Headline: opus-4.7 + brevity hint produces 395B passing bash — 2.30x smaller than the 910B reference.

voodoohop and others added 5 commits April 25, 2026 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compress: e2e + recon scoring, OpenAI-std env vars#2

compress: e2e + recon scoring, OpenAI-std env vars#2
voodoohop wants to merge 5 commits intomainfrom
prompt-compression

voodoohop commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

voodoohop commented Apr 25, 2026

Changes

Findings (n≥3 sandboxed runs)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant