Skip to content

compress: e2e + recon scoring, OpenAI-std env vars#2

Open
voodoohop wants to merge 5 commits intomainfrom
prompt-compression
Open

compress: e2e + recon scoring, OpenAI-std env vars#2
voodoohop wants to merge 5 commits intomainfrom
prompt-compression

Conversation

@voodoohop
Copy link
Copy Markdown
Member

Changes

  • shprout: env vars → OpenAI standard (OPENAI_API_KEY/MODEL/OPENAI_BASE_URL), loop bound 20→10
  • compress/cycle.py: closed-loop search with reconstruction scoring + 90% prompt-length cap
  • compress/reconstruction.py: deterministic normalized-diff scorer
  • compress/e2e_test.py: sandboxed runtime test (SHA256 task → f017597f)
  • compress/visualize_e2e.py: HTML dashboard
  • compress/{eval_simple,search}.py: hardened judge, rubric-aware proposer, rejection-feedback retries

Findings (n≥3 sandboxed runs)

  • 254B prompt → claude-large gen → ~2700B bash → 60% e2e
  • 288B prompt + brevity hint → claude-opus-4.7 gen → ~430B bash → 67% e2e
  • opus-4.7 actually golfs when asked; claude-large defaults to verbose bash regardless
  • Recon objective failed: claude-large pastes shprout verbatim rather than compress

Test plan

  • OPENAI_API_KEY=… MODEL=claude OPENAI_BASE_URL=… ./shprout "print sha256 of 'shprout'" → outputs f017597f
  • cd compress && uv run python e2e_test.py → e2e harness runs

🤖 Generated with Claude Code

voodoohop and others added 5 commits April 25, 2026 15:14
Find the shortest natural-language prompt that elicits a functionally-
equivalent shprout from an LLM. Multi-stage:

- eval.py: 3-tier scoring (regex → trace via fake_api+sandbox → judge)
- fake_api.py: stateful local OpenAI-compatible server, scripted replies,
  503 after T2 to catch termination bugs, logs full request/response trace
- eval_simple.py: lightweight gen + judge-vs-real-shprout (single Claude
  call asks "is this functionally equivalent to shprout source?")
- search.py: hand-rolled loop, Claude proposes shorter prompts, logs
  every attempt to candidates.jsonl
- visualize.py: HTML dashboard via the show skill, Pareto frontier
  highlighted, source-length comparison

Initial run: 312-byte prompt scores 1.00 (vs 856-byte source); 166-byte
prompt scores 0.80. Models swapped to deepseek-pro (gen) + claude-large
(judge) for the next iteration.
Pollinations is OpenAI-compatible, so the litellm wrapper bought us
nothing — and it caused mystery hangs (deepseek-pro stalled silently).
Direct urllib calls are 20× faster on cache hits and have zero deps.

- Single chat() helper in eval_simple.py shared by all three modules
- Model names lose the "openai/" prefix (just "claude-large", etc.)
- pyproject.toml dependencies = [] — pure stdlib
- pip install gepa was inherited from the abandoned framework run
A 117-byte natural-language prompt now elicits a functionally-equivalent
shprout from claude-large, verified by all three judges (openai-large,
claude-large, claude-opus-4.7) at exactly 8/10. That's 7.32× compression
vs the 856-byte source.

What changed:
- search.py: leaderboard-aware proposer (top-K shown each iter), Pareto
  seeding via --seed-from-snapshots, --proposer/--log CLI flags, dedup,
  temperature 1.0 (Bedrock cap)
- eval_simple.py: chat() retries without temperature on the
  "deprecated" 400 (claude-opus-4.7 reasoning models reject it),
  GEN_MODEL/JUDGE_MODEL pickable via env var
- model_sweep.py: 17-generator benchmark on the champion prompt
- head_to_head.py: n×n generator/judge grid for two models
- visualize.py: load all candidates-*.jsonl snapshots, per-run breakdown
- visualize_crossjudge.py: cross-judge dashboard showing min/mean per
  candidate so single-judge bias is visible
- README: rewrite compression section with headline, approach, lessons

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t @ 60% e2e

shprout: rename env vars to OpenAI standard (OPENAI_API_KEY/MODEL/OPENAI_BASE_URL),
loop bound 20→10, drop "with bearer" from descriptions.

compress/:
  - e2e_test.py: sandboxed runtime test against SHA256 task; runtime model = claude
  - cycle.py: closed-loop search with reconstruction scoring + 90% prompt-length cap
  - reconstruction.py: deterministic normalized-diff scorer (strip comments,
    collapse whitespace, alpha-rename internal vars; preserve env-var names + API path)
  - eval_simple.py: hardened JUDGE_SYS with cap-at-4 for missing critical behaviors,
    seed param to bypass cache, generate() with seed_base
  - search.py: rubric-aware proposer, iteration-seed prefix to break dup wall,
    rejection-feedback retries with temperature ramp
  - visualize_e2e.py: HTML dashboard renderer for e2e candidate results

Findings (n≥3 sandboxed runs):
  - 254B prompt → claude-large gen → 2596-3124B bash → 60% e2e on claude runtime
  - 288B prompt + brevity hint → claude-opus-4.7 gen → 392-481B bash → 67% e2e
  - opus-4.7 actually golfs when asked; claude-large defaults to verbose enterprise bash
  - Recon objective rejected: claude-large refuses to compress under 90% of target,
    pastes shprout verbatim instead. The LLM's K-complexity floor for shprout
    (under our prompts) is roughly the script itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot of the generator-comparison runs that previously lived only in /tmp:

- sample-claude-large-2596B.sh   (no brevity hint, 3/5 e2e)
- sample-opus-4.7-brevity-395B.sh (brevity hint, 2/3 e2e)
- sample-opus-4.7-think-392B.sh   (think-and-golf, 1/3 e2e)
- show_compare.py                 (dashboard generator, paths fixed to repo)
- results.json                    (structured metrics)
- README.md                       (index)

Headline: opus-4.7 + brevity hint produces 395B passing bash —
2.30x smaller than the 910B reference.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant