Pre-launch fixes: drop banner, refresh /paper, remove Enhanced CPS mentions by MaxGhenis · Pull Request #76 · PolicyEngine/policybench

MaxGhenis · 2026-06-15T21:16:44Z

⚠️ Merge at launch time only

Launch-readiness fixes for policybench.org. Merging deploys to production, so hold until launch goes live (first action in the runbook).

Changes

Drop the pre-release banner (Hero.tsx).
Refresh the /paper page (app/src/app/paper/page.tsx) — stale May date + per-country (UK) framing in copy and share metadata → June 2026-06-14 US populace snapshot; iframe cache-buster bumped.
Remove Enhanced CPS mentions — the paper abstract called populace the successor to the Enhanced CPS, and a dormant UK methodology string named Enhanced CPS records. Both dropped; manuscript re-rendered (PDF + web) and hashes re-pinned.

Remaining enhanced_cps references in the repo are internal code identifiers and the real UK .h5 dataset filename (plumbing, not user-facing); part of the separate us-data/uk-data removal, not this PR.

Verification

bun run lint + bun run build clean; manuscript snapshot tests pass; "Enhanced CPS" absent from the rendered paper (HTML + PDF) and all app source.

🤖 Generated with Claude Code

The provisional/"we plan to rerun" banner is removed so the site matches the launched June populace results. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel · 2026-06-15T21:16:51Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policybench-site	Ready	Preview, Comment	Jun 16, 2026 8:41pm

The paper page wrapper still showed the May snapshot date, the May 13-20 response window, and a per-country (UK) framing in its copy and share metadata. Update the date label, description, body copy, and iframe cache-buster to the June 2026-06-14 US populace snapshot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The paper abstract described populace as the successor to the Enhanced CPS, and a dormant UK methodology string named Enhanced CPS records. Drop both; re-render the manuscript and re-pin hashes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The "Scoring and weighting" card had an unconditional sentence naming UK enhanced FRS weights and the UK transfer scenarios, which rendered on the US-only site. Keep only the US populace weighting description. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The benchmark-scope footer ("/ UK fiscal year 2026-27") and the Household sensitivity-view description both named UK weighting/scope unconditionally, so they rendered on the US-only site. Keep US only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Switch PolicyBench's headline/default metric from the within-1% hit rate to exact match — the deployability bar (a prediction counts only if it matches the PolicyEngine reference to the dollar for amounts, or the eligibility flag for booleans). within-1% becomes the near-miss-tolerant companion. This is defensible because the public leaderboard is household-impact- weighted, which down-weights zero-reference outputs. The weighted exact rate is therefore not compressed near the ~84% unweighted zero share: GPT-5.5 leads at 80.3% weighted exact versus an always-zero baseline of 66.8% (a ~13-point margin, comparable to within-1%), and exact discriminates about as well as within-1% (spread 62-80% vs 63-83%). Site (app/src): - ModelLeaderboard: default scoringMode "within1pct" -> "exact"; rewrite the comment to explain the deployability bar and the weighting argument. - Methodology: "ranks by within-1%" -> "ranks by exact match"; report exact as headline with within-1% as companion; sensitivity-check prose references the public exact-match leaderboard. - modelPage.ts: model headline + both leaderboard sorts -> exact. - model/[id] page: headline ScorePill, metadata, and per-program table lead with exact; the within-1% column stays alongside. The exact/within-1%/continuous toggle and all three columns are intact. Paper (paper/index.qmd + policybench/paper_results.py): - Reframe the abstract, the headline section ("Headline metric: exact match" / "Near-miss companion: within 1%"), the related-work paragraph, the leaderboard tables (sorted by exact, within-1% kept as a column), and the bootstrap CIs to make exact the headline with the weighted-vs-unweighted nuance and the always-zero baseline. - paper_results.py: headline fields exact-based; within-1% accessors kept as the companion; add always-zero weighted-exact baseline accessors. - Extend the baselines table to report exact / within-1% / bounded. Re-render the manuscript and re-pin only the rendered-paper hashes in the snapshot manifest; the frozen run data is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Per-cell audit annotations, failure-source tags, and case notes are absent from the published data, so the scenario modal showed "Not yet reviewed." on every incorrect cell and the coverage card reported "0 rows include developer audit notes" — both contradicting the paper's hand-audit of every wrong cell. Drop the empty modal fallback and the audit-notes sentences, keeping explanation coverage. The aggregate audit result stays in the paper and blog. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

This reverts commit 0d1c5d9.

The dashboard-data-20260614 artifact predated the audit annotations and had no reference-computation narratives, so the explorer modal showed "Not yet reviewed." and "narrative not yet generated." Generate the 1,984 reference narratives (switch the generator to claude-haiku-4-5 — the PolicyEngine computation trace is deterministic, so Haiku only prosifies it; strip stray markdown headers), rebuild the payload with both the 3,300 audit annotations and the narratives, and republish as dashboard-data-20260616. The modal now shows the per-cell audit review and the reference computation for every case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The homepage used a compact top bar (small brand + action button at the top); /paper and /model used alwaysExpanded, which inlined a 36px brand and pushed the action button ~30px down, so the upper-right buttons jumped position between pages. Render the same compact top bar in alwaysExpanded mode and move the large brand into the hero block below, so the top bar is identical everywhere. Homepage is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The dashboard republish baked the 3,300 audit reviews + 1,984 reference narratives into the published artifact (dashboard-data-20260616), but the reproducibility contract requires the published payload to byte-match the export of the frozen source run. Update the frozen source data.json with the same annotations and repoint its hash plus the published_dashboard_artifact pin (sha 497c6c34, 37 MB). Snapshot suite 16/16, full suite 385 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drop the pre-release banner for the public launch

8529238

The provisional/"we plan to rerun" banner is removed so the site matches the launched June populace results. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview June 15, 2026 21:17 View deployment

MaxGhenis changed the title ~~Drop the pre-release banner for the public launch~~ Pre-launch site fixes: drop banner + refresh /paper page Jun 15, 2026

vercel Bot deployed to Preview June 15, 2026 21:23 View deployment

MaxGhenis changed the title ~~Pre-launch site fixes: drop banner + refresh /paper page~~ Pre-launch fixes: drop banner, refresh /paper, remove Enhanced CPS mentions Jun 15, 2026

vercel Bot deployed to Preview June 15, 2026 23:37 View deployment

vercel Bot deployed to Preview June 16, 2026 01:47 View deployment

vercel Bot deployed to Preview June 16, 2026 01:49 View deployment

MaxGhenis force-pushed the drop-prerelease-banner branch from a595076 to 5609cbc Compare June 16, 2026 01:52

vercel Bot deployed to Preview June 16, 2026 01:54 View deployment

Align docs with June US-only launch

4002407

vercel Bot deployed to Preview June 16, 2026 09:28 View deployment

vercel Bot deployed to Preview June 16, 2026 12:12 View deployment

vercel Bot deployed to Preview June 16, 2026 12:47 View deployment

MaxGhenis and others added 3 commits June 16, 2026 08:49

Revert "Remove empty audit-review UI from the explorer"

b178ac9

This reverts commit 0d1c5d9.

vercel Bot deployed to Preview June 16, 2026 14:36 View deployment

vercel Bot deployed to Preview June 16, 2026 20:41 View deployment

MaxGhenis merged commit f0ad009 into main Jun 16, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pre-launch fixes: drop banner, refresh /paper, remove Enhanced CPS mentions#76

Pre-launch fixes: drop banner, refresh /paper, remove Enhanced CPS mentions#76
MaxGhenis merged 12 commits into
mainfrom
drop-prerelease-banner

MaxGhenis commented Jun 15, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MaxGhenis commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Merge at launch time only

Changes

Verification

Uh oh!

vercel Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxGhenis commented Jun 15, 2026 •

edited

Loading

vercel Bot commented Jun 15, 2026 •

edited

Loading