Skip to content

Pre-launch fixes: drop banner, refresh /paper, remove Enhanced CPS mentions#76

Merged
MaxGhenis merged 12 commits into
mainfrom
drop-prerelease-banner
Jun 16, 2026
Merged

Pre-launch fixes: drop banner, refresh /paper, remove Enhanced CPS mentions#76
MaxGhenis merged 12 commits into
mainfrom
drop-prerelease-banner

Conversation

@MaxGhenis

@MaxGhenis MaxGhenis commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

⚠️ Merge at launch time only

Launch-readiness fixes for policybench.org. Merging deploys to production, so hold until launch goes live (first action in the runbook).

Changes

  1. Drop the pre-release banner (Hero.tsx).
  2. Refresh the /paper page (app/src/app/paper/page.tsx) — stale May date + per-country (UK) framing in copy and share metadata → June 2026-06-14 US populace snapshot; iframe cache-buster bumped.
  3. Remove Enhanced CPS mentions — the paper abstract called populace the successor to the Enhanced CPS, and a dormant UK methodology string named Enhanced CPS records. Both dropped; manuscript re-rendered (PDF + web) and hashes re-pinned.

Remaining enhanced_cps references in the repo are internal code identifiers and the real UK .h5 dataset filename (plumbing, not user-facing); part of the separate us-data/uk-data removal, not this PR.

Verification

  • bun run lint + bun run build clean; manuscript snapshot tests pass; "Enhanced CPS" absent from the rendered paper (HTML + PDF) and all app source.

🤖 Generated with Claude Code

The provisional/"we plan to rerun" banner is removed so the site matches
the launched June populace results.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policybench-site Ready Ready Preview, Comment Jun 16, 2026 8:41pm

Request Review

The paper page wrapper still showed the May snapshot date, the May 13-20
response window, and a per-country (UK) framing in its copy and share
metadata. Update the date label, description, body copy, and iframe
cache-buster to the June 2026-06-14 US populace snapshot.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis changed the title Drop the pre-release banner for the public launch Pre-launch site fixes: drop banner + refresh /paper page Jun 15, 2026
The paper abstract described populace as the successor to the Enhanced
CPS, and a dormant UK methodology string named Enhanced CPS records.
Drop both; re-render the manuscript and re-pin hashes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis changed the title Pre-launch site fixes: drop banner + refresh /paper page Pre-launch fixes: drop banner, refresh /paper, remove Enhanced CPS mentions Jun 15, 2026
The "Scoring and weighting" card had an unconditional sentence naming UK
enhanced FRS weights and the UK transfer scenarios, which rendered on the
US-only site. Keep only the US populace weighting description.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The benchmark-scope footer ("/ UK fiscal year 2026-27") and the
Household sensitivity-view description both named UK weighting/scope
unconditionally, so they rendered on the US-only site. Keep US only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Switch PolicyBench's headline/default metric from the within-1% hit rate
to exact match — the deployability bar (a prediction counts only if it
matches the PolicyEngine reference to the dollar for amounts, or the
eligibility flag for booleans). within-1% becomes the near-miss-tolerant
companion.

This is defensible because the public leaderboard is household-impact-
weighted, which down-weights zero-reference outputs. The weighted exact
rate is therefore not compressed near the ~84% unweighted zero share:
GPT-5.5 leads at 80.3% weighted exact versus an always-zero baseline of
66.8% (a ~13-point margin, comparable to within-1%), and exact
discriminates about as well as within-1% (spread 62-80% vs 63-83%).

Site (app/src):
- ModelLeaderboard: default scoringMode "within1pct" -> "exact"; rewrite
  the comment to explain the deployability bar and the weighting argument.
- Methodology: "ranks by within-1%" -> "ranks by exact match"; report
  exact as headline with within-1% as companion; sensitivity-check prose
  references the public exact-match leaderboard.
- modelPage.ts: model headline + both leaderboard sorts -> exact.
- model/[id] page: headline ScorePill, metadata, and per-program table
  lead with exact; the within-1% column stays alongside.
The exact/within-1%/continuous toggle and all three columns are intact.

Paper (paper/index.qmd + policybench/paper_results.py):
- Reframe the abstract, the headline section ("Headline metric: exact
  match" / "Near-miss companion: within 1%"), the related-work paragraph,
  the leaderboard tables (sorted by exact, within-1% kept as a column),
  and the bootstrap CIs to make exact the headline with the
  weighted-vs-unweighted nuance and the always-zero baseline.
- paper_results.py: headline fields exact-based; within-1% accessors kept
  as the companion; add always-zero weighted-exact baseline accessors.
- Extend the baselines table to report exact / within-1% / bounded.

Re-render the manuscript and re-pin only the rendered-paper hashes in the
snapshot manifest; the frozen run data is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per-cell audit annotations, failure-source tags, and case notes are
absent from the published data, so the scenario modal showed "Not yet
reviewed." on every incorrect cell and the coverage card reported
"0 rows include developer audit notes" — both contradicting the
paper's hand-audit of every wrong cell. Drop the empty modal fallback
and the audit-notes sentences, keeping explanation coverage. The
aggregate audit result stays in the paper and blog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MaxGhenis and others added 3 commits June 16, 2026 08:49
The dashboard-data-20260614 artifact predated the audit annotations and
had no reference-computation narratives, so the explorer modal showed
"Not yet reviewed." and "narrative not yet generated." Generate the
1,984 reference narratives (switch the generator to claude-haiku-4-5 —
the PolicyEngine computation trace is deterministic, so Haiku only
prosifies it; strip stray markdown headers), rebuild the payload with
both the 3,300 audit annotations and the narratives, and republish as
dashboard-data-20260616. The modal now shows the per-cell audit review
and the reference computation for every case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The homepage used a compact top bar (small brand + action button at the
top); /paper and /model used alwaysExpanded, which inlined a 36px brand
and pushed the action button ~30px down, so the upper-right buttons
jumped position between pages. Render the same compact top bar in
alwaysExpanded mode and move the large brand into the hero block below,
so the top bar is identical everywhere. Homepage is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The dashboard republish baked the 3,300 audit reviews + 1,984 reference
narratives into the published artifact (dashboard-data-20260616), but the
reproducibility contract requires the published payload to byte-match the
export of the frozen source run. Update the frozen source data.json with
the same annotations and repoint its hash plus the
published_dashboard_artifact pin (sha 497c6c34, 37 MB).

Snapshot suite 16/16, full suite 385 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit f0ad009 into main Jun 16, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant