Pre-launch fixes: drop banner, refresh /paper, remove Enhanced CPS mentions#76
Merged
Conversation
The provisional/"we plan to rerun" banner is removed so the site matches the launched June populace results. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The paper page wrapper still showed the May snapshot date, the May 13-20 response window, and a per-country (UK) framing in its copy and share metadata. Update the date label, description, body copy, and iframe cache-buster to the June 2026-06-14 US populace snapshot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The paper abstract described populace as the successor to the Enhanced CPS, and a dormant UK methodology string named Enhanced CPS records. Drop both; re-render the manuscript and re-pin hashes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "Scoring and weighting" card had an unconditional sentence naming UK enhanced FRS weights and the UK transfer scenarios, which rendered on the US-only site. Keep only the US populace weighting description. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The benchmark-scope footer ("/ UK fiscal year 2026-27") and the
Household sensitivity-view description both named UK weighting/scope
unconditionally, so they rendered on the US-only site. Keep US only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a595076 to
5609cbc
Compare
Switch PolicyBench's headline/default metric from the within-1% hit rate
to exact match — the deployability bar (a prediction counts only if it
matches the PolicyEngine reference to the dollar for amounts, or the
eligibility flag for booleans). within-1% becomes the near-miss-tolerant
companion.
This is defensible because the public leaderboard is household-impact-
weighted, which down-weights zero-reference outputs. The weighted exact
rate is therefore not compressed near the ~84% unweighted zero share:
GPT-5.5 leads at 80.3% weighted exact versus an always-zero baseline of
66.8% (a ~13-point margin, comparable to within-1%), and exact
discriminates about as well as within-1% (spread 62-80% vs 63-83%).
Site (app/src):
- ModelLeaderboard: default scoringMode "within1pct" -> "exact"; rewrite
the comment to explain the deployability bar and the weighting argument.
- Methodology: "ranks by within-1%" -> "ranks by exact match"; report
exact as headline with within-1% as companion; sensitivity-check prose
references the public exact-match leaderboard.
- modelPage.ts: model headline + both leaderboard sorts -> exact.
- model/[id] page: headline ScorePill, metadata, and per-program table
lead with exact; the within-1% column stays alongside.
The exact/within-1%/continuous toggle and all three columns are intact.
Paper (paper/index.qmd + policybench/paper_results.py):
- Reframe the abstract, the headline section ("Headline metric: exact
match" / "Near-miss companion: within 1%"), the related-work paragraph,
the leaderboard tables (sorted by exact, within-1% kept as a column),
and the bootstrap CIs to make exact the headline with the
weighted-vs-unweighted nuance and the always-zero baseline.
- paper_results.py: headline fields exact-based; within-1% accessors kept
as the companion; add always-zero weighted-exact baseline accessors.
- Extend the baselines table to report exact / within-1% / bounded.
Re-render the manuscript and re-pin only the rendered-paper hashes in the
snapshot manifest; the frozen run data is unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per-cell audit annotations, failure-source tags, and case notes are absent from the published data, so the scenario modal showed "Not yet reviewed." on every incorrect cell and the coverage card reported "0 rows include developer audit notes" — both contradicting the paper's hand-audit of every wrong cell. Drop the empty modal fallback and the audit-notes sentences, keeping explanation coverage. The aggregate audit result stays in the paper and blog. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This reverts commit 0d1c5d9.
The dashboard-data-20260614 artifact predated the audit annotations and had no reference-computation narratives, so the explorer modal showed "Not yet reviewed." and "narrative not yet generated." Generate the 1,984 reference narratives (switch the generator to claude-haiku-4-5 — the PolicyEngine computation trace is deterministic, so Haiku only prosifies it; strip stray markdown headers), rebuild the payload with both the 3,300 audit annotations and the narratives, and republish as dashboard-data-20260616. The modal now shows the per-cell audit review and the reference computation for every case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The homepage used a compact top bar (small brand + action button at the top); /paper and /model used alwaysExpanded, which inlined a 36px brand and pushed the action button ~30px down, so the upper-right buttons jumped position between pages. Render the same compact top bar in alwaysExpanded mode and move the large brand into the hero block below, so the top bar is identical everywhere. Homepage is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The dashboard republish baked the 3,300 audit reviews + 1,984 reference narratives into the published artifact (dashboard-data-20260616), but the reproducibility contract requires the published payload to byte-match the export of the frozen source run. Update the frozen source data.json with the same annotations and repoint its hash plus the published_dashboard_artifact pin (sha 497c6c34, 37 MB). Snapshot suite 16/16, full suite 385 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Launch-readiness fixes for policybench.org. Merging deploys to production, so hold until launch goes live (first action in the runbook).
Changes
Hero.tsx)./paperpage (app/src/app/paper/page.tsx) — stale May date + per-country (UK) framing in copy and share metadata → June 2026-06-14 US populace snapshot; iframe cache-buster bumped.Remaining
enhanced_cpsreferences in the repo are internal code identifiers and the real UK.h5dataset filename (plumbing, not user-facing); part of the separate us-data/uk-data removal, not this PR.Verification
bun run lint+bun run buildclean; manuscript snapshot tests pass; "Enhanced CPS" absent from the rendered paper (HTML + PDF) and all app source.🤖 Generated with Claude Code