Add cost and latency to the model leaderboard by MaxGhenis · Pull Request #81 · PolicyEngine/policybench

MaxGhenis · 2026-06-25T23:56:18Z

What

Adds Cost / household and Latency columns to the model leaderboard.

Pipeline

analysis.model_cost_latency(predictions, price_overrides) computes per model:
- costUsd (run total) and costPerHousehold (÷ distinct households), reusing usage_summary_by_model so totals match the existing usage CSV.
- latencySeconds — the median of each household's summed request-time. Median rather than mean/sum because the per-call timer wraps litellm's retry/backoff, so a few rate-limited calls inflate the mean; the median reflects a typical household.
config.PRICE_OVERRIDES_PER_1M fills cost for provider preview models litellm's price map doesn't cover yet — currently grok-build-0.1 at $1/$2 per 1M (https://x.ai/api), which otherwise showed blank cost.
Wired into build_dashboard_payload, so every modelStats entry carries them.

UI

ModelStat gains costUsd / costPerHousehold / latencySeconds / totalTokens (all optional).
ModelLeaderboard adds the two columns (desktop 12-col grid + a mobile line) with tooltips. Cost is shown to 2 significant figures (range ~$0.002–$0.29); latency as 66s / 1.1m.

Verification

Regenerated the June run via export_country: zero change to any accuracy number (max drift 0.00000) — only the new fields are added. All 13 models populate; grok-build-0.1 ≈ $4.49.
5 new unit tests (tests/test_cost_latency.py); test_analysis.py still green (79 passed).
eslint --max-warnings=0 clean; rendered locally — GPT-5.5 $0.12 / 1.1m, Gemini 3.1 Pro $0.085 / 45s, Opus 4.7 $0.29 / 53s.

Data

The columns render empty against the current published artifact (it predates these fields). They populate on the next export / publish-dashboard — the pipeline change makes future regenerations include them automatically.

🤖 Generated with Claude Code

analysis.model_cost_latency() joins per-model cost (USD total and per-household) and median per-household latency into the modelStats that build_dashboard_payload emits. Cost reuses usage_summary_by_model; latency is the median of each household's summed request-time, which is robust to the occasional rate-limit retry that inflates the mean. config.PRICE_OVERRIDES_PER_1M fills cost for provider preview models litellm cannot yet price (grok-build-0.1 at $1/$2 per 1M, https://x.ai/api). The leaderboard renders new "Cost / hh" and "Latency" columns (desktop grid + mobile line). Verified by regenerating the June run: the new fields populate for all 13 models (grok-build-0.1 at ~$4.49) with zero change to any accuracy number. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel · 2026-06-25T23:56:26Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policybench-site	Ready	Preview, Comment	Jun 26, 2026 12:19am

- Leaderboard: per-household cost now formats to a consistent 3 decimals (was 2 significant figures, which varied 2-4 decimals and rounded the cheapest models to $0.00). - Paper: new "Cost and latency" Results section with a per-model table (cost/household + median latency next to exact match) and a templated summary. Computed from the frozen run's predictions.csv.gz via the same model_cost_latency helper, so no snapshot regeneration or re-freeze. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drop the switch to minutes above 60s; every model's median latency now reads in seconds (e.g. 135s, not 2.2 min) for a single consistent unit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview June 25, 2026 23:56 View deployment

vercel Bot deployed to Preview June 26, 2026 00:15 View deployment

Show latency in seconds throughout (leaderboard + paper)

e690ed9

Drop the switch to minutes above 60s; every model's median latency now reads in seconds (e.g. 135s, not 2.2 min) for a single consistent unit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview June 26, 2026 00:19 View deployment

MaxGhenis merged commit fba3cfc into main Jun 26, 2026
6 checks passed

MaxGhenis deleted the feat/leaderboard-cost-latency branch June 26, 2026 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cost and latency to the model leaderboard#81

Add cost and latency to the model leaderboard#81
MaxGhenis merged 3 commits into
mainfrom
feat/leaderboard-cost-latency

MaxGhenis commented Jun 25, 2026

Uh oh!

vercel Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MaxGhenis commented Jun 25, 2026

What

Pipeline

UI

Verification

Data

Uh oh!

vercel Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 25, 2026 •

edited

Loading