Reconcile paper with expanded experiment design; improve narrative by juaristi22 · Pull Request #9 · PolicyEngine/l0-paper

juaristi22 · 2026-06-22T20:31:50Z

Summary

Reconciles the manuscript with the expanded experiment design (PR #8) now that the methodology and experimental design are implemented and stable, and refocuses the narrative on what the evidence can actually support. Result numbers are intentionally left as placeholders pending the expanded-5seed rerun; the figures, tables, and section structure are final, so the rerun just fills slots.

Closes #2.
Closes #3.

Narrative, method comparison, and claiming (#2)

Three-way matched-budget benchmark. The comparison is now informed $L_0$ / Hard-Concrete vs. two baselines: dense Populace calibration then weighted sampling, and uniform random subset then reweighting. The survey-weight baseline is rewritten as the unbiased Hansen–Hurwitz PPS-with-replacement integerisation. Combinatorial optimisation is dropped from the active comparison; GREG, IPF, and combinatorial remain in the background as context and no longer read as active experimental baselines.
Loss reconciled with the solver. The methodology, appendix table, and algorithm now carry the soft l2_lambda concentration term the solver implements. Concentration is framed as two controls on the weight-inflation ratio $w_i/w_{0,i}$: soft l2_lambda (the swept operability knob, informed-$L_0$ only) and hard max_weight_ratio (per-record cap; headline runs uncapped, reported as an output diagnostic).
Leak-free validation design. Fixed family-level holdout (cms_medicaid + usda_snap + state_income_tax, plus cbo as validation-only, scored out-of-sample but never fit), with the rationale for whole-family rather than random splits. The family-rotation panel is reported per fold as a descriptive robustness check, not as iid replicates.
Metrics. Median-led accuracy with the mean flagged as tail-sensitive; ESS / design effect reported as a primary result; micro and family-macro averages; and an identifiability-floor degenerate-target detector with a named targeted-removal sensitivity (no winsorization).
Honest framing. Abstract, introduction, and discussion reframed around graceful degradation under aggressive compression plus a measured size/geography/concentration operability frontier, rather than a blanket accuracy win. The abstract's empirical headline is held until the final numbers are in.

Populace-centered manuscript (#3)

The manuscript presents Populace as the active implementation surface throughout the sections this PR touches: the introduction and background describe the method as one calibration option inside the modular Populace pipeline, and the methodology defines the experiment against the Populace calibration solver. The Microplex→Populace rewrite of the pipeline section (data.tex) already landed on main; the remaining item for #3 is the pipeline overview diagram, which is tracked in PR #5 (fig:pipeline is still a placeholder here).

Style

Applied the humanizer and PolicyEngine writing skills: reduced em-dash overuse, removed AI-vocabulary, and preferred active voice and simple copulas.

Intentionally not in this PR (placeholders to fill)

All result numbers and the F1 / F6 / convergence figures → from the expanded-5seed rerun.
The pipeline overview diagram (fig:pipeline) → PR Add Populace pipeline overview brief and figure #5.
The supporting survey-statistics citations (Kish design effect/ESS, Potter 1990, Hainmueller 2012, Voas & Williamson 2001, Hyndman & Koehler 2006) → tracked follow-up; they need web verification before they go in.
L2_CHOSEN, the exact out-of-sample / fit / total target counts, and the bisection accuracy → \tbc until the rerun.

The paper compiles cleanly (tectonic): no undefined references or citations.

🤖 Generated with Claude Code

…ceholders Aligns the methodology, experimental design, and narrative framing with the expanded round (PR #8) now that those are implemented and stable. Result numbers are left as placeholders pending the expanded-5seed rerun. Methodology / appendix / algorithm: - Restore the L2 soft-concentration term in the loss (the solver now implements l2_lambda): three terms = capped mean absolute relative calibration loss + lambda_L0 expected open gates + lambda_L2 mean((w_i/w0_i)^2). - Frame two concentration controls on the inflation ratio w_i/w0_i: soft l2_lambda (the swept operability knob, informed-L0 only) and hard max_weight_ratio (per-record cap; headline runs uncapped). - 3-way benchmark: drop the combinatorial baseline from the active comparison (reviewed in background, not benchmarked); rewrite the survey-weight baseline as the unbiased Hansen-Hurwitz PPS-with-replacement integerisation. - Widened leak-free family holdout (cms_medicaid + usda_snap + state_income_tax + cbo), with the rationale for whole-family (not random) splits; 5 seeds; 5-fold family rotation reported per fold. - Median-led metrics; ESS / design effect reported as a primary result; identifiability-floor degenerate detector + targeted-removal sensitivity (no winsorization). Fix appendix hyperparameters (temperature 0.25, add lambda_L2 and target-loss-cap rows). Narrative (abstract / introduction / discussion / background): - Reframe to graceful degradation under aggressive compression plus a measured size/geography/concentration operability frontier, not a blanket accuracy win. - Defer the abstract's empirical headline until the final numbers are in. Results + tables + frontier figure: - Reset numbers to placeholders but upgrade the structure to the new design (ESS column, Hansen-Hurwitz row, median-led, F6 operability slot); drop the combinatorial row; holdout counts to be filled from the rerun. Style pass (humanizer + PolicyEngine writing skill): reduce em-dash overuse, remove AI-vocabulary, prefer active voice and simple copulas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Generated by experiments/figures.py (expanded-3seed run): frontier, usability, generalization gap, by-held-out-family, cost-accuracy, and the lambda_L2 operability trade. PolicyEngine design system styling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Brings the paper-results draft to a near-complete state on the expanded-3seed run. Empirical numbers, the discussion, the lambda_L2 headline value, and the sample-size presets are deliberately left as TBC. Figures and layout: - Wire the six expanded-3seed figures into the text at their references: F1 frontier and F3/F4 generalization in results, F2/F6 size-and-concentration controls in results, F5 cost-accuracy in a new appendix section; drop the convergence placeholder. - Switch all floats from [ht] to [H] (float package) so tables, figures, and the algorithm render where referenced instead of drifting to the section end. - Fix table margins: widen the appendix source-table family column and wrap the hyperparameter Role column so no table overflows the text width. Methodology: - Restructure as a shared calibrator (capped weighted MAPE, Adam on log-weights, mass conservation; attributed to Populace's calibrate machine) plus the L0 gate and L2 concentration terms that informed L0 adds; fix the loss scale to max(|t|, 1). - Delete the combinatorial-optimisation paragraph; correct the seed count to 3. - Replace the evaluation-metrics bullet list with a metrics table; add a held-out-family / in-fit-sibling table. Framing, abstract, data: - Make explicit throughout that the experiment evaluates the sampling (record reduction) step, not the calibration; name all three baselines in the abstract and define survey-weight sampling as the Hansen-Hurwitz scheme. - Pin the populace-us 2024 artifact (snapshot/commit/SHA, 75,112 households); correct the data section to reflect one row per household (no multi-geography record multiplication); fill the imputation donor table and QRF method; delete the take-up paragraph. - Fill the calibration-targets table and the appendix source table from PolicyEngine Ledger (arch-data); add the target-selection rationale (national+state surface, district targets deferred to the full build-big-then-prune). Bibliography: - Verify entries against sources; fix williamson1998 title and meyer2021 authors/year; replace the unverifiable rothbaum2021 with meyer2015 (Household Surveys in Crisis); add ledger2026; cite bryant2023a and pytorch2019; remove five uncited entries. Compiles cleanly with basictex (38 pages, no undefined references or citations). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the scaffolded placeholders with the interpretation of the weighted-loss-3seed run (3 seeds, budgets 2k-40k, l2 in {0, 1e-4}) and reframe the claims to what those results actually show. Narrative changes: - Reframe from a blanket accuracy claim to honest-competitive + operability. Informed L0's decisive, significant accuracy win is confined to aggressive compression (2k: OOS mean ARE 53.5% vs random+reweight 2,293%, p=0.017). The methods cross near 5k; at deployable budgets (>=10k) random+reweight is simpler and marginally but significantly more accurate (+7.9pp at 10k, p=0.02). Survey-weight (Hansen-Hurwitz) is the least accurate throughout. - Position the distinctive contribution as operational control (size via lambda_L0, geographic emphasis, weight concentration in one optimization) plus robustness to a bad random draw, motivated for build-big-then-prune and subnational regimes as design intent rather than a measured result. - Headline accuracy reported at lambda_L2=0; lambda_L2=1e-4 presented as the operability axis, with its budget-dependent accuracy cost stated (cheap at 40k: +52% ESS for ~1pp; costly at 10k: +10pp median). - Report in-sample fit and the generalization gap (L0 generalizes with a smaller gap than random+reweight despite a looser in-sample fit), the per-fold rotation (median best for L0 in 3/5 folds; target-weighted mean ranking reverses on the irs_soi fold), and the named degenerate-target audit (741 fit targets, chiefly irs_soi; targeted-removal vs median). - Methodology now documents the capped weighted MAPE (cap c=10, sum-omega normalization) and how the production_us_fiscal per-target weights are computed (count/amount value basis, sqrt-magnitude within basis, cross-basis equal-total balancing, mean-1). Correct Algorithm 1's calibration loss to the omega-weighted form so it matches the loss the runs used. Result inclusion: - Fill every \tbc placeholder across abstract, methodology, results, discussion, conclusion, introduction, appendix, and the presets table. - Wire in the run's tables (sampling_comparison, paired_comparison, calibration_accuracy) and figures F1-F6 (PNGs in, stale PDFs removed); rebuild paper/main.pdf. Compiles clean with no undefined references. Also stage a .gitignore entry and experiments/merge_l2_runs.py (l2-run merge helper) that were prepared alongside this work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Background: - Remove the standalone combinatorial-optimisation subsection and fold it into a one-sentence mention under "Informed selection", citing Williamson (1998) and Voas & Williamson (2000) as the closest gradient-free precedent rather than a benchmarked comparison method. - Correct the Hard Concrete deterministic-gate description. The gate does not output only 0s and 1s: it returns an exact 0 below a threshold and an exact 1 above another, with intermediate values in between. The exact zeros give the sparse weight vector; a record left with an intermediate gate is retained with its weight scaled by the gate. (Confirmed against the gates.py eval path.) Figures vs text: - Captions and the first figure reference now say "average retained records" to match the plotted x-axis (achieved counts 1,962-39,586), with a one-time note that the achieved count lands within 3% of each requested budget. - Reconcile the generalization section with F3: F3 plots the mean-ARE gap, so lead with the median gap (L0 smaller) and describe the mean gap accurately (L0 stable ~5-19pp; random+reweight volatile, about -160pp at the smallest budget where its in- and out-of-sample means are both inflated). - Bridge the figure family labels (census_stc, cbo) to the paper's names in the by-family text and F4 caption. Appendices: - Replace the Compact/Detailed sample-size presets with the requested-count to lambda_L0 mapping: explain the eight-step bisection and report the lambda_L0 each requested count converged to (4.9e-5 at 2k down to 2.1e-6 at 40k) with the achieved record count. Align the hyperparameter table's lambda_L0 range. - Reorder the appendices by first appearance in the text: A targets, B record budget -> sparsity penalty, C hyperparameters, D algorithm, E assembly, F cost. Reference every appendix in the body: Appendix B at the L0-penalty term in methodology, Appendix D for the optimization loop, and Appendix F in the discussion's computational-cost limitation. Also includes the updated result figures (F1-F3, F6) and drops the empty congressional-district row from the geography table. Compiles clean (42 pages, no undefined references). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

juaristi22 and others added 6 commits June 24, 2026 15:03

Add populace pipeline figure

fd3ca87

juaristi22 force-pushed the paper-results branch from d49a65a to 989f459 Compare June 24, 2026 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reconcile paper with expanded experiment design; improve narrative#9

Reconcile paper with expanded experiment design; improve narrative#9
juaristi22 wants to merge 6 commits into
mainfrom
paper-results

juaristi22 commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

juaristi22 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Narrative, method comparison, and claiming (#2)

Populace-centered manuscript (#3)

Style

Intentionally not in this PR (placeholders to fill)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

juaristi22 commented Jun 22, 2026 •

edited

Loading