Reconcile paper with expanded experiment design; improve narrative#9
Open
juaristi22 wants to merge 6 commits into
Open
Reconcile paper with expanded experiment design; improve narrative#9juaristi22 wants to merge 6 commits into
juaristi22 wants to merge 6 commits into
Conversation
…ceholders Aligns the methodology, experimental design, and narrative framing with the expanded round (PR #8) now that those are implemented and stable. Result numbers are left as placeholders pending the expanded-5seed rerun. Methodology / appendix / algorithm: - Restore the L2 soft-concentration term in the loss (the solver now implements l2_lambda): three terms = capped mean absolute relative calibration loss + lambda_L0 expected open gates + lambda_L2 mean((w_i/w0_i)^2). - Frame two concentration controls on the inflation ratio w_i/w0_i: soft l2_lambda (the swept operability knob, informed-L0 only) and hard max_weight_ratio (per-record cap; headline runs uncapped). - 3-way benchmark: drop the combinatorial baseline from the active comparison (reviewed in background, not benchmarked); rewrite the survey-weight baseline as the unbiased Hansen-Hurwitz PPS-with-replacement integerisation. - Widened leak-free family holdout (cms_medicaid + usda_snap + state_income_tax + cbo), with the rationale for whole-family (not random) splits; 5 seeds; 5-fold family rotation reported per fold. - Median-led metrics; ESS / design effect reported as a primary result; identifiability-floor degenerate detector + targeted-removal sensitivity (no winsorization). Fix appendix hyperparameters (temperature 0.25, add lambda_L2 and target-loss-cap rows). Narrative (abstract / introduction / discussion / background): - Reframe to graceful degradation under aggressive compression plus a measured size/geography/concentration operability frontier, not a blanket accuracy win. - Defer the abstract's empirical headline until the final numbers are in. Results + tables + frontier figure: - Reset numbers to placeholders but upgrade the structure to the new design (ESS column, Hansen-Hurwitz row, median-led, F6 operability slot); drop the combinatorial row; holdout counts to be filled from the rerun. Style pass (humanizer + PolicyEngine writing skill): reduce em-dash overuse, remove AI-vocabulary, prefer active voice and simple copulas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Generated by experiments/figures.py (expanded-3seed run): frontier, usability, generalization gap, by-held-out-family, cost-accuracy, and the lambda_L2 operability trade. PolicyEngine design system styling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings the paper-results draft to a near-complete state on the expanded-3seed run. Empirical numbers, the discussion, the lambda_L2 headline value, and the sample-size presets are deliberately left as TBC. Figures and layout: - Wire the six expanded-3seed figures into the text at their references: F1 frontier and F3/F4 generalization in results, F2/F6 size-and-concentration controls in results, F5 cost-accuracy in a new appendix section; drop the convergence placeholder. - Switch all floats from [ht] to [H] (float package) so tables, figures, and the algorithm render where referenced instead of drifting to the section end. - Fix table margins: widen the appendix source-table family column and wrap the hyperparameter Role column so no table overflows the text width. Methodology: - Restructure as a shared calibrator (capped weighted MAPE, Adam on log-weights, mass conservation; attributed to Populace's calibrate machine) plus the L0 gate and L2 concentration terms that informed L0 adds; fix the loss scale to max(|t|, 1). - Delete the combinatorial-optimisation paragraph; correct the seed count to 3. - Replace the evaluation-metrics bullet list with a metrics table; add a held-out-family / in-fit-sibling table. Framing, abstract, data: - Make explicit throughout that the experiment evaluates the sampling (record reduction) step, not the calibration; name all three baselines in the abstract and define survey-weight sampling as the Hansen-Hurwitz scheme. - Pin the populace-us 2024 artifact (snapshot/commit/SHA, 75,112 households); correct the data section to reflect one row per household (no multi-geography record multiplication); fill the imputation donor table and QRF method; delete the take-up paragraph. - Fill the calibration-targets table and the appendix source table from PolicyEngine Ledger (arch-data); add the target-selection rationale (national+state surface, district targets deferred to the full build-big-then-prune). Bibliography: - Verify entries against sources; fix williamson1998 title and meyer2021 authors/year; replace the unverifiable rothbaum2021 with meyer2015 (Household Surveys in Crisis); add ledger2026; cite bryant2023a and pytorch2019; remove five uncited entries. Compiles cleanly with basictex (38 pages, no undefined references or citations). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the scaffolded placeholders with the interpretation of the
weighted-loss-3seed run (3 seeds, budgets 2k-40k, l2 in {0, 1e-4}) and
reframe the claims to what those results actually show.
Narrative changes:
- Reframe from a blanket accuracy claim to honest-competitive + operability.
Informed L0's decisive, significant accuracy win is confined to aggressive
compression (2k: OOS mean ARE 53.5% vs random+reweight 2,293%, p=0.017). The
methods cross near 5k; at deployable budgets (>=10k) random+reweight is
simpler and marginally but significantly more accurate (+7.9pp at 10k,
p=0.02). Survey-weight (Hansen-Hurwitz) is the least accurate throughout.
- Position the distinctive contribution as operational control (size via
lambda_L0, geographic emphasis, weight concentration in one optimization)
plus robustness to a bad random draw, motivated for build-big-then-prune and
subnational regimes as design intent rather than a measured result.
- Headline accuracy reported at lambda_L2=0; lambda_L2=1e-4 presented as the
operability axis, with its budget-dependent accuracy cost stated (cheap at
40k: +52% ESS for ~1pp; costly at 10k: +10pp median).
- Report in-sample fit and the generalization gap (L0 generalizes with a
smaller gap than random+reweight despite a looser in-sample fit), the
per-fold rotation (median best for L0 in 3/5 folds; target-weighted mean
ranking reverses on the irs_soi fold), and the named degenerate-target audit
(741 fit targets, chiefly irs_soi; targeted-removal vs median).
- Methodology now documents the capped weighted MAPE (cap c=10, sum-omega
normalization) and how the production_us_fiscal per-target weights are
computed (count/amount value basis, sqrt-magnitude within basis, cross-basis
equal-total balancing, mean-1). Correct Algorithm 1's calibration loss to the
omega-weighted form so it matches the loss the runs used.
Result inclusion:
- Fill every \tbc placeholder across abstract, methodology, results,
discussion, conclusion, introduction, appendix, and the presets table.
- Wire in the run's tables (sampling_comparison, paired_comparison,
calibration_accuracy) and figures F1-F6 (PNGs in, stale PDFs removed); rebuild
paper/main.pdf. Compiles clean with no undefined references.
Also stage a .gitignore entry and experiments/merge_l2_runs.py (l2-run merge
helper) that were prepared alongside this work.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Background: - Remove the standalone combinatorial-optimisation subsection and fold it into a one-sentence mention under "Informed selection", citing Williamson (1998) and Voas & Williamson (2000) as the closest gradient-free precedent rather than a benchmarked comparison method. - Correct the Hard Concrete deterministic-gate description. The gate does not output only 0s and 1s: it returns an exact 0 below a threshold and an exact 1 above another, with intermediate values in between. The exact zeros give the sparse weight vector; a record left with an intermediate gate is retained with its weight scaled by the gate. (Confirmed against the gates.py eval path.) Figures vs text: - Captions and the first figure reference now say "average retained records" to match the plotted x-axis (achieved counts 1,962-39,586), with a one-time note that the achieved count lands within 3% of each requested budget. - Reconcile the generalization section with F3: F3 plots the mean-ARE gap, so lead with the median gap (L0 smaller) and describe the mean gap accurately (L0 stable ~5-19pp; random+reweight volatile, about -160pp at the smallest budget where its in- and out-of-sample means are both inflated). - Bridge the figure family labels (census_stc, cbo) to the paper's names in the by-family text and F4 caption. Appendices: - Replace the Compact/Detailed sample-size presets with the requested-count to lambda_L0 mapping: explain the eight-step bisection and report the lambda_L0 each requested count converged to (4.9e-5 at 2k down to 2.1e-6 at 40k) with the achieved record count. Align the hyperparameter table's lambda_L0 range. - Reorder the appendices by first appearance in the text: A targets, B record budget -> sparsity penalty, C hyperparameters, D algorithm, E assembly, F cost. Reference every appendix in the body: Appendix B at the L0-penalty term in methodology, Appendix D for the optimization loop, and Appendix F in the discussion's computational-cost limitation. Also includes the updated result figures (F1-F3, F6) and drops the empty congressional-district row from the geography table. Compiles clean (42 pages, no undefined references). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d49a65a to
989f459
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reconciles the manuscript with the expanded experiment design (PR #8) now that the methodology and experimental design are implemented and stable, and refocuses the narrative on what the evidence can actually support. Result numbers are intentionally left as placeholders pending the
expanded-5seedrerun; the figures, tables, and section structure are final, so the rerun just fills slots.Closes #2.
Closes #3.
Narrative, method comparison, and claiming (#2)
l2_lambdaconcentration term the solver implements. Concentration is framed as two controls on the weight-inflation ratiol2_lambda(the swept operability knob, informed-$L_0$ only) and hardmax_weight_ratio(per-record cap; headline runs uncapped, reported as an output diagnostic).cms_medicaid+usda_snap+state_income_tax, pluscboas validation-only, scored out-of-sample but never fit), with the rationale for whole-family rather than random splits. The family-rotation panel is reported per fold as a descriptive robustness check, not as iid replicates.Populace-centered manuscript (#3)
The manuscript presents Populace as the active implementation surface throughout the sections this PR touches: the introduction and background describe the method as one calibration option inside the modular Populace pipeline, and the methodology defines the experiment against the Populace calibration solver. The Microplex→Populace rewrite of the pipeline section (
data.tex) already landed on main; the remaining item for #3 is the pipeline overview diagram, which is tracked in PR #5 (fig:pipelineis still a placeholder here).Style
Applied the humanizer and PolicyEngine writing skills: reduced em-dash overuse, removed AI-vocabulary, and preferred active voice and simple copulas.
Intentionally not in this PR (placeholders to fill)
expanded-5seedrerun.fig:pipeline) → PR Add Populace pipeline overview brief and figure #5.L2_CHOSEN, the exact out-of-sample / fit / total target counts, and the bisection accuracy →\tbcuntil the rerun.The paper compiles cleanly (tectonic): no undefined references or citations.
🤖 Generated with Claude Code