Skip to content

Reconcile paper with expanded experiment design; improve narrative#9

Open
juaristi22 wants to merge 6 commits into
mainfrom
paper-results
Open

Reconcile paper with expanded experiment design; improve narrative#9
juaristi22 wants to merge 6 commits into
mainfrom
paper-results

Conversation

@juaristi22

@juaristi22 juaristi22 commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

Reconciles the manuscript with the expanded experiment design (PR #8) now that the methodology and experimental design are implemented and stable, and refocuses the narrative on what the evidence can actually support. Result numbers are intentionally left as placeholders pending the expanded-5seed rerun; the figures, tables, and section structure are final, so the rerun just fills slots.

Closes #2.
Closes #3.

Narrative, method comparison, and claiming (#2)

  • Three-way matched-budget benchmark. The comparison is now informed $L_0$ / Hard-Concrete vs. two baselines: dense Populace calibration then weighted sampling, and uniform random subset then reweighting. The survey-weight baseline is rewritten as the unbiased Hansen–Hurwitz PPS-with-replacement integerisation. Combinatorial optimisation is dropped from the active comparison; GREG, IPF, and combinatorial remain in the background as context and no longer read as active experimental baselines.
  • Loss reconciled with the solver. The methodology, appendix table, and algorithm now carry the soft l2_lambda concentration term the solver implements. Concentration is framed as two controls on the weight-inflation ratio $w_i/w_{0,i}$: soft l2_lambda (the swept operability knob, informed-$L_0$ only) and hard max_weight_ratio (per-record cap; headline runs uncapped, reported as an output diagnostic).
  • Leak-free validation design. Fixed family-level holdout (cms_medicaid + usda_snap + state_income_tax, plus cbo as validation-only, scored out-of-sample but never fit), with the rationale for whole-family rather than random splits. The family-rotation panel is reported per fold as a descriptive robustness check, not as iid replicates.
  • Metrics. Median-led accuracy with the mean flagged as tail-sensitive; ESS / design effect reported as a primary result; micro and family-macro averages; and an identifiability-floor degenerate-target detector with a named targeted-removal sensitivity (no winsorization).
  • Honest framing. Abstract, introduction, and discussion reframed around graceful degradation under aggressive compression plus a measured size/geography/concentration operability frontier, rather than a blanket accuracy win. The abstract's empirical headline is held until the final numbers are in.

Populace-centered manuscript (#3)

The manuscript presents Populace as the active implementation surface throughout the sections this PR touches: the introduction and background describe the method as one calibration option inside the modular Populace pipeline, and the methodology defines the experiment against the Populace calibration solver. The Microplex→Populace rewrite of the pipeline section (data.tex) already landed on main; the remaining item for #3 is the pipeline overview diagram, which is tracked in PR #5 (fig:pipeline is still a placeholder here).

Style

Applied the humanizer and PolicyEngine writing skills: reduced em-dash overuse, removed AI-vocabulary, and preferred active voice and simple copulas.

Intentionally not in this PR (placeholders to fill)

  • All result numbers and the F1 / F6 / convergence figures → from the expanded-5seed rerun.
  • The pipeline overview diagram (fig:pipeline) → PR Add Populace pipeline overview brief and figure #5.
  • The supporting survey-statistics citations (Kish design effect/ESS, Potter 1990, Hainmueller 2012, Voas & Williamson 2001, Hyndman & Koehler 2006) → tracked follow-up; they need web verification before they go in.
  • L2_CHOSEN, the exact out-of-sample / fit / total target counts, and the bisection accuracy → \tbc until the rerun.

The paper compiles cleanly (tectonic): no undefined references or citations.

🤖 Generated with Claude Code

juaristi22 and others added 6 commits June 24, 2026 15:03
…ceholders

Aligns the methodology, experimental design, and narrative framing with the
expanded round (PR #8) now that those are implemented and stable. Result numbers
are left as placeholders pending the expanded-5seed rerun.

Methodology / appendix / algorithm:
- Restore the L2 soft-concentration term in the loss (the solver now implements
  l2_lambda): three terms = capped mean absolute relative calibration loss
  + lambda_L0 expected open gates + lambda_L2 mean((w_i/w0_i)^2).
- Frame two concentration controls on the inflation ratio w_i/w0_i: soft
  l2_lambda (the swept operability knob, informed-L0 only) and hard
  max_weight_ratio (per-record cap; headline runs uncapped).
- 3-way benchmark: drop the combinatorial baseline from the active comparison
  (reviewed in background, not benchmarked); rewrite the survey-weight baseline
  as the unbiased Hansen-Hurwitz PPS-with-replacement integerisation.
- Widened leak-free family holdout (cms_medicaid + usda_snap + state_income_tax
  + cbo), with the rationale for whole-family (not random) splits; 5 seeds;
  5-fold family rotation reported per fold.
- Median-led metrics; ESS / design effect reported as a primary result;
  identifiability-floor degenerate detector + targeted-removal sensitivity
  (no winsorization). Fix appendix hyperparameters (temperature 0.25, add
  lambda_L2 and target-loss-cap rows).

Narrative (abstract / introduction / discussion / background):
- Reframe to graceful degradation under aggressive compression plus a measured
  size/geography/concentration operability frontier, not a blanket accuracy win.
- Defer the abstract's empirical headline until the final numbers are in.

Results + tables + frontier figure:
- Reset numbers to placeholders but upgrade the structure to the new design
  (ESS column, Hansen-Hurwitz row, median-led, F6 operability slot); drop the
  combinatorial row; holdout counts to be filled from the rerun.

Style pass (humanizer + PolicyEngine writing skill): reduce em-dash overuse,
remove AI-vocabulary, prefer active voice and simple copulas.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Generated by experiments/figures.py (expanded-3seed run): frontier, usability,
generalization gap, by-held-out-family, cost-accuracy, and the lambda_L2
operability trade. PolicyEngine design system styling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings the paper-results draft to a near-complete state on the expanded-3seed
run. Empirical numbers, the discussion, the lambda_L2 headline value, and the
sample-size presets are deliberately left as TBC.

Figures and layout:
- Wire the six expanded-3seed figures into the text at their references: F1
  frontier and F3/F4 generalization in results, F2/F6 size-and-concentration
  controls in results, F5 cost-accuracy in a new appendix section; drop the
  convergence placeholder.
- Switch all floats from [ht] to [H] (float package) so tables, figures, and the
  algorithm render where referenced instead of drifting to the section end.
- Fix table margins: widen the appendix source-table family column and wrap the
  hyperparameter Role column so no table overflows the text width.

Methodology:
- Restructure as a shared calibrator (capped weighted MAPE, Adam on log-weights,
  mass conservation; attributed to Populace's calibrate machine) plus the L0 gate
  and L2 concentration terms that informed L0 adds; fix the loss scale to
  max(|t|, 1).
- Delete the combinatorial-optimisation paragraph; correct the seed count to 3.
- Replace the evaluation-metrics bullet list with a metrics table; add a
  held-out-family / in-fit-sibling table.

Framing, abstract, data:
- Make explicit throughout that the experiment evaluates the sampling (record
  reduction) step, not the calibration; name all three baselines in the abstract
  and define survey-weight sampling as the Hansen-Hurwitz scheme.
- Pin the populace-us 2024 artifact (snapshot/commit/SHA, 75,112 households);
  correct the data section to reflect one row per household (no multi-geography
  record multiplication); fill the imputation donor table and QRF method; delete
  the take-up paragraph.
- Fill the calibration-targets table and the appendix source table from
  PolicyEngine Ledger (arch-data); add the target-selection rationale
  (national+state surface, district targets deferred to the full
  build-big-then-prune).

Bibliography:
- Verify entries against sources; fix williamson1998 title and meyer2021
  authors/year; replace the unverifiable rothbaum2021 with meyer2015 (Household
  Surveys in Crisis); add ledger2026; cite bryant2023a and pytorch2019; remove
  five uncited entries.

Compiles cleanly with basictex (38 pages, no undefined references or citations).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the scaffolded placeholders with the interpretation of the
weighted-loss-3seed run (3 seeds, budgets 2k-40k, l2 in {0, 1e-4}) and
reframe the claims to what those results actually show.

Narrative changes:
- Reframe from a blanket accuracy claim to honest-competitive + operability.
  Informed L0's decisive, significant accuracy win is confined to aggressive
  compression (2k: OOS mean ARE 53.5% vs random+reweight 2,293%, p=0.017). The
  methods cross near 5k; at deployable budgets (>=10k) random+reweight is
  simpler and marginally but significantly more accurate (+7.9pp at 10k,
  p=0.02). Survey-weight (Hansen-Hurwitz) is the least accurate throughout.
- Position the distinctive contribution as operational control (size via
  lambda_L0, geographic emphasis, weight concentration in one optimization)
  plus robustness to a bad random draw, motivated for build-big-then-prune and
  subnational regimes as design intent rather than a measured result.
- Headline accuracy reported at lambda_L2=0; lambda_L2=1e-4 presented as the
  operability axis, with its budget-dependent accuracy cost stated (cheap at
  40k: +52% ESS for ~1pp; costly at 10k: +10pp median).
- Report in-sample fit and the generalization gap (L0 generalizes with a
  smaller gap than random+reweight despite a looser in-sample fit), the
  per-fold rotation (median best for L0 in 3/5 folds; target-weighted mean
  ranking reverses on the irs_soi fold), and the named degenerate-target audit
  (741 fit targets, chiefly irs_soi; targeted-removal vs median).
- Methodology now documents the capped weighted MAPE (cap c=10, sum-omega
  normalization) and how the production_us_fiscal per-target weights are
  computed (count/amount value basis, sqrt-magnitude within basis, cross-basis
  equal-total balancing, mean-1). Correct Algorithm 1's calibration loss to the
  omega-weighted form so it matches the loss the runs used.

Result inclusion:
- Fill every \tbc placeholder across abstract, methodology, results,
  discussion, conclusion, introduction, appendix, and the presets table.
- Wire in the run's tables (sampling_comparison, paired_comparison,
  calibration_accuracy) and figures F1-F6 (PNGs in, stale PDFs removed); rebuild
  paper/main.pdf. Compiles clean with no undefined references.

Also stage a .gitignore entry and experiments/merge_l2_runs.py (l2-run merge
helper) that were prepared alongside this work.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Background:
- Remove the standalone combinatorial-optimisation subsection and fold it into
  a one-sentence mention under "Informed selection", citing Williamson (1998)
  and Voas & Williamson (2000) as the closest gradient-free precedent rather
  than a benchmarked comparison method.
- Correct the Hard Concrete deterministic-gate description. The gate does not
  output only 0s and 1s: it returns an exact 0 below a threshold and an exact 1
  above another, with intermediate values in between. The exact zeros give the
  sparse weight vector; a record left with an intermediate gate is retained with
  its weight scaled by the gate. (Confirmed against the gates.py eval path.)

Figures vs text:
- Captions and the first figure reference now say "average retained records" to
  match the plotted x-axis (achieved counts 1,962-39,586), with a one-time note
  that the achieved count lands within 3% of each requested budget.
- Reconcile the generalization section with F3: F3 plots the mean-ARE gap, so
  lead with the median gap (L0 smaller) and describe the mean gap accurately
  (L0 stable ~5-19pp; random+reweight volatile, about -160pp at the smallest
  budget where its in- and out-of-sample means are both inflated).
- Bridge the figure family labels (census_stc, cbo) to the paper's names in the
  by-family text and F4 caption.

Appendices:
- Replace the Compact/Detailed sample-size presets with the requested-count to
  lambda_L0 mapping: explain the eight-step bisection and report the lambda_L0
  each requested count converged to (4.9e-5 at 2k down to 2.1e-6 at 40k) with
  the achieved record count. Align the hyperparameter table's lambda_L0 range.
- Reorder the appendices by first appearance in the text: A targets, B record
  budget -> sparsity penalty, C hyperparameters, D algorithm, E assembly,
  F cost. Reference every appendix in the body: Appendix B at the L0-penalty
  term in methodology, Appendix D for the optimization loop, and Appendix F in
  the discussion's computational-cost limitation.

Also includes the updated result figures (F1-F3, F6) and drops the empty
congressional-district row from the geography table. Compiles clean (42 pages,
no undefined references).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rewrite pipeline section around Populace Refocus literature review, method comparison, and paper narrative

1 participant