Fix employment_income zeroed out in published H5 datasets by baogorek · Pull Request #574 · PolicyEngine/policyengine-us-data

baogorek · 2026-03-05T19:06:19Z

Summary

Rename CPS-stored aggregate income variables (employment_income, self_employment_income) to their input-variable equivalents (*_before_lsr) before _drop_formula_variables() removes them, so the formula engine can recompute the aggregates from preserved raw data
Fix indentation bug in create_sparse_ecps() where empty-dict cleanup was inside the inner time_period loop instead of the outer variable loop

Test plan

make format passes
Spot-check: _drop_formula_variables() renames employment_income → employment_income_before_lsr and drops the aggregate
Sanity tests pass: test_ecps_employment_income_positive, test_ecps_self_employment_income_positive, test_ecps_mean_employment_income_reasonable all green
Full make data rebuild to regenerate H5 and verify employment_income.sum() > 0 end-to-end (CI step)

Closes #573. Relates to #571, #444.

🤖 Generated with Claude Code

Two bugs caused employment_income (and self_employment_income) to be zero in published enhanced CPS datasets: 1. _drop_formula_variables() drops variables with `adds`/`subtracts`, including employment_income. But CPS raw data stores income under employment_income directly — employment_income_before_lsr was never in the H5. The formula engine then can't recompute the aggregate. Fix: rename CPS aggregate variables to their input-variable equivalents before the drop loop. 2. create_sparse_ecps() had the empty-dict cleanup indented inside the inner time_period loop instead of the outer variable loop, causing empty variable groups to be written to the H5. Fix: dedent to match create_small_ecps(). Closes #573, relates to #571, #444. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…idation Replace hard-coded _RENAME_BEFORE_DROP dict with dynamic discovery from the tax-benefit system, and update sparse eCPS validation to check for employment_income_before_lsr (the input variable) instead of the computed aggregate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of matching hard-coded suffixes like _before_lsr, detect input variables structurally: an adds component with no formula, no adds, and no subtracts is a pure input variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The structural approach (any pure-input adds component) matches ~90 variables and causes false positives. The _before_lsr/_before_response suffixes are a naming convention in policyengine-us for behavioral response variables and precisely target the right ones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PavelMakarchuk

LGTM if this affects all variables that have an adds formula which should not be set to 0

baogorek requested review from MaxGhenis and PavelMakarchuk March 5, 2026 20:12

baogorek and others added 3 commits March 5, 2026 15:37

PavelMakarchuk approved these changes Mar 5, 2026

View reviewed changes

baogorek merged commit d6ebf70 into main Mar 5, 2026
6 checks passed

baogorek deleted the fix-employment-income-zero branch March 5, 2026 21:56

baogorek mentioned this pull request Mar 6, 2026

CI checkpoint cache never invalidates on code changes #583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix employment_income zeroed out in published H5 datasets#574

Fix employment_income zeroed out in published H5 datasets#574
baogorek merged 4 commits intomainfrom
fix-employment-income-zero

baogorek commented Mar 5, 2026

Uh oh!

PavelMakarchuk left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

baogorek commented Mar 5, 2026

Summary

Test plan

Uh oh!

PavelMakarchuk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants