Skip to content

Fix employment_income zeroed out in published H5 datasets#574

Merged
baogorek merged 4 commits intomainfrom
fix-employment-income-zero
Mar 5, 2026
Merged

Fix employment_income zeroed out in published H5 datasets#574
baogorek merged 4 commits intomainfrom
fix-employment-income-zero

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Mar 5, 2026

Summary

  • Rename CPS-stored aggregate income variables (employment_income, self_employment_income) to their input-variable equivalents (*_before_lsr) before _drop_formula_variables() removes them, so the formula engine can recompute the aggregates from preserved raw data
  • Fix indentation bug in create_sparse_ecps() where empty-dict cleanup was inside the inner time_period loop instead of the outer variable loop

Test plan

  • make format passes
  • Spot-check: _drop_formula_variables() renames employment_incomeemployment_income_before_lsr and drops the aggregate
  • Sanity tests pass: test_ecps_employment_income_positive, test_ecps_self_employment_income_positive, test_ecps_mean_employment_income_reasonable all green
  • Full make data rebuild to regenerate H5 and verify employment_income.sum() > 0 end-to-end (CI step)

Closes #573. Relates to #571, #444.

🤖 Generated with Claude Code

Two bugs caused employment_income (and self_employment_income) to be
zero in published enhanced CPS datasets:

1. _drop_formula_variables() drops variables with `adds`/`subtracts`,
   including employment_income. But CPS raw data stores income under
   employment_income directly — employment_income_before_lsr was never
   in the H5. The formula engine then can't recompute the aggregate.
   Fix: rename CPS aggregate variables to their input-variable
   equivalents before the drop loop.

2. create_sparse_ecps() had the empty-dict cleanup indented inside the
   inner time_period loop instead of the outer variable loop, causing
   empty variable groups to be written to the H5.
   Fix: dedent to match create_small_ecps().

Closes #573, relates to #571, #444.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
baogorek and others added 3 commits March 5, 2026 15:37
…idation

Replace hard-coded _RENAME_BEFORE_DROP dict with dynamic discovery from
the tax-benefit system, and update sparse eCPS validation to check for
employment_income_before_lsr (the input variable) instead of the computed
aggregate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of matching hard-coded suffixes like _before_lsr, detect input
variables structurally: an adds component with no formula, no adds, and
no subtracts is a pure input variable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The structural approach (any pure-input adds component) matches ~90
variables and causes false positives. The _before_lsr/_before_response
suffixes are a naming convention in policyengine-us for behavioral
response variables and precisely target the right ones.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@PavelMakarchuk PavelMakarchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if this affects all variables that have an adds formula which should not be set to 0

@baogorek baogorek merged commit d6ebf70 into main Mar 5, 2026
6 checks passed
@baogorek baogorek deleted the fix-employment-income-zero branch March 5, 2026 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

employment_income is zero in published enhanced CPS datasets

2 participants