-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Problem
The published enhanced_cps_2024.h5 on HuggingFace produces employment_income.sum() == 0 because of two independent bugs.
Bug A: _drop_formula_variables drops CPS income without preserving raw data
PR #554 (commit 7025f6a) introduced _drop_formula_variables() in ExtendedCPS, which drops all variables with adds/subtracts. This correctly drops employment_income (which has adds: [employment_income_before_lsr, employment_income_behavioral_response]), but the CPS raw data stores income under employment_income directly — employment_income_before_lsr was never written to the H5.
When PolicyEngine tries to recompute employment_income via its adds formula, it needs employment_income_before_lsr, finds nothing, and defaults to 0. Same for self_employment_income.
Bug B: create_sparse_ecps indentation error deletes populated variables
In create_sparse_ecps() (small_enhanced_cps.py), lines 128-129 have the del data[variable] cleanup check inside the inner for time_period loop instead of after it. For formula variables with no known periods, the inner loop never executes, so empty data[variable] = {} dicts survive and get written as empty H5 groups. The sibling function create_small_ecps() at lines 51-52 has the correct indentation.
Relationship to existing issues
- Issue _drop_formula_variables drops employment_income, zeroing it in CI builds #571 documented the
employment_income == 0problem but was erroneously closed by PR Fix double-weight application in sanity tests #572 (which only fixed a test double-weighting bug, not the root cause). - Issue Bug: H5 Dataset Creation Caches 2,719 Calculated Variables, Breaking Uprating and Policy Flexibility #444 provides broader context on formula variable handling.
Fix
- Bug A: Before dropping formula variables, rename CPS-stored aggregate variables to their input-variable equivalents (e.g.
employment_income→employment_income_before_lsr). This preserves the raw data under the correct name so theaddsformula can recompute the aggregate. - Bug B: Dedent the empty-dict cleanup in
create_sparse_ecpsto matchcreate_small_ecps.