Fix employment_income zeroed out in published H5 datasets#574
Merged
Conversation
Two bugs caused employment_income (and self_employment_income) to be zero in published enhanced CPS datasets: 1. _drop_formula_variables() drops variables with `adds`/`subtracts`, including employment_income. But CPS raw data stores income under employment_income directly — employment_income_before_lsr was never in the H5. The formula engine then can't recompute the aggregate. Fix: rename CPS aggregate variables to their input-variable equivalents before the drop loop. 2. create_sparse_ecps() had the empty-dict cleanup indented inside the inner time_period loop instead of the outer variable loop, causing empty variable groups to be written to the H5. Fix: dedent to match create_small_ecps(). Closes #573, relates to #571, #444. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…idation Replace hard-coded _RENAME_BEFORE_DROP dict with dynamic discovery from the tax-benefit system, and update sparse eCPS validation to check for employment_income_before_lsr (the input variable) instead of the computed aggregate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of matching hard-coded suffixes like _before_lsr, detect input variables structurally: an adds component with no formula, no adds, and no subtracts is a pure input variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The structural approach (any pure-input adds component) matches ~90 variables and causes false positives. The _before_lsr/_before_response suffixes are a naming convention in policyengine-us for behavioral response variables and precisely target the right ones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PavelMakarchuk
approved these changes
Mar 5, 2026
Collaborator
PavelMakarchuk
left a comment
There was a problem hiding this comment.
LGTM if this affects all variables that have an adds formula which should not be set to 0
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
employment_income,self_employment_income) to their input-variable equivalents (*_before_lsr) before_drop_formula_variables()removes them, so the formula engine can recompute the aggregates from preserved raw datacreate_sparse_ecps()where empty-dict cleanup was inside the innertime_periodloop instead of the outervariableloopTest plan
make formatpasses_drop_formula_variables()renamesemployment_income→employment_income_before_lsrand drops the aggregatetest_ecps_employment_income_positive,test_ecps_self_employment_income_positive,test_ecps_mean_employment_income_reasonableall greenmake datarebuild to regenerate H5 and verifyemployment_income.sum() > 0end-to-end (CI step)Closes #573. Relates to #571, #444.
🤖 Generated with Claude Code