-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Problem
cps.py assigns is_pregnant as a stochastic draw using CDC/Census state-level pregnancy rates, because CPS does not collect pregnancy data. However, extended_cps.py has a _drop_formula_variables step that removes any variable with a formula, adds, or subtracts defined in policyengine-us.
is_pregnant has adds: ['current_pregnancies'] in the country package, so it gets silently dropped. The result is that is_pregnant is absent from every downstream dataset (extended, stratified, source-imputed), and calibration targets for pregnancy fail as "impossible."
Current workaround
_drop_formula_variables has a _KEEP_FORMULA_VARS set for exceptions. Adding is_pregnant to it fixes this specific case, but it's a hard-coded denylist escape hatch — any future stochastic input that happens to have a formula/adds/subtracts in the country package will be silently dropped too.
Underlying design tension
The data package (cps.py) deliberately sets is_pregnant as an input variable. The country package (policyengine-us) defines it with adds: ['current_pregnancies'], meaning the engine wants to compute it from components. Both are reasonable:
- Data package perspective: CPS lacks pregnancy data, so we impute it stochastically and store it as an input for calibration to fine-tune.
- Country package perspective:
is_pregnantis derived fromcurrent_pregnanciesviaadds, so the engine should recompute it.
The question is: when the data package provides a value for a variable that the country package also computes, which should win?
Options
- Hard-code exceptions (
_KEEP_FORMULA_VARS) — current approach, fragile - Don't drop variables that were explicitly set in
cps.py— flip the logic so intentional data inputs are preserved - Coordinate with policyengine-us — perhaps
is_pregnantshouldn't haveaddsif it's meant to be a data input, orcurrent_pregnanciesshould be the stored variable instead