Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions docs/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,16 +68,25 @@ missing from the CPS:
### Survey of Income and Program Participation (SIPP)

The SIPP provides income and program participation data. We use SIPP primarily to impute tip income
through a Quantile Regression Forest model trained on SIPP data, using employment income, age, and
household composition as predictors.
and policy-relevant asset inputs through Quantile Regression Forest models trained on SIPP data.
The asset imputations currently cover bank accounts, stocks, bonds, household vehicle counts, and
household vehicle values. Bank accounts, stocks, and bonds are then combined with comparable SCF
predictions through a stable household-level 50/50 source-model draw. These fields are not a
complete household balance sheet; they are exposed so policy models can select the resources that
matter for a specific program.

### Survey of Consumer Finances (SCF)

The SCF provides wealth and debt information that we use to impute several financial variables
missing from the CPS. We match auto loan balances based on household demographics and income, then
calculate interest on auto loans from these imputed balances. Additionally, we impute various net
worth components and other wealth measures not available in CPS. The SCF imputation uses their
reference person definition to ensure proper matching.
calculate interest on auto loans from these imputed balances. We also impute the SCF balance-sheet
components needed to express `net_worth` as a formula: certificates of deposit, savings bonds,
retirement assets, cash-value life insurance, managed assets, other financial assets, home value,
other real estate, business equity, other nonfinancial assets, mortgages, other residential debt,
lines of credit, credit card debt, vehicle installment debt, student debt, other installment debt,
and other debt. We compute `net_worth` from these components and the final SIPP/SCF-blended policy
leaves rather than rescaling resource-tested policy leaves to force an independently imputed SCF
aggregate. The SCF imputation uses their reference person definition to ensure proper matching.

### American Community Survey (ACS)

Expand Down
25 changes: 19 additions & 6 deletions docs/methodology.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,12 +237,25 @@ as a predictor, which allows the imputed values to reflect geographic variation
rates and rent levels.

**SIPP (Survey of Income and Program Participation)**: Tip income, bank account assets, stock
assets, bond assets. The SIPP lacks state identifiers, so these imputations are state-blind at the
microdata level — geographic variation in tip income and assets enters only through calibration
weights, not through the imputed values themselves.

**SCF (Survey of Consumer Finances)**: Net worth, auto loan balances, auto loan interest. The SCF
also lacks state identifiers, so these imputations are likewise state-blind.
assets, bond assets, household vehicle counts, and household vehicle values. The SIPP lacks state
identifiers, so these imputations are state-blind at the microdata level - geographic variation in
tip income and assets enters only through calibration weights, not through the imputed values
themselves.

**SCF (Survey of Consumer Finances)**: Aggregate net worth, auto loan balances, auto loan interest,
and balance-sheet components needed to express net worth as a formula. The SCF also lacks state
identifiers, so these imputations are likewise state-blind.

The asset fields are a mixed-source balance sheet. The SIPP liquid-asset and vehicle fields are
policy-relevant inputs in their own right. For overlapping bank-account, stock, and bond asset
variables, we use a stable household-level 50/50 source-model draw between the SIPP QRF prediction
and the comparable SCF QRF prediction, with a single draw shared across the financial-asset block.
We then impute the non-overlapping SCF balance-sheet components - home value, mortgage debt,
retirement assets, business equity, other real estate, other financial assets, other debts, and
related categories including vehicle, student, and other installment debt - and compute `net_worth`
from those components and the final SIPP/SCF-blended policy leaves. This gives downstream code a
direct component formula without an accounting residual or rescaling of resource-tested policy
leaves.

The output of this stage is the source-imputed stratified CPS
(`source_imputed_stratified_extended_cps_2024.h5`), which serves as the input to the
Expand Down
97 changes: 86 additions & 11 deletions policyengine_us_data/calibration/source_impute.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@
household_vehicles_value (no state predictor)
ORG -> hourly_wage, is_paid_hourly,
is_union_member_or_covered
SCF -> net_worth, auto_loan_balance, auto_loan_interest
SCF -> net_worth, auto_loan_balance, auto_loan_interest,
SCF-only balance-sheet components, and
50/50 source-model averaging for overlapping financial assets
(no state predictor)

Usage in unified calibration pipeline:
Expand Down Expand Up @@ -45,7 +47,16 @@
predict_org_features,
)
from policyengine_us_data.utils.asset_imputation import (
SCF_FINANCIAL_ASSET_POLICY_VARIABLES,
SCF_NET_WORTH_COMPONENT_VARIABLES,
add_scf_financial_asset_targets,
add_scf_net_worth_component_targets,
aggregate_person_values_to_reference_households,
align_household_values_to_reference_households,
build_household_vehicle_receiver,
combine_sipp_and_scf_financial_assets,
compute_net_worth_from_components,
require_scf_net_worth_formula_targets,
)

logger = logging.getLogger(__name__)
Expand All @@ -64,12 +75,17 @@
"household_vehicles_value",
]

SCF_IMPUTED_VARIABLES = [
"net_worth",
SCF_CORE_IMPUTED_VARIABLES = [
"auto_loan_balance",
"auto_loan_interest",
]

SCF_IMPUTED_VARIABLES = [
"net_worth",
*SCF_CORE_IMPUTED_VARIABLES,
*SCF_NET_WORTH_COMPONENT_VARIABLES,
]

ALL_SOURCE_VARIABLES = (
ACS_IMPUTED_VARIABLES
+ SIPP_IMPUTED_VARIABLES
Expand Down Expand Up @@ -763,17 +779,28 @@ def _impute_scf(
logger.warning("SCF missing predictors: %s", missing_preds)
scf_predictors = available_preds

if "networth" in scf_df.columns and "net_worth" not in scf_df.columns:
scf_df["net_worth"] = scf_df["networth"]
scf_financial_asset_targets = add_scf_financial_asset_targets(scf_df)
scf_component_targets = add_scf_net_worth_component_targets(scf_df)
require_scf_net_worth_formula_targets(
scf_financial_asset_targets=scf_financial_asset_targets,
scf_component_targets=scf_component_targets,
)

available_vars = [v for v in SCF_IMPUTED_VARIABLES if v in scf_df.columns]
if not available_vars:
available_vars = [
v for v in SCF_CORE_IMPUTED_VARIABLES if v in scf_df.columns
]
qrf_vars = available_vars + [
v for v in scf_financial_asset_targets if v in scf_df.columns
] + [
v for v in scf_component_targets if v in scf_df.columns
]
if not qrf_vars:
logger.warning("No SCF imputed variables available. Skipping.")
return data

weights = scf_df.get("wgt")

donor = scf_df[scf_predictors + available_vars].copy()
donor = scf_df[scf_predictors + qrf_vars].copy()
if weights is not None:
donor["wgt"] = weights
donor = donor.dropna(subset=scf_predictors)
Expand Down Expand Up @@ -834,12 +861,12 @@ def _impute_scf(
"SCF QRF: %d train, %d test, vars=%s",
len(donor),
len(cps_df),
available_vars,
qrf_vars,
)
fitted = qrf.fit(
X_train=donor,
predictors=scf_predictors,
imputed_variables=available_vars,
imputed_variables=qrf_vars,
weight_col="wgt" if weights is not None else None,
tune_hyperparameters=False,
)
Expand Down Expand Up @@ -870,10 +897,58 @@ def _impute_scf(
else:
data[var] = {time_period: person_vals}

person_hh_ids = data.get("person_household_id", {}).get(time_period)
if person_hh_ids is not None:
first_person_mask = ~pd.Series(person_hh_ids).duplicated().values
reference_household_ids = person_hh_ids[first_person_mask]
for var in SCF_NET_WORTH_COMPONENT_VARIABLES:
if var in preds:
data[var] = {
time_period: preds.loc[first_person_mask, var].values.astype(
np.float32
)
}
for scf_var, policy_var in SCF_FINANCIAL_ASSET_POLICY_VARIABLES.items():
if scf_var not in preds or policy_var not in data:
continue
data[policy_var] = {
time_period: combine_sipp_and_scf_financial_assets(
sipp_values=data[policy_var][time_period],
scf_household_values=preds.loc[first_person_mask, scf_var].values,
person_household_ids=person_hh_ids,
reference_person_mask=first_person_mask,
time_period=time_period,
)
}
net_worth_components = {}
for var in ("bank_account_assets", "stock_assets", "bond_assets"):
if var in data:
net_worth_components[var] = aggregate_person_values_to_reference_households(
data[var][time_period],
person_hh_ids,
first_person_mask,
)
if "household_vehicles_value" in data:
net_worth_components["household_vehicles_value"] = (
align_household_values_to_reference_households(
data["household_vehicles_value"][time_period],
hh_ids,
reference_household_ids,
)
)
for var in SCF_NET_WORTH_COMPONENT_VARIABLES:
if var in data:
net_worth_components[var] = data[var][time_period]
data["net_worth"] = {
time_period: compute_net_worth_from_components(
components=net_worth_components,
)
}

del fitted, preds
gc.collect()

logger.info("SCF imputation complete: %s", available_vars)
logger.info("SCF imputation complete: %s", SCF_IMPUTED_VARIABLES)
return data


Expand Down
70 changes: 65 additions & 5 deletions policyengine_us_data/datasets/cps/cps.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,16 @@
reported_subsidized_marketplace_by_tax_unit,
)
from policyengine_us_data.utils.asset_imputation import (
SCF_FINANCIAL_ASSET_POLICY_VARIABLES,
SCF_NET_WORTH_COMPONENT_VARIABLES,
add_scf_financial_asset_targets,
add_scf_net_worth_component_targets,
aggregate_person_values_to_reference_households,
align_household_values_to_reference_households,
build_household_vehicle_receiver,
combine_sipp_and_scf_financial_assets,
compute_net_worth_from_components,
require_scf_net_worth_formula_targets,
)
from policyengine_us_data.utils.policyengine import (
supports_medicare_enrollment_input,
Expand Down Expand Up @@ -2156,7 +2165,9 @@ def add_tips(self, cps: h5py.File):
mean_quantile=0.5,
).tip_income.values

# Impute liquid assets from SIPP (bank accounts, stocks, bonds)
# Impute SIPP liquid assets used directly by resource-tested policy rules.
# The SCF step below applies a stable 50/50 source-model draw for the
# overlapping bank, stock, and bond leaves.

from policyengine_us_data.datasets.sipp import get_asset_model

Expand Down Expand Up @@ -2473,6 +2484,7 @@ def determine_reference_person(group):

mask = create_scf_reference_person_mask(cps_data, person_data)
mask_len = mask.shape[0]
original_person_household_ids = np.asarray(cps_data["person_household_id"])

cps_data = {
var: data[mask] if data.shape[0] == mask_len else data
Expand Down Expand Up @@ -2543,7 +2555,9 @@ def determine_reference_person(group):
reference_persons = person_data[mask]
receiver_data["is_married"] = reference_persons.A_MARITL.isin([1, 2]).values

# Impute auto loan balance from the SCF
# Impute selected auto-loan fields, SCF equivalents for overlapping
# financial asset leaves, and SCF-only balance-sheet leaves. We compute
# net_worth from those components rather than storing an SCF aggregate.
from policyengine_us_data.datasets.scf.scf import SCF_2022

scf_dataset = SCF_2022()
Expand All @@ -2560,7 +2574,16 @@ def determine_reference_person(group):
"interest_dividend_income",
"social_security_pension_income",
]
IMPUTED_VARIABLES = ["networth", "auto_loan_balance", "auto_loan_interest"]
scf_financial_asset_targets = add_scf_financial_asset_targets(scf_data)
scf_component_targets = add_scf_net_worth_component_targets(scf_data)
require_scf_net_worth_formula_targets(
scf_financial_asset_targets=scf_financial_asset_targets,
scf_component_targets=scf_component_targets,
)
IMPUTED_VARIABLES = [
"auto_loan_balance",
"auto_loan_interest",
] + list(scf_financial_asset_targets) + list(scf_component_targets)
weights = ["wgt"]

donor_data = scf_data[PREDICTORS + IMPUTED_VARIABLES + weights].copy()
Expand Down Expand Up @@ -2589,8 +2612,45 @@ def determine_reference_person(group):
for var in IMPUTED_VARIABLES:
cps[var] = imputations[var]

cps["net_worth"] = cps["networth"]
del cps["networth"]
for scf_var, policy_var in SCF_FINANCIAL_ASSET_POLICY_VARIABLES.items():
if scf_var not in imputations:
continue
if policy_var in cps:
cps[policy_var] = combine_sipp_and_scf_financial_assets(
sipp_values=cps[policy_var],
scf_household_values=imputations[scf_var].values,
person_household_ids=original_person_household_ids,
reference_person_mask=mask,
time_period=self.time_period,
)
if scf_var in cps:
del cps[scf_var]

reference_household_ids = original_person_household_ids[mask]
net_worth_components = {}
for variable in ("bank_account_assets", "stock_assets", "bond_assets"):
if variable in cps:
net_worth_components[variable] = (
aggregate_person_values_to_reference_households(
cps[variable],
original_person_household_ids,
mask,
)
)
if "household_vehicles_value" in cps_data and "household_id" in cps_data:
net_worth_components["household_vehicles_value"] = (
align_household_values_to_reference_households(
cps_data["household_vehicles_value"],
cps_data["household_id"],
reference_household_ids,
)
)
for variable in SCF_NET_WORTH_COMPONENT_VARIABLES:
if variable in cps:
net_worth_components[variable] = cps[variable]
cps["net_worth"] = compute_net_worth_from_components(
components=net_worth_components
)

self.save_dataset(cps)

Expand Down
Loading
Loading