Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Repository Agent Instructions

- Open PRs from branches in `PolicyEngine/policyengine-us-data`, not forks. This repository's PR CI rejects fork PRs in `check-fork`, so push agent-created branches to the canonical `PolicyEngine/policyengine-us-data` remote before creating a PR.
- Do not put `[codex]` in PR titles.
1 change: 1 addition & 0 deletions changelog.d/835.changed
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Compute CPS net worth from SCF-anchored balance-sheet components, including SCF/SIPP-blended vehicle values, instead of an SCF aggregate residual.
20 changes: 15 additions & 5 deletions docs/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,16 +68,26 @@ missing from the CPS:
### Survey of Income and Program Participation (SIPP)

The SIPP provides income and program participation data. We use SIPP primarily to impute tip income
through a Quantile Regression Forest model trained on SIPP data, using employment income, age, and
household composition as predictors.
and policy-relevant asset inputs through Quantile Regression Forest models trained on SIPP data.
The asset imputations currently cover bank accounts, stocks, bonds, household vehicle counts, and
household vehicle values. Bank accounts, stocks, bonds, and household vehicle values are then
combined with comparable SCF predictions through a stable household-level 50/50 source-model draw.
These fields are not a complete household balance sheet; they are exposed so policy models can
select the resources that matter for a specific program.

### Survey of Consumer Finances (SCF)

The SCF provides wealth and debt information that we use to impute several financial variables
missing from the CPS. We match auto loan balances based on household demographics and income, then
calculate interest on auto loans from these imputed balances. Additionally, we impute various net
worth components and other wealth measures not available in CPS. The SCF imputation uses their
reference person definition to ensure proper matching.
calculate interest on auto loans from these imputed balances. We also impute the SCF balance-sheet
components needed to express `net_worth` as a formula: certificates of deposit, savings bonds,
retirement assets, cash-value life insurance, managed assets, other financial assets, home value,
other real estate, business equity, other nonfinancial assets, mortgages, other residential debt,
lines of credit, credit card debt, vehicle installment debt, student debt, other installment debt,
and other debt. We also impute a direct SCF net-worth anchor, then proportionally rebalance the
SCF-only leaves so `net_worth` remains a component formula while preserving the final
SIPP/SCF-blended policy leaves. The SCF imputation uses their reference person definition to ensure
proper matching.

### American Community Survey (ACS)

Expand Down
26 changes: 20 additions & 6 deletions docs/methodology.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,12 +237,26 @@ as a predictor, which allows the imputed values to reflect geographic variation
rates and rent levels.

**SIPP (Survey of Income and Program Participation)**: Tip income, bank account assets, stock
assets, bond assets. The SIPP lacks state identifiers, so these imputations are state-blind at the
microdata level — geographic variation in tip income and assets enters only through calibration
weights, not through the imputed values themselves.

**SCF (Survey of Consumer Finances)**: Net worth, auto loan balances, auto loan interest. The SCF
also lacks state identifiers, so these imputations are likewise state-blind.
assets, bond assets, household vehicle counts, and household vehicle values. The SIPP lacks state
identifiers, so these imputations are state-blind at the microdata level - geographic variation in
tip income and assets enters only through calibration weights, not through the imputed values
themselves.

**SCF (Survey of Consumer Finances)**: Aggregate net worth, auto loan balances, auto loan interest,
and balance-sheet components needed to express net worth as a formula. The SCF also lacks state
identifiers, so these imputations are likewise state-blind.

The asset fields are a mixed-source balance sheet. The SIPP liquid-asset and vehicle fields are
policy-relevant inputs in their own right. For overlapping bank-account, stock, bond, and vehicle
value variables, we use a stable household-level 50/50 source-model draw between the SIPP QRF
prediction and the comparable SCF QRF prediction, with a single draw shared across the asset block.
We then impute the non-overlapping SCF balance-sheet components - home value, mortgage debt,
retirement assets, business equity, other real estate, other financial assets, other debts, and
related categories including vehicle, student, and other installment debt. Because independently
imputed leaves do not preserve the SCF balance-sheet covariance exactly, we impute a direct SCF net
worth anchor and proportionally rebalance the SCF-only leaves to that anchor. This gives downstream
code a direct component formula without an accounting residual while preserving resource-tested
policy leaves.

The output of this stage is the source-imputed stratified CPS
(`source_imputed_stratified_extended_cps_2024.h5`), which serves as the input to the
Expand Down
131 changes: 120 additions & 11 deletions policyengine_us_data/calibration/source_impute.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@
household_vehicles_value (no state predictor)
ORG -> hourly_wage, is_paid_hourly,
is_union_member_or_covered
SCF -> net_worth, auto_loan_balance, auto_loan_interest
SCF -> net_worth, auto_loan_balance, auto_loan_interest,
SCF-only balance-sheet components, and
50/50 source-model averaging for overlapping financial assets
(no state predictor)

Usage in unified calibration pipeline:
Expand Down Expand Up @@ -45,7 +47,22 @@
predict_org_features,
)
from policyengine_us_data.utils.asset_imputation import (
SCF_NET_WORTH_TARGET,
SCF_FINANCIAL_ASSET_POLICY_VARIABLES,
SCF_HOUSEHOLD_ASSET_POLICY_VARIABLES,
SCF_NET_WORTH_COMPONENT_VARIABLES,
add_scf_financial_asset_targets,
add_scf_household_asset_targets,
add_scf_net_worth_target,
add_scf_net_worth_component_targets,
aggregate_person_values_to_reference_households,
align_household_values_to_reference_households,
build_household_vehicle_receiver,
combine_sipp_and_scf_financial_assets,
combine_sipp_and_scf_household_assets,
compute_net_worth_from_components,
rebalance_scf_net_worth_components,
require_scf_net_worth_formula_targets,
)

logger = logging.getLogger(__name__)
Expand All @@ -64,12 +81,17 @@
"household_vehicles_value",
]

SCF_IMPUTED_VARIABLES = [
"net_worth",
SCF_CORE_IMPUTED_VARIABLES = [
"auto_loan_balance",
"auto_loan_interest",
]

SCF_IMPUTED_VARIABLES = [
"net_worth",
*SCF_CORE_IMPUTED_VARIABLES,
*SCF_NET_WORTH_COMPONENT_VARIABLES,
]

ALL_SOURCE_VARIABLES = (
ACS_IMPUTED_VARIABLES
+ SIPP_IMPUTED_VARIABLES
Expand Down Expand Up @@ -763,17 +785,32 @@ def _impute_scf(
logger.warning("SCF missing predictors: %s", missing_preds)
scf_predictors = available_preds

if "networth" in scf_df.columns and "net_worth" not in scf_df.columns:
scf_df["net_worth"] = scf_df["networth"]
scf_net_worth_targets = add_scf_net_worth_target(scf_df)
scf_financial_asset_targets = add_scf_financial_asset_targets(scf_df)
scf_household_asset_targets = add_scf_household_asset_targets(scf_df)
scf_component_targets = add_scf_net_worth_component_targets(scf_df)
require_scf_net_worth_formula_targets(
scf_financial_asset_targets=scf_financial_asset_targets,
scf_household_asset_targets=scf_household_asset_targets,
scf_component_targets=scf_component_targets,
scf_net_worth_targets=scf_net_worth_targets,
)

available_vars = [v for v in SCF_IMPUTED_VARIABLES if v in scf_df.columns]
if not available_vars:
available_vars = [v for v in SCF_CORE_IMPUTED_VARIABLES if v in scf_df.columns]
qrf_vars = (
[v for v in scf_net_worth_targets if v in scf_df.columns]
+ available_vars
+ [v for v in scf_financial_asset_targets if v in scf_df.columns]
+ [v for v in scf_household_asset_targets if v in scf_df.columns]
+ [v for v in scf_component_targets if v in scf_df.columns]
)
if not qrf_vars:
logger.warning("No SCF imputed variables available. Skipping.")
return data

weights = scf_df.get("wgt")

donor = scf_df[scf_predictors + available_vars].copy()
donor = scf_df[scf_predictors + qrf_vars].copy()
if weights is not None:
donor["wgt"] = weights
donor = donor.dropna(subset=scf_predictors)
Expand Down Expand Up @@ -834,12 +871,12 @@ def _impute_scf(
"SCF QRF: %d train, %d test, vars=%s",
len(donor),
len(cps_df),
available_vars,
qrf_vars,
)
fitted = qrf.fit(
X_train=donor,
predictors=scf_predictors,
imputed_variables=available_vars,
imputed_variables=qrf_vars,
weight_col="wgt" if weights is not None else None,
tune_hyperparameters=False,
)
Expand Down Expand Up @@ -870,10 +907,82 @@ def _impute_scf(
else:
data[var] = {time_period: person_vals}

person_hh_ids = data.get("person_household_id", {}).get(time_period)
if person_hh_ids is not None:
first_person_mask = ~pd.Series(person_hh_ids).duplicated().values
reference_household_ids = person_hh_ids[first_person_mask]
for var in SCF_NET_WORTH_COMPONENT_VARIABLES:
if var in preds:
data[var] = {
time_period: preds.loc[first_person_mask, var].values.astype(
np.float32
)
}
for scf_var, policy_var in SCF_FINANCIAL_ASSET_POLICY_VARIABLES.items():
if scf_var not in preds or policy_var not in data:
continue
data[policy_var] = {
time_period: combine_sipp_and_scf_financial_assets(
sipp_values=data[policy_var][time_period],
scf_household_values=preds.loc[first_person_mask, scf_var].values,
person_household_ids=person_hh_ids,
reference_person_mask=first_person_mask,
time_period=time_period,
)
}
for scf_var, policy_var in SCF_HOUSEHOLD_ASSET_POLICY_VARIABLES.items():
if scf_var not in preds or policy_var not in data:
continue
data[policy_var] = {
time_period: combine_sipp_and_scf_household_assets(
sipp_household_values=data[policy_var][time_period],
scf_household_values=preds.loc[first_person_mask, scf_var].values,
household_ids=hh_ids,
reference_household_ids=reference_household_ids,
time_period=time_period,
)
}
net_worth_components = {}
for var in ("bank_account_assets", "stock_assets", "bond_assets"):
if var in data:
net_worth_components[var] = (
aggregate_person_values_to_reference_households(
data[var][time_period],
person_hh_ids,
first_person_mask,
)
)
if "household_vehicles_value" in data:
net_worth_components["household_vehicles_value"] = (
align_household_values_to_reference_households(
data["household_vehicles_value"][time_period],
hh_ids,
reference_household_ids,
)
)
for var in SCF_NET_WORTH_COMPONENT_VARIABLES:
if var in data:
net_worth_components[var] = data[var][time_period]
if SCF_NET_WORTH_TARGET in preds:
net_worth_components = rebalance_scf_net_worth_components(
components=net_worth_components,
target_net_worth=preds.loc[
first_person_mask, SCF_NET_WORTH_TARGET
].values.astype(np.float32),
)
for var in SCF_NET_WORTH_COMPONENT_VARIABLES:
if var in net_worth_components:
data[var] = {time_period: net_worth_components[var]}
data["net_worth"] = {
time_period: compute_net_worth_from_components(
components=net_worth_components,
)
}

del fitted, preds
gc.collect()

logger.info("SCF imputation complete: %s", available_vars)
logger.info("SCF imputation complete: %s", SCF_IMPUTED_VARIABLES)
return data


Expand Down
Loading
Loading