Skip to content

Cache QRF models to avoid retraining when only weights change #595

@MaxGhenis

Description

@MaxGhenis

Summary

The extended CPS build retrains all QRF models from scratch every time, even when only calibration weights change. Since the QRF imputation depends on source CPS + PUF data (not weights), the fitted models could be cached and reused.

Current cost

  • 85+ variable sequential QRF on ~20K PUF records: ~30-60 min
  • Additional QRF calls for weeks_unemployed, retirement contributions, SS sub-components
  • This runs on every make data or CI build

Proposed approach

  • Serialize fitted QRF models (e.g. pickle/joblib) keyed by a hash of the training data
  • On rebuild, check if source data hash matches cached model — if so, skip training and just predict
  • microimpute could potentially support this natively (save/load fitted models)
  • Could also cache the full extended_cps_2024.h5 and only rebuild when CPS/PUF inputs change

Context

Related to the sequential QRF migration in #594 — now that all 85 variables are in a single fit() call, caching the one fitted model would skip the entire training phase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions