Skip to content

Add entity-level HDFStore output format alongside h5py#568

Open
anth-volk wants to merge 4 commits intomainfrom
add-hdfstore-output
Open

Add entity-level HDFStore output format alongside h5py#568
anth-volk wants to merge 4 commits intomainfrom
add-hdfstore-output

Conversation

@anth-volk
Copy link
Collaborator

@anth-volk anth-volk commented Mar 4, 2026

Fixes #567

Related to PolicyEngine/policyengine-us#7700

Summary

  • stacked_dataset_builder.py now produces a Pandas HDFStore file (.hdfstore.h5) alongside the existing h5py file, with one table per entity and an embedded uprating manifest
  • Upload pipeline uploads HDFStore files to dedicated subdirectories (states_hdfstore/, districts_hdfstore/, cities_hdfstore/)
  • Comparison test validates both formats contain identical data for all ~183 variables

Test plan

  • Run stacked_dataset_builder on a single CD/state and confirm both .h5 and .hdfstore.h5 files are created
  • Run pytest test_format_comparison.py --h5py-path STATE.h5 --hdfstore-path STATE.hdfstore.h5 and confirm all variables match
  • Verify HDFStore contains _variable_metadata manifest with correct entity and uprating columns
  • Verify all 6 entity tables are present with correct row counts

🤖 Generated with Claude Code

The stacked_dataset_builder now produces a Pandas HDFStore file
(.hdfstore.h5) in addition to the existing h5py file. The HDFStore
contains one table per entity (person, household, tax_unit, spm_unit,
family, marital_unit) plus an embedded _variable_metadata manifest
recording each variable's entity and uprating parameter path.

The upload pipeline uploads HDFStore files to dedicated subdirectories
(states_hdfstore/, districts_hdfstore/, cities_hdfstore/).

A comparison test (test_format_comparison.py) validates that both
formats contain identical data for all variables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
anth-volk and others added 3 commits March 5, 2026 17:47
Replaces the two-file-input test with a self-contained roundtrip
script that takes only an h5py file path, generates an HDFStore
using inlined splitting logic, then compares both formats. Handles
entity-level h5py files and yearly/ETERNITY/monthly period keys.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@anth-volk anth-volk requested a review from juaristi22 March 5, 2026 18:39
@anth-volk anth-volk marked this pull request as ready for review March 5, 2026 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add entity-level HDFStore output format alongside h5py

1 participant