Add entity-level HDFStore output format alongside h5py by anth-volk · Pull Request #568 · PolicyEngine/policyengine-us-data

anth-volk · 2026-03-04T20:37:52Z

Fixes #567

Related to PolicyEngine/policyengine-us#7700

Summary

stacked_dataset_builder.py now produces a Pandas HDFStore file (.hdfstore.h5) alongside the existing h5py file, with one table per entity and an embedded uprating manifest
Upload pipeline uploads HDFStore files to dedicated subdirectories (states_hdfstore/, districts_hdfstore/, cities_hdfstore/)
Comparison test validates both formats contain identical data for all ~183 variables

Test plan

Run stacked_dataset_builder on a single CD/state and confirm both .h5 and .hdfstore.h5 files are created
Run pytest test_format_comparison.py --h5py-path STATE.h5 --hdfstore-path STATE.hdfstore.h5 and confirm all variables match
Verify HDFStore contains _variable_metadata manifest with correct entity and uprating columns
Verify all 6 entity tables are present with correct row counts

🤖 Generated with Claude Code

The stacked_dataset_builder now produces a Pandas HDFStore file (.hdfstore.h5) in addition to the existing h5py file. The HDFStore contains one table per entity (person, household, tax_unit, spm_unit, family, marital_unit) plus an embedded _variable_metadata manifest recording each variable's entity and uprating parameter path. The upload pipeline uploads HDFStore files to dedicated subdirectories (states_hdfstore/, districts_hdfstore/, cities_hdfstore/). A comparison test (test_format_comparison.py) validates that both formats contain identical data for all variables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replaces the two-file-input test with a self-contained roundtrip script that takes only an h5py file path, generates an HDFStore using inlined splitting logic, then compares both formats. Handles entity-level h5py files and yearly/ETERNITY/monthly period keys. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

juaristi22

Review: HDFStore output format

Overall the feature makes sense — producing an entity-level HDFStore alongside the variable-centric h5py file is a clean step toward the API v2 alpha format. A few structural and testing concerns below.

1. Extract HDFStore logic into its own module

The three new functions (_split_into_entity_dfs, _build_uprating_manifest, _save_hdfstore) are a general-purpose h5py → HDFStore format transform. They don't depend on calibration, CD stacking, or anything specific to stacked_dataset_builder.py.

Embedding them in stacked_dataset_builder.py creates two problems:

Merge conflict with #538. That PR reorganizes calibration modules from datasets/cps/local_area_calibration/ to calibration/. This PR adds ~150 lines of new functions to the same file. The merge will be painful.
Coupling to a build pipeline. The HDFStore format is consumed by API v2 and extend_single_year_dataset(). Those consumers shouldn't need to trace into a calibration build module for the serialization logic.

Suggestion: Move these to a standalone utility, e.g. policyengine_us_data/utils/hdfstore.py:

# policyengine_us_data/utils/hdfstore.py
def split_into_entity_dfs(arrays, system, vars_to_save): ...
def build_uprating_manifest(vars_to_save, system): ...
def save_hdfstore(entity_dfs, manifest_df, path, time_period): ...

Then stacked_dataset_builder.py and any future consumer just imports from it.

2. The test reimplements the logic instead of testing it

test_format_comparison.py contains its own copy of _split_into_entity_dfs, _build_uprating_manifest, and _save_hdfstore. This means the test validates its own implementation, not the production code. If the production _split_into_entity_dfs has a bug (e.g., incorrect __period suffix stripping, wrong person_X_id → X_id rename), this test cannot catch it.

Note also that the two implementations already diverge:

Production works on a combined_df with variable__period columns and handles the person_X_id → X_id rename.
Test works on raw h5py arrays with no period suffix and doesn't handle that rename.
Production's _save_hdfstore does not deduplicate columns; the test version does (df.loc[:, ~df.columns.duplicated()]).

Suggestion: If the conversion logic lives in a shared module (point 1), the test can import it directly instead of reimplementing it. The test would then validate the actual production code path.

3. Nice-to-have test improvements

The current tests check value equality and structural completeness, which covers the basics well. A few additional checks that could be worth adding down the road if you want to harden things further:

Weight conservation — verifying that household_weight, person_weight, etc. sum to the same totals across formats. This is probably the highest-value addition since it's the property that matters most for microsimulation correctness.
Referential integrity — checking that every person_household_id in the person table references a valid household_id in the household table (and likewise for other entities). Silent FK violations won't fail value comparisons but would break simulations.
Row count consistency — e.g. n_unique(person_household_id) == len(household_df).
Group entity dedup correctness — the current comparison for group entities falls back to np.unique set comparison when array lengths differ, which can mask cases where dedup picks the wrong representative row. Not urgent but worth being aware of.
Bidirectional column check — currently checks h5py → HDFStore but not the reverse (no extra/spurious columns in HDFStore).
_time_period roundtrip — the _time_period metadata table is written but never read back or verified.

None of these are blockers — just ideas for making the validation more robust over time.

4. Minor: manifest entity mismatch

_build_uprating_manifest classifies unknown variables (not in system.variables) as entity "unknown", but _split_into_entity_dfs classifies them as "household". The manifest entity won't match where the variable actually ends up. Should be consistent — probably both "household".

5. Upload pattern in `publish_local_area.py`

The HDFStore upload uses os.path.exists(hdfstore_path) checks repeated across states/districts/cities. This is fine for now but could be a helper like _upload_hdfstore_if_exists(output_path, category) to reduce the copy-paste.

Main moved publish_local_area.py and stacked_dataset_builder.py from datasets/cps/local_area_calibration/ to calibration/. Resolved by removing old-location files and porting HDFStore upload logic to the new publish_local_area.py location. Note: The HDFStore helper functions (_split_into_entity_dfs, _build_uprating_manifest, _save_hdfstore) from stacked_dataset_builder need re-implementation — the original create_sparse_cd_stacked_dataset was refactored into build_h5() which uses a different data structure (dict of arrays vs combined DataFrame). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

baogorek · 2026-03-12T20:14:11Z

Hi @anth-volk, since PR #538 was merged and was such a large PR, I'm taking initiative to get the remaining PRs up to date with main.

Merged main into this branch

PR #538 (calibration pipeline improvements) moved the files this PR modified:

datasets/cps/local_area_calibration/publish_local_area.py → calibration/publish_local_area.py
datasets/cps/local_area_calibration/stacked_dataset_builder.py → calibration/stacked_dataset_builder.py (completely rewritten as thin CLI wrapper)

What was ported

HDFStore upload logic in publish_local_area.py: The .hdfstore.h5 upload blocks for states, districts, and cities were applied to the new file location. These add states_hdfstore, districts_hdfstore, cities_hdfstore subdirectories for GCS/HF uploads.
Old-location files were removed (they no longer exist in main).
HDFStore helper functions (_split_into_entity_dfs, _build_uprating_manifest, _save_hdfstore) were adapted to work with build_h5()'s data dict structure (instead of the old combined_df DataFrame) and wired up at the end of build_h5().

Please review the merge commits — especially the entity splitting logic in _split_data_into_entity_dfs — and let me know if anything looks off.

Adapts _split_into_entity_dfs, _build_uprating_manifest, and _save_hdfstore to work with the new data dict (var -> {period -> array}) instead of the old combined_df DataFrame. Adds the integration call at the end of build_h5() so HDFStore files are generated alongside h5py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

baogorek · 2026-03-12T20:56:29Z

Details on the helper function port:

The original helpers worked with a combined_df pandas DataFrame that no longer exists — build_h5() uses a data dict of {var_name: {period: numpy_array}}. The adapted versions:

_split_data_into_entity_dfs: Groups variables by entity using system.variables, builds a DataFrame per entity from the data dict arrays, deduplicates group entities by ID
_build_uprating_manifest: Same logic, reads variable names from data.keys() instead of combined_df.columns
_save_hdfstore: Unchanged from original

The integration call is at the end of build_h5(), right after the h5py verification block — so every local area build produces both .h5 and .hdfstore.h5 files.

Moves split_data_into_entity_dfs, build_uprating_manifest, and save_hdfstore out of publish_local_area.py into a standalone utility module. Updates test_format_comparison.py to import the production functions instead of reimplementing them, bridging the h5py flat-array format into the nested {var: {period: array}} structure expected by the shared module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This reverts commit 69aa1f0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

juaristi22

Review

Fragile `.h5` path replacement (change requested)

save_hdfstore derives the output path via str(output_path).replace(".h5", ".hdfstore.h5") (utils/hdfstore.py:138). This has two problems:

If the path contains .h5 elsewhere (e.g. /data/h5_files/output.h5), the wrong occurrence gets replaced.
More concretely: running test_format_comparison.py against a state file triggers a double replacement. The test's h5py_to_hdfstore computes hdfstore_path = args.h5py_path.replace(".h5", ".hdfstore.h5") and passes that to save_hdfstore, which then applies .replace(".h5", ".hdfstore.h5") again internally — producing MA.hdfstore.hdfstore.h5. The file gets written there, but compare_formats looks for MA.hdfstore.h5, which doesn't exist, and the script crashes with FileNotFoundError.

The fix should make save_hdfstore either accept the final output path directly (no internal derivation), or use suffix-only replacement like Path(output_path).with_suffix(".hdfstore.h5"). The test script's h5py_to_hdfstore would then need to pass the base .h5 path instead of the derived one, or vice versa — just needs to be consistent.

Minor suggestions (non-blocking)

save_hdfstore mutates input DataFrames in-place (utils/hdfstore.py:149-152): The object-dtype columns are cast with df[col] = df[col].astype(str), which mutates the caller's DataFrames. If anything downstream of build_h5() uses entity_dfs after the save, the data will be silently altered. Consider df = df.copy() before the column loop.
Monthly-period variables silently dropped (utils/hdfstore.py:57-59): Variables stored under monthly keys like "2024-01" won't match time_period (int) or str(time_period) and are silently skipped. The test script's _read_h5py_arrays handles this by falling back to subkeys[0], but the production code doesn't. Consider a similar fallback or at least logging when a variable is skipped.

Thanks for the other changes, everything else looks good!

Refactor the monolithic build_h5() into a data-assembly function (build_output_dataset) that delegates to two internal serializers (_save_h5, _save_hdfstore). Each serializer appends its own file extension to the output_base path, eliminating the fragile .replace(".h5", ".hdfstore.h5") pattern. Also fixes in-place DataFrame mutation in _save_hdfstore via df.copy(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1. Fix NameError in h5py_to_hdfstore — summary now reads back from the saved HDFStore file instead of referencing out-of-scope locals. 2. Add _resolve_period_key helper so ETERNITY and Period-object keys are no longer silently dropped by _split_data_into_entity_dfs. 3. Remove unused `pandas` and `Dict` imports from publish_local_area.py. 4. Drop underscore prefix from save_h5/save_hdfstore — they are used cross-module and are part of the package API. Internal helpers (_split_data_into_entity_dfs, _build_uprating_manifest) keep theirs. 5. Deduplicate os.path.exists checks in upload blocks. 6. Replace fragile str.replace CLI path derivation with Path operations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests _resolve_period_key (int, str, ETERNITY, empty), save_h5 and save_hdfstore round-trips, input immutability, and ETERNITY-keyed variable inclusion. All use synthetic data with no dataset dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anth-volk · 2026-03-18T19:53:03Z

Redid pretty significantly because the code quality of the original was not good

juaristi22

Looks good! Clean refactor extracting serialization into utils/hdfstore.py and the entity-level HDFStore format is well-structured.

Minor note (non-blocking): in worker_script.py:72, if result: is now always truthy since DatasetResult is a dataclass. The old if path: could meaningfully be falsy, but now if build_output_dataset fails it raises rather than returning something falsy. Could remove the check or leave as-is — just flagging.

anth-volk mentioned this pull request Mar 4, 2026

Add extend_single_year_dataset for fast dataset year projection PolicyEngine/policyengine-us#7700

Open

5 tasks

anth-volk and others added 3 commits March 5, 2026 17:47

style: Run black formatter on changed files

82b88fd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into add-hdfstore-output

3fff876

anth-volk requested a review from juaristi22 March 5, 2026 18:39

anth-volk marked this pull request as ready for review March 5, 2026 18:39

juaristi22 reviewed Mar 11, 2026

View reviewed changes

anth-volk and others added 4 commits March 16, 2026 20:00

Apply black formatting (79 char line length)

69aa1f0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert "Apply black formatting (79 char line length)"

7fb0a6a

This reverts commit 69aa1f0.

Apply ruff formatting

514d05d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anth-volk requested a review from juaristi22 March 16, 2026 19:59

juaristi22 requested changes Mar 18, 2026

View reviewed changes

anth-volk and others added 3 commits March 18, 2026 20:04

anth-volk requested a review from juaristi22 March 18, 2026 19:52

juaristi22 approved these changes Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add entity-level HDFStore output format alongside h5py#568

Add entity-level HDFStore output format alongside h5py#568
anth-volk wants to merge 13 commits intomainfrom
add-hdfstore-output

anth-volk commented Mar 4, 2026 •

edited

Loading

Uh oh!

juaristi22 left a comment

Uh oh!

baogorek commented Mar 12, 2026 •

edited

Loading

Uh oh!

baogorek commented Mar 12, 2026 •

edited

Loading

Uh oh!

juaristi22 left a comment •

edited

Loading

Uh oh!

anth-volk commented Mar 18, 2026

Uh oh!

juaristi22 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anth-volk commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

juaristi22 left a comment

Choose a reason for hiding this comment

Review: HDFStore output format

1. Extract HDFStore logic into its own module

2. The test reimplements the logic instead of testing it

3. Nice-to-have test improvements

4. Minor: manifest entity mismatch

5. Upload pattern in publish_local_area.py

Uh oh!

baogorek commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merged main into this branch

What was ported

Uh oh!

baogorek commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juaristi22 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review

Fragile .h5 path replacement (change requested)

Minor suggestions (non-blocking)

Uh oh!

anth-volk commented Mar 18, 2026

Uh oh!

juaristi22 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anth-volk commented Mar 4, 2026 •

edited

Loading

5. Upload pattern in `publish_local_area.py`

baogorek commented Mar 12, 2026 •

edited

Loading

baogorek commented Mar 12, 2026 •

edited

Loading

juaristi22 left a comment •

edited

Loading

Fragile `.h5` path replacement (change requested)