Skip to content

Add entity-level HDFStore output format alongside h5py#568

Open
anth-volk wants to merge 13 commits intomainfrom
add-hdfstore-output
Open

Add entity-level HDFStore output format alongside h5py#568
anth-volk wants to merge 13 commits intomainfrom
add-hdfstore-output

Conversation

@anth-volk
Copy link
Collaborator

@anth-volk anth-volk commented Mar 4, 2026

Fixes #567

Related to PolicyEngine/policyengine-us#7700

Summary

  • stacked_dataset_builder.py now produces a Pandas HDFStore file (.hdfstore.h5) alongside the existing h5py file, with one table per entity and an embedded uprating manifest
  • Upload pipeline uploads HDFStore files to dedicated subdirectories (states_hdfstore/, districts_hdfstore/, cities_hdfstore/)
  • Comparison test validates both formats contain identical data for all ~183 variables

Test plan

  • Run stacked_dataset_builder on a single CD/state and confirm both .h5 and .hdfstore.h5 files are created
  • Run pytest test_format_comparison.py --h5py-path STATE.h5 --hdfstore-path STATE.hdfstore.h5 and confirm all variables match
  • Verify HDFStore contains _variable_metadata manifest with correct entity and uprating columns
  • Verify all 6 entity tables are present with correct row counts

🤖 Generated with Claude Code

The stacked_dataset_builder now produces a Pandas HDFStore file
(.hdfstore.h5) in addition to the existing h5py file. The HDFStore
contains one table per entity (person, household, tax_unit, spm_unit,
family, marital_unit) plus an embedded _variable_metadata manifest
recording each variable's entity and uprating parameter path.

The upload pipeline uploads HDFStore files to dedicated subdirectories
(states_hdfstore/, districts_hdfstore/, cities_hdfstore/).

A comparison test (test_format_comparison.py) validates that both
formats contain identical data for all variables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
anth-volk and others added 3 commits March 5, 2026 17:47
Replaces the two-file-input test with a self-contained roundtrip
script that takes only an h5py file path, generates an HDFStore
using inlined splitting logic, then compares both formats. Handles
entity-level h5py files and yearly/ETERNITY/monthly period keys.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@anth-volk anth-volk requested a review from juaristi22 March 5, 2026 18:39
@anth-volk anth-volk marked this pull request as ready for review March 5, 2026 18:39
Copy link
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: HDFStore output format

Overall the feature makes sense — producing an entity-level HDFStore alongside the variable-centric h5py file is a clean step toward the API v2 alpha format. A few structural and testing concerns below.


1. Extract HDFStore logic into its own module

The three new functions (_split_into_entity_dfs, _build_uprating_manifest, _save_hdfstore) are a general-purpose h5py → HDFStore format transform. They don't depend on calibration, CD stacking, or anything specific to stacked_dataset_builder.py.

Embedding them in stacked_dataset_builder.py creates two problems:

  • Merge conflict with #538. That PR reorganizes calibration modules from datasets/cps/local_area_calibration/ to calibration/. This PR adds ~150 lines of new functions to the same file. The merge will be painful.
  • Coupling to a build pipeline. The HDFStore format is consumed by API v2 and extend_single_year_dataset(). Those consumers shouldn't need to trace into a calibration build module for the serialization logic.

Suggestion: Move these to a standalone utility, e.g. policyengine_us_data/utils/hdfstore.py:

# policyengine_us_data/utils/hdfstore.py
def split_into_entity_dfs(arrays, system, vars_to_save): ...
def build_uprating_manifest(vars_to_save, system): ...
def save_hdfstore(entity_dfs, manifest_df, path, time_period): ...

Then stacked_dataset_builder.py and any future consumer just imports from it.


2. The test reimplements the logic instead of testing it

test_format_comparison.py contains its own copy of _split_into_entity_dfs, _build_uprating_manifest, and _save_hdfstore. This means the test validates its own implementation, not the production code. If the production _split_into_entity_dfs has a bug (e.g., incorrect __period suffix stripping, wrong person_X_idX_id rename), this test cannot catch it.

Note also that the two implementations already diverge:

  • Production works on a combined_df with variable__period columns and handles the person_X_idX_id rename.
  • Test works on raw h5py arrays with no period suffix and doesn't handle that rename.
  • Production's _save_hdfstore does not deduplicate columns; the test version does (df.loc[:, ~df.columns.duplicated()]).

Suggestion: If the conversion logic lives in a shared module (point 1), the test can import it directly instead of reimplementing it. The test would then validate the actual production code path.


3. Nice-to-have test improvements

The current tests check value equality and structural completeness, which covers the basics well. A few additional checks that could be worth adding down the road if you want to harden things further:

  • Weight conservation — verifying that household_weight, person_weight, etc. sum to the same totals across formats. This is probably the highest-value addition since it's the property that matters most for microsimulation correctness.
  • Referential integrity — checking that every person_household_id in the person table references a valid household_id in the household table (and likewise for other entities). Silent FK violations won't fail value comparisons but would break simulations.
  • Row count consistency — e.g. n_unique(person_household_id) == len(household_df).
  • Group entity dedup correctness — the current comparison for group entities falls back to np.unique set comparison when array lengths differ, which can mask cases where dedup picks the wrong representative row. Not urgent but worth being aware of.
  • Bidirectional column check — currently checks h5py → HDFStore but not the reverse (no extra/spurious columns in HDFStore).
  • _time_period roundtrip — the _time_period metadata table is written but never read back or verified.

None of these are blockers — just ideas for making the validation more robust over time.


4. Minor: manifest entity mismatch

_build_uprating_manifest classifies unknown variables (not in system.variables) as entity "unknown", but _split_into_entity_dfs classifies them as "household". The manifest entity won't match where the variable actually ends up. Should be consistent — probably both "household".


5. Upload pattern in publish_local_area.py

The HDFStore upload uses os.path.exists(hdfstore_path) checks repeated across states/districts/cities. This is fine for now but could be a helper like _upload_hdfstore_if_exists(output_path, category) to reduce the copy-paste.

Main moved publish_local_area.py and stacked_dataset_builder.py from
datasets/cps/local_area_calibration/ to calibration/. Resolved by
removing old-location files and porting HDFStore upload logic to the
new publish_local_area.py location.

Note: The HDFStore helper functions (_split_into_entity_dfs,
_build_uprating_manifest, _save_hdfstore) from stacked_dataset_builder
need re-implementation — the original create_sparse_cd_stacked_dataset
was refactored into build_h5() which uses a different data structure
(dict of arrays vs combined DataFrame).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@baogorek
Copy link
Collaborator

baogorek commented Mar 12, 2026

Hi @anth-volk, since PR #538 was merged and was such a large PR, I'm taking initiative to get the remaining PRs up to date with main.

Merged main into this branch

PR #538 (calibration pipeline improvements) moved the files this PR modified:

  • datasets/cps/local_area_calibration/publish_local_area.pycalibration/publish_local_area.py
  • datasets/cps/local_area_calibration/stacked_dataset_builder.pycalibration/stacked_dataset_builder.py (completely rewritten as thin CLI wrapper)

What was ported

  • HDFStore upload logic in publish_local_area.py: The .hdfstore.h5 upload blocks for states, districts, and cities were applied to the new file location. These add states_hdfstore, districts_hdfstore, cities_hdfstore subdirectories for GCS/HF uploads.
  • Old-location files were removed (they no longer exist in main).
  • HDFStore helper functions (_split_into_entity_dfs, _build_uprating_manifest, _save_hdfstore) were adapted to work with build_h5()'s data dict structure (instead of the old combined_df DataFrame) and wired up at the end of build_h5().

Please review the merge commits — especially the entity splitting logic in _split_data_into_entity_dfs — and let me know if anything looks off.

Adapts _split_into_entity_dfs, _build_uprating_manifest, and
_save_hdfstore to work with the new data dict (var -> {period -> array})
instead of the old combined_df DataFrame. Adds the integration call at
the end of build_h5() so HDFStore files are generated alongside h5py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@baogorek
Copy link
Collaborator

baogorek commented Mar 12, 2026

Details on the helper function port:

The original helpers worked with a combined_df pandas DataFrame that no longer exists — build_h5() uses a data dict of {var_name: {period: numpy_array}}. The adapted versions:

  • _split_data_into_entity_dfs: Groups variables by entity using system.variables, builds a DataFrame per entity from the data dict arrays, deduplicates group entities by ID
  • _build_uprating_manifest: Same logic, reads variable names from data.keys() instead of combined_df.columns
  • _save_hdfstore: Unchanged from original

The integration call is at the end of build_h5(), right after the h5py verification block — so every local area build produces both .h5 and .hdfstore.h5 files.

anth-volk and others added 4 commits March 16, 2026 20:00
Moves split_data_into_entity_dfs, build_uprating_manifest, and
save_hdfstore out of publish_local_area.py into a standalone utility
module. Updates test_format_comparison.py to import the production
functions instead of reimplementing them, bridging the h5py flat-array
format into the nested {var: {period: array}} structure expected by
the shared module.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@anth-volk anth-volk requested a review from juaristi22 March 16, 2026 19:59
Copy link
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Fragile .h5 path replacement (change requested)

save_hdfstore derives the output path via str(output_path).replace(".h5", ".hdfstore.h5") (utils/hdfstore.py:138). This has two problems:

  • If the path contains .h5 elsewhere (e.g. /data/h5_files/output.h5), the wrong occurrence gets replaced.
  • More concretely: running test_format_comparison.py against a state file triggers a double replacement. The test's h5py_to_hdfstore computes hdfstore_path = args.h5py_path.replace(".h5", ".hdfstore.h5") and passes that to save_hdfstore, which then applies .replace(".h5", ".hdfstore.h5") again internally — producing MA.hdfstore.hdfstore.h5. The file gets written there, but compare_formats looks for MA.hdfstore.h5, which doesn't exist, and the script crashes with FileNotFoundError.

The fix should make save_hdfstore either accept the final output path directly (no internal derivation), or use suffix-only replacement like Path(output_path).with_suffix(".hdfstore.h5"). The test script's h5py_to_hdfstore would then need to pass the base .h5 path instead of the derived one, or vice versa — just needs to be consistent.

Minor suggestions (non-blocking)

  1. save_hdfstore mutates input DataFrames in-place (utils/hdfstore.py:149-152): The object-dtype columns are cast with df[col] = df[col].astype(str), which mutates the caller's DataFrames. If anything downstream of build_h5() uses entity_dfs after the save, the data will be silently altered. Consider df = df.copy() before the column loop.

  2. Monthly-period variables silently dropped (utils/hdfstore.py:57-59): Variables stored under monthly keys like "2024-01" won't match time_period (int) or str(time_period) and are silently skipped. The test script's _read_h5py_arrays handles this by falling back to subkeys[0], but the production code doesn't. Consider a similar fallback or at least logging when a variable is skipped.

Thanks for the other changes, everything else looks good!

anth-volk and others added 3 commits March 18, 2026 20:04
Refactor the monolithic build_h5() into a data-assembly function
(build_output_dataset) that delegates to two internal serializers
(_save_h5, _save_hdfstore). Each serializer appends its own file
extension to the output_base path, eliminating the fragile
.replace(".h5", ".hdfstore.h5") pattern. Also fixes in-place
DataFrame mutation in _save_hdfstore via df.copy().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Fix NameError in h5py_to_hdfstore — summary now reads back from the
   saved HDFStore file instead of referencing out-of-scope locals.
2. Add _resolve_period_key helper so ETERNITY and Period-object keys
   are no longer silently dropped by _split_data_into_entity_dfs.
3. Remove unused `pandas` and `Dict` imports from publish_local_area.py.
4. Drop underscore prefix from save_h5/save_hdfstore — they are used
   cross-module and are part of the package API. Internal helpers
   (_split_data_into_entity_dfs, _build_uprating_manifest) keep theirs.
5. Deduplicate os.path.exists checks in upload blocks.
6. Replace fragile str.replace CLI path derivation with Path operations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests _resolve_period_key (int, str, ETERNITY, empty), save_h5 and
save_hdfstore round-trips, input immutability, and ETERNITY-keyed
variable inclusion. All use synthetic data with no dataset dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@anth-volk anth-volk requested a review from juaristi22 March 18, 2026 19:52
@anth-volk
Copy link
Collaborator Author

Redid pretty significantly because the code quality of the original was not good

Copy link
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Clean refactor extracting serialization into utils/hdfstore.py and the entity-level HDFStore format is well-structured.

Minor note (non-blocking): in worker_script.py:72, if result: is now always truthy since DatasetResult is a dataclass. The old if path: could meaningfully be falsy, but now if build_output_dataset fails it raises rather than returning something falsy. Could remove the check or leave as-is — just flagging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add entity-level HDFStore output format alongside h5py

3 participants