Blend entity values on would_file draws; fix entity weights by baogorek · Pull Request #611 · PolicyEngine/policyengine-us-data

baogorek · 2026-03-16T17:18:01Z

Summary

Matrix builder: Precompute a second set of entity values with would_file_taxes_voluntarily=False for tax_unit targets. In the clone worker, compute would_file draws first, blend between the two branches, then apply the target's own takeup draw. This ensures X@w matches sim.calculate for targets affected by non-target "state" variables. Fixes Matrix builder: blend entity values based on would_file draws #609
publish_local_area: Remove incorrect sub-entity weight overrides (tax_unit_weight, spm_unit_weight, family_weight, marital_unit_weight, person_weight) that used a wrong person-count-splitting formula. These are formula variables in policyengine-us that correctly derive from household_weight at runtime. Fixes Local area H5: remove incorrect sub-entity weight overrides #610
Takeup draw salting: Replace block-based RNG salting with (hh_id, clone_idx) salting so draws are tied to donor household identity and independent across clones. Fixes Stacked builder assigns wrong block to multi-clone households sharing the same block #597
LA County crash: Set zip_code='90001' for LA County (06037) in county precomputation to prevent astype(int) crash on 'UNKNOWN'. Fixes County precomputation crashes on LA County due to UNKNOWN zip_code #612

Context

8 of 9 takeup variables are "gate" variables — they sit between eligibility and the benefit, so eligible_amount × draw works. The 9th (would_file_taxes_voluntarily) is a "state" variable — it changes upstream simulation state (is_filer) that other targets depend on. You can't post-multiply a state change; you have to pre-branch it.

The entity weight bug caused sim.calculate("aca_ptc").sum() (weighted by tax_unit_weight) to differ from sim.calculate("aca_ptc", map_to="household").sum() (weighted by household_weight) in local area H5 files.

Verification

With a 2K household / 10 clone test dataset for South Carolina:

X@w for aca_ptc across 7 SC districts: 145.4M
sim.calculate("aca_ptc", map_to="household").sum() from SC.h5: 145.4M (exact match)
sim.calculate("aca_ptc").sum() (tax_unit level): 145.4M (now matches after weight fix)

Test plan

pytest policyengine_us_data/tests/test_calibration/test_unified_calibration.py — 42 passed
X@w matches sim.calculate on SC test dataset
Tax-unit and household level weighted sums agree after weight fix

🤖 Generated with Claude Code

baogorek · 2026-03-17T19:25:44Z

Removed `stacked_dataset_builder.py` and `geography.npz` artifact

geography.npz was a saved copy of the output of assign_random_geography(n_records, n_clones, seed), which is fully deterministic. publish_local_area already regenerates geography from the same seed — it never loaded the .npz.

The only consumer was stacked_dataset_builder.py, an alternative entry point that loaded geography from the file instead of regenerating it. This created a second code path that had to stay in sync with publish_local_area but could silently diverge (and had a known bug where n_clones metadata was wrong).

Removed both. load_geography/save_geography remain in clone_and_assign.py since modal_app/worker_script.py still references them.

Matrix builder: precompute entity values with would_file=False alongside the all-True values, then blend per tax unit based on the would_file draw before applying target takeup draws. This ensures X@w matches sim.calculate for targets affected by non-target state variables. Fixes #609 publish_local_area: remove explicit sub-entity weight overrides (tax_unit_weight, spm_unit_weight, family_weight, marital_unit_weight, person_weight) that used incorrect person-count splitting. These are formula variables in policyengine-us that correctly derive from household_weight at runtime. Fixes #610 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace block-based RNG salting with (hh_id, clone_idx) salting. Draws are now tied to the donor household identity and independent across clones, eliminating the multi-clone-same-block collision issue (#597). Geographic variation comes through the rate threshold, not the draw. Closes #597 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

County precomputation crashes on LA County (06037) because aca_ptc → slcsp_rating_area_la_county → three_digit_zip_code calls zip_code.astype(int) on 'UNKNOWN'. Set zip_code='90001' for LA County in both precomputation and publish_local_area so X @ w matches sim.calculate("aca_ptc").sum(). Fixes #612 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The zip_code set for LA County (06037) was being wiped by delete_arrays which only preserved "county". Also apply the 06037 zip_code fix to the in-process county precomputation path (not just the parallel worker function). Fixes #612 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The only county-dependent variable (aca_ptc) does not depend on would_file_taxes_voluntarily, so the entity_wf_false pass was computing identical values. Removing it eliminates ~2,977 extra simulation passes during --county-level builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ptc/eitc/ctc targets - Fix geography.npz n_clones: was saving unique CD count instead of actual clone count (line 1292 of unified_calibration.py) - Deduplicate county precomputation: inline workers=1 path now calls _compute_single_state_group_counties instead of copy-pasting it - Enable aca_ptc, eitc, and refundable_ctc targets at all levels in target_config.yaml (remove outdated #7748 disable comments) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Geography is fully deterministic from (n_records, n_clones, seed) via assign_random_geography, so the .npz file was redundant. publish_local_area already regenerates from seed. Removing the artifact and its only consumer (stacked_dataset_builder.py) eliminates a divergent code path that had to stay in sync. The modal_app/worker_script.py still uses load_geography, so the functions remain in clone_and_assign.py for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…coped checkpoints - Add create_source_imputed_cps.py to data_build.py Phase 5 (was skipped in CI) - Remove geography.npz dependency from Modal pipeline; workers regenerate geography deterministically from (n_records, n_clones, seed) - Add input-scoped checkpoints to publish_local_area.py: hash weights+dataset to auto-clear stale checkpoints when inputs change - Remove stale artifacts from push-to-modal (stacked_blocks, stacked_takeup, geo_labels) - Stop uploading source_imputed H5 as intermediate; promote-dataset uploads at promotion time instead - Default skip_download=True in Modal local_area (reads from volume) - Remove _upload_source_imputed from remote_calibration_runner - Clean up huggingface.py: remove geography/blocks/geo_labels from download and upload functions - ruff format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Keep upload-dataset and skip_download=False defaults so the full pipeline (data_build → calibrate → stage-h5s) works via HF transport. skip_download is available as opt-in for local push-to-modal workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The data_build.py upload step now pushes source_imputed to calibration/source_imputed_stratified_extended_cps.h5 on HF so the downstream calibration pipeline (build-matrices, calibrate) can download it. This closes the gap in the all-Modal pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --detach to all 7 modal run commands in Makefile so long-running jobs survive terminal disconnects - Add --county-level to build-matrices (required for county precomputation) - Add N_CLONES variable (default 430) and pass --n-clones to build-matrices, stage-h5s, and stage-national-h5 - Plumb n_clones through Modal scripts: build_package entrypoint, coordinate_publish, and coordinate_national_publish (replacing hardcoded 430) - Change pipeline target to a reference card since --detach makes sequential chaining impossible Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 7578ba2 to 310bb73 Compare March 16, 2026 17:54

baogorek and others added 11 commits March 17, 2026 21:57

modal

34f5fc0

baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 0ecf3d1 to 34f5fc0 Compare March 18, 2026 01:57

juaristi22 and others added 3 commits March 18, 2026 15:59

fix fixture to address failing tests

74e8bda

make tests optional in when building data in modal

9d50764

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blend entity values on would_file draws; fix entity weights#611

Blend entity values on would_file draws; fix entity weights#611
baogorek wants to merge 14 commits intomainfrom
fix-would-file-blend-and-entity-weights

baogorek commented Mar 16, 2026 •

edited by juaristi22

Loading

Uh oh!

baogorek commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

baogorek commented Mar 16, 2026 • edited by juaristi22 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Verification

Test plan

Uh oh!

baogorek commented Mar 17, 2026

Removed stacked_dataset_builder.py and geography.npz artifact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

baogorek commented Mar 16, 2026 •

edited by juaristi22

Loading

Removed `stacked_dataset_builder.py` and `geography.npz` artifact