Skip to content

Blend entity values on would_file draws; fix entity weights#611

Open
baogorek wants to merge 14 commits intomainfrom
fix-would-file-blend-and-entity-weights
Open

Blend entity values on would_file draws; fix entity weights#611
baogorek wants to merge 14 commits intomainfrom
fix-would-file-blend-and-entity-weights

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Mar 16, 2026

Summary

Context

8 of 9 takeup variables are "gate" variables — they sit between eligibility and the benefit, so eligible_amount × draw works. The 9th (would_file_taxes_voluntarily) is a "state" variable — it changes upstream simulation state (is_filer) that other targets depend on. You can't post-multiply a state change; you have to pre-branch it.

The entity weight bug caused sim.calculate("aca_ptc").sum() (weighted by tax_unit_weight) to differ from sim.calculate("aca_ptc", map_to="household").sum() (weighted by household_weight) in local area H5 files.

Verification

With a 2K household / 10 clone test dataset for South Carolina:

  • X@w for aca_ptc across 7 SC districts: 145.4M
  • sim.calculate("aca_ptc", map_to="household").sum() from SC.h5: 145.4M (exact match)
  • sim.calculate("aca_ptc").sum() (tax_unit level): 145.4M (now matches after weight fix)

Test plan

  • pytest policyengine_us_data/tests/test_calibration/test_unified_calibration.py — 42 passed
  • X@w matches sim.calculate on SC test dataset
  • Tax-unit and household level weighted sums agree after weight fix

🤖 Generated with Claude Code

@baogorek baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 7578ba2 to 310bb73 Compare March 16, 2026 17:54
@baogorek
Copy link
Collaborator Author

Removed stacked_dataset_builder.py and geography.npz artifact

geography.npz was a saved copy of the output of assign_random_geography(n_records, n_clones, seed), which is fully deterministic. publish_local_area already regenerates geography from the same seed — it never loaded the .npz.

The only consumer was stacked_dataset_builder.py, an alternative entry point that loaded geography from the file instead of regenerating it. This created a second code path that had to stay in sync with publish_local_area but could silently diverge (and had a known bug where n_clones metadata was wrong).

Removed both. load_geography/save_geography remain in clone_and_assign.py since modal_app/worker_script.py still references them.

baogorek and others added 11 commits March 17, 2026 21:57
Matrix builder: precompute entity values with would_file=False alongside
the all-True values, then blend per tax unit based on the would_file draw
before applying target takeup draws. This ensures X@w matches sim.calculate
for targets affected by non-target state variables.

Fixes #609

publish_local_area: remove explicit sub-entity weight overrides
(tax_unit_weight, spm_unit_weight, family_weight, marital_unit_weight,
person_weight) that used incorrect person-count splitting. These are
formula variables in policyengine-us that correctly derive from
household_weight at runtime.

Fixes #610

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace block-based RNG salting with (hh_id, clone_idx) salting.
Draws are now tied to the donor household identity and independent
across clones, eliminating the multi-clone-same-block collision
issue (#597). Geographic variation comes through the rate threshold,
not the draw.

Closes #597

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
County precomputation crashes on LA County (06037) because
aca_ptc → slcsp_rating_area_la_county → three_digit_zip_code
calls zip_code.astype(int) on 'UNKNOWN'. Set zip_code='90001'
for LA County in both precomputation and publish_local_area
so X @ w matches sim.calculate("aca_ptc").sum().

Fixes #612

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The zip_code set for LA County (06037) was being wiped by
delete_arrays which only preserved "county". Also apply the
06037 zip_code fix to the in-process county precomputation
path (not just the parallel worker function).

Fixes #612

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The only county-dependent variable (aca_ptc) does not depend on
would_file_taxes_voluntarily, so the entity_wf_false pass was
computing identical values. Removing it eliminates ~2,977 extra
simulation passes during --county-level builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ptc/eitc/ctc targets

- Fix geography.npz n_clones: was saving unique CD count instead of
  actual clone count (line 1292 of unified_calibration.py)
- Deduplicate county precomputation: inline workers=1 path now calls
  _compute_single_state_group_counties instead of copy-pasting it
- Enable aca_ptc, eitc, and refundable_ctc targets at all levels
  in target_config.yaml (remove outdated #7748 disable comments)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Geography is fully deterministic from (n_records, n_clones, seed)
via assign_random_geography, so the .npz file was redundant.
publish_local_area already regenerates from seed. Removing the
artifact and its only consumer (stacked_dataset_builder.py)
eliminates a divergent code path that had to stay in sync.

The modal_app/worker_script.py still uses load_geography, so
the functions remain in clone_and_assign.py for now.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…coped checkpoints

- Add create_source_imputed_cps.py to data_build.py Phase 5 (was skipped in CI)
- Remove geography.npz dependency from Modal pipeline; workers regenerate
  geography deterministically from (n_records, n_clones, seed)
- Add input-scoped checkpoints to publish_local_area.py: hash weights+dataset
  to auto-clear stale checkpoints when inputs change
- Remove stale artifacts from push-to-modal (stacked_blocks, stacked_takeup,
  geo_labels)
- Stop uploading source_imputed H5 as intermediate; promote-dataset uploads
  at promotion time instead
- Default skip_download=True in Modal local_area (reads from volume)
- Remove _upload_source_imputed from remote_calibration_runner
- Clean up huggingface.py: remove geography/blocks/geo_labels from
  download and upload functions
- ruff format

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep upload-dataset and skip_download=False defaults so the full
pipeline (data_build → calibrate → stage-h5s) works via HF transport.
skip_download is available as opt-in for local push-to-modal workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The data_build.py upload step now pushes source_imputed to
calibration/source_imputed_stratified_extended_cps.h5 on HF so the
downstream calibration pipeline (build-matrices, calibrate) can
download it. This closes the gap in the all-Modal pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@baogorek baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 0ecf3d1 to 34f5fc0 Compare March 18, 2026 01:57
juaristi22 and others added 3 commits March 18, 2026 15:59
- Add --detach to all 7 modal run commands in Makefile so long-running
  jobs survive terminal disconnects
- Add --county-level to build-matrices (required for county precomputation)
- Add N_CLONES variable (default 430) and pass --n-clones to
  build-matrices, stage-h5s, and stage-national-h5
- Plumb n_clones through Modal scripts: build_package entrypoint,
  coordinate_publish, and coordinate_national_publish (replacing
  hardcoded 430)
- Change pipeline target to a reference card since --detach makes
  sequential chaining impossible

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants