Blend entity values on would_file draws; fix entity weights#611
Open
Blend entity values on would_file draws; fix entity weights#611
Conversation
7578ba2 to
310bb73
Compare
Collaborator
Author
Removed
|
Matrix builder: precompute entity values with would_file=False alongside the all-True values, then blend per tax unit based on the would_file draw before applying target takeup draws. This ensures X@w matches sim.calculate for targets affected by non-target state variables. Fixes #609 publish_local_area: remove explicit sub-entity weight overrides (tax_unit_weight, spm_unit_weight, family_weight, marital_unit_weight, person_weight) that used incorrect person-count splitting. These are formula variables in policyengine-us that correctly derive from household_weight at runtime. Fixes #610 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace block-based RNG salting with (hh_id, clone_idx) salting. Draws are now tied to the donor household identity and independent across clones, eliminating the multi-clone-same-block collision issue (#597). Geographic variation comes through the rate threshold, not the draw. Closes #597 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
County precomputation crashes on LA County (06037) because
aca_ptc → slcsp_rating_area_la_county → three_digit_zip_code
calls zip_code.astype(int) on 'UNKNOWN'. Set zip_code='90001'
for LA County in both precomputation and publish_local_area
so X @ w matches sim.calculate("aca_ptc").sum().
Fixes #612
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The zip_code set for LA County (06037) was being wiped by delete_arrays which only preserved "county". Also apply the 06037 zip_code fix to the in-process county precomputation path (not just the parallel worker function). Fixes #612 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The only county-dependent variable (aca_ptc) does not depend on would_file_taxes_voluntarily, so the entity_wf_false pass was computing identical values. Removing it eliminates ~2,977 extra simulation passes during --county-level builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ptc/eitc/ctc targets - Fix geography.npz n_clones: was saving unique CD count instead of actual clone count (line 1292 of unified_calibration.py) - Deduplicate county precomputation: inline workers=1 path now calls _compute_single_state_group_counties instead of copy-pasting it - Enable aca_ptc, eitc, and refundable_ctc targets at all levels in target_config.yaml (remove outdated #7748 disable comments) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Geography is fully deterministic from (n_records, n_clones, seed) via assign_random_geography, so the .npz file was redundant. publish_local_area already regenerates from seed. Removing the artifact and its only consumer (stacked_dataset_builder.py) eliminates a divergent code path that had to stay in sync. The modal_app/worker_script.py still uses load_geography, so the functions remain in clone_and_assign.py for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…coped checkpoints - Add create_source_imputed_cps.py to data_build.py Phase 5 (was skipped in CI) - Remove geography.npz dependency from Modal pipeline; workers regenerate geography deterministically from (n_records, n_clones, seed) - Add input-scoped checkpoints to publish_local_area.py: hash weights+dataset to auto-clear stale checkpoints when inputs change - Remove stale artifacts from push-to-modal (stacked_blocks, stacked_takeup, geo_labels) - Stop uploading source_imputed H5 as intermediate; promote-dataset uploads at promotion time instead - Default skip_download=True in Modal local_area (reads from volume) - Remove _upload_source_imputed from remote_calibration_runner - Clean up huggingface.py: remove geography/blocks/geo_labels from download and upload functions - ruff format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep upload-dataset and skip_download=False defaults so the full pipeline (data_build → calibrate → stage-h5s) works via HF transport. skip_download is available as opt-in for local push-to-modal workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The data_build.py upload step now pushes source_imputed to calibration/source_imputed_stratified_extended_cps.h5 on HF so the downstream calibration pipeline (build-matrices, calibrate) can download it. This closes the gap in the all-Modal pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0ecf3d1 to
34f5fc0
Compare
- Add --detach to all 7 modal run commands in Makefile so long-running jobs survive terminal disconnects - Add --county-level to build-matrices (required for county precomputation) - Add N_CLONES variable (default 430) and pass --n-clones to build-matrices, stage-h5s, and stage-national-h5 - Plumb n_clones through Modal scripts: build_package entrypoint, coordinate_publish, and coordinate_national_publish (replacing hardcoded 430) - Change pipeline target to a reference card since --detach makes sequential chaining impossible Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
would_file_taxes_voluntarily=Falsefor tax_unit targets. In the clone worker, compute would_file draws first, blend between the two branches, then apply the target's own takeup draw. This ensures X@w matches sim.calculate for targets affected by non-target "state" variables. Fixes Matrix builder: blend entity values based on would_file draws #609tax_unit_weight,spm_unit_weight,family_weight,marital_unit_weight,person_weight) that used a wrong person-count-splitting formula. These are formula variables in policyengine-us that correctly derive fromhousehold_weightat runtime. Fixes Local area H5: remove incorrect sub-entity weight overrides #610(hh_id, clone_idx)salting so draws are tied to donor household identity and independent across clones. Fixes Stacked builder assigns wrong block to multi-clone households sharing the same block #597zip_code='90001'for LA County (06037) in county precomputation to preventastype(int)crash on'UNKNOWN'. Fixes County precomputation crashes on LA County due to UNKNOWN zip_code #612Context
8 of 9 takeup variables are "gate" variables — they sit between eligibility and the benefit, so
eligible_amount × drawworks. The 9th (would_file_taxes_voluntarily) is a "state" variable — it changes upstream simulation state (is_filer) that other targets depend on. You can't post-multiply a state change; you have to pre-branch it.The entity weight bug caused
sim.calculate("aca_ptc").sum()(weighted bytax_unit_weight) to differ fromsim.calculate("aca_ptc", map_to="household").sum()(weighted byhousehold_weight) in local area H5 files.Verification
With a 2K household / 10 clone test dataset for South Carolina:
sim.calculate("aca_ptc", map_to="household").sum()from SC.h5: 145.4M (exact match)sim.calculate("aca_ptc").sum()(tax_unit level): 145.4M (now matches after weight fix)Test plan
pytest policyengine_us_data/tests/test_calibration/test_unified_calibration.py— 42 passed🤖 Generated with Claude Code