fix: preserve row last-updated metadata for rewrite-columns updates#6263
fix: preserve row last-updated metadata for rewrite-columns updates#6263myandpr wants to merge 1 commit intolance-format:mainfrom
Conversation
PR Review: fix: preserve row last-updated metadata for rewrite-columns updatesOverallThe approach of threading P1: Column index alignment assumption in hash join fallbackIn P1:
|
74d2a65 to
8a809ef
Compare
|
note: Only linux-build is failing now, in the distributed IVF/PQ test under rust/lance/src/index/vector/ivf/v2.rs( |
|
This CI failure is caused by an existing flaky test on , not by the change in this PR. I opened a fix here: #6268 The flaky test was introduced by #6220 and has already failed on as well:
The fix makes the test probe all IVF partitions before comparing Top-K row ids, which removes the probing-related nondeterminism while keeping the same strict equality assertion. Will merge #6268 first, then this PR should be able to rerun cleanly. |
## Summary This fixes a flaky regression test added in #6220 (`b80fbb3231cf58dd50e5670f9c56d309999bbd73`). The affected test is: - `test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results` Recent failures showed up in both of these runs: - https://github.com/lance-format/lance/actions/runs/23450834811 `main` at `244c721504c6ef0b4c2f9700a342509976898d6e` - https://github.com/lance-format/lance/actions/runs/23460892697 #6263 In those failures, different platforms / index variants failed: - `linux-arm / case_1_ivf_flat` on `main` - `linux-build / case_2_ivf_pq` on #6263 That points to an existing flaky test. ## Root Cause The test compared the exact Top-K `_rowid` results between: - a single-segment index build, and - a distributed multi-segment index build However, the query path used by the test is ANN (`ANNIVFPartition`) under the default probing behavior. With partial probing, the candidate set can differ slightly between single-segment and multi-segment layouts, especially near the tail of Top-K. That makes exact `_rowid` equality too strict for this test and causes intermittent failures. ## Fix Make the test probe all IVF partitions before comparing Top-K row ids: - add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query This keeps the existing strong assertion (`ids_single == ids_split`) but removes the probing-related source of nondeterminism. ## Testing Local verification: - `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc` - `cargo test -p lance test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results -- --nocapture` Observed result: - `case_1_ivf_flat ... ok` - `case_2_ivf_pq ... ok` - `case_3_ivf_sq ... ok` I also verified during debugging that with full probing enabled, repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became stable.
8a809ef to
645fd22
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
95f746b to
3abaf3a
Compare
3abaf3a to
695a86a
Compare
|
Note: The current CI failure looks unrelated to the changes in this PR. It is failing in a distributed IVF test on linux-arm due to the same flaky exact Top-K row-id assertion pattern we saw before, not in the row-version metadata paths touched here. I’ve opened a follow-up PR to fix it: #6281. |
|
Hey @wjones127 @jackye1995 could you take a look this PR when you have a chance? It fixes _row_last_updated_at_version handling for rewrite-columns updates on stable-row-id datasets, including merge-insert-related cases. Any feedback would be much appreciated! |
## Summary This fixes a flaky regression test added in #6220 (`b80fbb3231cf58dd50e5670f9c56d309999bbd73`). The affected test is: - `test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results` Recent failures showed up in both of these runs: - https://github.com/lance-format/lance/actions/runs/23450834811 `main` at `244c721504c6ef0b4c2f9700a342509976898d6e` - https://github.com/lance-format/lance/actions/runs/23460892697 #6263 In those failures, different platforms / index variants failed: - `linux-arm / case_1_ivf_flat` on `main` - `linux-build / case_2_ivf_pq` on #6263 That points to an existing flaky test. ## Root Cause The test compared the exact Top-K `_rowid` results between: - a single-segment index build, and - a distributed multi-segment index build However, the query path used by the test is ANN (`ANNIVFPartition`) under the default probing behavior. With partial probing, the candidate set can differ slightly between single-segment and multi-segment layouts, especially near the tail of Top-K. That makes exact `_rowid` equality too strict for this test and causes intermittent failures. ## Fix Make the test probe all IVF partitions before comparing Top-K row ids: - add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query This keeps the existing strong assertion (`ids_single == ids_split`) but removes the probing-related source of nondeterminism. ## Testing Local verification: - `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc` - `cargo test -p lance test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results -- --nocapture` Observed result: - `case_1_ivf_flat ... ok` - `case_2_ivf_pq ... ok` - `case_3_ivf_sq ... ok` I also verified during debugging that with full probing enabled, repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became stable.
Closes #6000
Summary
Fix
_row_last_updated_at_versionforRewriteColumnsupdates on stable-row-id datasets.Previously,
update_columnscould advance the dataset version while leaving_row_last_updated_at_versionstale for rewritten fragments.The original bug was that
FileFragment.update_columnsrewrote column data but did not propagate enough row-level information for commit to refreshlast_updated_at_version_metacorrectly.Approach
This change makes rewrite-columns version refresh source-aware:
update_columnsnow records matched physical row offsets and carries them through commit as transient fragment statelast_updated_at_version_meta(for example merge-insert rewrite-columns), commit preserves itRewriteColumnsfalls back to full-fragment refreshThis keeps the existing API unchanged and limits the change to commit-time metadata handling plus the transient fragment metadata needed to carry row offsets across Rust, Python, and Java round-trips.
Tests
Added coverage for:
fragment.update_columnspartial row refreshValidation