fix: preserve row last-updated metadata for rewrite-columns updates by myandpr · Pull Request #6263 · lance-format/lance

myandpr · 2026-03-23T21:04:47Z

Summary

Fix _row_last_updated_at_version for RewriteColumns updates on stable-row-id datasets.

Previously, update_columns could advance the dataset version while leaving _row_last_updated_at_version stale for rewritten fragments.

The original bug was that FileFragment.update_columns rewrote column data but did not propagate enough row-level information for commit to refresh last_updated_at_version_meta correctly.

Approach

This change makes rewrite-columns version refresh source-aware:

update_columns now records matched physical row offsets and carries them through commit as transient fragment state
commit refreshes row-version metadata only for actually updated fragments
if partial row offsets are present, it does a partial per-row refresh
if upstream already computed precise last_updated_at_version_meta (for example merge-insert rewrite-columns), commit preserves it
otherwise, RewriteColumns falls back to full-fragment refresh

This keeps the existing API unchanged and limits the change to commit-time metadata handling plus the transient fragment metadata needed to carry row offsets across Rust, Python, and Java round-trips.

Tests

Added coverage for:

fragment.update_columns partial row refresh
non-zero fragment partial refresh
rewrite-columns fallback without transient row offsets
untouched fragments in multi-fragment datasets staying unchanged
merge-insert rewrite-columns preserving precise per-row last-updated metadata
Python JSON round-trip preserving pending rewrite offsets
Java fragment round-trip preserving pending rewrite offsets

Validation

cargo test -p lance test_fragment_update -- --nocapture
python -m pytest python/python/tests/test_fragment.py -k "last_updated_version or rewrite_columns_without_pending_offsets or untouched_fragments or non_zero_fragment or preserves_last_updated_version_across_json"
python -m pytest python/python/tests/test_dataset.py -k merge_insert_partial_rewrite_columns_preserves_last_updated_versions

github-actions · 2026-03-23T21:08:25Z

PR Review: fix: preserve row last-updated metadata for rewrite-columns updates

Overall

The approach of threading pending_updated_row_offsets as transient commit-time state through the Fragment struct is reasonable. Tests cover the key scenarios well. A few items worth addressing:

P1: Column index alignment assumption in hash join fallback

In fragment.rs, ROW_ADDR is appended to read_columns, which means the left batch now has more columns than before. The collect_with_fallback_and_matches function uses positional column indices (left_batch.column(column_i)) from the right batch to index into the left batch for fallback values. This works because ROW_ADDR and left_on are appended after the write_schema columns, placing them beyond self.batches[0].num_columns(). However, this correctness depends on an implicit contract that the first N columns of the left batch align positionally with the right batch. A comment documenting this invariant would prevent future breakage.

P1: `left_on` dedup is a latent bug fix — call it out

The change to skip appending left_on if it's already in write_schema fixes a pre-existing bug where duplicating the join key column in the left batch could cause column index misalignment in the hash joiner's interleave. This is a correctness fix independent of the version metadata work and worth noting in the PR description.

Minor observations

refresh_row_latest_update_meta_for_partial_frag_rewrite_cols (in lance-table) has a todo!() panic for RowIdMeta::External. This isn't new, but this PR introduces a new call path that can reach it. Worth a tracking issue if one doesn't exist.
The had_precise_last_updated_meta naming could be clearer — it's actually checking "did upstream already modify the metadata" (i.e., metadata differs from the original fragment). Consider renaming to upstream_already_refreshed_meta or similar.

Tests

Good coverage: partial row refresh, fallback without pending offsets, untouched fragments in multi-fragment datasets, and merge-insert preservation. The Rust test correctly enables enable_stable_row_ids and validates per-row version expectations.

myandpr · 2026-03-23T21:57:21Z

note: Only linux-build is failing now, in the distributed IVF/PQ test under rust/lance/src/index/vector/ivf/v2.rs(test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results::case_2_ivf_pq). This looks unrelated to the row-version metadata changes in this PR.

myandpr · 2026-03-24T09:12:41Z

This CI failure is caused by an existing flaky test on , not by the change in this PR.

I opened a fix here: #6268

The flaky test was introduced by #6220 and has already failed on as well:

The fix makes the test probe all IVF partitions before comparing Top-K row ids, which removes the probing-related nondeterminism while keeping the same strict equality assertion.

Will merge #6268 first, then this PR should be able to rerun cleanly.

## Summary This fixes a flaky regression test added in #6220 (`b80fbb3231cf58dd50e5670f9c56d309999bbd73`). The affected test is: - `test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results` Recent failures showed up in both of these runs: - https://github.com/lance-format/lance/actions/runs/23450834811 `main` at `244c721504c6ef0b4c2f9700a342509976898d6e` - https://github.com/lance-format/lance/actions/runs/23460892697 #6263 In those failures, different platforms / index variants failed: - `linux-arm / case_1_ivf_flat` on `main` - `linux-build / case_2_ivf_pq` on #6263 That points to an existing flaky test. ## Root Cause The test compared the exact Top-K `_rowid` results between: - a single-segment index build, and - a distributed multi-segment index build However, the query path used by the test is ANN (`ANNIVFPartition`) under the default probing behavior. With partial probing, the candidate set can differ slightly between single-segment and multi-segment layouts, especially near the tail of Top-K. That makes exact `_rowid` equality too strict for this test and causes intermittent failures. ## Fix Make the test probe all IVF partitions before comparing Top-K row ids: - add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query This keeps the existing strong assertion (`ids_single == ids_split`) but removes the probing-related source of nondeterminism. ## Testing Local verification: - `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc` - `cargo test -p lance test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results -- --nocapture` Observed result: - `case_1_ivf_flat ... ok` - `case_2_ivf_pq ... ok` - `case_3_ivf_sq ... ok` I also verified during debugging that with full probing enabled, repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became stable.

codecov · 2026-03-24T12:57:35Z

Codecov Report

❌ Patch coverage is 91.71598% with 14 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/transaction.rs	86.30%	9 Missing and 1 partial ⚠️
rust/lance/src/dataset/fragment.rs	95.71%	1 Missing and 2 partials ⚠️
rust/lance/src/dataset/hash_joiner.rs	92.30%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

myandpr · 2026-03-24T17:41:29Z

Note: The current CI failure looks unrelated to the changes in this PR. It is failing in a distributed IVF test on linux-arm due to the same flaky exact Top-K row-id assertion pattern we saw before, not in the row-version metadata paths touched here.

I’ve opened a follow-up PR to fix it: #6281.

myandpr · 2026-03-24T17:45:11Z

Hey @wjones127 @jackye1995 could you take a look this PR when you have a chance?

It fixes _row_last_updated_at_version handling for rewrite-columns updates on stable-row-id datasets, including merge-insert-related cases.

Any feedback would be much appreciated!

## Summary This fixes a flaky regression test added in #6220 (`b80fbb3231cf58dd50e5670f9c56d309999bbd73`). The affected test is: - `test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results` Recent failures showed up in both of these runs: - https://github.com/lance-format/lance/actions/runs/23450834811 `main` at `244c721504c6ef0b4c2f9700a342509976898d6e` - https://github.com/lance-format/lance/actions/runs/23460892697 #6263 In those failures, different platforms / index variants failed: - `linux-arm / case_1_ivf_flat` on `main` - `linux-build / case_2_ivf_pq` on #6263 That points to an existing flaky test. ## Root Cause The test compared the exact Top-K `_rowid` results between: - a single-segment index build, and - a distributed multi-segment index build However, the query path used by the test is ANN (`ANNIVFPartition`) under the default probing behavior. With partial probing, the candidate set can differ slightly between single-segment and multi-segment layouts, especially near the tail of Top-K. That makes exact `_rowid` equality too strict for this test and causes intermittent failures. ## Fix Make the test probe all IVF partitions before comparing Top-K row ids: - add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query This keeps the existing strong assertion (`ids_single == ids_split`) but removes the probing-related source of nondeterminism. ## Testing Local verification: - `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc` - `cargo test -p lance test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results -- --nocapture` Observed result: - `case_1_ivf_flat ... ok` - `case_2_ivf_pq ... ok` - `case_3_ivf_sq ... ok` I also verified during debugging that with full probing enabled, repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became stable.

github-actions bot added bug Something isn't working python labels Mar 23, 2026

github-actions bot added the java label Mar 23, 2026

myandpr force-pushed the fix-issue-6000-update-columns-version-meta branch from 74d2a65 to 8a809ef Compare March 23, 2026 21:24

myandpr mentioned this pull request Mar 24, 2026

test: fix flaky distributed vector build results test #6268

Merged

myandpr force-pushed the fix-issue-6000-update-columns-version-meta branch from 8a809ef to 645fd22 Compare March 24, 2026 12:26

myandpr force-pushed the fix-issue-6000-update-columns-version-meta branch 3 times, most recently from 95f746b to 3abaf3a Compare March 24, 2026 15:52

fix: preserve row last-updated metadata for rewrite-columns updates

695a86a

myandpr force-pushed the fix-issue-6000-update-columns-version-meta branch from 3abaf3a to 695a86a Compare March 24, 2026 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve row last-updated metadata for rewrite-columns updates#6263

fix: preserve row last-updated metadata for rewrite-columns updates#6263
myandpr wants to merge 1 commit intolance-format:mainfrom
myandpr:fix-issue-6000-update-columns-version-meta

myandpr commented Mar 23, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

myandpr commented Mar 23, 2026

Uh oh!

myandpr commented Mar 24, 2026

Uh oh!

codecov bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

myandpr commented Mar 24, 2026

Uh oh!

myandpr commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

myandpr commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Tests

Validation

Uh oh!

github-actions bot commented Mar 23, 2026

PR Review: fix: preserve row last-updated metadata for rewrite-columns updates

Overall

P1: Column index alignment assumption in hash join fallback

P1: left_on dedup is a latent bug fix — call it out

Minor observations

Tests

Uh oh!

myandpr commented Mar 23, 2026

Uh oh!

myandpr commented Mar 24, 2026

Uh oh!

codecov bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

myandpr commented Mar 24, 2026

Uh oh!

myandpr commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

myandpr commented Mar 23, 2026 •

edited

Loading

P1: `left_on` dedup is a latent bug fix — call it out

codecov bot commented Mar 24, 2026 •

edited

Loading