Skip to content

fix: preserve row last-updated metadata for rewrite-columns updates#6263

Open
myandpr wants to merge 1 commit intolance-format:mainfrom
myandpr:fix-issue-6000-update-columns-version-meta
Open

fix: preserve row last-updated metadata for rewrite-columns updates#6263
myandpr wants to merge 1 commit intolance-format:mainfrom
myandpr:fix-issue-6000-update-columns-version-meta

Conversation

@myandpr
Copy link
Contributor

@myandpr myandpr commented Mar 23, 2026

Closes #6000

Summary

Fix _row_last_updated_at_version for RewriteColumns updates on stable-row-id datasets.

Previously, update_columns could advance the dataset version while leaving _row_last_updated_at_version stale for rewritten fragments.

The original bug was that FileFragment.update_columns rewrote column data but did not propagate enough row-level information for commit to refresh last_updated_at_version_meta correctly.

Approach

This change makes rewrite-columns version refresh source-aware:

  • update_columns now records matched physical row offsets and carries them through commit as transient fragment state
  • commit refreshes row-version metadata only for actually updated fragments
  • if partial row offsets are present, it does a partial per-row refresh
  • if upstream already computed precise last_updated_at_version_meta (for example merge-insert rewrite-columns), commit preserves it
  • otherwise, RewriteColumns falls back to full-fragment refresh

This keeps the existing API unchanged and limits the change to commit-time metadata handling plus the transient fragment metadata needed to carry row offsets across Rust, Python, and Java round-trips.

Tests

Added coverage for:

  • fragment.update_columns partial row refresh
  • non-zero fragment partial refresh
  • rewrite-columns fallback without transient row offsets
  • untouched fragments in multi-fragment datasets staying unchanged
  • merge-insert rewrite-columns preserving precise per-row last-updated metadata
  • Python JSON round-trip preserving pending rewrite offsets
  • Java fragment round-trip preserving pending rewrite offsets

Validation

cargo test -p lance test_fragment_update -- --nocapture
python -m pytest python/python/tests/test_fragment.py -k "last_updated_version or rewrite_columns_without_pending_offsets or untouched_fragments or non_zero_fragment or preserves_last_updated_version_across_json"
python -m pytest python/python/tests/test_dataset.py -k merge_insert_partial_rewrite_columns_preserves_last_updated_versions

@github-actions github-actions bot added bug Something isn't working python labels Mar 23, 2026
@github-actions
Copy link
Contributor

PR Review: fix: preserve row last-updated metadata for rewrite-columns updates

Overall

The approach of threading pending_updated_row_offsets as transient commit-time state through the Fragment struct is reasonable. Tests cover the key scenarios well. A few items worth addressing:

P1: Column index alignment assumption in hash join fallback

In fragment.rs, ROW_ADDR is appended to read_columns, which means the left batch now has more columns than before. The collect_with_fallback_and_matches function uses positional column indices (left_batch.column(column_i)) from the right batch to index into the left batch for fallback values. This works because ROW_ADDR and left_on are appended after the write_schema columns, placing them beyond self.batches[0].num_columns(). However, this correctness depends on an implicit contract that the first N columns of the left batch align positionally with the right batch. A comment documenting this invariant would prevent future breakage.

P1: left_on dedup is a latent bug fix — call it out

The change to skip appending left_on if it's already in write_schema fixes a pre-existing bug where duplicating the join key column in the left batch could cause column index misalignment in the hash joiner's interleave. This is a correctness fix independent of the version metadata work and worth noting in the PR description.

Minor observations

  • refresh_row_latest_update_meta_for_partial_frag_rewrite_cols (in lance-table) has a todo!() panic for RowIdMeta::External. This isn't new, but this PR introduces a new call path that can reach it. Worth a tracking issue if one doesn't exist.
  • The had_precise_last_updated_meta naming could be clearer — it's actually checking "did upstream already modify the metadata" (i.e., metadata differs from the original fragment). Consider renaming to upstream_already_refreshed_meta or similar.

Tests

Good coverage: partial row refresh, fallback without pending offsets, untouched fragments in multi-fragment datasets, and merge-insert preservation. The Rust test correctly enables enable_stable_row_ids and validates per-row version expectations.

@github-actions github-actions bot added the java label Mar 23, 2026
@myandpr myandpr force-pushed the fix-issue-6000-update-columns-version-meta branch from 74d2a65 to 8a809ef Compare March 23, 2026 21:24
@myandpr
Copy link
Contributor Author

myandpr commented Mar 23, 2026

note: Only linux-build is failing now, in the distributed IVF/PQ test under rust/lance/src/index/vector/ivf/v2.rs(test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results::case_2_ivf_pq). This looks unrelated to the row-version metadata changes in this PR.

@myandpr
Copy link
Contributor Author

myandpr commented Mar 24, 2026

This CI failure is caused by an existing flaky test on , not by the change in this PR.

I opened a fix here: #6268

The flaky test was introduced by #6220 and has already failed on as well:

The fix makes the test probe all IVF partitions before comparing Top-K row ids, which removes the probing-related nondeterminism while keeping the same strict equality assertion.

Will merge #6268 first, then this PR should be able to rerun cleanly.

Xuanwo pushed a commit that referenced this pull request Mar 24, 2026
## Summary
This fixes a flaky regression test added in #6220
(`b80fbb3231cf58dd50e5670f9c56d309999bbd73`).

The affected test is:
-
`test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results`

Recent failures showed up in both of these runs:
- https://github.com/lance-format/lance/actions/runs/23450834811
  `main` at `244c721504c6ef0b4c2f9700a342509976898d6e`
- https://github.com/lance-format/lance/actions/runs/23460892697
   #6263
  

In those failures, different platforms / index variants failed:
- `linux-arm / case_1_ivf_flat` on `main`
- `linux-build / case_2_ivf_pq` on #6263

That points to an existing flaky test.

## Root Cause
The test compared the exact Top-K `_rowid` results between:
- a single-segment index build, and
- a distributed multi-segment index build

However, the query path used by the test is ANN (`ANNIVFPartition`)
under the default probing behavior. With partial probing, the candidate
set can differ slightly between single-segment and multi-segment
layouts, especially near the tail of Top-K. That makes exact `_rowid`
equality too strict for this test and causes intermittent failures.

## Fix
Make the test probe all IVF partitions before comparing Top-K row ids:
- add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query

This keeps the existing strong assertion (`ids_single == ids_split`) but
removes the probing-related source of nondeterminism.

## Testing
Local verification:
- `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc`
- `cargo test -p lance
test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results
-- --nocapture`

Observed result:
- `case_1_ivf_flat ... ok`
- `case_2_ivf_pq ... ok`
- `case_3_ivf_sq ... ok`

I also verified during debugging that with full probing enabled,
repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became
stable.
@myandpr myandpr force-pushed the fix-issue-6000-update-columns-version-meta branch from 8a809ef to 645fd22 Compare March 24, 2026 12:26
@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 91.71598% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/transaction.rs 86.30% 9 Missing and 1 partial ⚠️
rust/lance/src/dataset/fragment.rs 95.71% 1 Missing and 2 partials ⚠️
rust/lance/src/dataset/hash_joiner.rs 92.30% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@myandpr myandpr force-pushed the fix-issue-6000-update-columns-version-meta branch 3 times, most recently from 95f746b to 3abaf3a Compare March 24, 2026 15:52
@myandpr myandpr force-pushed the fix-issue-6000-update-columns-version-meta branch from 3abaf3a to 695a86a Compare March 24, 2026 16:06
@myandpr
Copy link
Contributor Author

myandpr commented Mar 24, 2026

Note: The current CI failure looks unrelated to the changes in this PR. It is failing in a distributed IVF test on linux-arm due to the same flaky exact Top-K row-id assertion pattern we saw before, not in the row-version metadata paths touched here.

I’ve opened a follow-up PR to fix it: #6281.

@myandpr
Copy link
Contributor Author

myandpr commented Mar 24, 2026

Hey @wjones127 @jackye1995 could you take a look this PR when you have a chance?

It fixes _row_last_updated_at_version handling for rewrite-columns updates on stable-row-id datasets, including merge-insert-related cases.

Any feedback would be much appreciated!

westonpace pushed a commit that referenced this pull request Mar 24, 2026
## Summary
This fixes a flaky regression test added in #6220
(`b80fbb3231cf58dd50e5670f9c56d309999bbd73`).

The affected test is:
-
`test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results`

Recent failures showed up in both of these runs:
- https://github.com/lance-format/lance/actions/runs/23450834811
  `main` at `244c721504c6ef0b4c2f9700a342509976898d6e`
- https://github.com/lance-format/lance/actions/runs/23460892697
   #6263
  

In those failures, different platforms / index variants failed:
- `linux-arm / case_1_ivf_flat` on `main`
- `linux-build / case_2_ivf_pq` on #6263

That points to an existing flaky test.

## Root Cause
The test compared the exact Top-K `_rowid` results between:
- a single-segment index build, and
- a distributed multi-segment index build

However, the query path used by the test is ANN (`ANNIVFPartition`)
under the default probing behavior. With partial probing, the candidate
set can differ slightly between single-segment and multi-segment
layouts, especially near the tail of Top-K. That makes exact `_rowid`
equality too strict for this test and causes intermittent failures.

## Fix
Make the test probe all IVF partitions before comparing Top-K row ids:
- add `.minimum_nprobes(TWO_FRAG_NUM_PARTITIONS)` to the test query

This keeps the existing strong assertion (`ids_single == ids_split`) but
removes the probing-related source of nondeterminism.

## Testing
Local verification:
- `export PROTOC=/opt/homebrew/opt/protobuf@3/bin/protoc`
- `cargo test -p lance
test_distributed_vector_build_commits_multiple_segments_and_preserves_query_results
-- --nocapture`

Observed result:
- `case_1_ivf_flat ... ok`
- `case_2_ivf_pq ... ok`
- `case_3_ivf_sq ... ok`

I also verified during debugging that with full probing enabled,
repeated runs of the previously flaky `ivf_flat` / `ivf_pq` cases became
stable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] FileFragment.update_columns does not update last_updated_at_version_meta

1 participant