feat: public APIs for trivial column access (raw byte ranges) by wkalt · Pull Request #6261 · lance-format/lance

wkalt · 2026-03-23T17:31:55Z

Add APIs that let downstream consumers read raw value bytes directly from the file without going through the decoder pipeline, when the column encoding is simple enough (non-nullable, fixed-width, flat, uncompressed).

New APIs:

ColumnInfo::is_trivially_decodable() — true when on-disk bytes match the in-memory representation. Handles MiniBlock, FullZip, and legacy v2.0 flat encodings.
ColumnInfo::trivial_page_descriptors() — returns per-page metadata (buffer offsets, bits_per_value, has_large_chunk) for locating raw data in the file.
miniblock_raw_data_ranges() — parses miniblock chunk metadata words to produce (offset, length, num_rows) tuples pointing past headers to raw value data.
PageInfo::miniblock_layout() / full_zip_layout() — extract the specific layout variant from the encoding protobuf.
CachedFileMetadata::column_index(name) — maps a column name to its index in column_infos, handling FSL child resolution.

These enable a speedup for brute-force KNN when vector data is cached in memory, by allowing the caller to reinterpret cached pages as &[f32] without Arrow allocation or DataFusion plan execution.

Add APIs that let downstream consumers read raw value bytes directly from the file without going through the decoder pipeline, when the column encoding is simple enough (non-nullable, fixed-width, flat, uncompressed). New APIs: - ColumnInfo::is_trivially_decodable() — true when on-disk bytes match the in-memory representation. Handles MiniBlock, FullZip, and legacy v2.0 flat encodings. - ColumnInfo::trivial_page_descriptors() — returns per-page metadata (buffer offsets, bits_per_value, has_large_chunk) for locating raw data in the file. - miniblock_raw_data_ranges() — parses miniblock chunk metadata words to produce (offset, length, num_rows) tuples pointing past headers to raw value data. - PageInfo::miniblock_layout() / full_zip_layout() — extract the specific layout variant from the encoding protobuf. - CachedFileMetadata::column_index(name) — maps a column name to its index in column_infos, handling FSL child resolution. These enable a speedup for brute-force KNN when vector data is cached in memory, by allowing the caller to reinterpret cached pages as &[f32] without Arrow allocation or DataFusion plan execution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-23T17:36:02Z

PR Review: feat: public APIs for trivial column access (raw byte ranges)

Missing Tests (P0)

Per project policy: "All bugfixes and features must have corresponding tests. We do not merge code without tests."

This PR adds ~250 lines of new public API surface with zero tests. At minimum, the following should be tested:

is_trivially_decodable() — true for simple flat/miniblock/fullzip pages, false for nullable, dictionary, compressed, sub-byte, multi-buffer cases
trivial_page_descriptors() — correct offsets, bits_per_value, page_row_offset accumulation, has_large_chunk propagation
miniblock_raw_data_ranges() — correct parsing of both u16 and u32 meta words, correct data_offset/data_len/num_values for multi-chunk pages, and correct last-chunk remainder handling
column_index() — scalar column, FSL column, missing column

Silent `continue` in `trivial_page_descriptors` (P1)

In trivial_page_descriptors(), the method first checks is_trivially_decodable() (which validates all pages), but then has continue branches in the per-page loop for pages that don't match any known layout. If is_trivially_decodable and the per-page logic ever drift apart, continue will silently produce fewer descriptors than there are pages, and the page_row_offset accumulation will be wrong. Consider making this an error/None return rather than silently skipping, or removing the redundant checks since is_trivially_decodable already guarantees the structure.

`CHUNK_HEADER_BYTES` should be derived, not hardcoded (P1)

CHUNK_HEADER_BYTES: u64 = 8 happens to be correct today for trivial miniblocks (2B num_levels + 2B or 4B buffer_size, padded to 8-byte alignment), but this is a fragile implicit coupling with the miniblock encoding format. If the format ever changes (e.g., different alignment, additional header fields), this constant will silently produce wrong offsets. At minimum, add a comment explaining the derivation. Better yet, reuse the existing header-parsing logic or assert the invariant.

`column_index` relies on `.last()` (minor)

column_index returns proj.column_indices.last() with the comment "handles FSL child resolution." This works because from_column_names returns parent + child indices for composite types. A brief doc comment explaining why .last() is the right choice (the last index is the leaf data column) would help future readers.

wkalt · 2026-03-23T18:07:35Z

this is proof of concept for a fast path to KNN scanning in-memory lance data pages. Not expected to merge or be the approach we ultimately take.

github-actions bot added the enhancement New feature or request label Mar 23, 2026

wkalt added the WIP work in progress label Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: public APIs for trivial column access (raw byte ranges)#6261

feat: public APIs for trivial column access (raw byte ranges)#6261
wkalt wants to merge 1 commit intolance-format:mainfrom
wkalt:feat/trivial-column-access

wkalt commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

wkalt commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wkalt commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

PR Review: feat: public APIs for trivial column access (raw byte ranges)

Missing Tests (P0)

Silent continue in trivial_page_descriptors (P1)

CHUNK_HEADER_BYTES should be derived, not hardcoded (P1)

column_index relies on .last() (minor)

Uh oh!

wkalt commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Silent `continue` in `trivial_page_descriptors` (P1)

`CHUNK_HEADER_BYTES` should be derived, not hardcoded (P1)

`column_index` relies on `.last()` (minor)