Skip to content

feat: public APIs for trivial column access (raw byte ranges)#6261

Open
wkalt wants to merge 1 commit intolance-format:mainfrom
wkalt:feat/trivial-column-access
Open

feat: public APIs for trivial column access (raw byte ranges)#6261
wkalt wants to merge 1 commit intolance-format:mainfrom
wkalt:feat/trivial-column-access

Conversation

@wkalt
Copy link
Contributor

@wkalt wkalt commented Mar 23, 2026

Add APIs that let downstream consumers read raw value bytes directly from the file without going through the decoder pipeline, when the column encoding is simple enough (non-nullable, fixed-width, flat, uncompressed).

New APIs:

  • ColumnInfo::is_trivially_decodable() — true when on-disk bytes match the in-memory representation. Handles MiniBlock, FullZip, and legacy v2.0 flat encodings.
  • ColumnInfo::trivial_page_descriptors() — returns per-page metadata (buffer offsets, bits_per_value, has_large_chunk) for locating raw data in the file.
  • miniblock_raw_data_ranges() — parses miniblock chunk metadata words to produce (offset, length, num_rows) tuples pointing past headers to raw value data.
  • PageInfo::miniblock_layout() / full_zip_layout() — extract the specific layout variant from the encoding protobuf.
  • CachedFileMetadata::column_index(name) — maps a column name to its index in column_infos, handling FSL child resolution.

These enable a speedup for brute-force KNN when vector data is cached in memory, by allowing the caller to reinterpret cached pages as &[f32] without Arrow allocation or DataFusion plan execution.

Add APIs that let downstream consumers read raw value bytes directly
from the file without going through the decoder pipeline, when the
column encoding is simple enough (non-nullable, fixed-width, flat,
uncompressed).

New APIs:
- ColumnInfo::is_trivially_decodable() — true when on-disk bytes match
  the in-memory representation. Handles MiniBlock, FullZip, and legacy
  v2.0 flat encodings.
- ColumnInfo::trivial_page_descriptors() — returns per-page metadata
  (buffer offsets, bits_per_value, has_large_chunk) for locating raw
  data in the file.
- miniblock_raw_data_ranges() — parses miniblock chunk metadata words
  to produce (offset, length, num_rows) tuples pointing past headers
  to raw value data.
- PageInfo::miniblock_layout() / full_zip_layout() — extract the
  specific layout variant from the encoding protobuf.
- CachedFileMetadata::column_index(name) — maps a column name to its
  index in column_infos, handling FSL child resolution.

These enable a speedup for brute-force KNN when vector data is cached in
memory, by allowing the caller to reinterpret cached pages as &[f32]
without Arrow allocation or DataFusion plan execution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the enhancement New feature or request label Mar 23, 2026
@github-actions
Copy link
Contributor

PR Review: feat: public APIs for trivial column access (raw byte ranges)

Missing Tests (P0)

Per project policy: "All bugfixes and features must have corresponding tests. We do not merge code without tests."

This PR adds ~250 lines of new public API surface with zero tests. At minimum, the following should be tested:

  • is_trivially_decodable() — true for simple flat/miniblock/fullzip pages, false for nullable, dictionary, compressed, sub-byte, multi-buffer cases
  • trivial_page_descriptors() — correct offsets, bits_per_value, page_row_offset accumulation, has_large_chunk propagation
  • miniblock_raw_data_ranges() — correct parsing of both u16 and u32 meta words, correct data_offset/data_len/num_values for multi-chunk pages, and correct last-chunk remainder handling
  • column_index() — scalar column, FSL column, missing column

Silent continue in trivial_page_descriptors (P1)

In trivial_page_descriptors(), the method first checks is_trivially_decodable() (which validates all pages), but then has continue branches in the per-page loop for pages that don't match any known layout. If is_trivially_decodable and the per-page logic ever drift apart, continue will silently produce fewer descriptors than there are pages, and the page_row_offset accumulation will be wrong. Consider making this an error/None return rather than silently skipping, or removing the redundant checks since is_trivially_decodable already guarantees the structure.

CHUNK_HEADER_BYTES should be derived, not hardcoded (P1)

CHUNK_HEADER_BYTES: u64 = 8 happens to be correct today for trivial miniblocks (2B num_levels + 2B or 4B buffer_size, padded to 8-byte alignment), but this is a fragile implicit coupling with the miniblock encoding format. If the format ever changes (e.g., different alignment, additional header fields), this constant will silently produce wrong offsets. At minimum, add a comment explaining the derivation. Better yet, reuse the existing header-parsing logic or assert the invariant.

column_index relies on .last() (minor)

column_index returns proj.column_indices.last() with the comment "handles FSL child resolution." This works because from_column_names returns parent + child indices for composite types. A brief doc comment explaining why .last() is the right choice (the last index is the leaf data column) would help future readers.

@wkalt wkalt added the WIP work in progress label Mar 23, 2026
@wkalt
Copy link
Contributor Author

wkalt commented Mar 23, 2026

this is proof of concept for a fast path to KNN scanning in-memory lance data pages. Not expected to merge or be the approach we ultimately take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request WIP work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant