feat: public APIs for trivial column access (raw byte ranges)#6261
feat: public APIs for trivial column access (raw byte ranges)#6261wkalt wants to merge 1 commit intolance-format:mainfrom
Conversation
Add APIs that let downstream consumers read raw value bytes directly from the file without going through the decoder pipeline, when the column encoding is simple enough (non-nullable, fixed-width, flat, uncompressed). New APIs: - ColumnInfo::is_trivially_decodable() — true when on-disk bytes match the in-memory representation. Handles MiniBlock, FullZip, and legacy v2.0 flat encodings. - ColumnInfo::trivial_page_descriptors() — returns per-page metadata (buffer offsets, bits_per_value, has_large_chunk) for locating raw data in the file. - miniblock_raw_data_ranges() — parses miniblock chunk metadata words to produce (offset, length, num_rows) tuples pointing past headers to raw value data. - PageInfo::miniblock_layout() / full_zip_layout() — extract the specific layout variant from the encoding protobuf. - CachedFileMetadata::column_index(name) — maps a column name to its index in column_infos, handling FSL child resolution. These enable a speedup for brute-force KNN when vector data is cached in memory, by allowing the caller to reinterpret cached pages as &[f32] without Arrow allocation or DataFusion plan execution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR Review: feat: public APIs for trivial column access (raw byte ranges)Missing Tests (P0)Per project policy: "All bugfixes and features must have corresponding tests. We do not merge code without tests." This PR adds ~250 lines of new public API surface with zero tests. At minimum, the following should be tested:
Silent
|
|
this is proof of concept for a fast path to KNN scanning in-memory lance data pages. Not expected to merge or be the approach we ultimately take. |
Add APIs that let downstream consumers read raw value bytes directly from the file without going through the decoder pipeline, when the column encoding is simple enough (non-nullable, fixed-width, flat, uncompressed).
New APIs:
These enable a speedup for brute-force KNN when vector data is cached in memory, by allowing the caller to reinterpret cached pages as &[f32] without Arrow allocation or DataFusion plan execution.