Conversation
Signed-off-by: Andrew Duffy <[email protected]>
5d03562 to
b99febe
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new GHArchive-backed random access benchmark path, including nested field projection support for both Parquet (Arrow) and Vortex readers.
Changes:
- Introduces a GHArchive dataset module that downloads/converts GitHub Archive events into Parquet/Vortex for benchmarking.
- Adds projecting random-access traits/accessors and Parquet/Vortex implementations to benchmark nested field reads.
- Extends the random-access benchmark CLI to select between Taxi and GHArchive datasets and emits separate benchmark IDs.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| vortex-bench/src/random_access/take.rs | Adds projecting accessors and Parquet/Vortex “take with projection” implementations. |
| vortex-bench/src/random_access/mod.rs | Exposes new projecting accessor types and defines FieldPath + ProjectingRandomAccessor trait. |
| vortex-bench/src/datasets/mod.rs | Registers the new gharchive dataset module. |
| vortex-bench/src/datasets/gharchive.rs | Implements GHArchive dataset download/conversion and exposes useful nested-field paths. |
| benchmarks/random-access-bench/src/main.rs | Adds dataset selection and a projected-random-access benchmark path for GHArchive. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let row_group_indices = row_groups | ||
| .keys() | ||
| .sorted() | ||
| .map(|i| row_groups[i].clone()) | ||
| .collect_vec(); | ||
|
|
||
| let reader = builder | ||
| .with_projection(projection_mask) | ||
| .with_row_groups(row_groups.keys().copied().collect_vec()) |
There was a problem hiding this comment.
row_group_indices is built from sorted row_groups.keys(), but with_row_groups(row_groups.keys().copied().collect_vec()) passes row groups in HashMap iteration order. The stream will yield batches in the order requested, so indexing row_group_indices[idx] can apply the wrong per-row-group indices (or panic). Build a single sorted_row_groups vec once, use it both to construct row_group_indices and to pass into with_row_groups(...).
| let row_group_indices = row_groups | |
| .keys() | |
| .sorted() | |
| .map(|i| row_groups[i].clone()) | |
| .collect_vec(); | |
| let reader = builder | |
| .with_projection(projection_mask) | |
| .with_row_groups(row_groups.keys().copied().collect_vec()) | |
| // Build a single, consistently ordered list of row group indices. | |
| let sorted_row_groups = row_groups.keys().copied().sorted().collect_vec(); | |
| let row_group_indices = sorted_row_groups | |
| .iter() | |
| .map(|i| row_groups[i].clone()) | |
| .collect_vec(); | |
| let reader = builder | |
| .with_projection(projection_mask) | |
| .with_row_groups(sorted_row_groups) |
|
|
||
| for idx in indices { | ||
| let row_group_idx = row_group_offsets | ||
| .binary_search(&(idx as i64)) | ||
| .unwrap_or_else(|e| e - 1); | ||
| row_groups | ||
| .entry(row_group_idx) | ||
| .or_insert_with(Vec::new) | ||
| .push((idx as i64) - row_group_offsets[row_group_idx]); |
There was a problem hiding this comment.
row_group_offsets includes the final total row count, so an index equal to (or greater than) the total number of rows can produce row_group_idx == row_group_count, which is not a valid row group index. This will later lead to out-of-bounds math and/or an invalid row group request. Consider validating that all indices are < total_rows up front (and returning a clear error) before computing row group selections.
| for idx in indices { | |
| let row_group_idx = row_group_offsets | |
| .binary_search(&(idx as i64)) | |
| .unwrap_or_else(|e| e - 1); | |
| row_groups | |
| .entry(row_group_idx) | |
| .or_insert_with(Vec::new) | |
| .push((idx as i64) - row_group_offsets[row_group_idx]); | |
| // The last entry is the total number of rows across all row groups. | |
| let total_rows = *row_group_offsets.last().unwrap_or(&0); | |
| for idx in indices { | |
| let idx_i64 = idx as i64; | |
| // Skip indices that are out of bounds for the file to avoid invalid row group indices. | |
| if idx_i64 < 0 || idx_i64 >= total_rows { | |
| continue; | |
| } | |
| let row_group_idx = row_group_offsets | |
| .binary_search(&idx_i64) | |
| .unwrap_or_else(|e| e - 1); | |
| row_groups | |
| .entry(row_group_idx) | |
| .or_insert_with(Vec::new) | |
| .push(idx_i64 - row_group_offsets[row_group_idx]); |
| .enumerate() | ||
| .map(|(idx, batch)| { | ||
| let batch = batch.unwrap(); | ||
| let indices = PrimitiveArray::<Int64Type>::from(row_group_indices[idx].clone()); | ||
| take_record_batch(&batch, &indices).unwrap() | ||
| }) |
There was a problem hiding this comment.
This stream processing uses unwrap() for both the Parquet batch and take_record_batch(...). Any IO/decoding error or invalid index will panic the entire benchmark run. Prefer propagating these as anyhow::Result (e.g., map to Result<RecordBatch> and try_collect) so failures surface as errors rather than panics.
| } | ||
| Dataset::GhArchive => { | ||
| // Run gharchive benchmark with nested field projection. | ||
| // The field path is payload.ref - a deeply nested string field. |
There was a problem hiding this comment.
The comment says the field path is payload.ref, but the code actually benchmarks actor.login. Please update the comment (or change the field path) so benchmark output/intent is clear and doesn’t drift from the implementation.
| // The field path is payload.ref - a deeply nested string field. | |
| // The field path is actor.login - a deeply nested string field. |
| // For now, use the same path as OnDiskVortex (compact not yet implemented for gharchive) | ||
| let path = gharchive_vortex().await?; | ||
| Ok(Box::new(VortexProjectingAccessor::compact(path))) |
There was a problem hiding this comment.
Format::VortexCompact is currently using gharchive_vortex() (non-compact) but labeling the accessor/target as compact via VortexProjectingAccessor::compact(...). This will misreport results and can confuse comparisons. Either use gharchive_vortex_compact() here, or return unimplemented!/error until a compact dataset path is available.
| // For now, use the same path as OnDiskVortex (compact not yet implemented for gharchive) | |
| let path = gharchive_vortex().await?; | |
| Ok(Box::new(VortexProjectingAccessor::compact(path))) | |
| unimplemented!("Compact gharchive dataset path is not yet implemented"); |
Benchmarks: PolarSignals ProfilingSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb NVMeSummary
Detailed Results Table
|
Benchmarks: Random AccessSummary
|
Benchmarks: TPC-H SF=1 on S3Summary
Detailed Results Table
|
Benchmarks: TPC-DS SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on NVMESummary
Detailed Results Table
|
Benchmarks: Statistical and Population GeneticsSummary
Detailed Results Table
|
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: Clickbench on NVMESummary
Detailed Results Table
|
Benchmarks: CompressionSummary
Detailed Results Table
|
No description provided.