-
Notifications
You must be signed in to change notification settings - Fork 132
VarBinBuilder: eagerly set IsSorted stat on offsets #6452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
VarBinBuilder guarantees monotonically increasing offsets, but finish() wasn't setting the IsSorted stat. This caused an O(n) recomputation every time a deserialized VarBinArray was validated, which showed up as a hot path in production profiles. Signed-off-by: Dimitar Dimitrov <[email protected]>
|
Can also do this for runend varbinview if not |
Signed-off-by: Dimitar Dimitrov <[email protected]>
i couldn't find where in the VarBinViewArrayBuilder we create runend offsets for VarBin(View) arrays. Can you point me pls? |
|
varbinview:
also maybe sequence array: |
RunEndArray::validate didn't check that ends are strictly increasing, which is required for binary search to work correctly. Also caches IsSorted and IsStrictSorted on the ends after construction so downstream consumers don't recompute it. Signed-off-by: Dimitar Dimitrov <[email protected]>
Merging this PR will degrade performance by 10.64%
Performance Changes
Comparing Footnotes
|
SequenceArray (A[i] = base + i * multiplier) can determine sortedness from the sign of the multiplier at construction time. Set both stats eagerly so we skip the kernel entirely. Signed-off-by: Dimitar Dimitrov <[email protected]>
joseph-isaacs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to be careful on validation cost on the release path
| let stats_set = ArrayStats::default(); | ||
| stats_set.set(Stat::IsSorted, StatPrecision::Exact(is_sorted.into())); | ||
| stats_set.set( | ||
| Stat::IsStrictSorted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think we can get away with a statically alloced CoW here so we don't need to alloc a stats set here?
Not sure.
Maybe also include min and max and other stats?
| // and strictly sorted iff multiplier > 0. | ||
| let (is_sorted, is_strict_sorted) = match_each_native_ptype!(ptype, |P| { | ||
| let m = multiplier.cast::<P>(); | ||
| (m >= zero::<P>(), m > zero::<P>()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we allow a zero multiplier. if no we should not allow?
| // Run ends are always strictly sorted (and therefore sorted) by invariant. | ||
| // Cache this so downstream consumers don't need to recompute it. | ||
| ends.statistics() | ||
| .set(Stat::IsStrictSorted, StatPrecision::Exact(true.into())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this run in release? it's very slow.
Can we set this on compression?
| } | ||
|
|
||
| // Run ends must be strictly sorted for binary search to work correctly. | ||
| if let Some(is_strict_sorted) = ends.statistics().compute_is_strict_sorted() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too expensive
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb NVMeSummary
Detailed Results Table
|
Benchmarks: Random AccessSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on S3Summary
Detailed Results Table
|
Benchmarks: TPC-DS SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb S3Summary
Detailed Results Table
|
Benchmarks: Statistical and Population GeneticsSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on S3Summary
Detailed Results Table
|
Benchmarks: CompressionSummary
Detailed Results Table
|
Benchmarks: Clickbench on NVMESummary
Detailed Results Table
|
Does this PR closes an open issue or discussion?
no
What this PR does?
VarBinBuilder guarantees monotonically increasing offsets, but finish() wasn't setting the IsSorted stat. This caused an O(n) recomputation every time a deserialized VarBinArray was validated.
How is this change tested?
unit tests
Are there any user-facing changes?
VarBin arrays now persist stats for their offsets child array. Existing persisted arrays will still be verified as sorted upon reading from disk.