fix: handle list-level NULLs in NOT filters#6044
fix: handle list-level NULLs in NOT filters#6044fenfeng9 wants to merge 9 commits intolance-format:mainfrom
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
@wjones127 @westonpace Could you please review this when you have a moment? Thanks! |
There was a problem hiding this comment.
Nice work. I think the one thing I would change is put the nulls in a global buffer. Here's an example of how we do something similar for IVF indexes:
lance/rust/lance/src/index/vector/ivf.rs
Lines 2024 to 2026 in ca8fb1b
I was going to say we should also add more compat tests, but I think we are already well covered there. You can see the warnings are working in our label list compat test:
https://github.com/lance-format/lance/actions/runs/22492208650/job/65160507370?pr=6044#step:6:117
That is defined here:
lance/python/python/tests/compat/test_scalar_indices.py
Lines 75 to 125 in ca8fb1b
|
|
||
| pub const BITMAP_LOOKUP_NAME: &str = "bitmap_page_lookup.lance"; | ||
| const LABEL_LIST_INDEX_VERSION: u32 = 0; | ||
| pub const LABEL_LIST_NULLS_NAME: &str = "label_list_nulls.lance"; |
There was a problem hiding this comment.
If we are just writing a single buffer, why don't we instead use a global buffer in one of the existing index files? That will cut down number of files needed.
There was a problem hiding this comment.
Thanks for the review and the suggestion! I’ll look into the global buffer idea and get back to you.
There was a problem hiding this comment.
I took a look at what it would take to do this on the scalar side.
The main difference from the IVF example is that the scalar index path currently goes through the IndexStore / IndexReader / IndexWriter abstraction:
lance/rust/lance-index/src/scalar.rs
Lines 170 to 179 in aac74b4
and that abstraction is centered around record-batch files. It does not currently expose global buffer read/write directly.
There is also a second constraint here: LabelList currently reuses the bitmap index train/update/remap paths, so moving the null map into the same index file would likely require some refactoring in that bitmap-backed flow as well.
lance/rust/lance-index/src/scalar/label_list.rs
Lines 443 to 452 in aac74b4
There was a problem hiding this comment.
I think we can move this to a global buffer, but on the scalar side it's a bit more than just moving the null map. We'd probably need to add global-buffer read/write support to the current scalar reader/writer path, and also adjust the bitmap-backed LabelList train/update/remap flow so the null map can be written into the same index file.
So before I go further with that, I wanted to check with you whether you'd like me to also convert LabelList to a global-buffer-based layout in this PR.
There was a problem hiding this comment.
Yeah I would think adding something like:
pub trait IndexWriter: Send {
async fn add_global_buffer(&mut self, data: Bytes) -> Result<()>;
}Would make sense. @westonpace do you have any objection?
There was a problem hiding this comment.
Yeah I would think adding something like:
pub trait IndexWriter: Send { async fn add_global_buffer(&mut self, data: Bytes) -> Result<()>; }
Thanks for the suggestion! I'll implement this and update the PR soon.
77dc90c to
8ebd55d
Compare
|
I updated the PR in that direction. I added global buffer support to the scalar |
| state: HashMap<ScalarValue, RowAddrTreeMap>, | ||
| store: &dyn IndexStore, | ||
| value_type: &DataType, | ||
| list_nulls: RowAddrTreeMap, |
There was a problem hiding this comment.
nitpick: I think you could pass this as a reference:
| list_nulls: RowAddrTreeMap, | |
| list_nulls: &RowAddrTreeMap, |
closes #5904
LabelListIndexdrops list-level NULL rows during unnest, soNOTfilters can return NULL lists incorrectly.Changed: