Skip to content

fix: handle list-level NULLs in NOT filters#6044

Open
fenfeng9 wants to merge 9 commits intolance-format:mainfrom
fenfeng9:fix/label-list-null-not
Open

fix: handle list-level NULLs in NOT filters#6044
fenfeng9 wants to merge 9 commits intolance-format:mainfrom
fenfeng9:fix/label-list-null-not

Conversation

@fenfeng9
Copy link
Contributor

@fenfeng9 fenfeng9 commented Feb 27, 2026

closes #5904

LabelListIndex drops list-level NULL rows during unnest, so NOT filters can return NULL lists incorrectly.

Changed:

  • persists list-level NULL rows in the LabelList index and merges them back into query null sets
  • keeps older bitmap-only LabelList index layouts readable by treating missing null-set metadata as empty
  • warns when opening an older LabelList index on a nullable column, since NOT-filter semantics may still be incomplete until the index is rebuilt

@github-actions github-actions bot added bug Something isn't working python labels Feb 27, 2026
@codecov
Copy link

codecov bot commented Feb 27, 2026

@fenfeng9
Copy link
Contributor Author

fenfeng9 commented Mar 2, 2026

@wjones127 @westonpace Could you please review this when you have a moment? Thanks!

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I think the one thing I would change is put the nulls in a global buffer. Here's an example of how we do something similar for IVF indexes:

// Add IVF protobuf as a global buffer and reference via IVF_METADATA_KEY.
let pos = v2_writer.add_global_buffer(ivf_bytes).await?;
v2_writer.add_schema_metadata(IVF_METADATA_KEY, pos.to_string());

I was going to say we should also add more compat tests, but I think we are already well covered there. You can see the warnings are working in our label list compat test:

https://github.com/lance-format/lance/actions/runs/22492208650/job/65160507370?pr=6044#step:6:117

That is defined here:

@compat_test(min_version="0.22.0")
class BitmapLabelListIndex(UpgradeDowngradeTest):
"""Test BITMAP and LABEL_LIST scalar index compatibility (introduced in 0.20.0).
Started fully working in 0.22.0 with fixes to LABEL_LIST index.
"""
def __init__(self, path: Path):
self.path = path
def create(self):
"""Create dataset with BITMAP and LABEL_LIST indices."""
shutil.rmtree(self.path, ignore_errors=True)
data = pa.table(
{
"idx": pa.array(range(1000)),
"bitmap": pa.array(range(1000)),
"label_list": pa.array([[f"label{i}"] for i in range(1000)]),
}
)
dataset = lance.write_dataset(data, self.path, max_rows_per_file=100)
dataset.create_scalar_index("bitmap", "BITMAP")
dataset.create_scalar_index("label_list", "LABEL_LIST")
def check_read(self):
"""Verify BITMAP and LABEL_LIST indices can be queried."""
ds = lance.dataset(self.path)
# Test BITMAP index
table = ds.to_table(filter="bitmap == 7")
assert table.num_rows == 1
assert table.column("idx").to_pylist() == [7]
# Test LABEL_LIST index
table = ds.to_table(filter="array_has_any(label_list, ['label7'])")
assert table.num_rows == 1
assert table.column("idx").to_pylist() == [7]
def check_write(self):
"""Verify can insert data and optimize indices."""
ds = lance.dataset(self.path)
data = pa.table(
{
"idx": pa.array([1000]),
"bitmap": pa.array([1000]),
"label_list": pa.array([["label1000"]]),
}
)
ds.insert(data)
ds.optimize.optimize_indices()
ds.optimize.compact_files()


pub const BITMAP_LOOKUP_NAME: &str = "bitmap_page_lookup.lance";
const LABEL_LIST_INDEX_VERSION: u32 = 0;
pub const LABEL_LIST_NULLS_NAME: &str = "label_list_nulls.lance";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are just writing a single buffer, why don't we instead use a global buffer in one of the existing index files? That will cut down number of files needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and the suggestion! I’ll look into the global buffer idea and get back to you.

Copy link
Contributor Author

@fenfeng9 fenfeng9 Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at what it would take to do this on the scalar side.

The main difference from the IVF example is that the scalar index path currently goes through the IndexStore / IndexReader / IndexWriter abstraction:

pub trait IndexWriter: Send {
/// Writes a record batch into the file, returning the 0-based index of the batch in the file
///
/// E.g. if this is the third time this is called this method will return 2
async fn write_record_batch(&mut self, batch: RecordBatch) -> Result<u64>;
/// Finishes writing the file and closes the file
async fn finish(&mut self) -> Result<()>;
/// Finishes writing the file and closes the file with additional metadata
async fn finish_with_metadata(&mut self, metadata: HashMap<String, String>) -> Result<()>;
}

and that abstraction is centered around record-batch files. It does not currently expose global buffer read/write directly.

There is also a second constraint here: LabelList currently reuses the bitmap index train/update/remap paths, so moving the null map into the same index file would likely require some refactoring in that bitmap-backed flow as well.

let data = unnest_chunks(data)?;
let bitmap_plugin = BitmapIndexPlugin;
bitmap_plugin
.train_index(data, index_store, request, fragment_ids, progress)
.await?;
Ok(CreatedIndex {
index_details: prost_types::Any::from_msg(&pbold::LabelListIndexDetails::default())
.unwrap(),
index_version: LABEL_LIST_INDEX_VERSION,
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this to a global buffer, but on the scalar side it's a bit more than just moving the null map. We'd probably need to add global-buffer read/write support to the current scalar reader/writer path, and also adjust the bitmap-backed LabelList train/update/remap flow so the null map can be written into the same index file.
So before I go further with that, I wanted to check with you whether you'd like me to also convert LabelList to a global-buffer-based layout in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I would think adding something like:

pub trait IndexWriter: Send { 
     async fn add_global_buffer(&mut self, data: Bytes) -> Result<()>;
}

Would make sense. @westonpace do you have any objection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I would think adding something like:

pub trait IndexWriter: Send { 
     async fn add_global_buffer(&mut self, data: Bytes) -> Result<()>;
}

Thanks for the suggestion! I'll implement this and update the PR soon.

@wjones127 wjones127 self-assigned this Mar 2, 2026
@fenfeng9 fenfeng9 requested a review from wjones127 March 13, 2026 13:24
@fenfeng9 fenfeng9 force-pushed the fix/label-list-null-not branch from 77dc90c to 8ebd55d Compare March 16, 2026 08:33
@fenfeng9
Copy link
Contributor Author

I updated the PR in that direction.

I added global buffer support to the scalar reader/writer path, and LabelList now stores list-level nulls in the bitmap file instead of a separate file. The buffer index is recorded in metadata and read back from there.

state: HashMap<ScalarValue, RowAddrTreeMap>,
store: &dyn IndexStore,
value_type: &DataType,
list_nulls: RowAddrTreeMap,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I think you could pass this as a reference:

Suggested change
list_nulls: RowAddrTreeMap,
list_nulls: &RowAddrTreeMap,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LabelListIndex: NOT filters mis-handle NULL lists (list-level NULLs)

2 participants