Skip to content

feat: support hamming clustering#6265

Open
jackye1995 wants to merge 5 commits intolance-format:mainfrom
jackye1995:hamming
Open

feat: support hamming clustering#6265
jackye1995 wants to merge 5 commits intolance-format:mainfrom
jackye1995:hamming

Conversation

@jackye1995
Copy link
Contributor

Add support for SIMD accelerated pairwise hamming distance computation, and the ability to compute a cluster of binary vectors that are within a given hamming distance threshold, these are considered similar or potentially duplicated vectors of the original representation.

Also expose the feature in python for easy consumption.

@github-actions github-actions bot added enhancement New feature or request python labels Mar 24, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 24, 2026

PR Review: feat: support hamming clustering

P1: total_rows silently drops fragments

let total_rows: usize = dataset
    .get_fragments()
    .iter()
    .filter_map(|f| f.metadata().physical_rows)  // silently skips None
    .sum();

If any fragment has physical_rows = None, the total is undercounted. This can lead to incorrect sampling behavior (sampling more than intended, or use_sampling being wrong). Consider using dataset.count_rows(None).await? instead.

@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 80.92158% with 236 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-linalg/src/distance/hamming.rs 85.44% 117 Missing and 6 partials ⚠️
rust/lance/src/index/vector/hamming.rs 71.17% 94 Missing and 19 partials ⚠️

📢 Thoughts on this report? Let us know!

jackye1995 and others added 4 commits March 23, 2026 22:04
- Change return type from dict/struct to Box<dyn RecordBatchReader + Send>
- Output schema: representative (uint64), duplicates (list<uint64>)
- ClusteringResult::into_reader() yields batches of 10k clusters
- Rename hamming_cluster_hashes -> hamming_clustering_from_hashes
- Log timing info via tracing instead of returning in struct
- Python bindings return pa.RecordBatchReader

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Use take_rows() which returns _rowid column, instead of using
positional indices from sample() as row IDs. This ensures the
cluster results contain actual row IDs that can be used for
downstream operations like deleting duplicates.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Fix sampling path to request _rowid column explicitly in take_rows projection
- Add integration tests for IVF partition clustering and sampled clustering
- Remove .unwrap() in Python binding closures, use ? operator
- Change to_record_batch to into_record_batch to avoid cloning

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant