feat: support hamming clustering by jackye1995 · Pull Request #6265 · lance-format/lance

jackye1995 · 2026-03-24T04:31:08Z

Add support for SIMD accelerated pairwise hamming distance computation, and the ability to compute a cluster of binary vectors that are within a given hamming distance threshold, these are considered similar or potentially duplicated vectors of the original representation.

Also expose the feature in python for easy consumption.

github-actions · 2026-03-24T04:32:39Z

PR Review: feat: support hamming clustering

P1: `total_rows` silently drops fragments

let total_rows: usize = dataset
    .get_fragments()
    .iter()
    .filter_map(|f| f.metadata().physical_rows)  // silently skips None
    .sum();

If any fragment has physical_rows = None, the total is undercounted. This can lead to incorrect sampling behavior (sampling more than intended, or use_sampling being wrong). Consider using dataset.count_rows(None).await? instead.

codecov · 2026-03-24T05:02:27Z

Codecov Report

❌ Patch coverage is 80.92158% with 236 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-linalg/src/distance/hamming.rs	85.44%	117 Missing and 6 partials ⚠️
rust/lance/src/index/vector/hamming.rs	71.17%	94 Missing and 19 partials ⚠️

📢 Thoughts on this report? Let us know!

- Change return type from dict/struct to Box<dyn RecordBatchReader + Send> - Output schema: representative (uint64), duplicates (list<uint64>) - ClusteringResult::into_reader() yields batches of 10k clusters - Rename hamming_cluster_hashes -> hamming_clustering_from_hashes - Log timing info via tracing instead of returning in struct - Python bindings return pa.RecordBatchReader Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

Use take_rows() which returns _rowid column, instead of using positional indices from sample() as row IDs. This ensures the cluster results contain actual row IDs that can be used for downstream operations like deleting duplicates. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

- Fix sampling path to request _rowid column explicitly in take_rows projection - Add integration tests for IVF partition clustering and sampled clustering - Remove .unwrap() in Python binding closures, use ? operator - Change to_record_batch to into_record_batch to avoid cloning Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

feat: support hamming clustering

8c69ce8

github-actions bot added enhancement New feature or request python labels Mar 24, 2026

jackye1995 and others added 4 commits March 23, 2026 22:04

fix: escape angle brackets in rustdoc comments

872ad24

Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support hamming clustering#6265

feat: support hamming clustering#6265
jackye1995 wants to merge 5 commits intolance-format:mainfrom
jackye1995:hamming

jackye1995 commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026 •

edited by jackye1995

Loading

Uh oh!

codecov bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jackye1995 commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026 • edited by jackye1995 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat: support hamming clustering

P1: total_rows silently drops fragments

Uh oh!

codecov bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 24, 2026 •

edited by jackye1995

Loading

P1: `total_rows` silently drops fragments

codecov bot commented Mar 24, 2026 •

edited

Loading