Skip to content

perf: batched WAND and new WAND structure, ~50% faster#6241

Merged
BubbleCal merged 7 commits intomainfrom
yang/batched-wand
Mar 23, 2026
Merged

perf: batched WAND and new WAND structure, ~50% faster#6241
BubbleCal merged 7 commits intomainfrom
yang/batched-wand

Conversation

@BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Mar 20, 2026

  • advances posting iterators in batch
  • splits posting iterators into head, lead and tail, reduces the cost of updating posting iterators
  • reuses query_weight
  • pruning documents more aggressive with block max scores
  • uses a Lucene-style conjunction path for phrase queries, with conjunction intersection and block-max pruning separated from OR WAND
Query Type Version Mode QPS Avg Latency P90 P99
match current single-thread 2.93 ms 5.62 ms 7.50 ms
match main (v2) single-thread 4.22 ms 7.56 ms 7.85 ms
match current 8-concurrency 612.80 13.05 ms 17.80 ms 20.78 ms
match main (v2) 8-concurrency 599.17 13.35 ms 19.61 ms 21.86 ms
phrase current single-thread 2.02 ms 2.59 ms 2.62 ms
phrase main (v2) single-thread 3.60 ms 4.62 ms 4.67 ms
phrase current 8-concurrency 1597.37 5.01 ms 6.57 ms 8.60 ms
phrase main (v2) 8-concurrency 1040.66 7.69 ms 9.86 ms 11.23 ms

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions
Copy link
Contributor

Review: perf: batched WAND and new WAND structure

Nice performance improvement — the benchmarks show meaningful gains across the board, especially for phrase queries (~40% latency reduction single-threaded). The head/lead/tail split is a well-known WAND optimization.

Issues to consider

P0: Potential infinite loop in next_and_candidate

next_and_candidate (wand.rs) has a loop with no exit condition other than early return via ? (which returns None if a posting is exhausted). However, if all postings are non-empty but never align to the same target, the loop will spin forever as target keeps increasing via target = target.max(doc.doc_id()). If postings are genuinely disjoint (no shared doc_id), target will keep jumping and eventually all postings exhaust — but only if next() eventually returns None when past the last doc. Please verify that posting.next(target) when target is beyond all docs causes posting.doc() to return None, breaking the loop. If there's a gap where next() clamps to the last doc instead of returning None, this is an infinite loop.

P1: advance_lead_to_head unconditionally clears tail

advance_lead_to_head calls self.clear_tail() at the end, which discards all tail postings. In the search_flat path, this is called after collecting a match (advance_lead_to_head(doc_id + 1) at line ~769), but those tail postings may still be relevant for future candidates. Contrast with push_back_leads which carefully reinserts into tail/head. Is the tail always guaranteed to be empty at this point, or is this losing postings?

P1: Floating-point subtraction in insert_tail_with_overflow

self.tail_max_score = self.tail_max_score - evicted.upper_bound + upper_bound;

Repeated add/subtract of f32 upper bounds will accumulate rounding errors in tail_max_score. Over many iterations this drift could cause premature pruning (if tail_max_score becomes slightly too low). Consider periodically recomputing tail_max_score from scratch (sum of all tail entries), or accept this as a minor recall risk and document it.

P1: update_max_scores rebuilds both heaps from scratch

update_max_scores calls std::mem::take on both self.head and self.tail, iterates over into_vec(), and rebuilds. This is O(n) per call. Since this is called in the hot loop (next()) whenever target > self.up_to, if up_to advances in small steps (short blocks), this could regress to O(n²) for large posting lists. Worth profiling with a pathological case (many terms, small blocks).

Minor nits

  • collect_tail_matches pushes non-matching postings to remaining then re-inserts them into head. These were in tail (i.e., low-priority) and are now being promoted to head, which seems intentional but worth a brief comment.
  • The #[cfg(test)] on PostingIterator::new is fine, but if any downstream crate's tests relied on it, they'd break. Looks safe since it's pub(crate).

Overall a solid optimization. The main concern is verifying the next_and_candidate termination and the advance_lead_to_head tail clearing behavior.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal changed the title perf: batched WAND and new WAND structure perf: batched WAND and new WAND structure, ~50% faster Mar 20, 2026
@BubbleCal BubbleCal requested review from Xuanwo and westonpace March 20, 2026 16:30
@codecov
Copy link

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 80.78740% with 122 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/wand.rs 80.69% 104 Missing and 18 partials ⚠️

📢 Thoughts on this report? Let us know!

@BubbleCal BubbleCal merged commit 384fb55 into main Mar 23, 2026
29 checks passed
@BubbleCal BubbleCal deleted the yang/batched-wand branch March 23, 2026 16:47
westonpace pushed a commit that referenced this pull request Mar 24, 2026
- advances posting iterators in batch
- splits posting iterators into `head`, `lead` and `tail`, reduces the
cost of updating posting iterators
- reuses `query_weight`
- pruning documents more aggressive with block max scores
- uses a Lucene-style conjunction path for phrase queries, with
conjunction intersection and block-max pruning separated from OR WAND

Query Type | Version | Mode | QPS | Avg Latency | P90 | P99
-- | -- | -- | -- | -- | -- | --
match | current | single-thread | — | 2.93 ms | 5.62 ms | 7.50 ms
match | main (v2) | single-thread | — | 4.22 ms | 7.56 ms | 7.85 ms
match | current | 8-concurrency | 612.80 | 13.05 ms | 17.80 ms | 20.78
ms
match | main (v2) | 8-concurrency | 599.17 | 13.35 ms | 19.61 ms | 21.86
ms
phrase | current | single-thread | — | 2.02 ms | 2.59 ms | 2.62 ms
phrase | main (v2) | single-thread | — | 3.60 ms | 4.62 ms | 4.67 ms
phrase | current | 8-concurrency | 1597.37 | 5.01 ms | 6.57 ms | 8.60 ms
phrase | main (v2) | 8-concurrency | 1040.66 | 7.69 ms | 9.86 ms | 11.23
ms

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants