Add block-transposed multi-vector representation for SIMD-friendly layouts by suri-kumkaran · Pull Request #805 · microsoft/DiskANN

suri-kumkaran · 2026-03-02T12:43:59Z

Summary

Introduces BlockTransposed<T, GROUP, PACK>, a new Repr implementation that stores multi-vectors in a block-transposed layout optimized for SIMD processing. This enables faster multi-vector distance computations by ensuring data is arranged for efficient vectorized access patterns.

Key changes

BlockTransposed<T, GROUP, PACK>: Generic block-transposed matrix representation where groups of GROUP rows are stored transposed, with an optional PACK factor for column interleaving (e.g. PACK=2 for vpmaddwd-style instructions).
Row proxy types: BlockTransposedRow / BlockTransposedRowMut with indexed access, iteration, and Send/Sync support.
Factory methods: new_block_transposed, from_strided, from_matrix_view — generic over any T: Copy + Default.
Block-level views: block() / block_mut() / remainder_block() / remainder_block_mut() returning MatrixView / MutMatrixView with dimensions (padded_ncols/PACK, GROUP*PACK) for direct SIMD consumption.
Compile-time constraints: GROUP > 0, PACK > 0, GROUP % PACK == 0 enforced at monomorphization.
Added tests including parameterized correctness tests across multiple GROUP/PACK combinations, generic element types (f32, i32, u8), bounds-checking panic tests, and Miri-compatible paths.

Copilot

Pull request overview

This PR introduces a new multi-vector matrix representation (BlockTransposed<T, GROUP, PACK>) intended to replace/extend the prior kmeans-specific BlockTranspose layout and enable SIMD-friendly access patterns (including optional packed column interleaving).

Changes:

Added multi_vector::BlockTransposed representation with row proxy types, block views, constructors, and extensive tests.
Migrated kmeans (lloyds, kmeans++) and product-quantization training/pivot codepaths to use Mat<BlockTransposed<...>> instead of BlockTranspose.
Removed the old BlockTranspose implementation from algorithms::kmeans::common and stopped re-exporting it from algorithms::kmeans.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
diskann-quantization/src/product/train.rs	Switches PQ training to build/use `Mat<BlockTransposed<...>>` for SIMD-friendly kmeans operations.
diskann-quantization/src/product/tables/transposed/pivots.rs	Updates pivot chunk storage to `Mat<BlockTransposed<...>>` and adjusts group-size/construction calls.
diskann-quantization/src/multi_vector/mod.rs	Adds the new `block_transposed` module and re-exports its public types.
diskann-quantization/src/multi_vector/matrix.rs	Makes some internals `pub(crate)` to support the new representation’s accessors/constructors.
diskann-quantization/src/multi_vector/block_transposed.rs	New core implementation of block-transposed (+ packed) representation with row proxies, block views, constructors, and tests.
diskann-quantization/src/algorithms/kmeans/plusplus.rs	Replaces `BlockTranspose` usage with `Mat<BlockTransposed<...>>` and updates microkernel wiring.
diskann-quantization/src/algorithms/kmeans/mod.rs	Removes `pub use common::BlockTranspose;` export.
diskann-quantization/src/algorithms/kmeans/lloyds.rs	Replaces `BlockTranspose` usage with `Mat<BlockTransposed<...>>`.
diskann-quantization/src/algorithms/kmeans/common.rs	Removes the old `BlockTranspose` implementation and its tests, leaving `square_norm`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

diskann-quantization/src/multi_vector/matrix.rs

diskann-quantization/src/multi_vector/block_transposed.rs

Copilot · 2026-03-02T12:52:59Z

diskann-quantization/src/multi_vector/block_transposed.rs

+        for row in 0..nrows {
+            for col in 0..ncols {
+                let idx = linear_index::<GROUP, PACK>(row, col, ncols);
+                // SAFETY: idx < storage_len by construction, base points to a valid allocation.
+                unsafe { *base.add(idx) = v[(row, col)] };


from_strided fills the backing store in logical row-major order, which writes with a stride of GROUP (or more with PACK) and can be significantly less cache-friendly than filling in physical block order. The previous BlockTranspose::from_strided implementation filled per-block/per-column (contiguous writes). Consider restructuring this loop to iterate blocks + column-groups + rows to write mostly sequentially into the backing allocation.

Suggested change

for row in 0..nrows {

for col in 0..ncols {

let idx = linear_index::<GROUP, PACK>(row, col, ncols);

// SAFETY: idx < storage_len by construction, base points to a valid allocation.

unsafe { *base.add(idx) = v[(row, col)] };

// Iterate in physical block order: blocks -> column-groups -> rows -> packed columns.

// This makes writes mostly sequential in the backing allocation, improving cache locality.

let nblocks = (nrows + GROUP - 1) / GROUP;

let ncol_groups = (ncols + PACK - 1) / PACK;

for block in 0..nblocks {

let row_base = block * GROUP;

for col_group in 0..ncol_groups {

let col_base = col_group * PACK;

for in_block_row in 0..GROUP {

let row = row_base + in_block_row;

if row >= nrows {

break;

}

for pack_idx in 0..PACK {

let col = col_base + pack_idx;

if col >= ncols {

break;

}

let idx = linear_index::<GROUP, PACK>(row, col, ncols);

// SAFETY: idx < storage_len by construction, base points to a valid allocation.

unsafe {

*base.add(idx) = v[(row, col)];

}

}

}

diskann-quantization/src/algorithms/kmeans/mod.rs

codecov-commenter · 2026-03-02T12:58:45Z

Codecov Report

❌ Patch coverage is 85.87513% with 138 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.40%. Comparing base (f206264) to head (27442a1).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...-quantization/src/multi_vector/block_transposed.rs	84.98%	135 Missing ⚠️
diskann-quantization/src/multi_vector/matrix.rs	88.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #805      +/-   ##
==========================================
- Coverage   89.45%   89.40%   -0.06%     
==========================================
  Files         432      433       +1     
  Lines       79452    80296     +844     
==========================================
+ Hits        71075    71788     +713     
- Misses       8377     8508     +131

Flag	Coverage Δ
miri	`89.40% <85.87%> (-0.06%)`	⬇️
unittests	`89.25% <85.87%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...skann-quantization/src/algorithms/kmeans/common.rs	`96.38% <ø> (-1.11%)`	⬇️
...skann-quantization/src/algorithms/kmeans/lloyds.rs	`98.15% <100.00%> (+0.01%)`	⬆️
...ann-quantization/src/algorithms/kmeans/plusplus.rs	`97.31% <100.00%> (+0.02%)`	⬆️
diskann-quantization/src/minmax/multi/meta.rs	`97.66% <100.00%> (ø)`
...antization/src/product/tables/transposed/pivots.rs	`97.57% <100.00%> (ø)`
diskann-quantization/src/product/train.rs	`96.66% <100.00%> (ø)`
diskann-quantization/src/utils.rs	`96.92% <100.00%> (+0.04%)`	⬆️
diskann-quantization/src/multi_vector/matrix.rs	`94.82% <88.00%> (-0.35%)`	⬇️
...-quantization/src/multi_vector/block_transposed.rs	`84.98% <84.98%> (ø)`

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hildebrandmw

Thanks Suryansh! I've taken a look over and left a few stylistic comments, but there are few larger ones that I will group together here:

We should be very careful with the unchecked arithmetic, particularly in compute_capacity. We honestly probably want two flavors of this function, one that verifies none of the intermediate steps overflow that gets used in the constructor, and one that assumes everything has already been checked (this would be the current compute_capacity function). I would also recommend leaning in to usize::next_multiple_of instead of doing the operation manually.
I'm getting more convinced that we do not want to use Mat/MatRef/MatMut in public APIs. This forces us into an awkward situation where we need to start adding a bunch of inherent methods to these types, which makes method discoverability a little harder and is not a pattern that can be replicated outside of diskann-quantization. Instead, I think it would be cleaner to have thin wrappers:
```
struct BlockTransposed<T, const GROUP: usize, const PACK: usize> {
    data: Mat<Repr<T, GROUP, PACK>>,
}

struct BlockTransposedRef<'a, T, const GROUP: usize, const PACK:    usize> {
    data: MatRef<'a, Repr<T, GROUP, PACK>>,
}

// etc.
```
Inherent methods can be added to these at will and the generated docs for these inherent methods will be clear (since they will not be grouped in with all the other inherent methods). This will also let you safely has as_slice as an inherent method to avoid the manual unsafe construction in tests.

Also, I'll admit a small preference for using method delegation instead of stamping out a giant macro, something like:
```
impl<T, const GROUP: usize, const PACK: usize> BlockTransposed<T, GROUP, PACK> {
    fn block(&self, block: usize) -> MatrixView<'_, T> {
        self.reborrow().block(block)
    }
}
```
But I think that is secondary.
In the tests - we should really have one that calls collect on a Mat::row_iter_mut() and writes to a bunch of rows in like a thread::scoped or something. Basically, something to make sure that working on multiple rows concurrently is fine (and when using Miri - that will add another level of safety).

diskann-quantization/src/algorithms/kmeans/lloyds.rs

diskann-quantization/src/multi_vector/matrix.rs

diskann-quantization/src/multi_vector/block_transposed.rs

hildebrandmw · 2026-03-02T17:14:06Z

diskann-quantization/src/multi_vector/block_transposed.rs

+#[derive(Debug, Clone, Copy)]
+pub struct BlockTransposedRow<'a, T, const GROUP: usize, const PACK: usize = 1> {
+    /// Pointer to the element at `(row, col=0)` in the backing allocation.
+    base: *const T,


It's possible we might be able to reuse bits::ptr::SlicePtr for representing a pointer with a lifetime. As an added benefit, SlicePtr is already NonNull, so can help Row use the niche optimization out-of-the-box. Finally, the SlicePtr type already has the correct variance behavior built-in, so we don't need to replicate the &'a T and &'a mut T phantom data fields.

hildebrandmw · 2026-03-02T17:16:46Z

diskann-quantization/src/multi_vector/block_transposed.rs

+
+    #[inline]
+    fn index(&self, col: usize) -> &Self::Output {
+        assert!(


As a small stylistic suggestion - If get() returns a reference, then index can be expressed as a call to get with a panic on the None case - saving one extra unsafe call.

hildebrandmw · 2026-03-02T17:17:45Z

diskann-quantization/src/multi_vector/block_transposed.rs

+        // SAFETY: bounds checked.
+        unsafe { &mut *self.base.add(col_offset::<GROUP, PACK>(col)) }
+    }
+}


Should we also implement std::ops::Index for RowMut?

diskann-quantization/src/multi_vector/block_transposed.rs

suri-kumkaran · 2026-03-03T22:27:07Z

Thanks Suryansh! I've taken a look over and left a few stylistic comments, but there are few larger ones that I will group together here:
We should be very careful with the unchecked arithmetic, particularly in compute_capacity. We honestly probably want two flavors of this function, one that verifies none of the intermediate steps overflow that gets used in the constructor, and one that assumes everything has already been checked (this would be the current compute_capacity function). I would also recommend leaning in to usize::next_multiple_of instead of doing the operation manually.
I'm getting more convinced that we do not want to use Mat/MatRef/MatMut in public APIs. This forces us into an awkward situation where we need to start adding a bunch of inherent methods to these types, which makes method discoverability a little harder and is not a pattern that can be replicated outside of diskann-quantization. Instead, I think it would be cleaner to have thin wrappers:
struct BlockTransposed<T, const GROUP: usize, const PACK: usize> {
    data: Mat<Repr<T, GROUP, PACK>>,
}

struct BlockTransposedRef<'a, T, const GROUP: usize, const PACK:    usize> {
    data: MatRef<'a, Repr<T, GROUP, PACK>>,
}

// etc.
Inherent methods can be added to these at will and the generated docs for these inherent methods will be clear (since they will not be grouped in with all the other inherent methods). This will also let you safely has as_slice as an inherent method to avoid the manual unsafe construction in tests.
Also, I'll admit a small preference for using method delegation instead of stamping out a giant macro, something like:
impl<T, const GROUP: usize, const PACK: usize> BlockTransposed<T, GROUP, PACK> {
    fn block(&self, block: usize) -> MatrixView<'_, T> {
        self.reborrow().block(block)
    }
}
But I think that is secondary.
In the tests - we should really have one that calls collect on a Mat::row_iter_mut() and writes to a bunch of rows in like a thread::scoped or something. Basically, something to make sure that working on multiple rows concurrently is fine (and when using Miri - that will add another level of safety).

Thanks Mark, for the amazing suggestions. I have addressed (1) and (2) and will soon add tests to cover (3). Please help with a follow up review.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T22:55:29Z

diskann-quantization/src/utils.rs

+/// To reclaim the memory later, reconstruct the `Box` via
+/// `Box::from_raw(slice_from_raw_parts_mut(ptr, len))`.


The doc comment for box_into_nonnull describes reclaiming memory as Box::from_raw(slice_from_raw_parts_mut(ptr, len)), but ptr here is a NonNull<u8> (byte pointer) and len needs to be in elements of T. As written, this guidance is easy to follow incorrectly and could lead to reconstructing a Box with the wrong element type/length (UB). Consider updating the comment to show casting ptr back to *mut T (and clarifying that len is the T element count, not bytes).

Suggested change

/// To reclaim the memory later, reconstruct the `Box` via

/// `Box::from_raw(slice_from_raw_parts_mut(ptr, len))`.

/// To reclaim the memory later, you must reconstruct the `Box<[T]>` using the

/// original element type `T` and the original element count `len` (in `T`

/// elements, **not** bytes). For example:

///

/// ```rust,ignore

/// // `ptr` is the `NonNull<u8>` returned by `box_into_nonnull::<T>`

/// // and `len` is the number of `T` elements in the original box.

/// unsafe {

/// let ptr_t: *mut T = ptr.cast().as_ptr();

/// let slice: &mut [T] = std::slice::from_raw_parts_mut(ptr_t, len);

/// let boxed: Box<[T]> = Box::from_raw(slice);

/// }

/// ```

hildebrandmw · 2026-03-04T01:14:12Z

diskann-quantization/src/utils.rs

+///
+/// To reclaim the memory later, reconstruct the `Box` via
+/// `Box::from_raw(slice_from_raw_parts_mut(ptr, len))`.
+pub(crate) fn box_into_nonnull<T>(b: Box<[T]>) -> NonNull<u8> {


I think this should return NonNull<T> and leave it to the caller to do the casting - similar to how the as_nonnull functions work. This also simplifies the safety documentation.

Suryansh Gupta added 2 commits February 28, 2026 03:12

Add a generic Transposed layout for MultiVector

a2e2197

Streamline the impls and add tests

61ca19e

suri-kumkaran self-assigned this Mar 2, 2026

suri-kumkaran requested review from a team and Copilot March 2, 2026 12:44

Copilot started reviewing on behalf of suri-kumkaran March 2, 2026 12:44 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

suri-kumkaran requested a review from hildebrandmw March 2, 2026 16:48

hildebrandmw reviewed Mar 2, 2026

View reviewed changes

suri-kumkaran added 2 commits March 3, 2026 02:22

Merge branch 'main' into users/suryangupta/transposed_multivector

3a23ccd

Address review comments

27442a1

suri-kumkaran requested review from Copilot and hildebrandmw March 3, 2026 22:49

Copilot started reviewing on behalf of suri-kumkaran March 3, 2026 22:50 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

hildebrandmw reviewed Mar 4, 2026

View reviewed changes

-        for row in 0..nrows {
-            for col in 0..ncols {
-                let idx = linear_index::<GROUP, PACK>(row, col, ncols);
-                // SAFETY: idx < storage_len by construction, base points to a valid allocation.
-                unsafe { *base.add(idx) = v[(row, col)] };
+        // Iterate in physical block order: blocks -> column-groups -> rows -> packed columns.
+        // This makes writes mostly sequential in the backing allocation, improving cache locality.
+        let nblocks = (nrows + GROUP - 1) / GROUP;
+        let ncol_groups = (ncols + PACK - 1) / PACK;
+        for block in 0..nblocks {
+            let row_base = block * GROUP;
+            for col_group in 0..ncol_groups {
+                let col_base = col_group * PACK;
+                for in_block_row in 0..GROUP {
+                    let row = row_base + in_block_row;
+                    if row >= nrows {
+                        break;
+                    }
+                    for pack_idx in 0..PACK {
+                        let col = col_base + pack_idx;
+                        if col >= ncols {
+                            break;
+                        }
+                        let idx = linear_index::<GROUP, PACK>(row, col, ncols);
+                        // SAFETY: idx < storage_len by construction, base points to a valid allocation.
+                        unsafe {
+                            *base.add(idx) = v[(row, col)];
+                        }
+                    }
+                }

		/// To reclaim the memory later, reconstruct the `Box` via
		/// `Box::from_raw(slice_from_raw_parts_mut(ptr, len))`.

-/// To reclaim the memory later, reconstruct the `Box` via
-/// `Box::from_raw(slice_from_raw_parts_mut(ptr, len))`.
+/// To reclaim the memory later, you must reconstruct the `Box<[T]>` using the
+/// original element type `T` and the original element count `len` (in `T`
+/// elements, **not** bytes). For example:
+///
+/// ```rust,ignore
+/// // `ptr` is the `NonNull<u8>` returned by `box_into_nonnull::<T>`
+/// // and `len` is the number of `T` elements in the original box.
+/// unsafe {
+///     let ptr_t: *mut T = ptr.cast().as_ptr();
+///     let slice: &mut [T] = std::slice::from_raw_parts_mut(ptr_t, len);
+///     let boxed: Box<[T]> = Box::from_raw(slice);
+/// }
+/// ```

Conversation

suri-kumkaran commented Mar 2, 2026

Summary

Key changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hildebrandmw Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

suri-kumkaran commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Mar 2, 2026 •

edited

Loading