SYSTEMDS-3855 Reimplement Matrix Multiplication kernels with vector api #2423

ppohlitze · 2026-01-30T14:16:07Z

Benchmark

The benchmark uses the Java Microbenchmark Harness (JMH) framework to measure the performance of the rewritten kernels. The result is the average execution time in microseconds for a given parameter set which is exported to a CSV file. Each benchmark run consists of 5 warmup iterations followed by 10 measurement iterations (1 second each), executed in a single forked JVM.

Matrix operands are generated once per trial using TestUtils.generateTestMatrixBlock() with configurable dimensions and sparsity levels. The result matrix is reset before each iteration to eliminate interference between measurements.
The setup phase, which was slightly altered depending on the kernel, performs format validation to ensure matrices are in the expected representation before benchmarking.
For benchmarking, the access modifiers of the kernel methods were temporarily relaxed from private to public to allow for direct method invocations.

Hardware Specs

JDK: OpenJDK 17 Temurin (AArch64)

Hardware Environment: Mac

Model: MacBook Pro (2024), Apple M4 Chip
CPU: 10 Cores (4 Performance @ 4.4 GHz and 6 Efficiency @ 2.85 GHz)
Architecture: ARMv9.2-A (NEON support, no SVE)
Vector Capability: 128-bit
Memory: 16 GB LPDDR5 (120 GB/s Bandwidth)
Cache (P-Cores): 192KB L1i / 128KB L1d per core; 16MB L2 shared cluster cache
OS: macOS Tahoe 26.2

Hardware Environment: Windows PC

CPU Model: Intel Core i5 9600K (Coffee Lake)
CPU: 6 Cores / 6 Threads (Base: 3.7 GHz, Turbo: 4.6 GHz)
Architecture: x86-64
Vector Capability: 256-bit
Memory: 16 GB DDR4-2666 (41.6 GB/s Bandwidth)
Cache:
- L1 Cache: 384 KB (32 KB instruction + 32 KB data per core)
- L2 Cache: 1.5 MB (256 KB per core)
- L3 Cache: 9 MB (Shared)
OS: Windows 10 Home 22H2

Sources

A Note on Hardware Vectorization: Although the Apple M4 architecture supports ARMv9 and reports FEAT_SME (Scalable Matrix Extension), macOS does not currently expose standard SVE registers. Consequently, the JDK 17 Vector API defaults to the 128-bit NEON instruction set on this platform. This limits the SIMD lane count to 2, whereas the Windows environment utilizes AVX2 a lane count of 4.

Performance Analysis

Raw Result files: https://github.com/ppohlitze/dia-project-benchmark-results

DenseDenseSparse

Benchmark Result Summary

the vectorized implementation is more than twice as fast as the baseline
most significant gains occur with the highest density matrices
minor performance regressions occur on sparser matrices, where the overhead of vector preparation outweighs the benefits of SIMD
significantly better performance on the Intel CPU, which is likely due to the higher lane count and hardware support for AVX2

Benchmark Parameters

m: 1024, 1050, 2048, 4073, 4096, 8192
cd: 1
n: 1024, 1050, 2048, 4073, 4096, 8192
Sparsity Left: 0.5, 0.75, 1.0
Sparsity Right: 0.001, 0.01, 0.1, 0.2
Total Configs: 192

Mac

Geometric Mean Speedup: 2.2943x

Top 5 Performance Gains (Speedup > 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
5.25x	1	4096	2048	1.0	0.2
4.97x	1	8192	2048	1.0	0.2
4.87x	1	4096	4096	1.0	0.2
4.81x	1	2048	2048	1.0	0.001
4.79x	1	4096	1024	1.0	0.001

Top 5 Performance Losses (Speedup < 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
0.83x	1	2048	8192	0.5	0.01
0.84x	1	4096	1024	0.5	0.01
0.87x	1	1024	1024	0.5	0.01
0.90x	1	2048	8192	0.75	0.001
0.90x	1	4096	2048	0.5	0.01

Windows

Geometric Mean Speedup: 2.9540x

Top 5 Performance Gains (Speedup > 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
7.07x	1	1024	1024	0.75	0.2
6.69x	1	4096	4096	1.0	0.2
6.56x	1	1024	2048	1.0	0.2
5.86x	1	8192	4096	0.75	0.2
5.73x	1	2048	1024	1.0	0.2

Top 5 Performance Losses (Speedup < 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
0.57x	1	8192	8192	0.5	0.01
1.11x	1	8192	8192	0.75	0.01
1.13x	1	8192	8192	0.5	0.001
1.14x	1	4096	8192	0.5	0.001
1.30x	1	2048	1024	0.5	0.001

DenseSparseDense

Benchmark Result Summary

the Vector API version is 5x to 25x slower than the scalar implementation
performance decreases as density increases, suggesting that the SIMD overhead scales with the number of non-zero elements
the largest speedups occur for the highest right hand side sparsities. In these cases we mostly execute the scalar tail, since rows contain less elements than the SIMD vector length. This indicates that the Vector API's gather and scatter operations (fromArray() and intoArray()) are the primary bottlenecks
again, better performance on the Intel CPU

Benchmark Parameters

m: 1, 1024, 4096
cd: 1
n: 1024, 4096
Sparsity Left: 0.5, 0.75, 1.0
Sparsity Right: 0.001, 0.01, 0.2
Total Configs: 54 (I had to significantly reduce the number of configs because the kernel is prohibitively slow for larger matrices)

Mac

Geometric Mean Speedup: 0.1125x

Top 5 Performance Gains (Speedup > 1.0)

Speedup	m	cd	n	sparsityLeft	sparsityRight
0.68x	4096	1024	1024	0.75	0.001
0.67x	1024	1024	1024	0.5	0.001
0.67x	1024	1024	1024	0.75	0.001
0.67x	4096	1024	1024	0.5	0.001
0.47x	4096	1024	1024	1.0	0.001

Top 5 Performance Losses (Speedup < 1.0)

Speedup	m	cd	n	sparsityLeft	sparsityRight
0.04x	1024	1024	1024	1.0	0.2
0.04x	4096	1024	1024	1.0	0.2
0.04x	1024	4096	4096	1.0	0.2
0.04x	1	4096	4096	1.0	0.2
0.04x	1024	4096	4096	0.75	0.2

Windows

Geometric Mean Speedup: 0.3121x

Top 5 Performance Gains (Speedup > 1.0)

Speedup	m	n	sparsityLeft	sparsityRight
0.91x	4096	1024	0.5	0.001
0.87x	1024	1024	0.75	0.001
0.87x	4096	1024	1.0	0.001
0.86x	4096	1024	0.75	0.001
0.85x	1024	1024	1.0	0.001

Top 5 Performance Losses (Speedup < 1.0)

Speedup	m	n	sparsityLeft	sparsityRight
0.13x	1024	1024	1.0	0.2
0.13x	4096	1024	1.0	0.2
0.13x	4096	4096	1.0	0.2
0.14x	1	1024	0.75	0.2
0.14x	1024	4096	1.0	0.2

DenseSparseSparse

Benchmark Result Summary

the Vector API implementation is 12x – 100x slower at high sparsity but achieves a 1.5x – 3.3x speedup as density increases toward 20%
the cost of initializing and scanning the dense intermediate buffer for every row dominates execution time when nnzs are rare
better performance on the Intel CPU

Benchmark Parameters

m: 1024, 1050, 2048, 4073, 4096, 8192
cd: 1
n: 1024, 1050, 2048, 4073, 4096, 8192
Sparsity Left: 0.5, 0.75, 1.0
Sparsity Right: 0.001, 0.01, 0.1, 0.2
Total Configs: 432

Mac

Geometric Mean Speedup: 0.1731x

Top 5 Performance Gains (Speedup > 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
3.33x	1	2048	4073	1.0	0.2
3.16x	1	4096	2048	1.0	0.2
3.01x	1	8192	2048	1.0	0.2
2.81x	1	1024	4096	1.0	0.2
2.76x	1	4096	1050	1.0	0.2

Top 5 Performance Losses (Speedup < 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
0.00x	1	8192	8192	1.0	0.001
0.00x	1	2048	8192	0.5	0.001
0.00x	1	4073	4096	1.0	0.001
0.00x	1	8192	4073	1.0	0.001
0.00x	1	4073	4073	0.75	0.001

Windows

Geometric Mean Speedup: 0.2560x

Top 5 Performance Gains (Speedup > 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
5.36x	1	4096	4096	1.0	0.2
5.31x	1	1050	4096	1.0	0.2
5.13x	1	4073	4096	1.0	0.2
5.00x	1	8192	8192	0.75	0.2
5.00x	1	4096	4073	1.0	0.2

Top 5 Performance Losses (Speedup < 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
0.00x	1	1050	8192	0.5	0.001
0.00x	1	2048	8192	0.5	0.001
0.01x	1	4073	8192	0.5	0.001
0.01x	1	8192	8192	0.5	0.001
0.01x	1	1024	8192	0.5	0.001

SparseDenseMVTallRHS

Benchmark Result Summary

Mac: the vectorized implementation is consistently 3.7x to 7.7x slower than the scalar baseline
- the regression is most severe for high sparsity and smaller matrix dimensions
Intel CPU: the vectorized implementation is on average ~9% faster than the scalar baseline
- the larger vector capacity and hardware support for AVX2 provide enough throughput to offset the vector setup costs

Benchmark Parameters

m: 2048, 4096, 8192
cd: 4096, 8192, 16384
n: 1
Sparsity Left: 0.5, 0.75, 1.0
Sparsity Right: 0.05, 0.1, 0.2
Total Configs: 81

Mac

Geometric Mean Speedup: 0.1938x

Top 5 Performance Gains (Speedup > 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
0.27x	16384	8192	1	0.1	0.75
0.27x	16384	4096	1	0.2	1.0
0.27x	16384	4096	1	0.1	0.5
0.27x	16384	4096	1	0.2	0.5
0.27x	16384	8192	1	0.1	0.5

Top 5 Performance Losses (Speedup < 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
0.13x	4096	2048	1	0.1	1.0
0.14x	4096	2048	1	0.1	0.75
0.14x	8192	4096	1	0.05	1.0
0.14x	4096	2048	1	0.1	0.5
0.14x	8192	4096	1	0.05	0.75

Windows

Geometric Mean Speedup: 1.0880x

Top 5 Performance Gains (Speedup > 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
1.25x	4096	2048	1	0.2	0.5
1.21x	8192	8192	1	0.2	0.75
1.18x	8192	2048	1	0.2	0.75
1.18x	8192	2048	1	0.2	1.0
1.18x	8192	4096	1	0.2	0.75

Top 5 Performance Losses (Speedup < 1.0)

Speedup	cd	m	n	sparsityLeft	sparsityRight
0.95x	16384	2048	1	0.05	0.75
0.97x	4096	4096	1	0.05	1.0
0.97x	4096	4096	1	0.05	0.5
0.98x	4096	4096	1	0.05	0.75
0.98x	4096	8192	1	0.05	0.75

Baseline vs Vectorized Performance plots

Click here to view plots

…parseDenseMVTallRHS

ppohlitze added 2 commits January 30, 2026 14:38

adds rewritten matrix multiplication kernels and test for matrixMultS…

644b257

…parseDenseMVTallRHS

[DO NOT MERGE] Temporary benchmark script for reviewer verification

b626acd

github-project-automation bot added this to SystemDS PR Queue Jan 30, 2026

github-project-automation bot moved this to In Progress in SystemDS PR Queue Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYSTEMDS-3855 Reimplement Matrix Multiplication kernels with vector api #2423

SYSTEMDS-3855 Reimplement Matrix Multiplication kernels with vector api #2423

ppohlitze commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SYSTEMDS-3855 Reimplement Matrix Multiplication kernels with vector api #2423

Are you sure you want to change the base?

SYSTEMDS-3855 Reimplement Matrix Multiplication kernels with vector api #2423

Conversation

ppohlitze commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Hardware Specs

Hardware Environment: Mac

Hardware Environment: Windows PC

Sources

Performance Analysis

DenseDenseSparse

Benchmark Result Summary

Benchmark Parameters

Mac

Top 5 Performance Gains (Speedup > 1.0)

Top 5 Performance Losses (Speedup < 1.0)

Windows

Top 5 Performance Gains (Speedup > 1.0)

Top 5 Performance Losses (Speedup < 1.0)

DenseSparseDense

Benchmark Result Summary

Benchmark Parameters

Mac

Top 5 Performance Gains (Speedup > 1.0)

Top 5 Performance Losses (Speedup < 1.0)

Windows

Top 5 Performance Gains (Speedup > 1.0)

Top 5 Performance Losses (Speedup < 1.0)

DenseSparseSparse

Benchmark Result Summary

Benchmark Parameters

Mac

Top 5 Performance Gains (Speedup > 1.0)

Top 5 Performance Losses (Speedup < 1.0)

Windows

Top 5 Performance Gains (Speedup > 1.0)

Top 5 Performance Losses (Speedup < 1.0)

SparseDenseMVTallRHS

Benchmark Result Summary

Benchmark Parameters

Mac

Top 5 Performance Gains (Speedup > 1.0)

Top 5 Performance Losses (Speedup < 1.0)

Windows

Top 5 Performance Gains (Speedup > 1.0)

Top 5 Performance Losses (Speedup < 1.0)

Baseline vs Vectorized Performance plots

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ppohlitze commented Jan 30, 2026 •

edited

Loading