SYSTEMDS-3855 Reimplement Matrix Multiplication kernels with vector api #2423
+353
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Benchmark
The benchmark uses the Java Microbenchmark Harness (JMH) framework to measure the performance of the rewritten kernels. The result is the average execution time in microseconds for a given parameter set which is exported to a CSV file. Each benchmark run consists of 5 warmup iterations followed by 10 measurement iterations (1 second each), executed in a single forked JVM.
Hardware Specs
JDK: OpenJDK 17 Temurin (AArch64)
Hardware Environment: Mac
Hardware Environment: Windows PC
Sources
A Note on Hardware Vectorization: Although the Apple M4 architecture supports ARMv9 and reports FEAT_SME (Scalable Matrix Extension), macOS does not currently expose standard SVE registers. Consequently, the JDK 17 Vector API defaults to the 128-bit NEON instruction set on this platform. This limits the SIMD lane count to 2, whereas the Windows environment utilizes AVX2 a lane count of 4.
Performance Analysis
Raw Result files: https://github.com/ppohlitze/dia-project-benchmark-results
DenseDenseSparse
Benchmark Result Summary
Benchmark Parameters
Mac
Geometric Mean Speedup: 2.2943x
Top 5 Performance Gains (Speedup > 1.0)
Top 5 Performance Losses (Speedup < 1.0)
Windows
Geometric Mean Speedup: 2.9540x
Top 5 Performance Gains (Speedup > 1.0)
Top 5 Performance Losses (Speedup < 1.0)
DenseSparseDense
Benchmark Result Summary
Benchmark Parameters
Mac
Geometric Mean Speedup: 0.1125x
Top 5 Performance Gains (Speedup > 1.0)
Top 5 Performance Losses (Speedup < 1.0)
Windows
Geometric Mean Speedup: 0.3121x
Top 5 Performance Gains (Speedup > 1.0)
Top 5 Performance Losses (Speedup < 1.0)
DenseSparseSparse
Benchmark Result Summary
Benchmark Parameters
Mac
Geometric Mean Speedup: 0.1731x
Top 5 Performance Gains (Speedup > 1.0)
Top 5 Performance Losses (Speedup < 1.0)
Windows
Geometric Mean Speedup: 0.2560x
Top 5 Performance Gains (Speedup > 1.0)
Top 5 Performance Losses (Speedup < 1.0)
SparseDenseMVTallRHS
Benchmark Result Summary
Benchmark Parameters
Mac
Geometric Mean Speedup: 0.1938x
Top 5 Performance Gains (Speedup > 1.0)
Top 5 Performance Losses (Speedup < 1.0)
Windows
Geometric Mean Speedup: 1.0880x
Top 5 Performance Gains (Speedup > 1.0)
Top 5 Performance Losses (Speedup < 1.0)
Baseline vs Vectorized Performance plots
Click here to view plots