Skip to content

Conversation

@ppohlitze
Copy link

@ppohlitze ppohlitze commented Jan 30, 2026

Benchmark

The benchmark uses the Java Microbenchmark Harness (JMH) framework to measure the performance of the rewritten kernels. The result is the average execution time in microseconds for a given parameter set which is exported to a CSV file. Each benchmark run consists of 5 warmup iterations followed by 10 measurement iterations (1 second each), executed in a single forked JVM.

  • Matrix operands are generated once per trial using TestUtils.generateTestMatrixBlock() with configurable dimensions and sparsity levels. The result matrix is reset before each iteration to eliminate interference between measurements.
  • The setup phase, which was slightly altered depending on the kernel, performs format validation to ensure matrices are in the expected representation before benchmarking.
  • For benchmarking, the access modifiers of the kernel methods were temporarily relaxed from private to public to allow for direct method invocations.

Hardware Specs

JDK: OpenJDK 17 Temurin (AArch64)

Hardware Environment: Mac

  • Model: MacBook Pro (2024), Apple M4 Chip
  • CPU: 10 Cores (4 Performance @ 4.4 GHz and 6 Efficiency @ 2.85 GHz)
  • Architecture: ARMv9.2-A (NEON support, no SVE)
  • Vector Capability: 128-bit
  • Memory: 16 GB LPDDR5 (120 GB/s Bandwidth)
  • Cache (P-Cores): 192KB L1i / 128KB L1d per core; 16MB L2 shared cluster cache
  • OS: macOS Tahoe 26.2

Hardware Environment: Windows PC

  • CPU Model: Intel Core i5 9600K (Coffee Lake)
  • CPU: 6 Cores / 6 Threads (Base: 3.7 GHz, Turbo: 4.6 GHz)
  • Architecture: x86-64
  • Vector Capability: 256-bit
  • Memory: 16 GB DDR4-2666 (41.6 GB/s Bandwidth)
  • Cache:
    • L1 Cache: 384 KB (32 KB instruction + 32 KB data per core)
    • L2 Cache: 1.5 MB (256 KB per core)
    • L3 Cache: 9 MB (Shared)
  • OS: Windows 10 Home 22H2

Sources

A Note on Hardware Vectorization: Although the Apple M4 architecture supports ARMv9 and reports FEAT_SME (Scalable Matrix Extension), macOS does not currently expose standard SVE registers. Consequently, the JDK 17 Vector API defaults to the 128-bit NEON instruction set on this platform. This limits the SIMD lane count to 2, whereas the Windows environment utilizes AVX2 a lane count of 4.

Performance Analysis

Raw Result files: https://github.com/ppohlitze/dia-project-benchmark-results

DenseDenseSparse

Benchmark Result Summary

  • the vectorized implementation is more than twice as fast as the baseline
  • most significant gains occur with the highest density matrices
  • minor performance regressions occur on sparser matrices, where the overhead of vector preparation outweighs the benefits of SIMD
  • significantly better performance on the Intel CPU, which is likely due to the higher lane count and hardware support for AVX2

Benchmark Parameters

  • m: 1024, 1050, 2048, 4073, 4096, 8192
  • cd: 1
  • n: 1024, 1050, 2048, 4073, 4096, 8192
  • Sparsity Left: 0.5, 0.75, 1.0
  • Sparsity Right: 0.001, 0.01, 0.1, 0.2
  • Total Configs: 192

Mac

Geometric Mean Speedup: 2.2943x

Top 5 Performance Gains (Speedup > 1.0)

Speedup cd m n sparsityLeft sparsityRight
5.25x 1 4096 2048 1.0 0.2
4.97x 1 8192 2048 1.0 0.2
4.87x 1 4096 4096 1.0 0.2
4.81x 1 2048 2048 1.0 0.001
4.79x 1 4096 1024 1.0 0.001

Top 5 Performance Losses (Speedup < 1.0)

Speedup cd m n sparsityLeft sparsityRight
0.83x 1 2048 8192 0.5 0.01
0.84x 1 4096 1024 0.5 0.01
0.87x 1 1024 1024 0.5 0.01
0.90x 1 2048 8192 0.75 0.001
0.90x 1 4096 2048 0.5 0.01

Windows

Geometric Mean Speedup: 2.9540x

Top 5 Performance Gains (Speedup > 1.0)

Speedup cd m n sparsityLeft sparsityRight
7.07x 1 1024 1024 0.75 0.2
6.69x 1 4096 4096 1.0 0.2
6.56x 1 1024 2048 1.0 0.2
5.86x 1 8192 4096 0.75 0.2
5.73x 1 2048 1024 1.0 0.2

Top 5 Performance Losses (Speedup < 1.0)

Speedup cd m n sparsityLeft sparsityRight
0.57x 1 8192 8192 0.5 0.01
1.11x 1 8192 8192 0.75 0.01
1.13x 1 8192 8192 0.5 0.001
1.14x 1 4096 8192 0.5 0.001
1.30x 1 2048 1024 0.5 0.001

DenseSparseDense

Benchmark Result Summary

  • the Vector API version is 5x to 25x slower than the scalar implementation
  • performance decreases as density increases, suggesting that the SIMD overhead scales with the number of non-zero elements
  • the largest speedups occur for the highest right hand side sparsities. In these cases we mostly execute the scalar tail, since rows contain less elements than the SIMD vector length. This indicates that the Vector API's gather and scatter operations (fromArray() and intoArray()) are the primary bottlenecks
  • again, better performance on the Intel CPU

Benchmark Parameters

  • m: 1, 1024, 4096
  • cd: 1
  • n: 1024, 4096
  • Sparsity Left: 0.5, 0.75, 1.0
  • Sparsity Right: 0.001, 0.01, 0.2
  • Total Configs: 54 (I had to significantly reduce the number of configs because the kernel is prohibitively slow for larger matrices)

Mac

Geometric Mean Speedup: 0.1125x

Top 5 Performance Gains (Speedup > 1.0)

Speedup m cd n sparsityLeft sparsityRight
0.68x 4096 1024 1024 0.75 0.001
0.67x 1024 1024 1024 0.5 0.001
0.67x 1024 1024 1024 0.75 0.001
0.67x 4096 1024 1024 0.5 0.001
0.47x 4096 1024 1024 1.0 0.001

Top 5 Performance Losses (Speedup < 1.0)

Speedup m cd n sparsityLeft sparsityRight
0.04x 1024 1024 1024 1.0 0.2
0.04x 4096 1024 1024 1.0 0.2
0.04x 1024 4096 4096 1.0 0.2
0.04x 1 4096 4096 1.0 0.2
0.04x 1024 4096 4096 0.75 0.2

Windows

Geometric Mean Speedup: 0.3121x

Top 5 Performance Gains (Speedup > 1.0)

Speedup m n sparsityLeft sparsityRight
0.91x 4096 1024 0.5 0.001
0.87x 1024 1024 0.75 0.001
0.87x 4096 1024 1.0 0.001
0.86x 4096 1024 0.75 0.001
0.85x 1024 1024 1.0 0.001

Top 5 Performance Losses (Speedup < 1.0)

Speedup m n sparsityLeft sparsityRight
0.13x 1024 1024 1.0 0.2
0.13x 4096 1024 1.0 0.2
0.13x 4096 4096 1.0 0.2
0.14x 1 1024 0.75 0.2
0.14x 1024 4096 1.0 0.2

DenseSparseSparse

Benchmark Result Summary

  • the Vector API implementation is 12x – 100x slower at high sparsity but achieves a 1.5x – 3.3x speedup as density increases toward 20%
  • the cost of initializing and scanning the dense intermediate buffer for every row dominates execution time when nnzs are rare
  • better performance on the Intel CPU

Benchmark Parameters

  • m: 1024, 1050, 2048, 4073, 4096, 8192
  • cd: 1
  • n: 1024, 1050, 2048, 4073, 4096, 8192
  • Sparsity Left: 0.5, 0.75, 1.0
  • Sparsity Right: 0.001, 0.01, 0.1, 0.2
  • Total Configs: 432

Mac

Geometric Mean Speedup: 0.1731x

Top 5 Performance Gains (Speedup > 1.0)

Speedup cd m n sparsityLeft sparsityRight
3.33x 1 2048 4073 1.0 0.2
3.16x 1 4096 2048 1.0 0.2
3.01x 1 8192 2048 1.0 0.2
2.81x 1 1024 4096 1.0 0.2
2.76x 1 4096 1050 1.0 0.2

Top 5 Performance Losses (Speedup < 1.0)

Speedup cd m n sparsityLeft sparsityRight
0.00x 1 8192 8192 1.0 0.001
0.00x 1 2048 8192 0.5 0.001
0.00x 1 4073 4096 1.0 0.001
0.00x 1 8192 4073 1.0 0.001
0.00x 1 4073 4073 0.75 0.001

Windows

Geometric Mean Speedup: 0.2560x

Top 5 Performance Gains (Speedup > 1.0)

Speedup cd m n sparsityLeft sparsityRight
5.36x 1 4096 4096 1.0 0.2
5.31x 1 1050 4096 1.0 0.2
5.13x 1 4073 4096 1.0 0.2
5.00x 1 8192 8192 0.75 0.2
5.00x 1 4096 4073 1.0 0.2

Top 5 Performance Losses (Speedup < 1.0)

Speedup cd m n sparsityLeft sparsityRight
0.00x 1 1050 8192 0.5 0.001
0.00x 1 2048 8192 0.5 0.001
0.01x 1 4073 8192 0.5 0.001
0.01x 1 8192 8192 0.5 0.001
0.01x 1 1024 8192 0.5 0.001

SparseDenseMVTallRHS

Benchmark Result Summary

  • Mac: the vectorized implementation is consistently 3.7x to 7.7x slower than the scalar baseline
    • the regression is most severe for high sparsity and smaller matrix dimensions
  • Intel CPU: the vectorized implementation is on average ~9% faster than the scalar baseline
    • the larger vector capacity and hardware support for AVX2 provide enough throughput to offset the vector setup costs

Benchmark Parameters

  • m: 2048, 4096, 8192
  • cd: 4096, 8192, 16384
  • n: 1
  • Sparsity Left: 0.5, 0.75, 1.0
  • Sparsity Right: 0.05, 0.1, 0.2
  • Total Configs: 81

Mac

Geometric Mean Speedup: 0.1938x

Top 5 Performance Gains (Speedup > 1.0)

Speedup cd m n sparsityLeft sparsityRight
0.27x 16384 8192 1 0.1 0.75
0.27x 16384 4096 1 0.2 1.0
0.27x 16384 4096 1 0.1 0.5
0.27x 16384 4096 1 0.2 0.5
0.27x 16384 8192 1 0.1 0.5

Top 5 Performance Losses (Speedup < 1.0)

Speedup cd m n sparsityLeft sparsityRight
0.13x 4096 2048 1 0.1 1.0
0.14x 4096 2048 1 0.1 0.75
0.14x 8192 4096 1 0.05 1.0
0.14x 4096 2048 1 0.1 0.5
0.14x 8192 4096 1 0.05 0.75

Windows

Geometric Mean Speedup: 1.0880x

Top 5 Performance Gains (Speedup > 1.0)

Speedup cd m n sparsityLeft sparsityRight
1.25x 4096 2048 1 0.2 0.5
1.21x 8192 8192 1 0.2 0.75
1.18x 8192 2048 1 0.2 0.75
1.18x 8192 2048 1 0.2 1.0
1.18x 8192 4096 1 0.2 0.75

Top 5 Performance Losses (Speedup < 1.0)

Speedup cd m n sparsityLeft sparsityRight
0.95x 16384 2048 1 0.05 0.75
0.97x 4096 4096 1 0.05 1.0
0.97x 4096 4096 1 0.05 0.5
0.98x 4096 4096 1 0.05 0.75
0.98x 4096 8192 1 0.05 0.75

Baseline vs Vectorized Performance plots

Click here to view plots dense_dense_sparse_mac_plot dense_dense_sparse_windows_plot dense_sparse_dense_mac_plot dense_sparse_dense_windows_plot dense_sparse_sparse_mac_plot dense_sparse_sparse_windows_plot sparse_dense_mvtallrhs_mac_plot sparse_dense_mvtallrhs_windows_plot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant