Skip to content

ARM SME2: Accelerate MatMul FP16#457

Merged
HenryNdubuaku merged 23 commits intocactus-compute:mainfrom
aarav18:arm-sme2
Feb 28, 2026
Merged

ARM SME2: Accelerate MatMul FP16#457
HenryNdubuaku merged 23 commits intocactus-compute:mainfrom
aarav18:arm-sme2

Conversation

@aarav18
Copy link
Copy Markdown
Contributor

@aarav18 aarav18 commented Feb 26, 2026

Summary

  • Implements FP16 MatMul using ARM SME2 instructions

  • SME2 kernel is only selected if:

    • The compiler supports SME2
    • The CPU supports SME2 (runtime check)
    • M >= 4 (minimum SME2 tile size)
  • Adds a separate compilation target for SME2

    • SME2 requires architecture-specific flags (e.g. -march=armv9.2-a+sme2), different from the normal build
    • To avoid double-compilation and impacting global build flags, all SME2 code is isolated to kernel_sme2.cpp
    • For future kernels, an "SME2 version" of each kernel file can be created, and all SME2 files can be added to the compilation target
  • Introduces CMake option ENABLE_SME2

    • AUTO (default): build SME2 only if supported by the compiler, else build normally
    • OFF: Force-disable SME2, build normally
  • Existing AMX/Neon implementations remain unchanged and serve as the fallback paths

Contributes to #299


Benchmark (M4, run with cactus test --precision FP16)

NEON vs SME2

Benchmark Before (NEON) After (SME2) Speedup
MatMul 1024³ CPU 12.158 ms / 177 GFLOPS 1.204 ms / 1784 GFLOPS 10.1x
MatMul F16 1x1024x1024 0.041 ms / 52 GFLOPS 0.038 ms / 55 GFLOPS ~1.0x (uses NEON)
MatMul F16 1024x1024x1024 11.299 ms / 190 GFLOPS 1.053 ms / 2040 GFLOPS 10.7x

AMX vs SME2

Benchmark Before (AMX) After (SME2) Speedup
MatMul 1024³ CPU 1.937 ms / 1109 GFLOPS 1.204 ms / 1784 GFLOPS 1.6x
MatMul F16 1x1024x1024 0.039 ms / 54 GFLOPS 0.038 ms / 55 GFLOPS ~1.0x (uses NEON)
MatMul F16 1024x1024x1024 1.744 ms / 1231 GFLOPS 1.053 ms / 2040 GFLOPS 1.7x

Further Work

  • There is most likely not much room for further optimization for this kernel
    • I can do some experimentation with prefetching and preprocessing, but gains would be minimal
  • Next steps would be to accelerate other kernels to leverage SME2, such as INT8/INT4

aarav18 and others added 23 commits February 25, 2026 20:29
Signed-off-by: HenryNdubuaku <[email protected]>
@HenryNdubuaku HenryNdubuaku merged commit b1203de into cactus-compute:main Feb 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants