ARM SME2: Accelerate MatMul FP16 by aarav18 · Pull Request #457 · cactus-compute/cactus

aarav18 · 2026-02-26T05:05:34Z

Summary

Implements FP16 MatMul using ARM SME2 instructions
SME2 kernel is only selected if:
- The compiler supports SME2
- The CPU supports SME2 (runtime check)
- M >= 4 (minimum SME2 tile size)
Adds a separate compilation target for SME2
- SME2 requires architecture-specific flags (e.g. -march=armv9.2-a+sme2), different from the normal build
- To avoid double-compilation and impacting global build flags, all SME2 code is isolated to kernel_sme2.cpp
- For future kernels, an "SME2 version" of each kernel file can be created, and all SME2 files can be added to the compilation target
Introduces CMake option ENABLE_SME2
- AUTO (default): build SME2 only if supported by the compiler, else build normally
- OFF: Force-disable SME2, build normally
Existing AMX/Neon implementations remain unchanged and serve as the fallback paths

Contributes to #299

Benchmark (M4, run with `cactus test --precision FP16`)

NEON vs SME2

Benchmark	Before (NEON)	After (SME2)	Speedup
MatMul 1024³ CPU	12.158 ms / 177 GFLOPS	1.204 ms / 1784 GFLOPS	10.1x
MatMul F16 1x1024x1024	0.041 ms / 52 GFLOPS	0.038 ms / 55 GFLOPS	~1.0x (uses NEON)
MatMul F16 1024x1024x1024	11.299 ms / 190 GFLOPS	1.053 ms / 2040 GFLOPS	10.7x

AMX vs SME2

Benchmark	Before (AMX)	After (SME2)	Speedup
MatMul 1024³ CPU	1.937 ms / 1109 GFLOPS	1.204 ms / 1784 GFLOPS	1.6x
MatMul F16 1x1024x1024	0.039 ms / 54 GFLOPS	0.038 ms / 55 GFLOPS	~1.0x (uses NEON)
MatMul F16 1024x1024x1024	1.744 ms / 1231 GFLOPS	1.053 ms / 2040 GFLOPS	1.7x

Further Work

There is most likely not much room for further optimization for this kernel
- I can do some experimentation with prefetching and preprocessing, but gains would be minimal
Next steps would be to accelerate other kernels to leverage SME2, such as INT8/INT4

Signed-off-by: Aarav Shah <[email protected]>

…/w checks Signed-off-by: Aarav Shah <[email protected]>

Signed-off-by: Aarav Shah <[email protected]>

This reverts commit d8447f5. Signed-off-by: Aarav Shah <[email protected]>

Signed-off-by: Aarav Shah <[email protected]>

…ains Signed-off-by: Aarav Shah <[email protected]>

Signed-off-by: Aarav Shah <[email protected]>

Signed-off-by: HenryNdubuaku <[email protected]>

aarav18 and others added 23 commits February 25, 2026 20:29

add SME2 runtime cpu check in kernel utils

d5ed3ee

Signed-off-by: Aarav Shah <[email protected]>

matmul fp16 sme2 kernel implemented

d00a26c

Signed-off-by: Aarav Shah <[email protected]>

sme2 compiles, correct, benchmarked

2f367e3

Signed-off-by: Aarav Shah <[email protected]>

tile size optimization for sme2 kernel

15e6f56

Signed-off-by: Aarav Shah <[email protected]>

sme2 kernel is now portable - changed compile time checks and added h…

1b3a886

…/w checks Signed-off-by: Aarav Shah <[email protected]>

use thread local storage, better block size for multithreading

3497c5f

Signed-off-by: Aarav Shah <[email protected]>

optimized sme2 kernel using multi-threaded preprocessing step

265c9b8

Signed-off-by: Aarav Shah <[email protected]>

sme2 kernel is now actually correct, passed all tests

8b0b8bf

Signed-off-by: Aarav Shah <[email protected]>

added parallelization, ready for PR

18562d7

Signed-off-by: Aarav Shah <[email protected]>

added M check to optimize low-dimensional matmul (no SME2)

53e2326

Signed-off-by: Aarav Shah <[email protected]>

optimized sme2 kernel to use multi-vector (2 x N tiling)

5700419

Signed-off-by: Aarav Shah <[email protected]>

optimized sme2 kernel to use multi-vector load/store (x2)

9b69224

Signed-off-by: Aarav Shah <[email protected]>

Revert "optimized sme2 kernel to use multi-vector load/store (x2)"

cc462d2

This reverts commit d8447f5. Signed-off-by: Aarav Shah <[email protected]>

putting SME2 path in front of AMX for testing

ec23158

Signed-off-by: Aarav Shah <[email protected]>

sme2 kernel fully working + some optimizations

8743498

Signed-off-by: Aarav Shah <[email protected]>

loop unrolling + full row tile optimizations

79c3ac8

Signed-off-by: Aarav Shah <[email protected]>

optimized with deeper k-unroll and tiles-per-thread tuning

05811d1

Signed-off-by: Aarav Shah <[email protected]>

improved A + B packing, ~10% gains

ae0bc0a

Signed-off-by: Aarav Shah <[email protected]>

guarded clang unrolling + small packing gains, ~1%

e499017

Signed-off-by: Aarav Shah <[email protected]>

retuned sme2 caller + even-k path fast path + cleanup (~15-20% gains)

8b6b640

Signed-off-by: Aarav Shah <[email protected]>

fused A-pack and compute, added dynamic row block scheduling, 8-10% g…

ff18f44

…ains Signed-off-by: Aarav Shah <[email protected]>

slight cleanup, should be PR-ready

0210e27

Signed-off-by: Aarav Shah <[email protected]>

Mobile bug fix

50ddcce

Signed-off-by: HenryNdubuaku <[email protected]>

HenryNdubuaku merged commit b1203de into cactus-compute:main Feb 28, 2026
1 check passed

yujonglee mentioned this pull request Mar 5, 2026

macOS: link clang_rt.osx to fix SME2 (__arm_tpidr2_*) link failures under rustc #498

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM SME2: Accelerate MatMul FP16#457

ARM SME2: Accelerate MatMul FP16#457
HenryNdubuaku merged 23 commits intocactus-compute:mainfrom
aarav18:arm-sme2

aarav18 commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aarav18 commented Feb 26, 2026

Summary

Benchmark (M4, run with cactus test --precision FP16)

NEON vs SME2

AMX vs SME2

Further Work

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Benchmark (M4, run with `cactus test --precision FP16`)