Accelerate FP16 matmul via cblas_sgemm for Apple AMX#340
Accelerate FP16 matmul via cblas_sgemm for Apple AMX#340HenryNdubuaku merged 1 commit intocactus-compute:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an Apple-specific fast path for large FP16 matrix multiplications by converting inputs to FP32, calling Accelerate’s cblas_sgemm (leveraging AMX internally), then converting the FP32 result back to FP16—while keeping the existing NEON implementation for smaller shapes.
Changes:
- Add
__APPLE__-guarded Accelerate (cblas_sgemm) matmul path for large FP16 matmuls. - Convert
__fp16inputs tofloatbuffers and cast the FP32 output back to__fp16. - Leave existing NEON tiled/parallel implementation as fallback for smaller matrices.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::vector<float> A_f32(a_len); | ||
| std::vector<float> BT_f32(b_len); | ||
| std::vector<float> C_f32(c_len); | ||
|
|
There was a problem hiding this comment.
The Accelerate path allocates three large std::vector<float> buffers on every call (A, B^T, C). For large matrices this can add significant allocator overhead and memory pressure. Consider reusing scratch buffers (e.g., thread-local or a caller-provided workspace) and/or using Accelerate/vDSP conversion routines to reduce per-call overhead.
| std::vector<float> A_f32(a_len); | |
| std::vector<float> BT_f32(b_len); | |
| std::vector<float> C_f32(c_len); | |
| static thread_local std::vector<float> A_f32; | |
| static thread_local std::vector<float> BT_f32; | |
| static thread_local std::vector<float> C_f32; | |
| A_f32.resize(a_len); | |
| BT_f32.resize(b_len); | |
| C_f32.resize(c_len); |
cactus/kernel/kernel_matmul.cpp
Outdated
| if (K >= 256 && M >= 4) { | ||
| const size_t a_len = M * K; |
There was a problem hiding this comment.
The thresholds K >= 256 and M >= 4 are hard-coded magic numbers. To make this easier to tune and keep consistent with other Accelerate thresholds (e.g., in kernel_conv.cpp), define them as named constexpr constants (and ideally document the rationale/benchmark behind them).
| #ifdef __APPLE__ | ||
| if (K >= 256 && M >= 4) { | ||
| const size_t a_len = M * K; | ||
| const size_t b_len = N * K; | ||
| const size_t c_len = M * N; | ||
|
|
||
| std::vector<float> A_f32(a_len); | ||
| std::vector<float> BT_f32(b_len); | ||
| std::vector<float> C_f32(c_len); | ||
|
|
||
| for (size_t i = 0; i < a_len; i++) A_f32[i] = (float)a[i]; | ||
| for (size_t i = 0; i < b_len; i++) BT_f32[i] = (float)b_transposed[i]; | ||
|
|
||
| cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, | ||
| (int)M, (int)N, (int)K, | ||
| 1.0f, A_f32.data(), (int)K, | ||
| BT_f32.data(), (int)K, | ||
| 0.0f, C_f32.data(), (int)N); | ||
|
|
||
| for (size_t i = 0; i < c_len; i++) c[i] = (__fp16)C_f32[i]; | ||
| return; | ||
| } |
There was a problem hiding this comment.
This change introduces a new Apple-only execution path with different numerics (FP32 accumulate + FP16 cast) and a different backend (cblas_sgemm), but there are currently no correctness tests for FP16 matmul. Adding a guarded (#ifdef __APPLE__) test that exercises the Accelerate threshold region (e.g., M>=4, K>=256) and compares against a reference implementation would help prevent silent regressions.
| cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, | ||
| (int)M, (int)N, (int)K, | ||
| 1.0f, A_f32.data(), (int)K, | ||
| BT_f32.data(), (int)K, | ||
| 0.0f, C_f32.data(), (int)N); |
There was a problem hiding this comment.
cblas_sgemm takes int dimensions/leading dimensions, but this code casts from size_t without bounds checks. If M/N/K exceed INT_MAX (or even just K/N for lda/ldb/ldc), the cast can overflow and lead to incorrect results or memory errors. Add a guard (e.g., if (M > INT_MAX || N > INT_MAX || K > INT_MAX) fall back to the existing NEON path) before calling BLAS.
Routes large FP16 matmuls (K>=256, M>=4) through Apple's Accelerate framework which uses AMX internally. Converts FP16→FP32, calls cblas_sgemm, converts back. Small matrices fall through to existing NEON path. Benchmarked 4x speedup on 1024³ (215→851 GFLOPS). Signed-off-by: Kayaan Tharani <[email protected]>
4e790d6 to
1cf3511
Compare
|
Thanks for this @KayaanT youve now learnt how we love code contributions, most tasks are designed no need no more than 1-3 file changes, so I'll merge this. |
Batch query positions into real GEMMs so cblas_sgemm can use AMX, same approach as the matmul kernel (cactus-compute#340). Parallelizes over batch * num_q_heads instead of batch * num_q_heads * seq_len, with each work item processing all seq_len positions for one head. Benchmark (1024x16x64): 8.7ms / 246 GFLOPS -> 4.4ms / 490 GFLOPS (~2x) Falls back to existing NEON path for seq_len < 64. Signed-off-by: Kayaan Tharani <[email protected]>
* Accelerate FP16 attention via cblas_sgemm for Apple AMX Batch query positions into real GEMMs so cblas_sgemm can use AMX, same approach as the matmul kernel (#340). Parallelizes over batch * num_q_heads instead of batch * num_q_heads * seq_len, with each work item processing all seq_len positions for one head. Benchmark (1024x16x64): 8.7ms / 246 GFLOPS -> 4.4ms / 490 GFLOPS (~2x) Falls back to existing NEON path for seq_len < 64. Signed-off-by: Kayaan Tharani <[email protected]> * Add ACCELERATE_NEW_LAPACK definition for improved LAPACK support Signed-off-by: HenryNdubuaku <[email protected]> --------- Signed-off-by: Kayaan Tharani <[email protected]> Signed-off-by: HenryNdubuaku <[email protected]> Co-authored-by: HenryNdubuaku <[email protected]>
* Accelerate FP16 attention via cblas_sgemm for Apple AMX Batch query positions into real GEMMs so cblas_sgemm can use AMX, same approach as the matmul kernel (#340). Parallelizes over batch * num_q_heads instead of batch * num_q_heads * seq_len, with each work item processing all seq_len positions for one head. Benchmark (1024x16x64): 8.7ms / 246 GFLOPS -> 4.4ms / 490 GFLOPS (~2x) Falls back to existing NEON path for seq_len < 64. Signed-off-by: Kayaan Tharani <[email protected]> * Add ACCELERATE_NEW_LAPACK definition for improved LAPACK support Signed-off-by: HenryNdubuaku <[email protected]> --------- Signed-off-by: Kayaan Tharani <[email protected]> Signed-off-by: HenryNdubuaku <[email protected]> Co-authored-by: HenryNdubuaku <[email protected]>
…te#346) * Accelerate FP16 attention via cblas_sgemm for Apple AMX Batch query positions into real GEMMs so cblas_sgemm can use AMX, same approach as the matmul kernel (cactus-compute#340). Parallelizes over batch * num_q_heads instead of batch * num_q_heads * seq_len, with each work item processing all seq_len positions for one head. Benchmark (1024x16x64): 8.7ms / 246 GFLOPS -> 4.4ms / 490 GFLOPS (~2x) Falls back to existing NEON path for seq_len < 64. Signed-off-by: Kayaan Tharani <[email protected]> * Add ACCELERATE_NEW_LAPACK definition for improved LAPACK support Signed-off-by: HenryNdubuaku <[email protected]> --------- Signed-off-by: Kayaan Tharani <[email protected]> Signed-off-by: HenryNdubuaku <[email protected]> Co-authored-by: HenryNdubuaku <[email protected]>
Summary
cblas_sgemm), which uses AMX internallycblas_sgemm, converts result back to FP16kernel_conv.cpp)Contributes to #298
Benchmark (M4 Pro, 1024x1024x1024)