[webgpu] Implement SubGroupMatrix based MatMulNBits for Metal#23729
[webgpu] Implement SubGroupMatrix based MatMulNBits for Metal#23729
Conversation
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h
Show resolved
Hide resolved
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Fixed
Show fixed
Hide fixed
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h
Fixed
Show fixed
Hide fixed
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc
Show resolved
Hide resolved
e90b823 to
09e30be
Compare
92db1cf to
f7ddbb0
Compare
|
the ort web pipeline compiles webgpu ep with emscripten which fails with: Possible the headerfile that comes with emscripten doesn't know that featurename yet Maybe use |
done ! |
|
/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline |
|
Azure Pipelines successfully started running 7 pipeline(s). |
### Description Recent progress with SubGroupMatrix prototype in Dawn https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix Functions to webgpu. This shader implements a matmulnbits using that primitive. Observed perf gains, in terms of LLM inference speed, prefill perf for Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s. With Changes ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 5.42498e+06 <<< SubGroupMatrix 5.4s avg (tokens/s): 184.517 p50 (us): 5.41982e+06 stddev (us): 12023.8 n: 5 * 1001 token(s) Token generation: avg (us): 91138.5 avg (tokens/s): 10.9723 p50 (us): 89488.5 stddev (us): 35136.2 n: 635 * 1 token(s) ``` Baseline ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.45507e+07 <<< Baseline 14.5s avg (tokens/s): 68.7938 p50 (us): 1.45413e+07 stddev (us): 22208.9 n: 5 * 1001 token(s) Token generation: avg (us): 94109.8 avg (tokens/s): 10.6259 p50 (us): 89660 stddev (us): 61579 n: 635 * 1 token(s) ```
### Description Recent progress with SubGroupMatrix prototype in Dawn https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix Functions to webgpu. This shader implements a matmulnbits using that primitive. Observed perf gains, in terms of LLM inference speed, prefill perf for Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s. With Changes ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 5.42498e+06 <<< SubGroupMatrix 5.4s avg (tokens/s): 184.517 p50 (us): 5.41982e+06 stddev (us): 12023.8 n: 5 * 1001 token(s) Token generation: avg (us): 91138.5 avg (tokens/s): 10.9723 p50 (us): 89488.5 stddev (us): 35136.2 n: 635 * 1 token(s) ``` Baseline ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.45507e+07 <<< Baseline 14.5s avg (tokens/s): 68.7938 p50 (us): 1.45413e+07 stddev (us): 22208.9 n: 5 * 1001 token(s) Token generation: avg (us): 94109.8 avg (tokens/s): 10.6259 p50 (us): 89660 stddev (us): 61579 n: 635 * 1 token(s) ```
Description
Recent progress with SubGroupMatrix prototype in Dawn https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix Functions to webgpu. This shader implements a matmulnbits using that primitive.
Observed perf gains, in terms of LLM inference speed, prefill perf for Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s.
With Changes
Baseline