-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend #134124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134124
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c25abbf with merge base 8136daf ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Attention! native_functions.yaml was changedIf you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info. Caused by: |
|
@pytorchbot label "ciflow/linux-aarch64" "module:arm" |
|
Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help. |
|
@pytorchbot label "module:arm" |
|
Didn't find following labels among repository labels: module:arm |
|
@pytorchbot label "module: arm" |
|
@pytorchbot label "ciflow/linux-aarch64" |
|
Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help. |
snadampal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the contribution. please find my comments inline.
53a9e89 to
fd6f73b
Compare
|
@malfet can you please help in merging this PR. |
I believe pre-requisite for merging is passing build and test for specific target and this PR clearly fails aarch64 build right now, see https://github.com/pytorch/pytorch/actions/runs/10616020253/job/29425421107?pr=134124 For comparison, here is result of the build/test for the recent trunk commit: https://github.com/pytorch/pytorch/actions/runs/10636148674/job/29487324878 |
|
Hello @malfet ,
Please let me know your thoughts on this and if this addresses all your concerns regarding PR. |
fd6f73b to
9cf8231
Compare
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Rebase failed due to Command |
|
@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, scales_and_zeros, bias, groupsize, in_features, out_features) Inputs required include: The input tensor, packed_weights, scales, bias, groupsize, and the in_features and out_features. Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Signed-off-by: Nikhil Gupta <[email protected]> Change-Id: I0a9c864c56a9d1b4e6179dc3059cd37b11525c8d
Description: 1. The scale, bias tensors are no longer needed in the dynamic quantized 4 bit matmul call. Signed-off-by: Nikhil Gupta <[email protected]> Change-Id: Ia34466eea5fefc5780d418a4321fa9b78c142799
…yn_quant_pack_4bit_weight Tests: python test/inductor/test_torchinductor.py -k test__dyn_quant_matmul_4bit Ran 1 test in 0.326s OK python test/inductor/test_torchinductor.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 5.664s OK Signed-off-by: Nikhil Gupta <[email protected]> Change-Id: I5ed5d8d761769c1af611f75d05e9fc6e9fc64cb4
Signed-off-by: Nikhil Gupta <[email protected]> Change-Id: I08dcdc5780831771a66325ac5e8b45e0805cf990
6d178af to
c25abbf
Compare
|
@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f 'Diff has been landed internally' |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Description: Allow int8_dynamic_activation_intx_weight to work with aten _dyn_quant_matmul_4bit op Needs : pytorch/pytorch#134124 or Pytorch > 2.6.0 Signed-off-by: Nikhil Gupta <[email protected]>
…I Backend (pytorch#134124)" This reverts commit 94737e8.
Mitigation for #145273 Reverting #134124 and #144074 Pull Request resolved: #145392 Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai
#134124 was reverted by #145392 due to KleidiAI clone issue. 1. This reverts commit 0940eb6 (#145392 )and Fixes KleidiAI mirror issue. 2. KleidiAI is now cloned from github mirror instead of arm gitlab Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2 Signed-off-by: Nikhil Gupta <[email protected]>
#145505) #134124 was reverted by #145392 due to KleidiAI clone issue. 1. This reverts commit 0940eb6 (#145392 )and Fixes KleidiAI mirror issue. 2. KleidiAI is now cloned from github mirror instead of arm gitlab Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2 Fixes #145273 Pull Request resolved: #145505 Approved by: https://github.com/malfet
Description: Allow int8_dynamic_activation_intx_weight to work with aten _dyn_quant_matmul_4bit op Needs : pytorch/pytorch#134124 or Pytorch > 2.6.0 Signed-off-by: Nikhil Gupta <[email protected]>
Description:
Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: #143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov