Skip to content

Conversation

@nikhil-arm
Copy link
Collaborator

@nikhil-arm nikhil-arm commented Aug 21, 2024

Description:

  1. Quantize Linear Layer Weights to 4-bits:
    Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
    Pack two 4-bit weights into one uint8 container.
    Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

  2. Prepare Quantized Weights, Scales, and Optional Bias:
    After quantizing, obtain the quantized_weights, scales, and groupsize.
    If the original Linear layer has a bias, prepare it as well.

  3. Pack the Weights Efficiently:
    Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.

packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)

Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

  1. Perform Dynamic Quantized Matrix Multiplication:
    Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)

Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: #143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134124

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c25abbf with merge base 8136daf (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: linalg_frontend release notes category label Aug 21, 2024
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 21, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@github-actions
Copy link
Contributor

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.


Caused by:

@nikhil-arm
Copy link
Collaborator Author

@pytorchbot label "ciflow/linux-aarch64" "module:arm"

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2024

Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help.

@nikhil-arm
Copy link
Collaborator Author

@pytorchbot label "module:arm"

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2024

Didn't find following labels among repository labels: module:arm

@nikhil-arm
Copy link
Collaborator Author

@pytorchbot label "module: arm"

@pytorch-bot pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Aug 21, 2024
@nikhil-arm
Copy link
Collaborator Author

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2024

Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help.

@malfet malfet added the ciflow/linux-aarch64 linux aarch64 CI workflow label Aug 22, 2024
Copy link
Collaborator

@snadampal snadampal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution. please find my comments inline.

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 27, 2024
@nikhil-arm nikhil-arm force-pushed the kleidiai_integration_final branch 3 times, most recently from 53a9e89 to fd6f73b Compare August 29, 2024 13:26
@nikhil-arm
Copy link
Collaborator Author

@malfet can you please help in merging this PR.

@malfet
Copy link
Contributor

malfet commented Aug 29, 2024

@malfet can you please help in merging this PR.

I believe pre-requisite for merging is passing build and test for specific target and this PR clearly fails aarch64 build right now, see https://github.com/pytorch/pytorch/actions/runs/10616020253/job/29425421107?pr=134124
Perhaps it's just a matter or rebase, but in general, I would strongly advice against merging a relatively large change if one could not get a clear signal from the platform it targets.

For comparison, here is result of the build/test for the recent trunk commit: https://github.com/pytorch/pytorch/actions/runs/10636148674/job/29487324878

@nikhil-arm
Copy link
Collaborator Author

nikhil-arm commented Sep 2, 2024

Hello @malfet ,
We are in the process to refactor the PR after your valuable inputs.
We are planning to :

  1. keep all the files , kernel interface and kernel implementation as it is in aten/src/ATen/native/kleidiai/*
  2. Plug kleidiai int4 matmul kernel with _weight_int4pack_mm_cpu() and modify the signature of _weight_int4pack_mm_cpu() to fit our requirements
  3. Plug kleidiai int4 weght pack kernel with _convert_weight_to_int4pack_cpu() and modify the signature of _convert_weight_to_int4pack_cpu() to fit our requirements
  4. Register a new op in torchao for kai_pack_rhs_size() kernel and use it in torchao for quantization and packing step. The implementation will still be in pytroch but op will be registered in torchao
  5. Add / Reuse Symmetric_quantization() in torchao . This will be used for functioning of _weight_int4pack_mm_cpu() and _convert_weight_to_int4pack_cpu() ops
  6. Modify/Add the existing tests for _weight_int4pack_mm_cpu() to accomodate kleidiai kernel and use quantization scheme from torchao directly

Please let me know your thoughts on this and if this addresses all your concerns regarding PR.

@nikhil-arm nikhil-arm force-pushed the kleidiai_integration_final branch from fd6f73b to 9cf8231 Compare September 10, 2024 15:39
@nikhil-arm nikhil-arm requested a review from eqy as a code owner September 10, 2024 15:39
@huydhn
Copy link
Contributor

huydhn commented Dec 19, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/134124/head returned non-zero exit code 1

Rebasing (1/4)
Auto-merging CMakeLists.txt
Auto-merging aten/src/ATen/CMakeLists.txt
Auto-merging aten/src/ATen/Context.cpp
Auto-merging aten/src/ATen/Context.h
Auto-merging aten/src/ATen/native/LinearAlgebra.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/LinearAlgebra.cpp
Auto-merging aten/src/ATen/native/cpu/int4mm_kernel.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/cpu/int4mm_kernel.cpp
Auto-merging aten/src/ATen/native/cpu/int_mm_kernel.h
CONFLICT (content): Merge conflict in aten/src/ATen/native/cpu/int_mm_kernel.h
Auto-merging aten/src/ATen/native/kleidiai/kai_kernels.cpp
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/kleidiai/kai_kernels.cpp
Auto-merging aten/src/ATen/native/kleidiai/kai_pack.h
CONFLICT (add/add): Merge conflict in aten/src/ATen/native/kleidiai/kai_pack.h
Auto-merging aten/src/ATen/native/native_functions.yaml
CONFLICT (content): Merge conflict in aten/src/ATen/native/native_functions.yaml
Auto-merging cmake/Dependencies.cmake
Auto-merging setup.py
Auto-merging test/expect/HasDecompTest.test_has_decomposition.expect
Auto-merging test/test_linalg.py
CONFLICT (content): Merge conflict in test/test_linalg.py
Auto-merging torch/_C/__init__.pyi.in
CONFLICT (content): Merge conflict in torch/_C/__init__.pyi.in
Auto-merging torch/_dynamo/trace_rules.py
Auto-merging torch/_meta_registrations.py
CONFLICT (content): Merge conflict in torch/_meta_registrations.py
Auto-merging torch/backends/kleidiai/__init__.py
CONFLICT (add/add): Merge conflict in torch/backends/kleidiai/__init__.py
Auto-merging torch/csrc/Module.cpp
CONFLICT (content): Merge conflict in torch/csrc/Module.cpp
Auto-merging torch/csrc/inductor/aoti_torch/generated/c_shim_cpu.h
CONFLICT (content): Merge conflict in torch/csrc/inductor/aoti_torch/generated/c_shim_cpu.h
Auto-merging torch/overrides.py
Auto-merging torch/testing/_internal/common_quantization.py
error: could not apply 1c0ef38138d... [ARM][feat]: Add 4 bit dynamic quantization  matmuls & KleidiAI Backend
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 1c0ef38138d... [ARM][feat]: Add 4 bit dynamic quantization  matmuls & KleidiAI Backend

Raised by https://github.com/pytorch/pytorch/actions/runs/12422668007

@facebook-github-bot
Copy link
Contributor

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, scales_and_zeros, bias, groupsize, in_features, out_features)
Inputs required include:
The input tensor, packed_weights, scales, bias, groupsize, and the in_features and out_features.

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Signed-off-by: Nikhil Gupta <[email protected]>
Change-Id: I0a9c864c56a9d1b4e6179dc3059cd37b11525c8d
Description:
1. The scale, bias tensors are no longer needed in the dynamic quantized
   4 bit matmul call.

Signed-off-by: Nikhil Gupta <[email protected]>
Change-Id: Ia34466eea5fefc5780d418a4321fa9b78c142799
…yn_quant_pack_4bit_weight

Tests:

python test/inductor/test_torchinductor.py -k test__dyn_quant_matmul_4bit

Ran 1 test in 0.326s

OK

python test/inductor/test_torchinductor.py -k test__dyn_quant_pack_4bit_weight

Ran 1 test in 5.664s

OK

Signed-off-by: Nikhil Gupta <[email protected]>
Change-Id: I5ed5d8d761769c1af611f75d05e9fc6e9fc64cb4
Signed-off-by: Nikhil Gupta <[email protected]>
Change-Id: I08dcdc5780831771a66325ac5e8b45e0805cf990
@nikhil-arm nikhil-arm force-pushed the kleidiai_integration_final branch from 6d178af to c25abbf Compare December 20, 2024 00:12
@facebook-github-bot
Copy link
Contributor

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

@huydhn
Copy link
Contributor

huydhn commented Dec 20, 2024

@pytorchbot merge -f 'Diff has been landed internally'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

nikhil-arm added a commit to nikhil-arm/ao that referenced this pull request Jan 15, 2025
Description:

Allow int8_dynamic_activation_intx_weight to work with aten  _dyn_quant_matmul_4bit op

Needs : pytorch/pytorch#134124 or Pytorch > 2.6.0

Signed-off-by: Nikhil Gupta <[email protected]>
albanD added a commit to albanD/pytorch that referenced this pull request Jan 22, 2025
nikhil-arm added a commit that referenced this pull request Jan 23, 2025
#134124 was reverted by
#145392 due to KleidiAI clone
issue.

1. This reverts commit 0940eb6 (#145392 )and Fixes
KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2
Signed-off-by: Nikhil Gupta <[email protected]>
pytorchmergebot pushed a commit that referenced this pull request Jan 23, 2025
#145505)

#134124 was reverted by #145392 due to KleidiAI clone issue.

1. This reverts commit 0940eb6 (#145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2

Fixes #145273

Pull Request resolved: #145505
Approved by: https://github.com/malfet
nikhil-arm added a commit to nikhil-arm/ao that referenced this pull request Jan 30, 2025
Description:

Allow int8_dynamic_activation_intx_weight to work with aten  _dyn_quant_matmul_4bit op

Needs : pytorch/pytorch#134124 or Pytorch > 2.6.0

Signed-off-by: Nikhil Gupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/s390 s390x-related CI jobs ciflow/trunk Trigger trunk jobs on your pull request Merged module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor open source release notes: linalg_frontend release notes category Reverted triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.