-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Quant][Onednn] add linear_dynamic_fp16 ops #140376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140376
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit b401c7c with merge base 330c957 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| auto w_desc = ideep::matmul_forward::expected_weights_desc( | ||
| wei.get_dims(), input_dims, dnnl::memory::data_type::f32, dnnl::memory::data_type::f32); | ||
| w_desc = w_desc.to_type(ideep::data_type::f16); | ||
| ideep::tensor expected_weight(w_desc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this would work well - you asked onednn for a memory descriptor of fp32 while then change it to fp16 to reorder to, seems not a common way of invoking the reorder API. Perhaps you can just use the plain layout since you are not using the blocked fp16 tensor anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. It works on latest Xeon but fails on older platforms. I have modified this part and now we return plain layout directly.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
**About this PR** This PR adds the following ops for `linear_dynamic_fp16` in onednn namespace. These ops are intended for PT2E quantization eager mode. - `onednn::linear_prepack_fp16`: packs fp32 weight to an fp16 MkldnnCPU tensor. - `onednn::linear_dynamic_fp16`: takes an fp32 CPU tensor and an fp16 MkldnnCPU tensor and compute linear in fp32 - `onednn::linear_relu_dynamic_fp16`: similar as the former and apply relu on output. **Test plan** `python test/test_quantization.py -k test_linear_dynamic_fp16_onednn` **Implementation** These ops call oneDNN lib under the hood. It's worth noting that oneDNN does not support f32 * f16 -> f32 computation, so we have to convert fp16 weight to fp32 before computation. And weight is still in plain format after packing. **Correctness and performance** Correctness is guaranteed by UT. Performance of the new ops may be better than the FBGEMM implementation when weight shape is small but worse when weight shape is large. It's because weight dtype conversion and computation are not fused. For example, I ran benchmarks on an Intel(R) Xeon(R) Platinum 8490H machine with different cores and shapes. When using 1 core per instance, the new implementation generally is faster for weight shape < 1024 * 1024. When using more cores, the threshold will increase. Pull Request resolved: pytorch#140376 Approved by: https://github.com/jerryzh168, https://github.com/jgong5
Stack from ghstack (oldest at bottom):
About this PR
This PR adds the following ops for
linear_dynamic_fp16in onednn namespace. These ops are intended for PT2E quantization eager mode.onednn::linear_prepack_fp16: packs fp32 weight to an fp16 MkldnnCPU tensor.onednn::linear_dynamic_fp16: takes an fp32 CPU tensor and an fp16 MkldnnCPU tensor and compute linear in fp32onednn::linear_relu_dynamic_fp16: similar as the former and apply relu on output.Test plan
python test/test_quantization.py -k test_linear_dynamic_fp16_onednnImplementation
These ops call oneDNN lib under the hood. It's worth noting that oneDNN does not support f32 * f16 -> f32 computation, so we have to convert fp16 weight to fp32 before computation. And weight is still in plain format after packing.
Correctness and performance
Correctness is guaranteed by UT.
Performance of the new ops may be better than the FBGEMM implementation when weight shape is small but worse when weight shape is large. It's because weight dtype conversion and computation are not fused.
For example, I ran benchmarks on an Intel(R) Xeon(R) Platinum 8490H machine with different cores and shapes. When using 1 core per instance, the new implementation generally is faster for weight shape < 1024 * 1024. When using more cores, the threshold will increase.
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10