[Quant][Onednn] add linear_dynamic_fp16 ops #140376

Xia-Weiwen · 2024-11-12T09:10:12Z

Stack from ghstack (oldest at bottom):

-> [Quant][Onednn] add linear_dynamic_fp16 ops #140376

About this PR
This PR adds the following ops for linear_dynamic_fp16 in onednn namespace. These ops are intended for PT2E quantization eager mode.

onednn::linear_prepack_fp16: packs fp32 weight to an fp16 MkldnnCPU tensor.
onednn::linear_dynamic_fp16: takes an fp32 CPU tensor and an fp16 MkldnnCPU tensor and compute linear in fp32
onednn::linear_relu_dynamic_fp16: similar as the former and apply relu on output.

Test plan
python test/test_quantization.py -k test_linear_dynamic_fp16_onednn

Implementation
These ops call oneDNN lib under the hood. It's worth noting that oneDNN does not support f32 * f16 -> f32 computation, so we have to convert fp16 weight to fp32 before computation. And weight is still in plain format after packing.

Correctness and performance
Correctness is guaranteed by UT.
Performance of the new ops may be better than the FBGEMM implementation when weight shape is small but worse when weight shape is large. It's because weight dtype conversion and computation are not fused.
For example, I ran benchmarks on an Intel(R) Xeon(R) Platinum 8490H machine with different cores and shapes. When using 1 core per instance, the new implementation generally is faster for weight shape < 1024 * 1024. When using more cores, the threshold will increase.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

[ghstack-poisoned]

pytorch-bot · 2024-11-12T09:10:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140376

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

GLIBC not found in Nova workflows

✅ No Failures

As of commit b401c7c with merge base 330c957 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 8d3be4e Pull Request resolved: #140376

jgong5 · 2024-11-13T05:29:09Z

aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp

+  auto w_desc = ideep::matmul_forward::expected_weights_desc(
+      wei.get_dims(), input_dims, dnnl::memory::data_type::f32, dnnl::memory::data_type::f32);
+  w_desc = w_desc.to_type(ideep::data_type::f16);
+  ideep::tensor expected_weight(w_desc);


Not sure if this would work well - you asked onednn for a memory descriptor of fp32 while then change it to fp16 to reorder to, seems not a common way of invoking the reorder API. Perhaps you can just use the plain layout since you are not using the blocked fp16 tensor anyway.

Thanks. It works on latest Xeon but fails on older platforms. I have modified this part and now we return plain layout directly.

[ghstack-poisoned]

ghstack-source-id: 9f7adf8 Pull Request resolved: #140376

Xia-Weiwen · 2024-11-14T02:43:07Z

@pytorchbot merge

pytorchmergebot · 2024-11-14T02:45:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

**About this PR** This PR adds the following ops for `linear_dynamic_fp16` in onednn namespace. These ops are intended for PT2E quantization eager mode. - `onednn::linear_prepack_fp16`: packs fp32 weight to an fp16 MkldnnCPU tensor. - `onednn::linear_dynamic_fp16`: takes an fp32 CPU tensor and an fp16 MkldnnCPU tensor and compute linear in fp32 - `onednn::linear_relu_dynamic_fp16`: similar as the former and apply relu on output. **Test plan** `python test/test_quantization.py -k test_linear_dynamic_fp16_onednn` **Implementation** These ops call oneDNN lib under the hood. It's worth noting that oneDNN does not support f32 * f16 -> f32 computation, so we have to convert fp16 weight to fp32 before computation. And weight is still in plain format after packing. **Correctness and performance** Correctness is guaranteed by UT. Performance of the new ops may be better than the FBGEMM implementation when weight shape is small but worse when weight shape is large. It's because weight dtype conversion and computation are not fused. For example, I ran benchmarks on an Intel(R) Xeon(R) Platinum 8490H machine with different cores and shapes. When using 1 core per instance, the new implementation generally is faster for weight shape < 1024 * 1024. When using more cores, the threshold will increase. Pull Request resolved: pytorch#140376 Approved by: https://github.com/jerryzh168, https://github.com/jgong5

Update

2384e50

[ghstack-poisoned]

Xia-Weiwen requested review from digantdesai, jerryzh168, jianyuh, kimishpatel and salilsdesai as code owners November 12, 2024 09:10

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: quantization release notes category labels Nov 12, 2024

Xia-Weiwen added a commit that referenced this pull request Nov 12, 2024

[Quant][Onednn] add linear_dynamic_fp16 ops

e100c4f

ghstack-source-id: 8d3be4e Pull Request resolved: #140376

Xia-Weiwen marked this pull request as draft November 12, 2024 09:10

pytorchbot added the open source label Nov 12, 2024

Xia-Weiwen requested review from jgong5 and leslie-fang-intel November 12, 2024 09:30

jerryzh168 approved these changes Nov 13, 2024

View reviewed changes

jgong5 requested changes Nov 13, 2024

View reviewed changes

Update

b401c7c

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Nov 13, 2024

[Quant][Onednn] add linear_dynamic_fp16 ops

bddcd08

ghstack-source-id: 9f7adf8 Pull Request resolved: #140376

Xia-Weiwen requested a review from jgong5 November 13, 2024 14:31

jgong5 approved these changes Nov 14, 2024

View reviewed changes

Xia-Weiwen marked this pull request as ready for review November 14, 2024 02:41

Xia-Weiwen added the intel This tag is for PR from Intel label Nov 14, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 14, 2024

pytorchmergebot added the merging label Nov 14, 2024

pytorchmergebot closed this in 62eea62 Nov 14, 2024

pytorchmergebot added Merged and removed merging labels Nov 14, 2024

github-actions bot deleted the gh/Xia-Weiwen/19/head branch December 15, 2024 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Quant][Onednn] add linear_dynamic_fp16 ops #140376

[Quant][Onednn] add linear_dynamic_fp16 ops #140376

Uh oh!

Xia-Weiwen commented Nov 12, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 12, 2024 •

edited

Loading

Uh oh!

jgong5 Nov 13, 2024

Uh oh!

Xia-Weiwen Nov 13, 2024

Uh oh!

Xia-Weiwen commented Nov 14, 2024

Uh oh!

pytorchmergebot commented Nov 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Quant][Onednn] add linear_dynamic_fp16 ops #140376

[Quant][Onednn] add linear_dynamic_fp16 ops #140376

Uh oh!

Conversation

Xia-Weiwen commented Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140376

❗ 1 Active SEVs

✅ No Failures

Uh oh!

jgong5 Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen commented Nov 14, 2024

Uh oh!

pytorchmergebot commented Nov 14, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Xia-Weiwen commented Nov 12, 2024 •

edited

Loading

pytorch-bot bot commented Nov 12, 2024 •

edited

Loading