[Quant][fx] Create new FX-based LSTM reference module by andrewor14 · Pull Request #96343 · pytorch/pytorch

andrewor14 · 2023-03-08T21:01:42Z

Stack from ghstack (oldest at bottom):

-> [Quant][fx] Create new FX-based LSTM reference module #96343

Summary: The previous LSTM reference module implementation did
not handle dtypes other than quint8 correctly. This is because
the internal LSTM custom module quantization used eager mode,
which did not insert the q-dq ops properly. E.g., we want the
following reference quantized model:

[dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 ->
  [dq - linear2_fp32 -> q_to_quint8] -> dq -> ...

This requires two sets of q - dq pairs between two adjacent
ops that have different dtypes (linear1 and linear2). However,
these q - dq pairs were not inserted in the old flow, because
eager mode required users to insert Quant/DeQuantStubs manually.

This commit changes the internal LSTM custom module quantization
to use FX graph mode quantization, which automatically inserts
the q - dq ops that convert the dtypes between adjacent ops
correctly. However, using FX graph mode quantization here comes
with its own set of challenges that required some hacks to get
the end-to-end flow to work. These hacks are detailed in the
comments in the util functions.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams

This commit also updates the corresponding test to verify the
dtypes as well as the qparams in the reference quantized graph.
This test case should serve as an example for users to set up
their own LSTM reference module flows.

Reviewers: vkuzo, supriyar, jcaip

Subscribers: vkuzo, supriyar, jcaip

Summary: The previous reference LSTM module implementation did not handle dtypes other than quint8 correctly. This was because the internal LSTM custom module quantization used eager mode, which did not insert the q-dq ops properly. E.g., we want the following reference quantized graph: ``` [dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 -> [dq - linear2_fp32 -> q_to_quint8] -> dq -> ... ``` This requires two sets of `q - dq` pairs between two adjacent ops that have different dtypes (linear1 and linear2). These `q - dq` pairs were not inserted in the old flow, because eager mode did not do this automatically. This commit changes the internal LSTM custom module quantization to use FX graph mode quantization, which does automatically insert the `q - dq` ops that convert the dtypes between adjacent ops correctly. However, using FX graph mode quantization here comes with its own set of challenges that required some hacks to get the end-to-end flow to work. These challenges are detailed in the test comments. Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams This commit also updates the corresponding test to verify the dtypes as well as the qparams in the reference quantized graph. This test case should serve as an example for users to set up their own LSTM reference module flows. Reviewers: vkuzo, supriyar, jcaip Subscribers: vkuzo, supriyar, jcaip [ghstack-poisoned]

pytorch-bot · 2023-03-08T21:01:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96343

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6679b55:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: The previous LSTM reference module implementation did not handle dtypes other than quint8 correctly. This is because the internal LSTM custom module quantization used eager mode, which did not insert the q-dq ops properly. E.g., we want the following reference quantized model: ``` [dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 -> [dq - linear2_fp32 -> q_to_quint8] -> dq -> ... ``` This requires two sets of `q - dq` pairs between two adjacent ops that have different dtypes (linear1 and linear2). However, these `q - dq` pairs were not inserted in the old flow, because eager mode required users to insert Quant/DeQuantStubs manually. This commit changes the internal LSTM custom module quantization to use FX graph mode quantization, which automatically inserts the `q - dq` ops that convert the dtypes between adjacent ops correctly. However, using FX graph mode quantization here comes with its own set of challenges that required some hacks to get the end-to-end flow to work. These hacks are detailed in the comments in the util functions. Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams This commit also updates the corresponding test to verify the dtypes as well as the qparams in the reference quantized graph. This test case should serve as an example for users to set up their own LSTM reference module flows. Reviewers: vkuzo, supriyar, jcaip Subscribers: vkuzo, supriyar, jcaip [ghstack-poisoned]

Summary: The previous LSTM reference module implementation did not handle dtypes other than quint8 correctly. This is because the internal LSTM custom module quantization used eager mode, which did not insert the q-dq ops properly. E.g., we want the following reference quantized model: ``` [dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 -> [dq - linear2_fp32 -> q_to_quint8] -> dq -> ... ``` This requires two sets of `q - dq` pairs between two adjacent ops that have different dtypes (linear1 and linear2). However, these `q - dq` pairs were not inserted in the old flow, because eager mode required users to insert Quant/DeQuantStubs manually. This commit changes the internal LSTM custom module quantization to use FX graph mode quantization, which automatically inserts the `q - dq` ops that convert the dtypes between adjacent ops correctly. However, using FX graph mode quantization here comes with its own set of challenges that required some hacks to get the end-to-end flow to work. These hacks are detailed in the comments in the util functions. Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams This commit also updates the corresponding test to verify the dtypes as well as the qparams in the reference quantized graph. This test case should serve as an example for users to set up their own LSTM reference module flows. Reviewers: vkuzo, supriyar, jcaip Subscribers: vkuzo, supriyar, jcaip ghstack-source-id: c6b693e Pull Request resolved: #96343

Summary: The previous LSTM reference module implementation did not handle dtypes other than quint8 correctly. This is because the internal LSTM custom module quantization used eager mode, which did not insert the q-dq ops properly. E.g., we want the following reference quantized model: ``` [dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 -> [dq - linear2_fp32 -> q_to_quint8] -> dq -> ... ``` This requires two sets of `q - dq` pairs between two adjacent ops that have different dtypes (linear1 and linear2). However, these `q - dq` pairs were not inserted in the old flow, because eager mode required users to insert Quant/DeQuantStubs manually. This commit changes the internal LSTM custom module quantization to use FX graph mode quantization, which automatically inserts the `q - dq` ops that convert the dtypes between adjacent ops correctly. However, using FX graph mode quantization here comes with its own set of challenges that required some hacks to get the end-to-end flow to work. These hacks are detailed in the comments in the util functions. Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams This commit also updates the corresponding test to verify the dtypes as well as the qparams in the reference quantized graph. This test case should serve as an example for users to set up their own LSTM reference module flows. Reviewers: vkuzo, supriyar, jcaip Subscribers: vkuzo, supriyar, jcaip [ghstack-poisoned]

Summary: The previous LSTM reference module implementation did not handle dtypes other than quint8 correctly. This is because the internal LSTM custom module quantization used eager mode, which did not insert the q-dq ops properly. E.g., we want the following reference quantized model: ``` [dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 -> [dq - linear2_fp32 -> q_to_quint8] -> dq -> ... ``` This requires two sets of `q - dq` pairs between two adjacent ops that have different dtypes (linear1 and linear2). However, these `q - dq` pairs were not inserted in the old flow, because eager mode required users to insert Quant/DeQuantStubs manually. This commit changes the internal LSTM custom module quantization to use FX graph mode quantization, which automatically inserts the `q - dq` ops that convert the dtypes between adjacent ops correctly. However, using FX graph mode quantization here comes with its own set of challenges that required some hacks to get the end-to-end flow to work. These hacks are detailed in the comments in the util functions. Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams This commit also updates the corresponding test to verify the dtypes as well as the qparams in the reference quantized graph. This test case should serve as an example for users to set up their own LSTM reference module flows. Reviewers: vkuzo, supriyar, jcaip Subscribers: vkuzo, supriyar, jcaip ghstack-source-id: 2e9f45e Pull Request resolved: #96343

test/quantization/fx/test_quantize_fx.py

vkuzo · 2023-03-08T21:19:02Z

torch/ao/quantization/fx/lstm_utils.py

+        # HACK: Manually replace the activation_post_process following these ops.
+        # This is needed for FloatFunctional ops because there is currently no way
+        # to configure these ops in FX graph mode quantization today. This is because
+        # the FloatFunctional modules simply disappear from the graph after tracing.


FloatFunctional gets swapped to FXFloatFunctional here: https://fburl.com/code/8bxfrqzn

I agree, a good fix would be to remove the use of FloatFunctional in the future

Yes, though FXFloatFunctional also disappears after tracing because it's just a simple wrapper around the torch ops, so we can't access these by the original name

torch/ao/quantization/fx/lstm_utils.py

Summary: The previous LSTM reference module implementation did not handle dtypes other than quint8 correctly. This is because the internal LSTM custom module quantization used eager mode, which did not insert the q-dq ops properly. E.g., we want the following reference quantized model: ``` [dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 -> [dq - linear2_fp32 -> q_to_quint8] -> dq -> ... ``` This requires two sets of `q - dq` pairs between two adjacent ops that have different dtypes (linear1 and linear2). However, these `q - dq` pairs were not inserted in the old flow, because eager mode required users to insert Quant/DeQuantStubs manually. This commit changes the internal LSTM custom module quantization to use FX graph mode quantization, which automatically inserts the `q - dq` ops that convert the dtypes between adjacent ops correctly. However, using FX graph mode quantization here comes with its own set of challenges that required some hacks to get the end-to-end flow to work. These hacks are detailed in the comments in the util functions. Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams This commit also updates the corresponding test to verify the dtypes as well as the qparams in the reference quantized graph. This test case should serve as an example for users to set up their own LSTM reference module flows. Reviewers: vkuzo, supriyar, jcaip Subscribers: vkuzo, supriyar, jcaip ghstack-source-id: 2e9f45e Pull Request resolved: #96343

supriyar · 2023-03-08T22:26:14Z

torch/ao/nn/quantizable/modules/rnn.py

        fgate_cx_igate_cgate = self.fgate_cx_igate_cgate.add(fgate_cx, igate_cgate)
        cy = fgate_cx_igate_cgate

+        # TODO: make this tanh a member of the module so its qparams can be configured


do we need tanh qparams to be configurable for the current use case?

Not sure. In this PR I'm trying to isolate the changes to within my util functions. If we need to configure this we can fix this separately in the future.

supriyar · 2023-03-08T22:49:20Z

torch/ao/quantization/fx/lstm_utils.py

+        # HACK: Manually remove input quantize nodes and output dequantize nodes,
+        # since custom modules expect quint8 inputs and outputs for now. Note that


what does the graph look like if we don't do this? For ex, will we have 2 quantize nodes before the linear op?
Also, confirming that this means the dtype at input/output boundary to the LSTM module will always be fp32 since we are dealing with reference quantized ops.

Yes, that was something I spent a long time debugging actually. The problem was because it's a custom module, the caller of this module would feed it quint8 tensors. However, within the module FX graph mode quantization assumes the inputs are fp32, so it'll try to quantize them again, leading to the "Unable to run X with args from QuantizedCPU backend" error. Unfortunately there is no easy way to fix this right now without significant changes to how FX graph mode quantization works.

Summary: The previous LSTM reference module implementation did not handle dtypes other than quint8 correctly. This is because the internal LSTM custom module quantization used eager mode, which did not insert the q-dq ops properly. E.g., we want the following reference quantized model: ``` [dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 -> [dq - linear2_fp32 -> q_to_quint8] -> dq -> ... ``` This requires two sets of `q - dq` pairs between two adjacent ops that have different dtypes (linear1 and linear2). However, these `q - dq` pairs were not inserted in the old flow, because eager mode required users to insert Quant/DeQuantStubs manually. This commit changes the internal LSTM custom module quantization to use FX graph mode quantization, which automatically inserts the `q - dq` ops that convert the dtypes between adjacent ops correctly. However, using FX graph mode quantization here comes with its own set of challenges that required some hacks to get the end-to-end flow to work. These hacks are detailed in the comments in the util functions. Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams This commit also updates the corresponding test to verify the dtypes as well as the qparams in the reference quantized graph. This test case should serve as an example for users to set up their own LSTM reference module flows. Reviewers: vkuzo, supriyar, jcaip Subscribers: vkuzo, supriyar, jcaip [ghstack-poisoned]

Summary: The previous LSTM reference module implementation did not handle dtypes other than quint8 correctly. This is because the internal LSTM custom module quantization used eager mode, which did not insert the q-dq ops properly. E.g., we want the following reference quantized model: ``` [dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 -> [dq - linear2_fp32 -> q_to_quint8] -> dq -> ... ``` This requires two sets of `q - dq` pairs between two adjacent ops that have different dtypes (linear1 and linear2). However, these `q - dq` pairs were not inserted in the old flow, because eager mode required users to insert Quant/DeQuantStubs manually. This commit changes the internal LSTM custom module quantization to use FX graph mode quantization, which automatically inserts the `q - dq` ops that convert the dtypes between adjacent ops correctly. However, using FX graph mode quantization here comes with its own set of challenges that required some hacks to get the end-to-end flow to work. These hacks are detailed in the comments in the util functions. Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams This commit also updates the corresponding test to verify the dtypes as well as the qparams in the reference quantized graph. This test case should serve as an example for users to set up their own LSTM reference module flows. Reviewers: vkuzo, supriyar, jcaip Subscribers: vkuzo, supriyar, jcaip ghstack-source-id: e67c3ad Pull Request resolved: #96343

andrewor14 · 2023-03-09T05:31:37Z

Ok, I'm merging this. Thanks for everyone's feedback!

andrewor14 · 2023-03-09T05:31:45Z

@pytorchbot merge

andrewor14 · 2023-03-09T15:27:57Z

@pytorchbot merge -g

andrewor14 · 2023-03-09T23:21:57Z

@pytorchbot merge

pytorchmergebot · 2023-03-09T23:23:43Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: The previous LSTM reference module implementation did not handle dtypes other than quint8 correctly. This is because the internal LSTM custom module quantization used eager mode, which did not insert the q-dq ops properly. E.g., we want the following reference quantized model: ``` [dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 -> [dq - linear2_fp32 -> q_to_quint8] -> dq -> ... ``` This requires two sets of `q - dq` pairs between two adjacent ops that have different dtypes (linear1 and linear2). However, these `q - dq` pairs were not inserted in the old flow, because eager mode required users to insert Quant/DeQuantStubs manually. This commit changes the internal LSTM custom module quantization to use FX graph mode quantization, which automatically inserts the `q - dq` ops that convert the dtypes between adjacent ops correctly. However, using FX graph mode quantization here comes with its own set of challenges that required some hacks to get the end-to-end flow to work. These hacks are detailed in the comments in the util functions. Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams This commit also updates the corresponding test to verify the dtypes as well as the qparams in the reference quantized graph. This test case should serve as an example for users to set up their own LSTM reference module flows. Reviewers: vkuzo, supriyar, jcaip Subscribers: vkuzo, supriyar, jcaip Pull Request resolved: pytorch/pytorch#96343 Approved by: https://github.com/vkuzo

Summary: The previous LSTM reference module implementation did not handle dtypes other than quint8 correctly. This is because the internal LSTM custom module quantization used eager mode, which did not insert the q-dq ops properly. E.g., we want the following reference quantized model: ``` [dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 -> [dq - linear2_fp32 -> q_to_quint8] -> dq -> ... ``` This requires two sets of `q - dq` pairs between two adjacent ops that have different dtypes (linear1 and linear2). However, these `q - dq` pairs were not inserted in the old flow, because eager mode required users to insert Quant/DeQuantStubs manually. This commit changes the internal LSTM custom module quantization to use FX graph mode quantization, which automatically inserts the `q - dq` ops that convert the dtypes between adjacent ops correctly. However, using FX graph mode quantization here comes with its own set of challenges that required some hacks to get the end-to-end flow to work. These hacks are detailed in the comments in the util functions. Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams This commit also updates the corresponding test to verify the dtypes as well as the qparams in the reference quantized graph. This test case should serve as an example for users to set up their own LSTM reference module flows. Reviewers: vkuzo, supriyar, jcaip Subscribers: vkuzo, supriyar, jcaip Approved by: https://github.com/vkuzo

andrewor14 requested a review from jerryzh168 as a code owner March 8, 2023 21:01

pytorch-bot bot added the release notes: quantization release notes category label Mar 8, 2023

andrewor14 requested review from jcaip, supriyar and vkuzo March 8, 2023 21:03

vkuzo approved these changes Mar 8, 2023

View reviewed changes

supriyar reviewed Mar 8, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 9, 2023

pytorchmergebot added the Merged label Mar 9, 2023

pytorchmergebot closed this in faa4cb2 Mar 9, 2023

facebook-github-bot deleted the gh/andrewor14/46/head branch June 8, 2023 15:13

		# HACK: Manually remove input quantize nodes and output dequantize nodes,
		# since custom modules expect quint8 inputs and outputs for now. Note that

Conversation

andrewor14 commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96343

✅ No Failures

Uh oh!

Uh oh!

vkuzo Mar 8, 2023

Choose a reason for hiding this comment

Uh oh!

andrewor14 Mar 8, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

supriyar Mar 8, 2023

Choose a reason for hiding this comment

Uh oh!

andrewor14 Mar 9, 2023

Choose a reason for hiding this comment

Uh oh!

supriyar Mar 8, 2023

Choose a reason for hiding this comment

Uh oh!

andrewor14 Mar 9, 2023

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Mar 9, 2023

Uh oh!

andrewor14 commented Mar 9, 2023

Uh oh!

andrewor14 commented Mar 9, 2023

Uh oh!

andrewor14 commented Mar 9, 2023

Uh oh!

pytorchmergebot commented Mar 9, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andrewor14 commented Mar 8, 2023 •

edited

Loading

pytorch-bot bot commented Mar 8, 2023 •

edited

Loading