[Inductor] Fix AOT weight alignment issue on CPU #135205

leslie-fang-intel · 2024-09-05T10:15:36Z

Stack from ghstack (oldest at bottom):

-> [Inductor] Fix AOT weight alignment issue on CPU #135205

Summary
Fix issue: #135027. On CPU, the consts_size used to generate _binary_constants_bin_start is not padded to ALIGN_BYTES, while serialized_weights is, causing a failure in the 16K alignment check.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

Differential Revision: D62307347

[ghstack-poisoned]

pytorch-bot · 2024-09-05T10:15:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135205

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit ebb7ba0 with merge base e000cf0 ():

NEW FAILURE - The following job has failed:

trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 3, 5, linux.4xlarge.nvidia.gpu) (gh)
'test/inductor/test_cudacodecache.py::TestCUDACodeCache::test_cuda_load'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jgong5

CI failing?

**Summary** Fix issue: #135027, the `consts_size` used to calculate `_binary_constants_bin_start` is not padding to `ALIGN_BYTES` but `serialized_weights` does which failed the check of `16K` alignment. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

leslie-fang-intel · 2024-09-05T11:58:12Z

CI failing?

DLRM has been marked as expected failure previously, changed it to pass as it should be fixed by this PR. Seems we can take this model as the UT to cover this issue.

**Summary** Fix issue: #135027. On CPU, the `consts_size` used to calculate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

desertfire · 2024-09-05T12:56:26Z

torch/_inductor/codecache.py


            output_o = os.path.splitext(input_path)[0] + ".o"
+
+            all_cuda = all(


I originally proposed to skip alignment for all_cuda case, but it seems unnecessarily complex. Ok to stick with it for this PR, but I will create a followup PR to skip all_cuda check, and verify if it looks fine for GPU.

**Summary** Fix issue: #135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

ghstack-source-id: a2c703f Pull Request resolved: #135205

leslie-fang-intel · 2024-09-06T00:12:30Z

The same failure of test/inductor/test_cudacodecache.py::TestCUDACodeCache::test_cuda_load also exists on main: https://github.com/pytorch/pytorch/actions/runs/10722567567/job/29736827985

leslie-fang-intel · 2024-09-06T03:02:12Z

@pytorchbot merge -i "un-related ci failure"

pytorch-bot · 2024-09-06T03:02:14Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: un-related ci failure

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

leslie-fang-intel · 2024-09-06T03:05:10Z

@pytorchbot merge -f "un-related ci failure"

pytorchmergebot · 2024-09-06T03:06:42Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

aorenste · 2024-09-06T17:27:35Z

@aorenste has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

**Summary** Fix issue: pytorch#135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check. Pull Request resolved: pytorch#135205 Approved by: https://github.com/jgong5, https://github.com/desertfire

…139054) Fixes the failure of INT8 DLRM using AOTI. The previous code calculates `consts_size` directly using `tensor` from `graph.constants`: ``` consts_size = sum( get_nbytes_of_tensor(tensor, all_cuda) for (name, tensor) in graph.constants.items() if name not in graph.folded_constants ) ``` Meanwhile, the actual bytes to serialize (`serialized_weights`) is using `graph.get_original_value_of_constant(name)`: ``` serialized_weights = b"".join( _to_bytes(graph.get_original_value_of_constant(name), all_cuda) for name in graph.constants.keys() if name not in graph.folded_constants ) ``` `tensor` from `graph.constants` could be different from `graph.get_original_value_of_constant(name)` thus making the `consts_size` inconsistent with the actual byte size of the `serialized_weights`, resulting in runtime error `weights_offset must be aligned to 16K boundary`, similar to what happened in #135205. This PR direclty gets `consts_size ` using `len(serialized_weights)`, which fixes the inconsistency. We also added a `reduce_range` argument to the `get_default_x86_inductor_quantization_config` function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI). Pull Request resolved: #139054 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire

…ytorch#139054) Fixes the failure of INT8 DLRM using AOTI. The previous code calculates `consts_size` directly using `tensor` from `graph.constants`: ``` consts_size = sum( get_nbytes_of_tensor(tensor, all_cuda) for (name, tensor) in graph.constants.items() if name not in graph.folded_constants ) ``` Meanwhile, the actual bytes to serialize (`serialized_weights`) is using `graph.get_original_value_of_constant(name)`: ``` serialized_weights = b"".join( _to_bytes(graph.get_original_value_of_constant(name), all_cuda) for name in graph.constants.keys() if name not in graph.folded_constants ) ``` `tensor` from `graph.constants` could be different from `graph.get_original_value_of_constant(name)` thus making the `consts_size` inconsistent with the actual byte size of the `serialized_weights`, resulting in runtime error `weights_offset must be aligned to 16K boundary`, similar to what happened in pytorch#135205. This PR direclty gets `consts_size ` using `len(serialized_weights)`, which fixes the inconsistency. We also added a `reduce_range` argument to the `get_default_x86_inductor_quantization_config` function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI). Pull Request resolved: pytorch#139054 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire

[Inductor] Enable UniformValueConstantFolder for general get_attr node

b205217

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 5, 2024

leslie-fang-intel added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 5, 2024

pytorchbot added the open source label Sep 5, 2024

leslie-fang-intel added topic: not user facing topic category and removed open source labels Sep 5, 2024

leslie-fang-intel changed the title ~~[Inductor] Enable UniformValueConstantFolder for general get_attr node~~ [Inductor] Fix AOT weight alignment issue on CPU Sep 5, 2024

leslie-fang-intel requested review from chunyuan-w, desertfire, eellison and jgong5 and removed request for eellison September 5, 2024 10:17

pytorchbot added the open source label Sep 5, 2024

jgong5 requested changes Sep 5, 2024

View reviewed changes

pytorch-bot bot added the module: dynamo label Sep 5, 2024

leslie-fang-intel requested a review from jgong5 September 5, 2024 11:59

desertfire approved these changes Sep 5, 2024

View reviewed changes

leslie-fang-intel added a commit that referenced this pull request Sep 5, 2024

[Inductor] Enable UniformValueConstantFolder for general get_attr node

a1d543e

ghstack-source-id: a2c703f Pull Request resolved: #135205

jgong5 approved these changes Sep 6, 2024

View reviewed changes

pytorchmergebot added the merging label Sep 6, 2024

pytorchmergebot added the Merged label Sep 6, 2024

pytorchmergebot closed this in 07689a3 Sep 6, 2024

pytorchmergebot removed the merging label Sep 6, 2024

leslie-fang-intel mentioned this pull request Sep 6, 2024

[inductor][cpu] AlbertForMaskedLM, DebertaV2ForMaskedLM and timm_vision_transformer_large AMP AOT inductor crash issue #135027

Closed

github-actions bot deleted the gh/leslie-fang-intel/145/head branch October 7, 2024 02:07

chunyuan-w mentioned this pull request Oct 28, 2024

[AOTI] Use len(serialized_weights) when calculating consts_size #139054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor] Fix AOT weight alignment issue on CPU #135205

[Inductor] Fix AOT weight alignment issue on CPU #135205

Uh oh!

leslie-fang-intel commented Sep 5, 2024 •

edited by aorenste

Loading

Uh oh!

pytorch-bot bot commented Sep 5, 2024 •

edited

Loading

Uh oh!

jgong5 left a comment

Uh oh!

leslie-fang-intel commented Sep 5, 2024

Uh oh!

desertfire Sep 5, 2024

Uh oh!

leslie-fang-intel commented Sep 6, 2024

Uh oh!

leslie-fang-intel commented Sep 6, 2024

Uh oh!

pytorch-bot bot commented Sep 6, 2024

Uh oh!

leslie-fang-intel commented Sep 6, 2024

Uh oh!

pytorchmergebot commented Sep 6, 2024

Uh oh!

aorenste commented Sep 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants


		output_o = os.path.splitext(input_path)[0] + ".o"

		all_cuda = all(

[Inductor] Fix AOT weight alignment issue on CPU #135205

[Inductor] Fix AOT weight alignment issue on CPU #135205

Uh oh!

Conversation

leslie-fang-intel commented Sep 5, 2024 • edited by aorenste Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135205

❌ 1 New Failure

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel commented Sep 5, 2024

Uh oh!

desertfire Sep 5, 2024

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel commented Sep 6, 2024

Uh oh!

leslie-fang-intel commented Sep 6, 2024

Uh oh!

pytorch-bot bot commented Sep 6, 2024

Uh oh!

leslie-fang-intel commented Sep 6, 2024

Uh oh!

pytorchmergebot commented Sep 6, 2024

Merge started

Uh oh!

aorenste commented Sep 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

leslie-fang-intel commented Sep 5, 2024 •

edited by aorenste

Loading

pytorch-bot bot commented Sep 5, 2024 •

edited

Loading