Skip to content

Conversation

@leslie-fang-intel
Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel commented Sep 5, 2024

Stack from ghstack (oldest at bottom):

Summary
Fix issue: #135027. On CPU, the consts_size used to generate _binary_constants_bin_start is not padded to ALIGN_BYTES, while serialized_weights is, causing a failure in the 16K alignment check.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

Differential Revision: D62307347

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 5, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135205

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit ebb7ba0 with merge base e000cf0 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@leslie-fang-intel leslie-fang-intel added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 5, 2024
@leslie-fang-intel leslie-fang-intel changed the title [Inductor] Enable UniformValueConstantFolder for general get_attr node [Inductor] Fix AOT weight alignment issue on CPU Sep 5, 2024
@leslie-fang-intel leslie-fang-intel requested review from chunyuan-w, desertfire, eellison and jgong5 and removed request for eellison September 5, 2024 10:17
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failing?

**Summary**
Fix issue: #135027, the `consts_size` used to calculate `_binary_constants_bin_start` is not padding to `ALIGN_BYTES` but `serialized_weights` does which failed the check of `16K` alignment.



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@leslie-fang-intel
Copy link
Collaborator Author

CI failing?

DLRM has been marked as expected failure previously, changed it to pass as it should be fixed by this PR. Seems we can take this model as the UT to cover this issue.

**Summary**
Fix issue: #135027. On CPU, the `consts_size` used to calculate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check.



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec

[ghstack-poisoned]

output_o = os.path.splitext(input_path)[0] + ".o"

all_cuda = all(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally proposed to skip alignment for all_cuda case, but it seems unnecessarily complex. Ok to stick with it for this PR, but I will create a followup PR to skip all_cuda check, and verify if it looks fine for GPU.

**Summary**
Fix issue: #135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check.



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec

[ghstack-poisoned]
leslie-fang-intel added a commit that referenced this pull request Sep 5, 2024
@leslie-fang-intel
Copy link
Collaborator Author

The same failure of test/inductor/test_cudacodecache.py::TestCUDACodeCache::test_cuda_load also exists on main: https://github.com/pytorch/pytorch/actions/runs/10722567567/job/29736827985

@leslie-fang-intel
Copy link
Collaborator Author

@pytorchbot merge -i "un-related ci failure"

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 6, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: un-related ci failure

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

@leslie-fang-intel
Copy link
Collaborator Author

@pytorchbot merge -f "un-related ci failure"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@aorenste
Copy link
Contributor

aorenste commented Sep 6, 2024

@aorenste has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
**Summary**
Fix issue: pytorch#135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check.

Pull Request resolved: pytorch#135205
Approved by: https://github.com/jgong5, https://github.com/desertfire
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
**Summary**
Fix issue: pytorch#135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check.

Pull Request resolved: pytorch#135205
Approved by: https://github.com/jgong5, https://github.com/desertfire
@github-actions github-actions bot deleted the gh/leslie-fang-intel/145/head branch October 7, 2024 02:07
pytorchmergebot pushed a commit that referenced this pull request Oct 31, 2024
…139054)

Fixes the failure of INT8 DLRM using AOTI.
The previous code calculates `consts_size` directly using `tensor` from `graph.constants`:
```
  consts_size = sum(
      get_nbytes_of_tensor(tensor, all_cuda)
      for (name, tensor) in graph.constants.items()
      if name not in graph.folded_constants
  )
```
Meanwhile, the actual bytes to serialize (`serialized_weights`) is using `graph.get_original_value_of_constant(name)`:
```
  serialized_weights = b"".join(
      _to_bytes(graph.get_original_value_of_constant(name), all_cuda)
      for name in graph.constants.keys()
      if name not in graph.folded_constants
  )
```

`tensor` from `graph.constants` could be different from `graph.get_original_value_of_constant(name)` thus making the `consts_size` inconsistent with the actual byte size of the `serialized_weights`, resulting in runtime error `weights_offset must be aligned to 16K boundary`, similar to what happened in #135205.

This PR direclty gets `consts_size ` using `len(serialized_weights)`, which fixes the inconsistency.

We also added a `reduce_range` argument to the `get_default_x86_inductor_quantization_config` function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI).

Pull Request resolved: #139054
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…ytorch#139054)

Fixes the failure of INT8 DLRM using AOTI.
The previous code calculates `consts_size` directly using `tensor` from `graph.constants`:
```
  consts_size = sum(
      get_nbytes_of_tensor(tensor, all_cuda)
      for (name, tensor) in graph.constants.items()
      if name not in graph.folded_constants
  )
```
Meanwhile, the actual bytes to serialize (`serialized_weights`) is using `graph.get_original_value_of_constant(name)`:
```
  serialized_weights = b"".join(
      _to_bytes(graph.get_original_value_of_constant(name), all_cuda)
      for name in graph.constants.keys()
      if name not in graph.folded_constants
  )
```

`tensor` from `graph.constants` could be different from `graph.get_original_value_of_constant(name)` thus making the `consts_size` inconsistent with the actual byte size of the `serialized_weights`, resulting in runtime error `weights_offset must be aligned to 16K boundary`, similar to what happened in pytorch#135205.

This PR direclty gets `consts_size ` using `len(serialized_weights)`, which fixes the inconsistency.

We also added a `reduce_range` argument to the `get_default_x86_inductor_quantization_config` function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI).

Pull Request resolved: pytorch#139054
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants