[AOTI] Remove explicit abi_compatible setting in tests #138016

desertfire · 2024-10-15T20:29:48Z

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @chauhang

Differential Revision: D64439674

[ghstack-poisoned]

pytorch-bot · 2024-10-15T20:29:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138016

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c146673 with merge base 966a1a9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang [ghstack-poisoned]

desertfire · 2024-10-16T00:02:22Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang Differential Revision: [D64439674](https://our.internmc.facebook.com/intern/diff/D64439674) [ghstack-poisoned]

desertfire · 2024-10-16T01:45:10Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

malfet

Deleted code is tested code :)

desertfire · 2024-10-16T02:06:52Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-10-16T21:28:41Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2024-10-16T21:30:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: The ABI-compatible mode has been turned on as default in #136534. Removing the non-ABI-compatible logic to greatly simplify the wrapper codegen logic. Differential Revision: [D64439676](https://our.internmc.facebook.com/intern/diff/D64439676) Pull Request resolved: #138009 Approved by: https://github.com/chenyang78 ghstack dependencies: #137982, #138016

Summary: Continue to clean up non-ABI-compatible mode related code. Differential Revision: [D64444327](https://our.internmc.facebook.com/intern/diff/D64444327) Pull Request resolved: #138047 Approved by: https://github.com/chenyang78 ghstack dependencies: #137982, #138016, #138009

ptrblck · 2024-10-18T18:13:11Z

I was trying to isolate an IMA caused in this test, but see it was removed and don't see the reason why this test was removed.
The stacktrace using an older PyTorch version containing this test:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-gdb --args  python inductor/test_aot_inductor.py -v -k test_torchvision_transforms_functional_tensor_resize_abi_compatible_cuda
...
r
...
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ffcab79ac10  triton_poi_fused.to_copy.unsafe_index_add_arange_clamp_mul_sub_view_1  (cnzk43i2k6noh4w4ubolxat6sh2vn7i66ccpqdrc62y3yk7vo3so.py:76)

Thread 1 "python" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 108, block (63510,0,0), thread (0,0,0), device 0, sm 0, warp 30, lane 0]
triton_poi_fused.to_copy.unsafe_index_add_arange_clamp_mul_sub_view_1<<<(187500,1,1),(128,1,1)>>> () at /tmp/tmpsz9gwx20/nz/cnzk43i2k6noh4w4ubolxat6sh2vn7i66ccpqdrc62y3yk7vo3so.py:76

Are these kernels now tested in another unit or are we ignoring these issues?
CC @malfet

desertfire · 2024-10-18T19:57:29Z

It's just a naming change in this case. test_torchvision_transforms_functional_tensor_resize_abi_compatible_cuda -> test_torchvision_transforms_functional_tensor_resize_cuda. I dropped the abi_compatible keyword because it is the default codegen behavior now.

ptrblck · 2024-10-21T22:04:25Z

Thanks for the info, @desertfire!

When running a nightly from today with source from main I see:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 python inductor/test_aot_inductor.py -v -k test_torchvision_transforms_functional_tensor_resize_cuda -v
test_torchvision_transforms_functional_tensor_resize_cuda (__main__.AOTInductorTestABICompatibleCuda.test_torchvision_transforms_functional_tensor_resize_cuda) ... ERROR

======================================================================
ERROR: test_torchvision_transforms_functional_tensor_resize_cuda (__main__.AOTInductorTestABICompatibleCuda.test_torchvision_transforms_functional_tensor_resize_cuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/src/pytorch/test/inductor/test_aot_inductor.py", line 3686, in setUp
    torch.ops.load_library(str(lib_file_path))
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1357, in load_library
    ctypes.CDLL(path)
  File "/usr/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /usr/local/lib/python3.12/dist-packages/torch/build/lib/libaoti_custom_ops.so: cannot open shared object file: No such file or directory

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

So I guess I might need to execute the non-CUDA test before to build the actual lib, which fails with:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 python inductor/test_aot_inductor.py -v -k test_torchvision_transforms_functional_tensor_resize -v
test_torchvision_transforms_functional_tensor_resize_cpu (__main__.AOTInductorTestABICompatibleCpu.test_torchvision_transforms_functional_tensor_resize_cpu) ... ERROR
test_torchvision_transforms_functional_tensor_resize_cpu_with_stack_allocation (__main__.AOTInductorTestABICompatibleCpuWithStackAllocation.test_torchvision_transforms_functional_tensor_resize_cpu_with_stack_allocation) ... ERROR
test_torchvision_transforms_functional_tensor_resize_cpu_with_stack_allocation_and_minimal_arrayref_interface (__main__.AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface.test_torchvision_transforms_functional_tensor_resize_cpu_with_stack_allocation_and_minimal_arrayref_interface) ... In file included from /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3,
                 from /tmp/tmppo5aidep/cj3v2c67ufgabywfqrvifllhohdm3jwuizd3erl2vwvyo6acesld/ccohu4kz2drtkxu47qvmlcjb3zmgk2ve463tdwlfnbqcsy2vcnsn.cpp:2:
/tmp/tmppo5aidep/cj3v2c67ufgabywfqrvifllhohdm3jwuizd3erl2vwvyo6acesld/ccohu4kz2drtkxu47qvmlcjb3zmgk2ve463tdwlfnbqcsy2vcnsn.cpp: In member function ‘Outputs torch::aot_inductor::AOTInductorModel::run_impl_minimal_arrayref_interface(const Inputs&, torch::aot_inductor::DeviceStreamType, AOTIProxyExecutorHandle) [with Inputs = std::tuple<torch::aot_inductor::ArrayRefTensor<float>, torch::aot_inductor::ArrayRefTensor<long int> >; Outputs = std::tuple<torch::aot_inductor::ArrayRefTensor<float> >; torch::aot_inductor::DeviceStreamType = void*; AOTIProxyExecutorHandle = AOTIProxyExecutorOpaque*]’:
/tmp/tmppo5aidep/cj3v2c67ufgabywfqrvifllhohdm3jwuizd3erl2vwvyo6acesld/ccohu4kz2drtkxu47qvmlcjb3zmgk2ve463tdwlfnbqcsy2vcnsn.cpp:769:54: error: cannot convert ‘torch::aot_inductor::ArrayRefTensor<float>’ to ‘AtenTensorHandle’ {aka ‘AtenTensorOpaque*’}
  769 |     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg0_1, &arg0_1_size));
      |                                                      ^~~~~~
      |                                                      |
      |                                                      torch::aot_inductor::ArrayRefTensor<float>

In any case, unrelated to this PR and we should follow up in another issue to discuss how to execute this test as I might miss something.

desertfire · 2024-11-12T23:25:16Z

/usr/local/lib/python3.12/dist-packages/torch/build/lib/libaoti_custom_ops.so: cannot open shared object file: No such file or directory

@angelayi , liaoti_custom_ops will only be built when BUILD_TEST =1. This is causing people not able to run AOTI tests using nightly. We should split your custom_ops test into a separate file.

[AOTI] Remove explict abi_compatible setting in tests

f8f8d6f

[ghstack-poisoned]

desertfire requested a review from a team as a code owner October 15, 2024 20:29

desertfire mentioned this pull request Oct 15, 2024

[AOTI] Remove non-ABI-compatible tests #137982

Closed

desertfire mentioned this pull request Oct 15, 2024

[AOTI] Remove the non-ABI-compatible mode (part 1) #138009

Closed

pytorch-bot bot added module: inductor release notes: releng release notes category labels Oct 15, 2024

desertfire added topic: not user facing topic category ciflow/inductor labels Oct 15, 2024

desertfire changed the title ~~[AOTI] Remove explict abi_compatible setting in tests~~ [AOTI] Remove explicit abi_compatible setting in tests Oct 15, 2024

Update on "[AOTI] Remove explicit abi_compatible setting in tests"

63337da

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang [ghstack-poisoned]

malfet approved these changes Oct 16, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 16, 2024

desertfire mentioned this pull request Oct 16, 2024

[AOTI] Remove the non-ABI-compatible mode (part 2) #138047

Closed

pytorchmergebot added the merging label Oct 16, 2024

pytorchmergebot added the Merged label Oct 16, 2024

pytorchmergebot closed this in 443472b Oct 16, 2024

pytorchmergebot removed the merging label Oct 16, 2024

eqy mentioned this pull request Oct 25, 2024

[Inductor][AOTInductor] test_constant_folding_abi_compatible_cpu surfaces CUDA error: invalid argument on H100 #138958

Closed

github-actions bot deleted the gh/desertfire/488/head branch December 14, 2024 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AOTI] Remove explicit abi_compatible setting in tests #138016

[AOTI] Remove explicit abi_compatible setting in tests #138016

Uh oh!

desertfire commented Oct 15, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 15, 2024 •

edited

Loading

Uh oh!

desertfire commented Oct 16, 2024

Uh oh!

desertfire commented Oct 16, 2024

Uh oh!

malfet left a comment

Uh oh!

desertfire commented Oct 16, 2024

Uh oh!

facebook-github-bot commented Oct 16, 2024

Uh oh!

pytorchmergebot commented Oct 16, 2024

Uh oh!

ptrblck commented Oct 18, 2024

Uh oh!

desertfire commented Oct 18, 2024

Uh oh!

ptrblck commented Oct 21, 2024

Uh oh!

desertfire commented Nov 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[AOTI] Remove explicit abi_compatible setting in tests #138016

[AOTI] Remove explicit abi_compatible setting in tests #138016

Uh oh!

Conversation

desertfire commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138016

✅ No Failures

Uh oh!

desertfire commented Oct 16, 2024

Uh oh!

desertfire commented Oct 16, 2024

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

desertfire commented Oct 16, 2024

Uh oh!

facebook-github-bot commented Oct 16, 2024

Uh oh!

pytorchmergebot commented Oct 16, 2024

Merge started

Uh oh!

ptrblck commented Oct 18, 2024

Uh oh!

desertfire commented Oct 18, 2024

Uh oh!

ptrblck commented Oct 21, 2024

Uh oh!

desertfire commented Nov 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

desertfire commented Oct 15, 2024 •

edited

Loading

pytorch-bot bot commented Oct 15, 2024 •

edited

Loading