Skip to content

Conversation

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138016

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c146673 with merge base 966a1a9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@desertfire desertfire changed the title [AOTI] Remove explict abi_compatible setting in tests [AOTI] Remove explicit abi_compatible setting in tests Oct 15, 2024
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang

[ghstack-poisoned]
@desertfire
Copy link
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang

Differential Revision: [D64439674](https://our.internmc.facebook.com/intern/diff/D64439674)

[ghstack-poisoned]
@desertfire
Copy link
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted code is tested code :)

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 16, 2024
@desertfire
Copy link
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Oct 17, 2024
Summary: The ABI-compatible mode has been turned on as default in #136534. Removing the non-ABI-compatible logic to greatly simplify the wrapper codegen logic.

Differential Revision: [D64439676](https://our.internmc.facebook.com/intern/diff/D64439676)
Pull Request resolved: #138009
Approved by: https://github.com/chenyang78
ghstack dependencies: #137982, #138016
@ptrblck
Copy link
Collaborator

ptrblck commented Oct 18, 2024

I was trying to isolate an IMA caused in this test, but see it was removed and don't see the reason why this test was removed.
The stacktrace using an older PyTorch version containing this test:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-gdb --args  python inductor/test_aot_inductor.py -v -k test_torchvision_transforms_functional_tensor_resize_abi_compatible_cuda
...
r
...
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ffcab79ac10  triton_poi_fused.to_copy.unsafe_index_add_arange_clamp_mul_sub_view_1  (cnzk43i2k6noh4w4ubolxat6sh2vn7i66ccpqdrc62y3yk7vo3so.py:76)

Thread 1 "python" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 108, block (63510,0,0), thread (0,0,0), device 0, sm 0, warp 30, lane 0]
triton_poi_fused.to_copy.unsafe_index_add_arange_clamp_mul_sub_view_1<<<(187500,1,1),(128,1,1)>>> () at /tmp/tmpsz9gwx20/nz/cnzk43i2k6noh4w4ubolxat6sh2vn7i66ccpqdrc62y3yk7vo3so.py:76

Are these kernels now tested in another unit or are we ignoring these issues?
CC @malfet

@desertfire
Copy link
Contributor Author

It's just a naming change in this case. test_torchvision_transforms_functional_tensor_resize_abi_compatible_cuda -> test_torchvision_transforms_functional_tensor_resize_cuda. I dropped the abi_compatible keyword because it is the default codegen behavior now.

@ptrblck
Copy link
Collaborator

ptrblck commented Oct 21, 2024

Thanks for the info, @desertfire!

When running a nightly from today with source from main I see:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 python inductor/test_aot_inductor.py -v -k test_torchvision_transforms_functional_tensor_resize_cuda -v
test_torchvision_transforms_functional_tensor_resize_cuda (__main__.AOTInductorTestABICompatibleCuda.test_torchvision_transforms_functional_tensor_resize_cuda) ... ERROR

======================================================================
ERROR: test_torchvision_transforms_functional_tensor_resize_cuda (__main__.AOTInductorTestABICompatibleCuda.test_torchvision_transforms_functional_tensor_resize_cuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/src/pytorch/test/inductor/test_aot_inductor.py", line 3686, in setUp
    torch.ops.load_library(str(lib_file_path))
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1357, in load_library
    ctypes.CDLL(path)
  File "/usr/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /usr/local/lib/python3.12/dist-packages/torch/build/lib/libaoti_custom_ops.so: cannot open shared object file: No such file or directory

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

So I guess I might need to execute the non-CUDA test before to build the actual lib, which fails with:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 python inductor/test_aot_inductor.py -v -k test_torchvision_transforms_functional_tensor_resize -v
test_torchvision_transforms_functional_tensor_resize_cpu (__main__.AOTInductorTestABICompatibleCpu.test_torchvision_transforms_functional_tensor_resize_cpu) ... ERROR
test_torchvision_transforms_functional_tensor_resize_cpu_with_stack_allocation (__main__.AOTInductorTestABICompatibleCpuWithStackAllocation.test_torchvision_transforms_functional_tensor_resize_cpu_with_stack_allocation) ... ERROR
test_torchvision_transforms_functional_tensor_resize_cpu_with_stack_allocation_and_minimal_arrayref_interface (__main__.AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface.test_torchvision_transforms_functional_tensor_resize_cpu_with_stack_allocation_and_minimal_arrayref_interface) ... In file included from /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3,
                 from /tmp/tmppo5aidep/cj3v2c67ufgabywfqrvifllhohdm3jwuizd3erl2vwvyo6acesld/ccohu4kz2drtkxu47qvmlcjb3zmgk2ve463tdwlfnbqcsy2vcnsn.cpp:2:
/tmp/tmppo5aidep/cj3v2c67ufgabywfqrvifllhohdm3jwuizd3erl2vwvyo6acesld/ccohu4kz2drtkxu47qvmlcjb3zmgk2ve463tdwlfnbqcsy2vcnsn.cpp: In member functionOutputs torch::aot_inductor::AOTInductorModel::run_impl_minimal_arrayref_interface(const Inputs&, torch::aot_inductor::DeviceStreamType, AOTIProxyExecutorHandle) [with Inputs = std::tuple<torch::aot_inductor::ArrayRefTensor<float>, torch::aot_inductor::ArrayRefTensor<long int> >; Outputs = std::tuple<torch::aot_inductor::ArrayRefTensor<float> >; torch::aot_inductor::DeviceStreamType = void*; AOTIProxyExecutorHandle = AOTIProxyExecutorOpaque*]’:
/tmp/tmppo5aidep/cj3v2c67ufgabywfqrvifllhohdm3jwuizd3erl2vwvyo6acesld/ccohu4kz2drtkxu47qvmlcjb3zmgk2ve463tdwlfnbqcsy2vcnsn.cpp:769:54: error: cannot converttorch::aot_inductor::ArrayRefTensor<float>toAtenTensorHandle’ {akaAtenTensorOpaque*’}
  769 |     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg0_1, &arg0_1_size));
      |                                                      ^~~~~~
      |                                                      |
      |                                                      torch::aot_inductor::ArrayRefTensor<float>

In any case, unrelated to this PR and we should follow up in another issue to discuss how to execute this test as I might miss something.

@desertfire
Copy link
Contributor Author

/usr/local/lib/python3.12/dist-packages/torch/build/lib/libaoti_custom_ops.so: cannot open shared object file: No such file or directory

@angelayi , liaoti_custom_ops will only be built when BUILD_TEST =1. This is causing people not able to run AOTI tests using nightly. We should split your custom_ops test into a separate file.

@github-actions github-actions bot deleted the gh/desertfire/488/head branch December 14, 2024 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor release notes: releng release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants