Skip to content

Conversation

@atalman
Copy link
Contributor

@atalman atalman commented Feb 28, 2025

Fixes: #148120

Test with almalinux/9-base:latest :

>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 401, in <module>
    from torch._C import *  # noqa: F403
ImportError: libcufile.so.0: cannot open shared object file: No such file or directory
>>> exit()
[root@18b37257e416 /]# vi /usr/local/lib64/python3.9/site-packages/torch/__init__.py
[root@18b37257e416 /]# python3
Python 3.9.19 (main, Sep 11 2024, 00:00:00) 
[GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu126'

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148137

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

⏳ No Failures, 47 Pending

As of commit 9dc23f3 with merge base fc78192 (image):
💚 Looks good so far! There are no failures yet. 💚

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@malfet malfet added release notes: releng release notes category topic: bug fixes topic category labels Feb 28, 2025
@malfet
Copy link
Contributor

malfet commented Feb 28, 2025

@pytorchbot merge -f "What can go wrong"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

"cufft": "libcufft.so.*[0-9]",
"curand": "libcurand.so.*[0-9]",
"nvjitlink": "libnvJitLink.so.*[0-9]",
"cufile": "libcufile.so.*[0-9]",
Copy link
Contributor

@mikaylagawarecki mikaylagawarecki Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing this!!

I don't fully understand how this bit of code works 😅 , but since we only have cufile as a dependency in cuda 12.6 and 12.8 binaries, do we need an if statement for that here?

Also do we need to check the platform is not windows (?)

Otherwise, perhaps this would break the other binaries in a similar way that happened for 2.5.0 :( #138324

Copy link
Contributor Author

@atalman atalman Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikaylagawarecki It works on cuda 11.8:

Applied this patch before executing:

>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu118'

I believe this patch fixes exactly this issue, for 2.5.1 cufile was not installed via pypi. This time it is, we just preload these libs from correct path.

pytorchmergebot pushed a commit that referenced this pull request Mar 1, 2025
Follow up after #148137
Make sure we don't try to load cufile on CUDA 11.8

Test:
```
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu118'
>>>
```

Pull Request resolved: #148184
Approved by: https://github.com/mikaylagawarecki
pytorchmergebot pushed a commit that referenced this pull request Mar 6, 2025
seeing `  File "/usr/local/lib/python3.12/site-packages/torch/__init__.py", line 411, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libcufile.so.0: cannot open shared object file: No such file or directory` with arm cu128 nightly.
related to #148137
need to copy the dependency for arm build as well

Pull Request resolved: #148465
Approved by: https://github.com/atalman, https://github.com/abhilash1910
@github-actions github-actions bot deleted the atalman-patch-9 branch March 30, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged release notes: releng release notes category topic: bug fixes topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Torch 2.7.0 nightly cuda 12.6 and cuda 12.8 builds are broken on Amazon linux 2023

5 participants