Skip to content

Torch 2.7.0 nightly cuda 12.6 and cuda 12.8 builds are broken on Amazon linux 2023 #148120

@atalman

Description

@atalman

🐛 Describe the bug

Failure can be seen here:
https://github.com/pytorch/test-infra/actions/runs/13558218752/job/37934127476

Error:

2025-02-27T16:00:09.7113764Z + python3 .ci/pytorch/smoke_test/smoke_test.py --package torchonly
2025-02-27T16:00:09.7114301Z Traceback (most recent call last):
2025-02-27T16:00:09.7114936Z   File "/pytorch/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 11, in <module>
2025-02-27T16:00:09.7115416Z     import torch
2025-02-27T16:00:09.7115840Z   File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 401, in <module>
2025-02-27T16:00:09.7116346Z     from torch._C import *  # noqa: F403
2025-02-27T16:00:09.7116835Z ImportError: libcufile.so.0: cannot open shared object file: No such file or directory
2025-02-27T16:00:09.7117696Z   File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
2025-02-27T16:00:09.7118364Z     main()
2025-02-27T16:00:09.7118954Z   File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
2025-02-27T16:00:09.7119687Z     run_cmd_or_die(f"docker exec -t {container_name} /exec")
2025-02-27T16:00:09.7120450Z   File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
2025-02-27T16:00:09.7121360Z     raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")

Looks like this is result of cufile addition: #145748

Versions

torch-2.7.0.dev20250227+cu126

cc @seemethere @malfet @mikaylagawarecki

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions