-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[ROCm][CI] upgrade CI and manywheel docker images to ROCm 6.2.4 #140851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140851
Note: Links to docs will display an error until the docs builds have been completed. ❌ 22 New Failures, 7 Cancelled Jobs, 1 Unrelated FailureAs of commit 19b2e5e with merge base 80870f6 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@huydhn While this might fix the issue of timeouts for ROCm docker builds, I feel like we have a logical discrepancy in our workflows vis-a-vis docker builds:
Is there a way to make the PyTorch build job depend on the docker-build job to finish? |
The way I usually do is to ignore the build jobs at first and just let the docker build job to finish. Once done, it will make the new image available on ECR. Then, I will rerun the build jobs. Because the new image is now available, they won't re-build the image again and should pull from ECR instead. Let me know if you see a different behavior here. |
Yes, I realized I could have done that here too, but I decided to take the opportunity to improve the ROCm docker build times anyway. It's just that it's a manual step, the reasoning for which might not be obvious to most devs. |
|
Only shard 6 of 6 failed in the |
|
Updated the manylinux images to 6.2.4: https://github.com/pytorch/pytorch/actions/runs/11923309568/job/33289567551. @atalman @jeffdaily please review and merge |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 19 jobs have failed, first few of them are: Build almalinux docker images / build-docker (11.8), Build almalinux docker images / build-docker (12.1), Build almalinux docker images / build-docker (12.4), Build almalinux docker images / build-docker (12.6), Build almalinux docker images / build-docker (cpu) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot revert -m "Need to upgrade libtorch images to ROCm 6.2.4 as well" |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -m "Need to upgrade libtorch images to ROCm 6.2.4 as well" -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@jithunnair-amd your PR has been successfully reverted. |
This reverts commit 6c9bfd5. Reverted #140851 on behalf of https://github.com/jithunnair-amd due to Need to upgrade libtorch images to ROCm 6.2.4 as well ([comment](#140851 (comment)))
|
@huydhn Need ECR tag |
Libtorch ECR tag creation is having some issues. But I also see that the libtorch ECR images aren't really used anywhere currently: https://github.com/search?q=repo%3Apytorch%2Fpytorch%20libtorch-cxx11-builder-rocm&type=code And the libtorch nightly builds jobs use the docker images from dockerhub:
Hence, in my understanding, merging this PR without the ECR tag being ready shouldn't break anything except the libtorch docker build jobs in any PRs. |
|
@pytorchbot merge -f "CI failures unrelated to ROCm, except libtorch docker build job, for which explanation is provided in above comment" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes issue of long docker build times in PRs which trigger the docker build in regular PyTorch build jobs eg. https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198. These docker builds take a long time for ROCm6.2 because: 1. They are run on less capable machines (`c5.2xlarge`) instead of the beefier ones on which [docker-build workflows](https://github.com/pytorch/pytorch/blob/924c1fe3f304aa599b823fb549c35b7809f61086/.github/workflows/docker-builds.yml#L50) run (`c5.12xlarge`) 2. ROCm6.2 docker builds enabled building of MIOpen from source, which runs into [timeout of 90mins](https://github.com/pytorch/test-infra/blob/9abd4d95bb0b86d78d1929abcd6046d07e8a5864/.github/actions/calculate-docker-image/action.yml#L171): https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198#step:7:160 Pull Request resolved: pytorch#140851 Approved by: https://github.com/jeffdaily
This reverts commit 6c9bfd5. Reverted pytorch#140851 on behalf of https://github.com/jithunnair-amd due to Need to upgrade libtorch images to ROCm 6.2.4 as well ([comment](pytorch#140851 (comment)))
…rch#140851) Fixes issue of long docker build times in PRs which trigger the docker build in regular PyTorch build jobs eg. https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198. These docker builds take a long time for ROCm6.2 because: 1. They are run on less capable machines (`c5.2xlarge`) instead of the beefier ones on which [docker-build workflows](https://github.com/pytorch/pytorch/blob/924c1fe3f304aa599b823fb549c35b7809f61086/.github/workflows/docker-builds.yml#L50) run (`c5.12xlarge`) 2. ROCm6.2 docker builds enabled building of MIOpen from source, which runs into [timeout of 90mins](https://github.com/pytorch/test-infra/blob/9abd4d95bb0b86d78d1929abcd6046d07e8a5864/.github/actions/calculate-docker-image/action.yml#L171): https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198#step:7:160 Pull Request resolved: pytorch#140851 Approved by: https://github.com/jeffdaily
Fixes issue of long docker build times in PRs which trigger the docker build in regular PyTorch build jobs eg. https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198. These docker builds take a long time for ROCm6.2 because:
c5.2xlarge) instead of the beefier ones on which docker-build workflows run (c5.12xlarge)cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd