[ROCm][CI] upgrade CI and manywheel docker images to ROCm 6.2.4 #140851

jithunnair-amd · 2024-11-15T21:54:42Z

Fixes issue of long docker build times in PRs which trigger the docker build in regular PyTorch build jobs eg. https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198. These docker builds take a long time for ROCm6.2 because:

They are run on less capable machines (c5.2xlarge) instead of the beefier ones on which docker-build workflows run (c5.12xlarge)
ROCm6.2 docker builds enabled building of MIOpen from source, which runs into timeout of 90mins: https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198#step:7:160

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2024-11-15T21:54:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140851

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 22 New Failures, 7 Cancelled Jobs, 1 Unrelated Failure

As of commit 19b2e5e with merge base 80870f6 ():

NEW FAILURES - The following jobs have failed:

Build manywheel docker images for s390x / build-docker-cpu-s390x (gh)
no space left on device
linux-binary-manywheel / manywheel-py3_9-cuda12_6-test / test (gh)
RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.
periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 2, 5, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh)
test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_lu_factor_cuda_float64
pull / cuda12.1-py3.10-gcc9-sm75 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-cuda12.1-py3.10-gcc9 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-py3_9-clang9-xla / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-py3.11-clang10 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-py3.12-clang10 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-py3.9-clang10 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-py3.9-clang10-onnx / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-focal-rocm6.2-py3.10 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-cuda11.8-cudnn9-py3.9-clang12 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3-clang12-executorch / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3-clang12-mobile-build / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3.10-clang15-asan / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3.9-gcc11 / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3.9-gcc11-mobile-lightweight-dispatch-build / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3.9-gcc11-no-ops / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / linux-jammy-py3.9-gcc11-pch / build (gh)
##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
pull / win-vs2019-cpu-py3 / build (gh)
sccache: error: couldn't connect to server

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

rocm / linux-focal-rocm6.2-py3.10 / test (default, 1, 6, linux.rocm.gpu.2) (gh) (disabled by #141458)
test_linalg.py::TestLinalgCUDA::test_matmul_small_brute_force_tunableop_cuda_float16

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jithunnair-amd · 2024-11-15T22:14:05Z

@huydhn While this might fix the issue of timeouts for ROCm docker builds, I feel like we have a logical discrepancy in our workflows vis-a-vis docker builds:

The docker-builds.yml specifies a timeout of 4hrs, and runs on beefier machines c5.12xlarge
If any PR tries to update anything that requires a new docker image tag (which doesn't already exist), it triggers docker-builds workflow as well as pull workflows.
However, the pull workflow jobs require the new docker image that would be generated and pushed to ECR docker registry by the docker-build workflow
Since they are kicked off at the same time, the pull workflow ends up deciding to build the docker image itself, since it's not yet available, and runs the docker image build on a weaker c5.2xlarge machine with a smaller timeout of 1.5hrs (90mins) as well
This obviously doesn't go well for docker build jobs that need more time

Is there a way to make the PyTorch build job depend on the docker-build job to finish?

huydhn · 2024-11-15T22:23:49Z

Is there a way to make the PyTorch build job depend on the docker-build job to finish?

The way I usually do is to ignore the build jobs at first and just let the docker build job to finish. Once done, it will make the new image available on ECR. Then, I will rerun the build jobs. Because the new image is now available, they won't re-build the image again and should pull from ECR instead. Let me know if you see a different behavior here.

jithunnair-amd · 2024-11-15T22:25:58Z

Is there a way to make the PyTorch build job depend on the docker-build job to finish?

The way I usually do is to ignore the build jobs at first and just let the docker build job to finish. Once done, it will make the new image available on ECR. Then, I will rerun the build jobs. Because the new image is now available, they won't re-build the image again and should pull from ECR instead. Let me know if you see a different behavior here.

Yes, I realized I could have done that here too, but I decided to take the opportunity to improve the ROCm docker build times anyway. It's just that it's a manual step, the reasoning for which might not be obvious to most devs.

jithunnair-amd · 2024-11-19T00:51:24Z

Only shard 6 of 6 failed in the rocm workflow for this test: test_linalg.py::TestLinalgCUDA::test_matmul_small_brute_force_tunableop_cuda_float16, which is seen in 6.2.0 runs as well, so it's not a 6.2.4-specific issue.

jithunnair-amd · 2024-11-19T21:50:25Z

@jeffdaily

https://hud.pytorch.org/pr/140851#docker-builds shows ROCm CI base docker images built in 53m
https://hud.pytorch.org/pr/140851#build%20manywheel%20docker%20images shows ROCm manywheel docker images built in 1.3h or less
rocm, inductor-rocm and periodic workflows ran successfully for ROCm, except that test_linalg.py::TestLinalgCUDA::test_matmul_small_brute_force_tunableop_cuda_float16 test which is flaky.

~~Can you please approve and force merge the PR?~~ Please hold on, I need to move the manylinux images to build with 6.2.4 as well.

jithunnair-amd · 2024-11-21T02:12:45Z

@jeffdaily

https://hud.pytorch.org/pr/140851#docker-builds shows ROCm CI base docker images built in 53m

https://hud.pytorch.org/pr/140851#build%20manywheel%20docker%20images shows ROCm manywheel docker images built in 1.3h or less

rocm, inductor-rocm and periodic workflows ran successfully for ROCm, except that test_linalg.py::TestLinalgCUDA::test_matmul_small_brute_force_tunableop_cuda_float16 test which is flaky.

~~Can you please approve and force merge the PR?~~ Please hold on, I need to move the manylinux images to build with 6.2.4 as well.

Updated the manylinux images to 6.2.4: https://github.com/pytorch/pytorch/actions/runs/11923309568/job/33289567551. @atalman @jeffdaily please review and merge

jeffdaily · 2024-11-21T17:39:27Z

@pytorchbot merge

pytorchmergebot · 2024-11-21T17:41:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-11-21T17:42:24Z

Merge failed

Reason: 19 jobs have failed, first few of them are: Build almalinux docker images / build-docker (11.8), Build almalinux docker images / build-docker (12.1), Build almalinux docker images / build-docker (12.4), Build almalinux docker images / build-docker (12.6), Build almalinux docker images / build-docker (cpu)

Details for Dev Infra team

Raised by workflow job

jithunnair-amd · 2024-11-22T00:42:24Z

@pytorchbot revert -m "Need to upgrade libtorch images to ROCm 6.2.4 as well"

pytorch-bot · 2024-11-22T00:42:26Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

jithunnair-amd · 2024-11-22T00:43:01Z

@pytorchbot revert -m "Need to upgrade libtorch images to ROCm 6.2.4 as well" -c ghfirst

pytorchmergebot · 2024-11-22T00:44:28Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2024-11-22T00:44:37Z

@jithunnair-amd your PR has been successfully reverted.

This reverts commit 6c9bfd5. Reverted #140851 on behalf of https://github.com/jithunnair-amd due to Need to upgrade libtorch images to ROCm 6.2.4 as well ([comment](#140851 (comment)))

jithunnair-amd · 2024-11-22T00:50:09Z

@huydhn Need ECR tag libtorch-cxx11-builder-rocm6.2.4 to be created please

jithunnair-amd · 2024-11-23T03:33:41Z

@huydhn Need ECR tag libtorch-cxx11-builder-rocm6.2.4 to be created please

Libtorch ECR tag creation is having some issues. But I also see that the libtorch ECR images aren't really used anywhere currently: https://github.com/search?q=repo%3Apytorch%2Fpytorch%20libtorch-cxx11-builder-rocm&type=code

And the libtorch nightly builds jobs use the docker images from dockerhub:

pytorch/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml

Line 449 in a8ab6b0

DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.2-main

Hence, in my understanding, merging this PR without the ECR tag being ready shouldn't break anything except the libtorch docker build jobs in any PRs.

jithunnair-amd · 2024-11-23T03:34:40Z

@pytorchbot merge -f "CI failures unrelated to ROCm, except libtorch docker build job, for which explanation is provided in above comment"

pytorchmergebot · 2024-11-23T03:36:14Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes issue of long docker build times in PRs which trigger the docker build in regular PyTorch build jobs eg. https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198. These docker builds take a long time for ROCm6.2 because: 1. They are run on less capable machines (`c5.2xlarge`) instead of the beefier ones on which [docker-build workflows](https://github.com/pytorch/pytorch/blob/924c1fe3f304aa599b823fb549c35b7809f61086/.github/workflows/docker-builds.yml#L50) run (`c5.12xlarge`) 2. ROCm6.2 docker builds enabled building of MIOpen from source, which runs into [timeout of 90mins](https://github.com/pytorch/test-infra/blob/9abd4d95bb0b86d78d1929abcd6046d07e8a5864/.github/actions/calculate-docker-image/action.yml#L171): https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198#step:7:160 Pull Request resolved: pytorch#140851 Approved by: https://github.com/jeffdaily

This reverts commit 6c9bfd5. Reverted pytorch#140851 on behalf of https://github.com/jithunnair-amd due to Need to upgrade libtorch images to ROCm 6.2.4 as well ([comment](pytorch#140851 (comment)))

…rch#140851) Fixes issue of long docker build times in PRs which trigger the docker build in regular PyTorch build jobs eg. https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198. These docker builds take a long time for ROCm6.2 because: 1. They are run on less capable machines (`c5.2xlarge`) instead of the beefier ones on which [docker-build workflows](https://github.com/pytorch/pytorch/blob/924c1fe3f304aa599b823fb549c35b7809f61086/.github/workflows/docker-builds.yml#L50) run (`c5.12xlarge`) 2. ROCm6.2 docker builds enabled building of MIOpen from source, which runs into [timeout of 90mins](https://github.com/pytorch/test-infra/blob/9abd4d95bb0b86d78d1929abcd6046d07e8a5864/.github/actions/calculate-docker-image/action.yml#L171): https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198#step:7:160 Pull Request resolved: pytorch#140851 Approved by: https://github.com/jeffdaily

jithunnair-amd added 2 commits November 15, 2024 12:51

Bump N to 6.2.4 to include MIOpen fixes not present in ROCm6.2

305f115

Don't build MIOpen for ROCm6.2.4 since issues in ROCm6.2 already fixed

792d637

pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Nov 15, 2024

pytorchbot added the open source label Nov 15, 2024

jithunnair-amd added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor ciflow/inductor-rocm Trigger "inductor" config CI on ROCm and removed ciflow/inductor labels Nov 19, 2024

jithunnair-amd marked this pull request as ready for review November 19, 2024 21:39

jithunnair-amd requested a review from jeffdaily as a code owner November 19, 2024 21:39

jithunnair-amd marked this pull request as draft November 19, 2024 22:58

Move manywheel images to ROCm6.2.4

d01a11a

jataylo mentioned this pull request Nov 20, 2024

[triton] Update pin for PyTorch 2.6/Triton 3.2 #139206

Closed

jithunnair-amd marked this pull request as ready for review November 21, 2024 02:11

jithunnair-amd requested a review from a team as a code owner November 21, 2024 02:11

jeffdaily approved these changes Nov 21, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 21, 2024

pytorchmergebot added the merging label Nov 21, 2024

pytorchmergebot removed the merging label Nov 21, 2024

jithunnair-amd deleted the jnair/upgrade_to_rocm_6.2.4 branch November 21, 2024 21:15

jithunnair-amd mentioned this pull request Nov 21, 2024

Upgrade ROCm wheels to manylinux2_28 - 1a of 2 (docker images) #140681

Closed

jithunnair-amd changed the title ~~[ROCm][CI] upgrade CI to ROCm 6.2.4~~ [ROCm][CI] upgrade CI and manywheel docker images to ROCm 6.2.4 Nov 21, 2024

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Nov 22, 2024

Update libtorch images to ROCm6.2.4 as well

19b2e5e

jithunnair-amd reopened this Nov 22, 2024

pytorchmergebot added the merging label Nov 23, 2024

pytorchmergebot closed this in 6cc2297 Nov 23, 2024

pytorchmergebot removed the merging label Nov 23, 2024

atalman mentioned this pull request Nov 27, 2024

Upgrade ROCm wheels to manylinux2_28 - 2 of 2 (binaries) #141423

Closed

[ROCm][CI] upgrade CI and manywheel docker images to ROCm 6.2.4 #140851

[ROCm][CI] upgrade CI and manywheel docker images to ROCm 6.2.4 #140851

Uh oh!

Conversation

jithunnair-amd commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140851

❌ 22 New Failures, 7 Cancelled Jobs, 1 Unrelated Failure

Uh oh!

jithunnair-amd commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn commented Nov 15, 2024

Uh oh!

jithunnair-amd commented Nov 15, 2024

Uh oh!

jithunnair-amd commented Nov 19, 2024

Uh oh!

jithunnair-amd commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily commented Nov 21, 2024

Uh oh!

pytorchmergebot commented Nov 21, 2024

Merge started

Uh oh!

pytorchmergebot commented Nov 21, 2024

Merge failed

Uh oh!

jithunnair-amd commented Nov 22, 2024

Uh oh!

pytorch-bot bot commented Nov 22, 2024

Uh oh!

jithunnair-amd commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Uh oh!

jithunnair-amd commented Nov 22, 2024

Uh oh!

jithunnair-amd commented Nov 23, 2024

Uh oh!

jithunnair-amd commented Nov 23, 2024

Uh oh!

pytorchmergebot commented Nov 23, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jithunnair-amd commented Nov 15, 2024 •

edited

Loading

pytorch-bot bot commented Nov 15, 2024 •

edited

Loading

jithunnair-amd commented Nov 15, 2024 •

edited

Loading

jithunnair-amd commented Nov 19, 2024 •

edited

Loading

jithunnair-amd commented Nov 21, 2024 •

edited

Loading