Skip to content

Conversation

@jithunnair-amd
Copy link
Collaborator

@jithunnair-amd jithunnair-amd commented Nov 15, 2024

Fixes issue of long docker build times in PRs which trigger the docker build in regular PyTorch build jobs eg. https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198. These docker builds take a long time for ROCm6.2 because:

  1. They are run on less capable machines (c5.2xlarge) instead of the beefier ones on which docker-build workflows run (c5.12xlarge)
  2. ROCm6.2 docker builds enabled building of MIOpen from source, which runs into timeout of 90mins: https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198#step:7:160

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140851

Note: Links to docs will display an error until the docs builds have been completed.

❌ 22 New Failures, 7 Cancelled Jobs, 1 Unrelated Failure

As of commit 19b2e5e with merge base 80870f6 (image):

NEW FAILURES - The following jobs have failed:

  • Build manywheel docker images for s390x / build-docker-cpu-s390x (gh)
    no space left on device
  • linux-binary-manywheel / manywheel-py3_9-cuda12_6-test / test (gh)
    RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.
  • periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 2, 5, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh)
    test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_lu_factor_cuda_float64
  • pull / cuda12.1-py3.10-gcc9-sm75 / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-cuda12.1-py3.10-gcc9 / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-py3_9-clang9-xla / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-py3.11-clang10 / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-py3.12-clang10 / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-py3.9-clang10 / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-py3.9-clang10-onnx / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-focal-rocm6.2-py3.10 / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-cuda11.8-cudnn9-py3.9-clang12 / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3-clang12-executorch / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3-clang12-mobile-build / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.10-clang15-asan / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11 / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11-mobile-lightweight-dispatch-build / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11-no-ops / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / linux-jammy-py3.9-gcc11-pch / build (gh)
    ##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/ec2-user/actions-runner/_work/pytorch/pytorch/.github/actions/upload-sccache-stats'. Did you forget to run actions/checkout before running your local action?
  • pull / win-vs2019-cpu-py3 / build (gh)
    sccache: error: couldn't connect to server

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Nov 15, 2024
@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Nov 15, 2024

@huydhn While this might fix the issue of timeouts for ROCm docker builds, I feel like we have a logical discrepancy in our workflows vis-a-vis docker builds:

  • The docker-builds.yml specifies a timeout of 4hrs, and runs on beefier machines c5.12xlarge
  • If any PR tries to update anything that requires a new docker image tag (which doesn't already exist), it triggers docker-builds workflow as well as pull workflows.
  • However, the pull workflow jobs require the new docker image that would be generated and pushed to ECR docker registry by the docker-build workflow
  • Since they are kicked off at the same time, the pull workflow ends up deciding to build the docker image itself, since it's not yet available, and runs the docker image build on a weaker c5.2xlarge machine with a smaller timeout of 1.5hrs (90mins) as well
  • This obviously doesn't go well for docker build jobs that need more time

Is there a way to make the PyTorch build job depend on the docker-build job to finish?

@huydhn
Copy link
Contributor

huydhn commented Nov 15, 2024

Is there a way to make the PyTorch build job depend on the docker-build job to finish?

The way I usually do is to ignore the build jobs at first and just let the docker build job to finish. Once done, it will make the new image available on ECR. Then, I will rerun the build jobs. Because the new image is now available, they won't re-build the image again and should pull from ECR instead. Let me know if you see a different behavior here.

@jithunnair-amd
Copy link
Collaborator Author

Is there a way to make the PyTorch build job depend on the docker-build job to finish?

The way I usually do is to ignore the build jobs at first and just let the docker build job to finish. Once done, it will make the new image available on ECR. Then, I will rerun the build jobs. Because the new image is now available, they won't re-build the image again and should pull from ECR instead. Let me know if you see a different behavior here.

Yes, I realized I could have done that here too, but I decided to take the opportunity to improve the ROCm docker build times anyway. It's just that it's a manual step, the reasoning for which might not be obvious to most devs.

@jithunnair-amd
Copy link
Collaborator Author

Only shard 6 of 6 failed in the rocm workflow for this test: test_linalg.py::TestLinalgCUDA::test_matmul_small_brute_force_tunableop_cuda_float16, which is seen in 6.2.0 runs as well, so it's not a 6.2.4-specific issue.

@jithunnair-amd jithunnair-amd added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor ciflow/inductor-rocm Trigger "inductor" config CI on ROCm and removed ciflow/inductor labels Nov 19, 2024
@jithunnair-amd jithunnair-amd marked this pull request as ready for review November 19, 2024 21:39
@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Nov 19, 2024

@jeffdaily

Can you please approve and force merge the PR? Please hold on, I need to move the manylinux images to build with 6.2.4 as well.

@jithunnair-amd jithunnair-amd marked this pull request as draft November 19, 2024 22:58
@jithunnair-amd jithunnair-amd marked this pull request as ready for review November 21, 2024 02:11
@jithunnair-amd jithunnair-amd requested a review from a team as a code owner November 21, 2024 02:11
@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Nov 21, 2024

@jeffdaily

Can you please approve and force merge the PR? Please hold on, I need to move the manylinux images to build with 6.2.4 as well.

Updated the manylinux images to 6.2.4: https://github.com/pytorch/pytorch/actions/runs/11923309568/job/33289567551. @atalman @jeffdaily please review and merge

@jeffdaily
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 21, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@jithunnair-amd jithunnair-amd deleted the jnair/upgrade_to_rocm_6.2.4 branch November 21, 2024 21:15
@jithunnair-amd jithunnair-amd changed the title [ROCm][CI] upgrade CI to ROCm 6.2.4 [ROCm][CI] upgrade CI and manywheel docker images to ROCm 6.2.4 Nov 21, 2024
@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot revert -m "Need to upgrade libtorch images to ROCm 6.2.4 as well"

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 22, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot revert -m "Need to upgrade libtorch images to ROCm 6.2.4 as well" -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@jithunnair-amd your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Nov 22, 2024
@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Nov 22, 2024
@jithunnair-amd
Copy link
Collaborator Author

@huydhn Need ECR tag libtorch-cxx11-builder-rocm6.2.4 to be created please

@jithunnair-amd
Copy link
Collaborator Author

@huydhn Need ECR tag libtorch-cxx11-builder-rocm6.2.4 to be created please

Libtorch ECR tag creation is having some issues. But I also see that the libtorch ECR images aren't really used anywhere currently: https://github.com/search?q=repo%3Apytorch%2Fpytorch%20libtorch-cxx11-builder-rocm&type=code

And the libtorch nightly builds jobs use the docker images from dockerhub:

DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.2-main

Hence, in my understanding, merging this PR without the ECR tag being ready shouldn't break anything except the libtorch docker build jobs in any PRs.

@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot merge -f "CI failures unrelated to ROCm, except libtorch docker build job, for which explanation is provided in above comment"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants