Skip to content

Conversation

@jithunnair-amd
Copy link
Collaborator

@jithunnair-amd jithunnair-amd commented Apr 15, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151355

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures

As of commit 9452804 with merge base daf2ccf (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Apr 15, 2025
@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Apr 16, 2025

@huydhn Is this something you can help with?
https://github.com/pytorch/pytorch/actions/runs/14479904535/job/40614420540

denied: User: arn:aws:sts::391835788720:assumed-role/ghci-lf-github-action-runners-runner-role/i-05b0cda8626afb94c is not authorized to perform: ecr:InitiateLayerUpload on resource: arn:aws:ecr:us-east-1:308535385114:repository/pytorch/manylinux2_28-builder-rocm6.4 because no resource-based policy allows the ecr:InitiateLayerUpload action

The only difference between the previous passed build and current failing build seems to be that the former used non-lf CI runners while the latter uses lf CI runners lf.ephemeral.linux.9xlarge.ephemeral

@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot rebase -b viable/strict

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased rocm64_nightly onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout rocm64_nightly && git pull --rebase)

@malfet
Copy link
Contributor

malfet commented Apr 16, 2025

A bit unrelated: why are we using new ECRs here instead of tags? We stopped creating new ECRs and just use tags for different builds

@malfet
Copy link
Contributor

malfet commented Apr 16, 2025

A bit unrelated: why are we using new ECRs here instead of tags? We stopped creating new ECRs and just use tags for different builds.

But current failure are due to the fact that LF runners do not have push permissions to new ECRs

@malfet malfet added the no-runner-experiments Bypass Meta/LF runner determinator label Apr 16, 2025
@jeffdaily jeffdaily marked this pull request as ready for review April 16, 2025 18:05
@jeffdaily jeffdaily requested a review from a team as a code owner April 16, 2025 18:05
@jeffdaily
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased rocm64_nightly onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout rocm64_nightly && git pull --rebase)

@jithunnair-amd jithunnair-amd added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Apr 16, 2025
@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Apr 16, 2025

A bit unrelated: why are we using new ECRs here instead of tags? We stopped creating new ECRs and just use tags for different builds

@malfet Yes, we'd like to move away from using ECRs to just tags. But I think that requires changing the naming convention for the docker repo to be ROCm-version agnostic, and have the ROCm version as part of the tag:

docker-image-name: manylinux2_28-builder-rocm${{matrix.rocm_version}}

https://github.com/pytorch/pytorch/actions/runs/14502430577/job/40685004248#step:4:154
So need to have:
308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/manylinux2_28-builder-rocm:6.4-f8555c14c97c7831a7f9e6eb8220b15ecbc8cb40
OR
308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/manylinux2_28-builder:rocm6.4-f8555c14c97c7831a7f9e6eb8220b15ecbc8cb40
instead of
308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/manylinux2_28-builder-rocm6.4:f8555c14c97c7831a7f9e6eb8220b15ecbc8cb40

@jeffdaily
Copy link
Collaborator

@pytorchbot merge -f "only failures are due to rocm 6.4 builder images not refreshed w/ magma package and available for the 6.4 wheels; images built fine, should work out okay"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/rocm Trigger "default" config CI on ROCm Merged module: rocm AMD GPU support for Pytorch no-runner-experiments Bypass Meta/LF runner determinator open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants