Skip to content

Conversation

@jithunnair-amd
Copy link
Collaborator

@jithunnair-amd jithunnair-amd commented Nov 14, 2024

Fixes #140631

Highlights:

  • Use cpu_final base for ROCm in .ci/docker/manywheel/Dockerfile_2_28
  • Cleans up install_miopen.sh to remove old ROCm references
  • Install gcc-gfortran package to build magma for ROCm on almalinux

Needs builder PR pytorch/builder#2043 (merged) so that GCC_ABI expected value is updated.

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140681

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 6642fb5 with merge base cb8c956 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch release notes: releng release notes category labels Nov 14, 2024
@pruthvistony pruthvistony self-requested a review November 18, 2024 16:35
@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Nov 19, 2024

@huydhn @atalman Got this error for the rocm6.1 docker build job:

2024-11-19T01:39:33.9350341Z denied: User: arn:aws:sts::391835788720:assumed-role/ghci-lf-github-action-runners-runner-role/i-015cfa06d20e93248 is not authorized to perform: ecr:InitiateLayerUpload on resource: arn:aws:ecr:us-east-1:308535385114:repository/pytorch/manylinux2_28-builder-rocm6.1 because no resource-based policy allows the ecr:InitiateLayerUpload action

@huydhn
Copy link
Contributor

huydhn commented Nov 19, 2024

I have created the missing ECR record arn:aws:ecr:us-east-1:308535385114:repository/pytorch/manylinux2_28-builder-rocm6.1, let's retry to see if it works now

@jithunnair-amd
Copy link
Collaborator Author

I have created the missing ECR record arn:aws:ecr:us-east-1:308535385114:repository/pytorch/manylinux2_28-builder-rocm6.1, let retry to see if it works now

Thanks, can you please do it for rocm6.2 as well? That will also need an ECR record.

@jithunnair-amd
Copy link
Collaborator Author

I have created the missing ECR record arn:aws:ecr:us-east-1:308535385114:repository/pytorch/manylinux2_28-builder-rocm6.1, let retry to see if it works now

Thanks, can you please do it for rocm6.2 as well? That will also need an ECR record.

@huydhn Also, I suppose this means we will need to request you to create a new ECR record for every ROCm upgrade?

@huydhn huydhn added the no-runner-experiments Bypass Meta/LF runner determinator label Nov 19, 2024
@huydhn
Copy link
Contributor

huydhn commented Nov 19, 2024

@huydhn Also, I suppose this means we will need to request you to create a new ECR record for every ROCm upgrade?

That's the current process.

Also, due to #140958 earlier today, I haven't been able to deploy the ECR change yet. I will need to check with @jeanschmidt when we can resume infra deployment.

We could use no-runner-experiments to make this work for now I think.

@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased jnair/rocm_manylinux2_28_upgrade onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jnair/rocm_manylinux2_28_upgrade && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the jnair/rocm_manylinux2_28_upgrade branch from b0b44ca to 1a87ee6 Compare November 19, 2024 22:03
@jithunnair-amd jithunnair-amd added the ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR label Nov 20, 2024
@jithunnair-amd jithunnair-amd force-pushed the jnair/rocm_manylinux2_28_upgrade branch from 1a87ee6 to 3dd5cb0 Compare November 21, 2024 21:17
@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Nov 21, 2024

@huydhn @atalman Since we upgraded to ROCm6.2.4, please help create ECR tag manylinux2_28-builder-rocm6.2.4.

I just create it, starting the infra deployment now, so give it about 15 minutes to finish

@huydhn huydhn removed the no-runner-experiments Bypass Meta/LF runner determinator label Nov 21, 2024
@jithunnair-amd jithunnair-amd force-pushed the jnair/rocm_manylinux2_28_upgrade branch from 3dd5cb0 to 2a4560f Compare November 23, 2024 04:15
@jithunnair-amd jithunnair-amd changed the title Upgrade ROCm wheels to manylinux2_28 Upgrade ROCm wheels to manylinux2_28 - 1 of 2 (docker images) Nov 23, 2024
@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot pytorchmergebot force-pushed the jnair/rocm_manylinux2_28_upgrade branch from 2a4560f to cb09ef8 Compare November 25, 2024 01:58
@jithunnair-amd jithunnair-amd removed the ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR label Nov 25, 2024
@jithunnair-amd jithunnair-amd force-pushed the jnair/rocm_manylinux2_28_upgrade branch from 434d09a to 6642fb5 Compare November 25, 2024 22:34
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Nov 25, 2024
@jithunnair-amd jithunnair-amd marked this pull request as ready for review November 25, 2024 23:54
@jithunnair-amd jithunnair-amd requested a review from a team as a code owner November 25, 2024 23:54
@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot merge -f "Manylinux/libtorch/CI docker image builds for ROCm completed with build duration 40-45min; other CI failures unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jithunnair-amd jithunnair-amd deleted the jnair/rocm_manylinux2_28_upgrade branch November 26, 2024 00:53
@jithunnair-amd jithunnair-amd changed the title Upgrade ROCm wheels to manylinux2_28 - 1 of 2 (docker images) Upgrade ROCm wheels to manylinux2_28 - 1a of 2 (docker images) Nov 28, 2024
pytorchmergebot pushed a commit that referenced this pull request Nov 28, 2024
…er images) (#141609)

Upgrade gcc version from 9 to 11 on ROCm manylinux images.

Needed for #141423 since almalinux8-based manylinux2_28 images for ROCm (#140681) installs gcc-toolset-9, which installs [gcc 9.2.1](https://pkgs.org/download/gcc-toolset-9-gcc-c++). However, PyTorch CMakeLists.txt enforces a [minimum gcc version of 9.3](https://github.com/pytorch/pytorch/blob/5318bf8baf19fecda365c185cd81196e3cfb08e3/CMakeLists.txt#L61).

Pull Request resolved: #141609
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <[email protected]>
Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this pull request Dec 2, 2024
…h#140681)

Fixes pytorch#140631

Highlights:
* Use `cpu_final` base for ROCm in `.ci/docker/manywheel/Dockerfile_2_28`
* Cleans up install_miopen.sh to remove old ROCm references
* Install `gcc-gfortran` package to build magma for ROCm on almalinux

Needs builder PR pytorch/builder#2043 (merged) so that GCC_ABI expected value is updated.

Pull Request resolved: pytorch#140681
Approved by: https://github.com/jeffdaily
pytorchmergebot pushed a commit that referenced this pull request Dec 4, 2024
Depends on #140681 and #141609

Highlights:
* Upgrade binaries to ROCm6.2.4 to use latest docker images
* Remove pre-cxx11 builds for libtorch on ROCm
* Use manylinux2_28 docker images for ROCm
* Set `DESIRED_DEVTOOLSET=cxx-abi` (and hence `_GLIBCXX_USE_CXX11_ABI=1`) for ROCm manylinux2_28 wheels (ROCm RHEL8 packages also have GCC_ABI=1, so it keeps it consistent)

Pull Request resolved: #141423
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Pruthvi Madugundu <[email protected]>
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…h#140681)

Fixes pytorch#140631

Highlights:
* Use `cpu_final` base for ROCm in `.ci/docker/manywheel/Dockerfile_2_28`
* Cleans up install_miopen.sh to remove old ROCm references
* Install `gcc-gfortran` package to build magma for ROCm on almalinux

Needs builder PR pytorch/builder#2043 (merged) so that GCC_ABI expected value is updated.

Pull Request resolved: pytorch#140681
Approved by: https://github.com/jeffdaily
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
)

Depends on pytorch#140681 and pytorch#141609

Highlights:
* Upgrade binaries to ROCm6.2.4 to use latest docker images
* Remove pre-cxx11 builds for libtorch on ROCm
* Use manylinux2_28 docker images for ROCm
* Set `DESIRED_DEVTOOLSET=cxx-abi` (and hence `_GLIBCXX_USE_CXX11_ABI=1`) for ROCm manylinux2_28 wheels (ROCm RHEL8 packages also have GCC_ABI=1, so it keeps it consistent)

Pull Request resolved: pytorch#141423
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Pruthvi Madugundu <[email protected]>
AmdSampsa pushed a commit to AmdSampsa/pytorch that referenced this pull request Dec 9, 2024
)

Depends on pytorch#140681 and pytorch#141609

Highlights:
* Upgrade binaries to ROCm6.2.4 to use latest docker images
* Remove pre-cxx11 builds for libtorch on ROCm
* Use manylinux2_28 docker images for ROCm
* Set `DESIRED_DEVTOOLSET=cxx-abi` (and hence `_GLIBCXX_USE_CXX11_ABI=1`) for ROCm manylinux2_28 wheels (ROCm RHEL8 packages also have GCC_ABI=1, so it keeps it consistent)

Pull Request resolved: pytorch#141423
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Pruthvi Madugundu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm Merged module: rocm AMD GPU support for Pytorch open source release notes: releng release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DomainsOnly] Jobs fail with GLIBC version not found

5 participants