-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Migrate conda, manywheel and libtorch docker builds to pytorch/pytorch #129022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129022
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 608e533 with merge base 637ab85 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
3557afe to
b80d939
Compare
b80d939 to
f202745
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such PR could be landed only if it references PR against the builder that deletes the same files this one is migrating. Correct me if I'm wrong, but in its current form its impossible, as all the common files are used by both conda and manywheel Docker builds. If those two workflows are so intertwined, let's migrate all of them in one go.
Nits:
-
.ci/dockersounds like a wrong location for those files, as they are for use for deployment, rather that testing, so please consider different location. -
I don't see how this change translates into the workflow that @atalman described, where one can test docker changes before merging the PR (I assume it would require some changes to the binary builds workflow, wouldn't it?)
Sounds good. Let me bring everything in.
Perhaps
Yes totally second change will be required where we actually start using images from ecr rather then docker hub. This is only first part, where we move the workflows and make Docker builds upload to both places ecr and dockerhub. |
| @@ -1,26 +1,30 @@ | |||
| #!/bin/bash | |||
| set -xe | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @chuanqi129 could you please review these changes in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @atalman LGTM
| @@ -0,0 +1,15 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atalman Any plans to sync this up with the mkl version being installed via install_conda.sh? There is a version difference (2021.4.0 for install_conda.sh vs 2024.2.0 for this file), a dynamic (install_conda.sh) vs static library (this file) difference, as well as an install location difference (/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION for install_conda.sh vs /opt/intel for this file) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @jithunnair-amd yes totally we should try to move to 2024.2.0. conda install will not work since we are not using conda-forge. However we should install these packages from pip
.ci/docker/common/install_mkl.sh
Outdated
| MKL_VERSION=2024.2.0 | ||
|
|
||
| mkdir -p /opt/intel/ | ||
| MKLROOT=${MKLROOT:-/opt/intel} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atalman IIUC, this script could end up using a different value of $MKLROOT (if env var already exists) as compared to .ci/docker/common/install_rocm_magma.sh, which will get it as /opt/intel from the ENV directive here. If that happens, the ROCm magma build would fail. Either we should hardcode MKLROOT to /opt/intel here, or set ARG MKLROOT /opt/intel as in my PR, since ARGs are visible in different stages of a multi-stage Dockerfile, but ENVs are limited to the specific stage where they are specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good let me use '/opt/intel' here . I don't think you are calling this file from rocm builds as far as I know.
* Add build-conda-images.yml in pytorch/pytorch * Add build_docker.sh to run .github/workflows/build-conda-images.yml * Create build-conda-images.yml * Delete .github/build-conda-images.yml Add build-conda-images.yml in pytorch/pytorch (#128563) (#128700) * Small adjustment for docker build * build_docker * Add docker file * Add common Add build-conda-images.yml in pytorch/pytorch (#128563) (#128962) * Add jni.h path * test * test * fixpath Update script path moving conda builds from builder to pytorch (#129167) * Add jni.h path * test * test * fixpath * test * test * test * test * test * test * test * test * test * test * tets * test * test Cleanup build docker images (#129273) * Cleanup docker move * add docker file Add calculate-docker-image when WITH_PUSH is true Update build-conda-images.yml Update build-conda-images.yml Update build-conda-images.yml test Update build-conda-images.yml Update build-conda-images.yml test test test test Update build-conda-images.yml Update build-conda-images.yml Update build-conda-images.yml Update build-conda-images.yml Update build-conda-images.yml Update build-conda-images.yml Update build-conda-images.yml Update build-conda-images.yml Update build-conda-images.yml Update build-conda-images.yml Update build-conda-images.yml Update build_docker_conda.sh test test test test test test test test test test test test test test test test test keep_both_tag test_conda_builds test test test test test test test test test
lint test common files test test test test test test test test test test test test test libtorch libtorch sudo test rocm fix test lint use one file for install_rocm_magma xpu rocm libtorch test libtorch test Manywheel builds fix_manywheel
10aa073 to
6264e93
Compare
| runs-on: linux.4xlarge | ||
| steps: | ||
| - name: Checkout PyTorch | ||
| uses: pytorch/pytorch/.github/actions/checkout-pytorch@move_docker_conda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp code. Will update to main before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why do you need to specify branch here to begin with? Shouldn't one just be using scripts from the current branch?
malfet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but few comments:
- Mixing CI and CD scripts in the same
commonfolder feels a bit confusing, at least worth mentioning in the revie - Please do not use
pytorch/pytorch/.github/actions/...@branchpattern, but rather relative./github/actions/... - Hardcoding
/home/ec2-user/actions-runner/_workfeels wrong, and this line is not present in the builder repo, so why this is needed - Last but not least, according to https://github.com/pytorch/pytorch/blob/main/SECURITY.md#release-pipelines-security all release related activities should be done on the ephemeral runners (to prevent reverse ssh attacks), but
linux.12xlargeis non-ephemeral, isn't it?
| GPU_ARCH_VERSION: ${{ matrix.cuda_version }} | ||
| steps: | ||
| - name: Checkout PyTorch | ||
| uses: pytorch/pytorch/.github/actions/checkout-pytorch@move_docker_conda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why specify branch if it could be done from the same repo?
| uses: pytorch/pytorch/.github/actions/checkout-pytorch@move_docker_conda | |
| uses: ./.github/actions/checkout-pytorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is what we use everywhere:
git grep "checkout-pytorch"
github/actions/teardown-win/action.yml: # retry this step several time similar to how checkout-pytorch GHA does
.github/workflows/_android-build-test.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_android-build-test.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_android-full-build-test.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_android-full-build-test.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_bazel-build-test.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_bazel-build-test.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_binary-build-linux.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_binary-test-linux.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_binary-upload.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_buck-build-test.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_buck-build-test.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_docs.yml: uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
etc.....
|
@pytorchmergebot merge -f "docker builds and lint are green" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: install_cuda.sh has been moved from pytorch/builder to pytorch/pytorch: pytorch/builder#1942 pytorch/pytorch#129022 We need to make sure that sm_90a is included: ``` xargs -I '{}' bash -c 'echo {} && /usr/local/cuda-12.4/bin/nvprune -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a /usr/local/cuda-12.4/lib64/{} -o /usr/local/cuda-12.4/lib64/{}' ``` Pull Request resolved: #2397 Test Plan: https://github.com/pytorch/benchmark/actions/runs/10163701096 Reviewed By: atalman Differential Revision: D60454551 Pulled By: xuzhao9 fbshipit-source-id: 03322314b190e0fb95e7d7741152de65ab35ecd4
Migration of Docker conda builds to pytorch/pytorch from pytorch/builder: https://github.com/pytorch/builder/blob/main/.github/workflows/build-conda-images.yml
Related to: pytorch/builder#1849
Migrate scripts and worklfows, adds logic to execute on PR and upload to ecr with github hash tag in order to test Docker build and nightly on PR.
Test when executing on PR, upload to ecr:
https://github.com/pytorch/pytorch/actions/runs/9799439218/job/27059691327
Test With-Push, upload to dockerhub:
https://github.com/pytorch/pytorch/actions/runs/9799783407/job/27060633427
Will upload here: https://hub.docker.com/r/pytorch/conda-builder/
Test using ecr image in the nightly workflow:
https://github.com/pytorch/pytorch/actions/runs/9798428933/job/27057835235#step:16:87
Note: This is first part that will build docker and upload it to either dockerhub or ecr. After merging followup PR will need to change conda nightly workflows to either use ecr image or dockerhub image, depending if we are running it on PR or from main/release branch.
Cleanup of workflows and scripts from builder repo: pytorch/builder#1923