Skip to content

Conversation

@juliagmt-google
Copy link
Collaborator

@juliagmt-google juliagmt-google commented Jun 19, 2024

Migration of Docker conda builds to pytorch/pytorch from pytorch/builder: https://github.com/pytorch/builder/blob/main/.github/workflows/build-conda-images.yml

Related to: pytorch/builder#1849

Migrate scripts and worklfows, adds logic to execute on PR and upload to ecr with github hash tag in order to test Docker build and nightly on PR.

Test when executing on PR, upload to ecr:
https://github.com/pytorch/pytorch/actions/runs/9799439218/job/27059691327

308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/conda-builder-cpu:789cf8fcd738088860056160f6e9ea7cd005972b

Test With-Push, upload to dockerhub:
https://github.com/pytorch/pytorch/actions/runs/9799783407/job/27060633427

docker.io/pytorch/conda-builder:cpu done

Will upload here: https://hub.docker.com/r/pytorch/conda-builder/

Test using ecr image in the nightly workflow:
https://github.com/pytorch/pytorch/actions/runs/9798428933/job/27057835235#step:16:87

Note: This is first part that will build docker and upload it to either dockerhub or ecr. After merging followup PR will need to change conda nightly workflows to either use ecr image or dockerhub image, depending if we are running it on PR or from main/release branch.

Cleanup of workflows and scripts from builder repo: pytorch/builder#1923

@juliagmt-google juliagmt-google requested review from a team and jeffdaily as code owners June 19, 2024 01:15
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129022

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 608e533 with merge base 637ab85 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: releng release notes category label Jun 19, 2024
@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 21, 2024
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 21, 2024
@atalman atalman force-pushed the move_docker_conda branch 3 times, most recently from 3557afe to b80d939 Compare July 4, 2024 21:13
@atalman atalman changed the title Update script path in .github/workflows/build-conda-images.yml Migrate conda docker builds to pytorch/pytorch Jul 4, 2024
@atalman atalman force-pushed the move_docker_conda branch from b80d939 to f202745 Compare July 4, 2024 21:50
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such PR could be landed only if it references PR against the builder that deletes the same files this one is migrating. Correct me if I'm wrong, but in its current form its impossible, as all the common files are used by both conda and manywheel Docker builds. If those two workflows are so intertwined, let's migrate all of them in one go.

Nits:

  • .ci/docker sounds like a wrong location for those files, as they are for use for deployment, rather that testing, so please consider different location.

  • I don't see how this change translates into the workflow that @atalman described, where one can test docker changes before merging the PR (I assume it would require some changes to the binary builds workflow, wouldn't it?)

@malfet malfet requested review from atalman and seemethere July 6, 2024 13:22
@atalman
Copy link
Contributor

atalman commented Jul 8, 2024

Such PR could be landed only if it references PR against the builder that deletes the same files this one is migrating. Correct me if I'm wrong, but in its current form its impossible, as all the common files are used by both conda and manywheel Docker builds. If those two workflows are so intertwined, let's migrate all of them in one go.

Sounds good. Let me bring everything in.

Nits:

  • .ci/docker sounds like a wrong location for those files, as they are for use for deployment, rather that testing, so please consider different location.

Perhaps .cd/docker would be better in this case ?

  • I don't see how this change translates into the workflow that @atalman described, where one can test docker changes before merging the PR (I assume it would require some changes to the binary builds workflow, wouldn't it?)

Yes totally second change will be required where we actually start using images from ecr rather then docker hub. This is only first part, where we move the workflows and make Docker builds upload to both places ecr and dockerhub.

@atalman atalman changed the title Migrate conda docker builds to pytorch/pytorch Migrate conda, manywheel and libtorch docker builds to pytorch/pytorch Jul 8, 2024
@@ -1,26 +1,30 @@
#!/bin/bash
set -xe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @chuanqi129 could you please review these changes in this file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @atalman LGTM

@@ -0,0 +1,15 @@
#!/bin/bash
Copy link
Collaborator

@jithunnair-amd jithunnair-amd Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atalman Any plans to sync this up with the mkl version being installed via install_conda.sh? There is a version difference (2021.4.0 for install_conda.sh vs 2024.2.0 for this file), a dynamic (install_conda.sh) vs static library (this file) difference, as well as an install location difference (/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION for install_conda.sh vs /opt/intel for this file) .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @jithunnair-amd yes totally we should try to move to 2024.2.0. conda install will not work since we are not using conda-forge. However we should install these packages from pip

MKL_VERSION=2024.2.0

mkdir -p /opt/intel/
MKLROOT=${MKLROOT:-/opt/intel}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atalman IIUC, this script could end up using a different value of $MKLROOT (if env var already exists) as compared to .ci/docker/common/install_rocm_magma.sh, which will get it as /opt/intel from the ENV directive here. If that happens, the ROCm magma build would fail. Either we should hardcode MKLROOT to /opt/intel here, or set ARG MKLROOT /opt/intel as in my PR, since ARGs are visible in different stages of a multi-stage Dockerfile, but ENVs are limited to the specific stage where they are specified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good let me use '/opt/intel' here . I don't think you are calling this file from rocm builds as far as I know.

juliagmt-google and others added 2 commits July 20, 2024 06:19
* Add build-conda-images.yml in pytorch/pytorch

* Add build_docker.sh to run .github/workflows/build-conda-images.yml

* Create build-conda-images.yml

* Delete .github/build-conda-images.yml

Add build-conda-images.yml in pytorch/pytorch (#128563) (#128700)

* Small adjustment for docker build

* build_docker

* Add docker file

* Add common

Add build-conda-images.yml in pytorch/pytorch (#128563) (#128962)

* Add jni.h path

* test

* test

* fixpath

Update script path

moving conda builds from builder to pytorch (#129167)

* Add jni.h path

* test

* test

* fixpath

* test

* test

* test

* test

* test

* test

* test

* test

* test

* test

* tets

* test

* test

Cleanup build docker images (#129273)

* Cleanup docker move

* add docker file

Add calculate-docker-image when WITH_PUSH is true

Update build-conda-images.yml

Update build-conda-images.yml

Update build-conda-images.yml

test

Update build-conda-images.yml

Update build-conda-images.yml

test

test

test

test

Update build-conda-images.yml

Update build-conda-images.yml

Update build-conda-images.yml

Update build-conda-images.yml

Update build-conda-images.yml

Update build-conda-images.yml

Update build-conda-images.yml

Update build-conda-images.yml

Update build-conda-images.yml

Update build-conda-images.yml

Update build-conda-images.yml

Update build_docker_conda.sh

test

test

test

test

test

test

test

test

test

test

test

test

test

test

test

test

test

keep_both_tag

test_conda_builds

test

test

test

test

test

test

test

test

test
lint

test

common files

test

test

test

test

test

test

test

test

test

test

test

test

test

libtorch

libtorch

sudo

test

rocm fix

test

lint

use one file for install_rocm_magma

xpu

rocm

libtorch

test

libtorch

test

Manywheel builds

fix_manywheel
@atalman atalman force-pushed the move_docker_conda branch from 10aa073 to 6264e93 Compare July 20, 2024 13:20
runs-on: linux.4xlarge
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@move_docker_conda
Copy link
Contributor

@atalman atalman Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temp code. Will update to main before merging

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why do you need to specify branch here to begin with? Shouldn't one just be using scripts from the current branch?

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but few comments:

GPU_ARCH_VERSION: ${{ matrix.cuda_version }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@move_docker_conda
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why specify branch if it could be done from the same repo?

Suggested change
uses: pytorch/pytorch/.github/actions/checkout-pytorch@move_docker_conda
uses: ./.github/actions/checkout-pytorch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is what we use everywhere:

git grep "checkout-pytorch"
github/actions/teardown-win/action.yml:    # retry this step several time similar to how checkout-pytorch GHA does
.github/workflows/_android-build-test.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_android-build-test.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_android-full-build-test.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_android-full-build-test.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_bazel-build-test.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_bazel-build-test.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_binary-build-linux.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_binary-test-linux.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_binary-upload.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_buck-build-test.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_buck-build-test.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
.github/workflows/_docs.yml:        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
etc.....

@atalman
Copy link
Contributor

atalman commented Jul 25, 2024

@pytorchmergebot merge -f "docker builds and lint are green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

facebook-github-bot pushed a commit to pytorch/benchmark that referenced this pull request Jul 30, 2024
Summary:
install_cuda.sh has been moved from pytorch/builder to pytorch/pytorch:

pytorch/builder#1942
pytorch/pytorch#129022

We need to make sure that sm_90a is included:
```
xargs -I '{}' bash -c 'echo {} && /usr/local/cuda-12.4/bin/nvprune -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a /usr/local/cuda-12.4/lib64/{} -o /usr/local/cuda-12.4/lib64/{}'
```

Pull Request resolved: #2397

Test Plan: https://github.com/pytorch/benchmark/actions/runs/10163701096

Reviewed By: atalman

Differential Revision: D60454551

Pulled By: xuzhao9

fbshipit-source-id: 03322314b190e0fb95e7d7741152de65ab35ecd4
@github-actions github-actions bot deleted the move_docker_conda branch August 25, 2024 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source release notes: releng release notes category topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants