Skip to content

Conversation

@jeanschmidt
Copy link
Contributor

Renames the arc references to container, and add changes required so CI that requires GPU can run on containers

@jeanschmidt jeanschmidt requested a review from a team as a code owner October 2, 2024 10:15
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Oct 2, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137169

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 3273b62 with merge base 52d29a2 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jeanschmidt jeanschmidt self-assigned this Oct 2, 2024
@jeanschmidt jeanschmidt added ciflow/trunk Trigger trunk jobs on your pull request ciflow/nightly Trigger all jobs we run nightly (nightly.yml) ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Oct 2, 2024
@jeanschmidt jeanschmidt changed the title [BE] Support for CI GPU test and benchmark on containers [CI] Support for CI GPU test and benchmark on containers Oct 2, 2024
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}

- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if we really need to set SCCACHE_SERVER_PORT here? If it doesn't help with sccache, I wonder if we need to keep the step (there are 2 copies of this step here and in _linux_test.yml)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a interesting question, maybe yes, maybe not. It is not very likely that running it in a containerized environment this would be a problem. If you believe that there are potential risks on doing this we can remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the downsize is we might have a redundant step that does nothing in an already complex workflow.

sudo nvidia-smi -ac 1215,1410
nvidia-smi
if: contains(matrix.runner, 'a100')
if: ${{ contains(matrix.runner, 'a100') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are still some reference to gcp.a100 in the workflows by searching for that string gcp.a100

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not migrating from gcp.a100 to aws.a100 in this PR, we first need those changes merged on main and other's PRs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I mean there are step like Setup SSH where it checks if the runner contains gcp.a100. That can be changed to a100. I get the idea that this is not the cut over PR from gcp.100 to aws.100 thus the question about rollout plan below :)

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's chat later on how you plan to roll this out, we would probably want to use both gcp and aws A100 for a week or so and compare the benchmark numbers

@jeanschmidt
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge), inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the jeanschmidt/linux-test-container branch November 6, 2024 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/nightly Trigger all jobs we run nightly (nightly.yml) ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants