[CI] Support for CI GPU test and benchmark on containers #137169

jeanschmidt · 2024-10-02T10:15:00Z

Renames the arc references to container, and add changes required so CI that requires GPU can run on containers

pytorch-bot · 2024-10-02T10:15:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137169

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 3273b62 with merge base 52d29a2 ():

NEW FAILURE - The following job has failed:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh)
running benchmark: 100% 30/30 [00:03<00:00, 7.77it/s]

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge) (gh) (trunk failure)
test_transformers.py::TestSDPAPrivateUse1Only::test_scaled_dot_product_fused_attention_overrideable_backward

This comment was automatically generated by Dr. CI and updates every 15 minutes.

huydhn · 2024-10-02T16:24:42Z

.github/actions/linux-test/action.yml

+      run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
+      if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
+
+    - name: Setup SCCACHE_SERVER_PORT environment for docker run when on container


Do you know if we really need to set SCCACHE_SERVER_PORT here? If it doesn't help with sccache, I wonder if we need to keep the step (there are 2 copies of this step here and in _linux_test.yml)

This is a interesting question, maybe yes, maybe not. It is not very likely that running it in a containerized environment this would be a problem. If you believe that there are potential risks on doing this we can remove it.

I think the downsize is we might have a redundant step that does nothing in an already complex workflow.

huydhn · 2024-10-02T16:27:32Z

.github/workflows/_linux-test.yml

          sudo nvidia-smi -ac 1215,1410
          nvidia-smi
-        if: contains(matrix.runner, 'a100')
+        if: ${{ contains(matrix.runner, 'a100') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}


I think there are still some reference to gcp.a100 in the workflows by searching for that string gcp.a100

not migrating from gcp.a100 to aws.a100 in this PR, we first need those changes merged on main and other's PRs

Ah, I mean there are step like Setup SSH where it checks if the runner contains gcp.a100. That can be changed to a100. I get the idea that this is not the cut over PR from gcp.100 to aws.100 thus the question about rollout plan below :)

huydhn

Let's chat later on how you plan to roll this out, we would probably want to use both gcp and aws A100 for a week or so and compare the benchmark numbers

jeanschmidt · 2024-10-02T17:03:44Z

@pytorchbot merge -i

pytorchmergebot · 2024-10-02T17:05:29Z

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge), inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[BE] Support for CI GPU test and benchmark on containers

b8ccafd

jeanschmidt requested a review from a team as a code owner October 2, 2024 10:15

pytorch-bot bot added the topic: not user facing topic category label Oct 2, 2024

jeanschmidt self-assigned this Oct 2, 2024

jeanschmidt changed the title ~~[BE] Support for CI GPU test and benchmark on containers~~ [CI] Support for CI GPU test and benchmark on containers Oct 2, 2024

IMO the code is worse, but respecting linter

3273b62

huydhn reviewed Oct 2, 2024

View reviewed changes

huydhn approved these changes Oct 2, 2024

View reviewed changes

pytorchmergebot added the merging label Oct 2, 2024

pytorchmergebot added the Merged label Oct 2, 2024

pytorchmergebot closed this in 466623f Oct 2, 2024

pytorchmergebot removed the merging label Oct 2, 2024

github-actions bot deleted the jeanschmidt/linux-test-container branch November 6, 2024 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Support for CI GPU test and benchmark on containers #137169

[CI] Support for CI GPU test and benchmark on containers #137169

Uh oh!

jeanschmidt commented Oct 2, 2024

Uh oh!

pytorch-bot bot commented Oct 2, 2024 •

edited

Loading

Uh oh!

huydhn Oct 2, 2024

Uh oh!

jeanschmidt Oct 2, 2024

Uh oh!

huydhn Oct 2, 2024

Uh oh!

huydhn Oct 2, 2024

Uh oh!

jeanschmidt Oct 2, 2024

Uh oh!

huydhn Oct 2, 2024

Uh oh!

huydhn left a comment

Uh oh!

jeanschmidt commented Oct 2, 2024

Uh oh!

pytorchmergebot commented Oct 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CI] Support for CI GPU test and benchmark on containers #137169

[CI] Support for CI GPU test and benchmark on containers #137169

Uh oh!

Conversation

jeanschmidt commented Oct 2, 2024

Uh oh!

pytorch-bot bot commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137169

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

huydhn Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

jeanschmidt Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

huydhn Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

huydhn Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

jeanschmidt Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

huydhn Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

huydhn left a comment

Choose a reason for hiding this comment

Uh oh!

jeanschmidt commented Oct 2, 2024

Uh oh!

pytorchmergebot commented Oct 2, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented Oct 2, 2024 •

edited

Loading