Move cuda 12.4 jobs to periodic for both pull and inductor #127825

clee2000 · 2024-06-03T23:29:49Z

Moves 12.4 sm86/a10g jobs in pull to trunk
Moves 12.4 cuda non sm86 jobs to periodic
Moves 12.4 jobs in inductor to inductor-periodic, except inductor_timm which seems to give important signal

There has been a lot of queueing for cuda runners due to the addition of jobs for cuda 12.4, so move those jobs to other workflows that are run less often

pytorch-bot · 2024-06-03T23:29:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127825

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 35dfa36 with merge base 6e54539 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ZainRizvi

Would it make W

ZainRizvi · 2024-06-04T15:21:22Z

.github/workflows/pull.yml

-  linux-focal-cuda12_4-py3_10-gcc9-build:
-    name: linux-focal-cuda12.4-py3.10-gcc9
-    uses: ./.github/workflows/_linux-build-label.yml
-    with:


What do you think about putting these in trunk instead of periodic, to serve as smoke tests? (while keeping the rest in periodic)

Telling from current HUD read. CUDA 12.4 inductor_timm shard seems to be most helpful.
https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_timm
Since CUDA 12.1 inductor_timm shard is mostly un-usable. Perhaps, just delete all the shards for cuda 12.4 except inductor_timm?
And agree with @ZainRizvi trunk job runs might still be necessary to detect regressions.

kept inductor_timm in inductor, kept all others in inductor-periodic
sm86 cu 124 -> trunk
cu124 -> periodic

nWEIdia

Sorry for the infra pressure, this looks great!

clee2000 · 2024-06-04T17:51:17Z

@pytorchbot merge

pytorchmergebot · 2024-06-04T17:53:10Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

nWEIdia · 2024-06-04T20:20:58Z

This seems low risk to force merge.

pytorchmergebot · 2024-06-04T21:36:40Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-docs / build-docs-python-false

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

clee2000 · 2024-06-05T18:16:12Z

@pytorchbot merge

pytorchmergebot · 2024-06-05T18:17:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

atalman · 2024-06-05T18:53:36Z

@pytorchbot rebase -b main

pytorchmergebot · 2024-06-05T18:55:03Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2024-06-05T18:55:04Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/main pull/127825/head returned non-zero exit code 1

Rebasing (1/5)
Auto-merging .github/workflows/periodic.yml
CONFLICT (content): Merge conflict in .github/workflows/periodic.yml
error: could not apply fa8041d80b5... update
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply fa8041d80b5... update

Raised by https://github.com/pytorch/pytorch/actions/runs/9389672880

pytorchmergebot · 2024-06-05T19:05:29Z

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team

Raised by workflow job

.github/workflows/periodic.yml

ZainRizvi · 2024-06-05T20:58:58Z

@pytorchbot merge -f "Lint passed. Others are irrelevant"

atalman · 2024-06-05T20:59:21Z

@pytorchmergebot merge -f "lint is green"

pytorchmergebot · 2024-06-05T21:01:27Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…red by ciflow/inductor (#128250) #127825 The majority of the g5 runner usage comes from inductor (its something like 2x everything else) in the past week, inductor ran 1300 ish times on PRs and 300 times on main. Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs. I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners Are we expected to be able to handle two versions of cuda in CI? Because currently we cannot, at least not comfortably Pull Request resolved: #128250 Approved by: https://github.com/huydhn

…red by ciflow/inductor (pytorch#128250) pytorch#127825 The majority of the g5 runner usage comes from inductor (its something like 2x everything else) in the past week, inductor ran 1300 ish times on PRs and 300 times on main. Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs. I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners Are we expected to be able to handle two versions of cuda in CI? Because currently we cannot, at least not comfortably Pull Request resolved: pytorch#128250 Approved by: https://github.com/huydhn

pytorch-bot bot added the topic: not user facing topic category label Jun 3, 2024

clee2000 marked this pull request as ready for review June 3, 2024 23:50

clee2000 requested a review from a team as a code owner June 3, 2024 23:50

ZainRizvi approved these changes Jun 4, 2024

View reviewed changes

clee2000 force-pushed the csl/move_cuda124_periodic branch from a404a01 to 0fed434 Compare June 4, 2024 16:46

nWEIdia approved these changes Jun 4, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 4, 2024

pytorchmergebot added the merging label Jun 4, 2024

pytorchmergebot removed the merging label Jun 4, 2024

clee2000 added 4 commits June 5, 2024 11:11

update

fa8041d

tc

d35bfb9

cudnn

89117f4

update

ea3ab31

clee2000 force-pushed the csl/move_cuda124_periodic branch from 34186a3 to ea3ab31 Compare June 5, 2024 18:12

fix

8f87ff0

pytorchmergebot added the merging label Jun 5, 2024

atalman approved these changes Jun 5, 2024

View reviewed changes

nWEIdia mentioned this pull request Jun 5, 2024

[BE]: Update cudnn to 9.1.0.70 #123475

Closed

pytorch deleted a comment from pytorch-bot bot Jun 5, 2024

Merge branch 'main' into csl/move_cuda124_periodic

ac29ee3

pytorchmergebot removed the merging label Jun 5, 2024

atalman reviewed Jun 5, 2024

View reviewed changes

.github/workflows/periodic.yml Outdated Show resolved Hide resolved

atalman reviewed Jun 5, 2024

View reviewed changes

.github/workflows/periodic.yml Outdated Show resolved Hide resolved

atalman added 2 commits June 5, 2024 16:31

Update .github/workflows/periodic.yml

a002bb3

Update .github/workflows/periodic.yml

35dfa36

malfet approved these changes Jun 5, 2024

View reviewed changes

pytorchmergebot added the merging label Jun 5, 2024

pytorchmergebot closed this in 01694ea Jun 5, 2024

pytorchmergebot added Merged and removed merging labels Jun 5, 2024

clee2000 mentioned this pull request Jun 7, 2024

Move inductor cuda 124 jobs to a separate workflow that is not triggered by ciflow/inductor #128250

Closed

github-actions bot deleted the csl/move_cuda124_periodic branch July 6, 2024 01:54

Move cuda 12.4 jobs to periodic for both pull and inductor #127825

Move cuda 12.4 jobs to periodic for both pull and inductor #127825

Uh oh!

Conversation

clee2000 commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127825

✅ No Failures

Uh oh!

ZainRizvi left a comment

Choose a reason for hiding this comment

Uh oh!

ZainRizvi Jun 4, 2024

Choose a reason for hiding this comment

Uh oh!

nWEIdia Jun 4, 2024

Choose a reason for hiding this comment

Uh oh!

clee2000 Jun 4, 2024

Choose a reason for hiding this comment

Uh oh!

nWEIdia left a comment

Choose a reason for hiding this comment

Uh oh!

clee2000 commented Jun 4, 2024

Uh oh!

pytorchmergebot commented Jun 4, 2024

Merge started

Uh oh!

nWEIdia commented Jun 4, 2024

Uh oh!

pytorchmergebot commented Jun 4, 2024

Merge failed

Uh oh!

clee2000 commented Jun 5, 2024

Uh oh!

pytorchmergebot commented Jun 5, 2024

Merge started

Uh oh!

atalman commented Jun 5, 2024

Uh oh!

pytorchmergebot commented Jun 5, 2024

Uh oh!

pytorchmergebot commented Jun 5, 2024

Uh oh!

pytorchmergebot commented Jun 5, 2024

Merge failed

Uh oh!

Uh oh!

Uh oh!

ZainRizvi commented Jun 5, 2024

Uh oh!

atalman commented Jun 5, 2024

Uh oh!

pytorchmergebot commented Jun 5, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

clee2000 commented Jun 3, 2024 •

edited

Loading

pytorch-bot bot commented Jun 3, 2024 •

edited

Loading