-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Move cuda 12.4 jobs to periodic for both pull and inductor #127825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127825
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 35dfa36 with merge base 6e54539 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ZainRizvi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make W
| linux-focal-cuda12_4-py3_10-gcc9-build: | ||
| name: linux-focal-cuda12.4-py3.10-gcc9 | ||
| uses: ./.github/workflows/_linux-build-label.yml | ||
| with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about putting these in trunk instead of periodic, to serve as smoke tests? (while keeping the rest in periodic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Telling from current HUD read. CUDA 12.4 inductor_timm shard seems to be most helpful.
https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_timm
Since CUDA 12.1 inductor_timm shard is mostly un-usable. Perhaps, just delete all the shards for cuda 12.4 except inductor_timm?
And agree with @ZainRizvi trunk job runs might still be necessary to detect regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kept inductor_timm in inductor, kept all others in inductor-periodic
sm86 cu 124 -> trunk
cu124 -> periodic
a404a01 to
0fed434
Compare
nWEIdia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the infra pressure, this looks great!
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
This seems low risk to force merge. |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
34186a3 to
ea3ab31
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot rebase -b main |
|
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
|
Rebase failed due to Command |
Merge failedReason: New commits were pushed while merging. Please rerun the merge command. Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "Lint passed. Others are irrelevant" |
|
@pytorchmergebot merge -f "lint is green" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…red by ciflow/inductor (#128250) #127825 The majority of the g5 runner usage comes from inductor (its something like 2x everything else) in the past week, inductor ran 1300 ish times on PRs and 300 times on main. Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs. I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners Are we expected to be able to handle two versions of cuda in CI? Because currently we cannot, at least not comfortably Pull Request resolved: #128250 Approved by: https://github.com/huydhn
…red by ciflow/inductor (pytorch#128250) pytorch#127825 The majority of the g5 runner usage comes from inductor (its something like 2x everything else) in the past week, inductor ran 1300 ish times on PRs and 300 times on main. Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs. I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners Are we expected to be able to handle two versions of cuda in CI? Because currently we cannot, at least not comfortably Pull Request resolved: pytorch#128250 Approved by: https://github.com/huydhn
Moves 12.4 sm86/a10g jobs in pull to trunk
Moves 12.4 cuda non sm86 jobs to periodic
Moves 12.4 jobs in inductor to inductor-periodic, except inductor_timm which seems to give important signal
There has been a lot of queueing for cuda runners due to the addition of jobs for cuda 12.4, so move those jobs to other workflows that are run less often