Skip to content

Conversation

@clee2000
Copy link
Contributor

@clee2000 clee2000 commented Jun 3, 2024

Moves 12.4 sm86/a10g jobs in pull to trunk
Moves 12.4 cuda non sm86 jobs to periodic
Moves 12.4 jobs in inductor to inductor-periodic, except inductor_timm which seems to give important signal

There has been a lot of queueing for cuda runners due to the addition of jobs for cuda 12.4, so move those jobs to other workflows that are run less often

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127825

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 35dfa36 with merge base 6e54539 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 3, 2024
@clee2000 clee2000 marked this pull request as ready for review June 3, 2024 23:50
@clee2000 clee2000 requested a review from a team as a code owner June 3, 2024 23:50
Copy link
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make W

Comment on lines -288 to -291
linux-focal-cuda12_4-py3_10-gcc9-build:
name: linux-focal-cuda12.4-py3.10-gcc9
uses: ./.github/workflows/_linux-build-label.yml
with:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about putting these in trunk instead of periodic, to serve as smoke tests? (while keeping the rest in periodic)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Telling from current HUD read. CUDA 12.4 inductor_timm shard seems to be most helpful.
https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_timm
Since CUDA 12.1 inductor_timm shard is mostly un-usable. Perhaps, just delete all the shards for cuda 12.4 except inductor_timm?
And agree with @ZainRizvi trunk job runs might still be necessary to detect regressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kept inductor_timm in inductor, kept all others in inductor-periodic
sm86 cu 124 -> trunk
cu124 -> periodic

@clee2000 clee2000 force-pushed the csl/move_cuda124_periodic branch from a404a01 to 0fed434 Compare June 4, 2024 16:46
Copy link
Collaborator

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the infra pressure, this looks great!

@clee2000
Copy link
Contributor Author

clee2000 commented Jun 4, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 4, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 4, 2024

This seems low risk to force merge.

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@clee2000 clee2000 force-pushed the csl/move_cuda124_periodic branch from 34186a3 to ea3ab31 Compare June 5, 2024 18:12
@clee2000
Copy link
Contributor Author

clee2000 commented Jun 5, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@atalman
Copy link
Contributor

atalman commented Jun 5, 2024

@pytorchbot rebase -b main

@pytorch pytorch deleted a comment from pytorch-bot bot Jun 5, 2024
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/main pull/127825/head returned non-zero exit code 1

Rebasing (1/5)
Auto-merging .github/workflows/periodic.yml
CONFLICT (content): Merge conflict in .github/workflows/periodic.yml
error: could not apply fa8041d80b5... update
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply fa8041d80b5... update

Raised by https://github.com/pytorch/pytorch/actions/runs/9389672880

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team Raised by workflow job

@ZainRizvi
Copy link
Contributor

@pytorchbot merge -f "Lint passed. Others are irrelevant"

@atalman
Copy link
Contributor

atalman commented Jun 5, 2024

@pytorchmergebot merge -f "lint is green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jun 7, 2024
…red by ciflow/inductor (#128250)

#127825

The majority of the g5 runner usage comes from inductor (its something like 2x everything else)
in the past week, inductor ran 1300 ish times on PRs and 300 times on main.  Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs.

I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners

Are we expected to be able to handle two versions of cuda in CI?  Because currently we cannot, at least not comfortably

Pull Request resolved: #128250
Approved by: https://github.com/huydhn
TharinduRusira pushed a commit to TharinduRusira/pytorch that referenced this pull request Jun 14, 2024
…red by ciflow/inductor (pytorch#128250)

pytorch#127825

The majority of the g5 runner usage comes from inductor (its something like 2x everything else)
in the past week, inductor ran 1300 ish times on PRs and 300 times on main.  Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs.

I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners

Are we expected to be able to handle two versions of cuda in CI?  Because currently we cannot, at least not comfortably

Pull Request resolved: pytorch#128250
Approved by: https://github.com/huydhn
@github-actions github-actions bot deleted the csl/move_cuda124_periodic branch July 6, 2024 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants