Skip to content

Conversation

@jithunnair-amd
Copy link
Collaborator

@jithunnair-amd jithunnair-amd commented Jul 24, 2024

inductor and rocm workflows are the major contributors to the CI load on ROCm CI at the moment, resulting in huge backlogs: #131489 (comment)

  • Move rocm.yml to cron frequency
  • Move ROCm CI jobs from inductor.yml to inductor-rocm.yml
  • Introduce ciflow/inductor-rocm as PR label to manually invoke inductor jobs for ROCm (no automatic invoking to limit CI load)
  • After this PR, only trunk workflow jobs for ROCm will run on every commit and PR merge, but since they take 45min*3 time on average, I decided to leave them as-is since it will provide us some basic insulation against ROCm breakage.

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131637

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 6 Cancelled Jobs, 1 Unrelated Failure

As of commit ec595db with merge base e9db1b0 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Jul 24, 2024
@jithunnair-amd jithunnair-amd removed the ciflow/rocm Trigger "default" config CI on ROCm label Jul 24, 2024
@pytorch-bot pytorch-bot bot added the ciflow/rocm Trigger "default" config CI on ROCm label Jul 24, 2024
@jithunnair-amd
Copy link
Collaborator Author

@clee2000 one reason I'd prefer this PR over #131489 is because this moves the ROCm inductor jobs to their own workflow, which would allow us to invoke them on PRs where we do want to test inductor functionality on ROCm (but manually, not via bot labeling). The other PR would make distributed tests run along with inductor using the ciflow/periodic label, which would increase the load on CI.

Copy link
Contributor

@clee2000 clee2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but please fix lint before merging

schedule:
# We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs.
# Also run less frequently on weekends.
- cron: 45 0,8,16 * * 1-5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need so many schedule here? They come from the current periodic workflow, in this case I think we just want to run every 4 hours and keep the daily mem leak check and rerun disabled tests one

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One small suggestion to simplify the list of schedules a bit

@huydhn
Copy link
Contributor

huydhn commented Jul 24, 2024

cc @clee2000 @atalman I notice this huge stack #131564, we know that similar ones had caused long queue across the board in the past.

@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Jul 24, 2024

LGTM! One small suggestion to simplify the list of schedules a bit

@huydhn I just copied the schedule from periodic.yml. It's ideal for us to have the same schedule as periodic so that the same commit on main will trigger all the ROCm-related jobs - this helps us in our OSS UT parity analysis. If you'd like to simplify the schedule on the periodic.yml as well, I can include that change in my PR.

# We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs.
# Also run less frequently on weekends.
- cron: 45 0,4,8,12,16,20 * * 1-5
- cron: 45 4,12 * * 0,6
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn I consolidated the schedules, but would prefer to keep the weekends lighter to not strain the CI unnecessarily.

@jithunnair-amd jithunnair-amd marked this pull request as ready for review July 24, 2024 18:39
@jithunnair-amd jithunnair-amd requested a review from a team as a code owner July 24, 2024 18:39
@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot merge -f "Need to reduce ROCm CI backlog"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pruthvistony
Copy link
Collaborator

@pytorchbot --help

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 10, 2024

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. Some commands may be used on issues as specified below.

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label,drci,cherry-pick,close}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR
    drci                Update Dr. CI
    cherry-pick         Cherry pick a PR onto a release branch
    close               Close a PR

Merge

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
                        
                        Please use `-f` as last resort, prefer `--ignore-current` to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
  -i, --ignore-current  Merge while ignoring the currently failing jobs.  Behaves like -f if there are no pending jobs.
  -ic                   Old flag for --ignore-current. Deprecated in favor of -i.
  -r [{viable/strict,main}], --rebase [{viable/strict,main}]
                        Rebase the PR to re run checks before merging.  Accepts viable/strict or main as branch options and will default to viable/strict if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the stable viable/strict branch of pytorch.
Repeat contributor may use this command to rebase their PR.

optional arguments:
  -s, --stable          [DEPRECATED] Rebase onto viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR or Issue [Can be used on Issues]

positional arguments:
  labels  Labels to add to given Pull Request or Issue [Can be used on Issues]

Dr CI

usage: @pytorchbot drci 

Update Dr. CI. Updates the Dr. CI comment on the PR in case it's gotten out of sync with actual CI results.

cherry-pick

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Cherry pick a pull request onto a release branch for inclusion in a release

optional arguments:
  --onto ONTO           Branch you would like to cherry pick onto (Example: release/2.1)
  --fixes FIXES         Link to the issue that your PR fixes (Example: https://github.com/pytorch/pytorch/issues/110666)
  -c {regression,critical,fixnewfeature,docs,release}, --classification {regression,critical,fixnewfeature,docs,release}
                        A machine-friendly classification of the cherry-pick reason.

Close

usage: @pytorchbot close

Close a PR [Can be used on issues]

@pruthvistony
Copy link
Collaborator

@pytorchbot revert -m "Need to enable ROCm jobs to test more frequently" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 131637 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 05064f2827298a651a40431334e2a841d3a05153 returned non-zero exit code 1

Auto-merging .github/pytorch-probot.yml
CONFLICT (modify/delete): .github/workflows/inductor-rocm.yml deleted in parent of 05064f2827 ([CI] Move all ROCm jobs to periodic frequency (#131637)) and modified in HEAD.  Version HEAD of .github/workflows/inductor-rocm.yml left in tree.
Auto-merging .github/workflows/inductor.yml
CONFLICT (content): Merge conflict in .github/workflows/inductor.yml
error: could not revert 05064f2827... [CI] Move all ROCm jobs to periodic frequency (#131637)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm Merged module: rocm AMD GPU support for Pytorch open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants