-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[CI] Move all ROCm jobs to periodic frequency #131637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Move all ROCm jobs to periodic frequency #131637
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131637
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 6 Cancelled Jobs, 1 Unrelated FailureAs of commit ec595db with merge base e9db1b0 ( NEW FAILURE - The following job has failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@clee2000 one reason I'd prefer this PR over #131489 is because this moves the ROCm inductor jobs to their own workflow, which would allow us to invoke them on PRs where we do want to test inductor functionality on ROCm (but manually, not via bot labeling). The other PR would make distributed tests run along with inductor using the |
clee2000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but please fix lint before merging
.github/workflows/inductor-rocm.yml
Outdated
| schedule: | ||
| # We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs. | ||
| # Also run less frequently on weekends. | ||
| - cron: 45 0,8,16 * * 1-5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need so many schedule here? They come from the current periodic workflow, in this case I think we just want to run every 4 hours and keep the daily mem leak check and rerun disabled tests one
huydhn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! One small suggestion to simplify the list of schedules a bit
@huydhn I just copied the schedule from periodic.yml. It's ideal for us to have the same schedule as periodic so that the same commit on main will trigger all the ROCm-related jobs - this helps us in our OSS UT parity analysis. If you'd like to simplify the schedule on the periodic.yml as well, I can include that change in my PR. |
| # We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs. | ||
| # Also run less frequently on weekends. | ||
| - cron: 45 0,4,8,12,16,20 * * 1-5 | ||
| - cron: 45 4,12 * * 0,6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huydhn I consolidated the schedules, but would prefer to keep the weekends lighter to not strain the CI unnecessarily.
|
@pytorchbot merge -f "Need to reduce ROCm CI backlog" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Forgot to add them in #128250 and #131637 Pull Request resolved: #133143 Approved by: https://github.com/huydhn
|
@pytorchbot --help |
PyTorchBot HelpMergeRevertRebaseLabelDr CIcherry-pickCloseusage: @pytorchbot close Close a PR [Can be used on issues] |
|
@pytorchbot revert -m "Need to enable ROCm jobs to test more frequently" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 131637 failedReason: Command Details for Dev Infra teamRaised by workflow job |
Dial the frequency back up from #131637 Pull Request resolved: #135644 Approved by: https://github.com/huydhn
Dial the frequency back up from #131637 Pull Request resolved: #135644 Approved by: https://github.com/huydhn
…35644) Dial the frequency back up from pytorch#131637 Pull Request resolved: pytorch#135644 Approved by: https://github.com/huydhn
inductorandrocmworkflows are the major contributors to the CI load on ROCm CI at the moment, resulting in huge backlogs: #131489 (comment)ciflow/inductor-rocmas PR label to manually invoke inductor jobs for ROCm (no automatic invoking to limit CI load)trunkworkflow jobs for ROCm will run on every commit and PR merge, but since they take 45min*3 time on average, I decided to leave them as-is since it will provide us some basic insulation against ROCm breakage.cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang