Skip to content

Conversation

@pragupta
Copy link
Collaborator

@pragupta pragupta commented May 14, 2025

When timing is enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why I updated the test state check to be either "started" or "scheduled"

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot
Copy link

pytorch-bot bot commented May 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153545

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ab98a99 with merge base 0c6c778 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels May 14, 2025
@jeffdaily jeffdaily added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels May 14, 2025
@pytorch-bot pytorch-bot bot removed ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels May 14, 2025
@jeffdaily jeffdaily added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels May 14, 2025
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this pull request May 16, 2025
…2110)

When timing in enabled, ROCR runtime used to sleep for a small amount
which ensured that the application saw the correct state. However, for
perf reasons this sleep was removed and now the state is not guaranteed
to be "started". That's why, I updated the test state check to be either
"started" or "scheduled"

Fixes https://ontrack-internal.amd.com/browse/SWDEV-525883

Upstream PR: pytorch#153545
pragupta added a commit to pragupta/pytorch that referenced this pull request May 28, 2025
…OCm#2110)

When timing in enabled, ROCR runtime used to sleep for a small amount
which ensured that the application saw the correct state. However, for
perf reasons this sleep was removed and now the state is not guaranteed
to be "started". That's why, I updated the test state check to be either
"started" or "scheduled"

Fixes https://ontrack-internal.amd.com/browse/SWDEV-525883

Upstream PR: pytorch#153545

(cherry picked from commit 8a1ad2c)
pragupta added a commit to pragupta/pytorch that referenced this pull request May 28, 2025
…OCm#2110)

When timing in enabled, ROCR runtime used to sleep for a small amount
which ensured that the application saw the correct state. However, for
perf reasons this sleep was removed and now the state is not guaranteed
to be "started". That's why, I updated the test state check to be either
"started" or "scheduled"

Fixes https://ontrack-internal.amd.com/browse/SWDEV-525883

Upstream PR: pytorch#153545

(cherry picked from commit 8a1ad2c)
@pragupta pragupta marked this pull request as ready for review May 28, 2025 14:49
Copy link
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with one nit to change.

@pytorch-bot pytorch-bot bot removed ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels May 28, 2025
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label May 28, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented May 28, 2025

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label May 28, 2025
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label May 28, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented May 28, 2025

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label May 28, 2025
@jeffdaily jeffdaily added ciflow/rocm Trigger "default" config CI on ROCm ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels May 28, 2025
@pruthvistony pruthvistony added ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels May 28, 2025
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request May 28, 2025
#2201)

…#2110)

When timing in enabled, ROCR runtime used to sleep for a small amount
which ensured that the application saw the correct state. However, for
perf reasons this sleep was removed and now the state is not guaranteed
to be "started". That's why, I updated the test state check to be either
"started" or "scheduled"

Fixes https://ontrack-internal.amd.com/browse/SWDEV-525883

Upstream PR: pytorch#153545

(cherry picked from commit 8a1ad2c)

Fixes #ISSUE_NUMBER
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request May 28, 2025
#2202)

…#2110)

When timing in enabled, ROCR runtime used to sleep for a small amount
which ensured that the application saw the correct state. However, for
perf reasons this sleep was removed and now the state is not guaranteed
to be "started". That's why, I updated the test state check to be either
"started" or "scheduled"

Fixes https://ontrack-internal.amd.com/browse/SWDEV-525883

Upstream PR: pytorch#153545

(cherry picked from commit 8a1ad2c)

Fixes #ISSUE_NUMBER
@jeffdaily
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pragupta and others added 4 commits May 30, 2025 16:47
When timing in enabled, ROCR runtime used to sleep for a small amount
which ensured that the application saw the correct state. However, for
perf reasons this sleep was removed and now the state is not guaranteed
to be "started". That's why, I updated the test state check to be
either "started" or "scheduled"
@pytorchmergebot
Copy link
Collaborator

Successfully rebased pg-navi31-jira-upstream onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout pg-navi31-jira-upstream && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the pg-navi31-jira-upstream branch from 1b33d3f to ab98a99 Compare May 30, 2025 16:47
@pytorch-bot pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels May 30, 2025
@jeffdaily jeffdaily added ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels May 30, 2025
@jeffdaily
Copy link
Collaborator

@pytorchbot merge -f "unrelated rocm failures, rocm-only change to test logic"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 Merged module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants