Skip to content

Conversation

@amathewc
Copy link
Contributor

@amathewc amathewc commented Jan 17, 2025

MOTIVATION

To generalize distributed test cases for non-CUDA devices, we are leveraging the DistributedTestBase class introduced in PR #138216. This new class is derived from MultiProcessTestCase and abstracts the creation/deletion of process groups and other functionality for specific devices. In this PR, we extend the scope of these tests to support HPUs.

CHANGES

Replaced MultiProcessTestCase with the DistributedTestBase class.
Extended test functionality to include support for HPUs.
Utilized instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
Applied the skipIfHPU decorator to skip tests that are not yet compatible with HPU devices.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @ankurneog

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Jan 17, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145056

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6a29ea1 with merge base 9c78fb9 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator

@Skylion007 Skylion007 Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also please properly type annotation this function with callable[P,R] etc? Decorators are important to type properly as they lose all typing info of things they decorate otherwise.

Suggested change
def with_comms(func=None):
def with_comms(func=None):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
devices=["cpu","cuda","hpu"]
devices=("cpu","cuda","hpu")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if the way DistributedTestBase passes device arg to test classes plays nicely with using @parametrize on test instances- have you tried this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried using both @parameterize and DistributedTestCase in the same file on different test cases (test/distributed/test_functional_api.py) but I haven't tried using them both on the same testcase so I am unsure about how they interact.

Thanks for pointing this out! I will take a look at it and see if any changes are required :)

@ankurneog
Copy link

@kwen2501 : Kindly help with the review. thanks

@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 22, 2025
@kwen2501
Copy link
Collaborator

kwen2501 commented Jan 23, 2025

Sorry just back to work this week. Will have a look soon.
@ankurneog You can add me or @fegin to the reviewer list next time for Distributed tests. (That helps me search for PRs that need a review.) Thanks!

@kwen2501 kwen2501 self-requested a review January 23, 2025 17:19
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We are very close to land, once the comments are addressed.

@amathewc
Copy link
Contributor Author

@kwen2501 : I had addressed your review comments ? Could you take a look?

@amathewc
Copy link
Contributor Author

@ankurneog , @kwen2501 , @fegin : Could you help with the approval and merging ?

@amathewc
Copy link
Contributor Author

amathewc commented Feb 5, 2025

@ankurneog , @kwen2501 : Could you help with the merging ?

@amathewc
Copy link
Contributor Author

amathewc commented Feb 6, 2025

@pytorchbot merge -i

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 6, 2025

-i flag is only allowed for users with write permissions

@amathewc
Copy link
Contributor Author

amathewc commented Feb 6, 2025

@guangyey : Could you help with retriggering the CI and merge ?

@ankurneog
Copy link

Sorry just back to work this week. Will have a look soon. @ankurneog You can add me or @fegin to the reviewer list next time for Distributed tests. (That helps me search for PRs that need a review.) Thanks!

@kwen2501 I don't seem to have the permission to directly assign reviewers. Do i need to do anything from my side to enable that ?

@guangyey
Copy link
Collaborator

guangyey commented Feb 8, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 8, 2025
@guangyey
Copy link
Collaborator

guangyey commented Feb 8, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased distributed_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout distributed_changes && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team Raised by workflow job

@guangyey
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@amathewc
Copy link
Contributor Author

@kwen2501 , @guangyey , @ankurneog : Could you help with the merging ? The failure seems to be unrelated.

@AnantGulati
Copy link
Contributor

"Try to land this since the failure is unrelated."
@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@amathewc amathewc deleted the distributed_changes branch February 18, 2025 09:30
Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this pull request Feb 24, 2025
…pytorch#145056)

# MOTIVATION
To generalize distributed test cases for non-CUDA devices, we are leveraging the DistributedTestBase class introduced in [PR pytorch#138216](pytorch#138216). This new class is derived from MultiProcessTestCase and abstracts the creation/deletion of process groups and other functionality for specific devices. In this PR, we extend the scope of these tests to support HPUs.

# CHANGES

Replaced MultiProcessTestCase with the DistributedTestBase class.
Extended test functionality to include support for HPUs.
Utilized instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
Applied the skipIfHPU decorator to skip tests that are not yet compatible with HPU devices.

Pull Request resolved: pytorch#145056
Approved by: https://github.com/kwen2501, https://github.com/guangyey
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants