Skip to content

Conversation

@shuqiangzhang
Copy link
Contributor

@shuqiangzhang shuqiangzhang commented Oct 21, 2024

Stack from ghstack (oldest at bottom):

Summary:
Eager init and split should be two different concepts. E.g., we should be able to support eager init but without using split mode.

In this PR, we provide users an option to specify use_split or not, even if eager init is used.

Also, in the future, it is recommended to use split_group instead of new_group if users want to use split.

Test Plan:
Added a test_comm_eager_without_split
Reviewers:

Subscribers:

Tasks:

Tags:

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
shuqiangzhang added a commit that referenced this pull request Oct 21, 2024
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 9e1001e
Pull Request resolved: #138518
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138518

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7246122 with merge base ce63193 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 21, 2024
@shuqiangzhang shuqiangzhang changed the title [c10d] explicity specify if split semantics should be used [c10d] user to explicitly specify if split semantics should be used in new_group Oct 21, 2024
…d be used in new_group"


Summary:
Eager init and split should be two different concepts. E.g., we should be able to support eager init but without using split mode. 

In this PR, one needs to explicitly specify use_split, in order to use the split comm functions in NCCL

Test Plan:
Added a test_comm_eager_without_split
Reviewers:

Subscribers:

Tasks:

Tags:

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
…d be used in new_group"


Summary:
Eager init and split should be two different concepts. E.g., we should be able to support eager init but without using split mode. 

In this PR, one needs to explicitly specify use_split, in order to use the split comm functions in NCCL

Test Plan:
Added a test_comm_eager_without_split
Reviewers:

Subscribers:

Tasks:

Tags:

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
…d be used in new_group"


Summary:
Eager init and split should be two different concepts. E.g., we should be able to support eager init but without using split mode. 

In this PR, we provide users an option to specify use_split or not, even if eager init is used.

Test Plan:
Added a test_comm_eager_without_split
Reviewers:

Subscribers:

Tasks:

Tags:

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
shuqiangzhang added a commit that referenced this pull request Oct 22, 2024
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 2c65d99
Pull Request resolved: #138518
@shuqiangzhang shuqiangzhang changed the title [c10d] user to explicitly specify if split semantics should be used in new_group [c10d] user to explicitly specify whether split semantics should be used in new_group Oct 22, 2024
@kwen2501
Copy link
Collaborator

It depends on whether we see split as implementation detail or not.

@shuqiangzhang
Copy link
Contributor Author

shuqiangzhang commented Oct 22, 2024

It depends on whether we see split as implementation detail or not.

Similar to eager init/device id para input, for normal users who don't care much about init time or resources, they do not need to know or set this. But if they care, they have a choice without changing their current code structure of using new_group, and the are also encouraged to switch to split_group

@wconstab
Copy link
Contributor

my thought is that use_split defaults to false, which may lead to less usage of split. I think we need to balance between good design to separate split/eager, but also using defaults to encourage the path we want to encourage.

What are the tradeoffs of using new_group + use_split=True vs False? What about the tradeoffs of using new_group+use_split vs split_group?

Separately, if you want to discourage new_group and encourage split_group, how do we do that?

@shuqiangzhang
Copy link
Contributor Author

shuqiangzhang commented Oct 22, 2024

my thought is that use_split defaults to false, which may lead to less usage of split. I think we need to balance between good design to separate split/eager, but also using defaults to encourage the path we want to encourage.

What are the tradeoffs of using new_group + use_split=True vs False? What about the tradeoffs of using new_group+use_split vs split_group?

Separately, if you want to discourage new_group and encourage split_group, how do we do that?

Right, Checking my commit V2, V3 and V4, you could see me debating myself by changing the default use_split from false to true and then to false again.

The intention of changing the default to be false is exactly to encourage performance caring users to use split_group instead of new_group. split_group is mostly stable now and is better than new_group + use_split=True for all 2 strong reasons listed in #130407, that's why advanced users (including device mesh) are using split_group now.

Of course, we can only suggest this to users (modified comment/doc) but not force it, that's why we still need to keep the old option viable of new_group. For users don't care about perf, not using split is fine. But once users feel the needs of better performance, they can follow the suggestion

@kwen2501
Copy link
Collaborator

Shall we look at the new_group API solely?

There are a couple questions we'd need to answer imo:

  • For new_group, is split-based impl strictly better than bootstrap-based impl?
  • For new_group, do users need to know / express which impl is used?

@shuqiangzhang
Copy link
Contributor Author

shuqiangzhang commented Oct 23, 2024

Shall we look at the new_group API solely?

There are a couple questions we'd need to answer imo:

  • For new_group, is split-based impl strictly better than bootstrap-based impl?
  • For new_group, do users need to know / express which impl is used?

From perf point of view, newgroup without split < newgroup with split (< split_group). For most OSS users, they don't need to know the details, that's why it's an optional para. But for advanced users (e.g., who only want to eager init a sub pg, but not using split for various reasons), we need to give them this option.

…should be used in new_group"


Summary:
Eager init and split should be two different concepts. E.g., we should be able to support eager init but without using split mode. 

In this PR, we provide users an option to specify use_split or not, even if eager init is used.

Also, in the future, it is recommended to use split_group instead of new_group if users want to use split. 

Test Plan:
Added a test_comm_eager_without_split
Reviewers:

Subscribers:

Tasks:

Tags:

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
shuqiangzhang added a commit that referenced this pull request Oct 25, 2024
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 80230d8
Pull Request resolved: #138518
@wconstab
Copy link
Contributor

One more clarification- if you use new group with or without split, you still have to have all ranks call new group the same number of times in the same order, right? (No visible semantic changes based on the split flag)?

@shuqiangzhang
Copy link
Contributor Author

shuqiangzhang commented Oct 31, 2024

One more clarification- if you use new group with or without split, you still have to have all ranks call new group the same number of times in the same order, right? (No visible semantic changes based on the split flag)?

That's a good question that I hadn't thought before. Indeed, if no split is used, I think there is no need of the requirement of calling new_group on every rank of the default PG, so this requirement can be relaxed. (Let me verify this by some tests). calling it on every ranks does not do any harm and just backward compatible

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Dec 30, 2024
@github-actions github-actions bot deleted the gh/shuqiangzhang/53/head branch February 9, 2025 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants