-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Pass in the appropiate sequence no. for "wait" op #134619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134619
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 2 Unrelated FailuresAs of commit 3974e03 with merge base 86e03a6 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D61882985 |
0b29c7e to
ac35fdf
Compare
|
This pull request was exported from Phabricator. Differential Revision: D61882985 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ++ here?
|
This pull request was exported from Phabricator. Differential Revision: D61882985 |
ac35fdf to
2c98af6
Compare
…pytorch#134619) Summary: Pull Request resolved: pytorch#134619 In a ProcessGroupNCCL, there are two different sequence no. for collective and p2p op. "wait" op need to know what type of op it waits. In this DIFF, I changed RECORD_PARAM_COM interface to add a new entry "isP2P", this is a boolean varaible to flag if the op is a point to point op or not. Test Plan: buck2 run mode/opt kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test Differential Revision: D61882985
2c98af6 to
361be42
Compare
|
This pull request was exported from Phabricator. Differential Revision: D61882985 |
|
|
||
| #define RECORD_PARAM_COMMS( \ | ||
| seq, \ | ||
| isP2P, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making seq be a tuple of (seq_id, is_p2p), but not a stand alone integer? Then make RECORD_PARAM_COMM and RECORD_PARAM_COMM_DATA have the same trace fromat, exept the first tensor. Otherwise, replayer needs to branch positional parsing.
After that, we will use (seq_id, is_p2p, pg_id) as key to match comm op and wait.
| c10::intrusive_ptr<Work> ProcessGroupNCCL::barrier(const BarrierOptions& opts) { | ||
| RECORD_PARAM_COMMS( | ||
| static_cast<int>( | ||
| this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we should remove the function getSequenceNumberForGroup(), but use seqCollective_ directly.
- func name can't represent it returns collective sequence id;
- keep conssistant with P2P, which use
seqP2Pdirectly;
Or we can reimplement getSequenceNumberForGroup() to:
uint64_t ProcessGroupNCCL::getSequenceNumberForGroup(OpType typ, bool batchP2P = false) {
return isP2P(typ, batchP2P) ? seqP2P_ : seqCollective_;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getSequenceNumberForGroup is a virtual function defined in root class BackEnd. P2P seems only used in ProcessGroupNCCL. Add a new input parameter probably does not make sense to other sub-classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. Then how about using seqCollective_ directly?
| rank, | ||
| opType, | ||
| seqCollective_, | ||
| isP2POp(opType) ? seqP2P_ : seqCollective_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to isP2POp, batched P2P is not P2P op. So we need to check coalesing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batched P2P is implemented through Coalescing. The id we want to pick is based on if it is either a single P2P op/coalesced P2P op or a single collective op/coalesced collective op, does not matter it is coalesced or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fduwjj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Let's wait for a conclusion of this issue? |
361be42 to
d9bf310
Compare
|
This pull request was exported from Phabricator. Differential Revision: D61882985 |
…pytorch#134619) Summary: Pull Request resolved: pytorch#134619 In a ProcessGroupNCCL, there are two different sequence no. for collective and p2p op. "wait" op need to know what type of op it waits. In this DIFF, I changed RECORD_PARAM_COM interface to add a new entry "isP2P", this is a boolean varaible to flag if the op is a point to point op or not. Test Plan: buck2 run mode/opt kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test Reviewed By: c-p-i-o Differential Revision: D61882985
d9bf310 to
3974e03
Compare
|
This pull request was exported from Phabricator. Differential Revision: D61882985 |
| RECORD_PARAM_COMMS_DATA( | ||
| static_cast<int>( | ||
| this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective | ||
| this->seqP2P_ + 1), // seqP2P_ + 1 to match pointToPoint op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems better to use tuple (seq, is_p2p) for RECORD_PARAM_COMMS_DATA. Otherewise, replayer needs to copy the logic of PyTorch to check collectiveName_. Keep consistant pattern would be better.
|
@shengfukevin Please refer to this PR (#134578) or use it directly. Thanks! |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Summary:
In a ProcessGroupNCCL, there are two different sequence no. for collective and p2p op. "wait" op need to know what type of op it waits.
In this DIFF, I changed RECORD_PARAM_COM interface to add a new entry "isP2P", this is a boolean varaible to flag if the op is a point to point op or not.
Differential Revision: D61882985
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o