Skip to content

Conversation

@shengfukevin
Copy link
Contributor

@shengfukevin shengfukevin commented Aug 27, 2024

Summary:
In a ProcessGroupNCCL, there are two different sequence no. for collective and p2p op. "wait" op need to know what type of op it waits.

In this DIFF, I changed RECORD_PARAM_COM interface to add a new entry "isP2P", this is a boolean varaible to flag if the op is a point to point op or not.

Differential Revision: D61882985

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 27, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134619

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit 3974e03 with merge base 86e03a6 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 27, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61882985

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61882985

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ++ here?

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61882985

shengfukevin added a commit to shengfukevin/pytorch that referenced this pull request Aug 30, 2024
…pytorch#134619)

Summary:
Pull Request resolved: pytorch#134619

In a ProcessGroupNCCL, there are two different sequence no. for collective and p2p op. "wait" op need to know what type of op it waits.

In this DIFF, I changed RECORD_PARAM_COM interface to add a new entry "isP2P", this is a boolean varaible to flag if the op is a point to point op or not.

Test Plan: buck2 run mode/opt kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test

Differential Revision: D61882985
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61882985


#define RECORD_PARAM_COMMS( \
seq, \
isP2P, \
Copy link

@GSSBMW GSSBMW Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making seq be a tuple of (seq_id, is_p2p), but not a stand alone integer? Then make RECORD_PARAM_COMM and RECORD_PARAM_COMM_DATA have the same trace fromat, exept the first tensor. Otherwise, replayer needs to branch positional parsing.
After that, we will use (seq_id, is_p2p, pg_id) as key to match comm op and wait.

c10::intrusive_ptr<Work> ProcessGroupNCCL::barrier(const BarrierOptions& opts) {
RECORD_PARAM_COMMS(
static_cast<int>(
this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we should remove the function getSequenceNumberForGroup(), but use seqCollective_ directly.

  1. func name can't represent it returns collective sequence id;
  2. keep conssistant with P2P, which use seqP2P directly;

Or we can reimplement getSequenceNumberForGroup() to:

uint64_t ProcessGroupNCCL::getSequenceNumberForGroup(OpType typ, bool batchP2P = false) {
  return isP2P(typ, batchP2P) ? seqP2P_ : seqCollective_;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSequenceNumberForGroup is a virtual function defined in root class BackEnd. P2P seems only used in ProcessGroupNCCL. Add a new input parameter probably does not make sense to other sub-classes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. Then how about using seqCollective_ directly?

rank,
opType,
seqCollective_,
isP2POp(opType) ? seqP2P_ : seqCollective_,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to isP2POp, batched P2P is not P2P op. So we need to check coalesing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batched P2P is implemented through Coalescing. The id we want to pick is based on if it is either a single P2P op/coalesced P2P op or a single collective op/coalesced collective op, does not matter it is coalesced or not.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. If it's coalesed p2p, seqCollective_ is bumped up in startCoalesing() and we pick seqCollective_ here.
Not sure whether this will be changed/rectified according to this comment from @fduwjj

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 30, 2024
@GSSBMW
Copy link

GSSBMW commented Aug 31, 2024

Let's wait for a conclusion of this issue?

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61882985

…pytorch#134619)

Summary:
Pull Request resolved: pytorch#134619

In a ProcessGroupNCCL, there are two different sequence no. for collective and p2p op. "wait" op need to know what type of op it waits.

In this DIFF, I changed RECORD_PARAM_COM interface to add a new entry "isP2P", this is a boolean varaible to flag if the op is a point to point op or not.

Test Plan: buck2 run mode/opt kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test

Reviewed By: c-p-i-o

Differential Revision: D61882985
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61882985

RECORD_PARAM_COMMS_DATA(
static_cast<int>(
this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective
this->seqP2P_ + 1), // seqP2P_ + 1 to match pointToPoint op
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems better to use tuple (seq, is_p2p) for RECORD_PARAM_COMMS_DATA. Otherewise, replayer needs to copy the logic of PyTorch to check collectiveName_. Keep consistant pattern would be better.

@GSSBMW
Copy link

GSSBMW commented Sep 16, 2024

@shengfukevin Please refer to this PR (#134578) or use it directly. Thanks!

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Nov 15, 2024
@github-actions github-actions bot closed this Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants