Skip to content

coalescing implementation in ProcessGroupNCCL.hpp/cpp #134833

@GSSBMW

Description

@GSSBMW

🐛 Describe the bug

Hi, all. I'm developing the replay to Execution Trace. Confusing about the 2 points below and not sure whether it is bug. Can you help confirm/explain? Thanks!

  1. Consider torch.distributed.batch_isend_irecv() is invoked.
    Several send/recv will be invoked in coalesed range. startCoalesing bumps up seqCollective_, but not seqP2P_. Is it by design?

  2. Consider allgather is invoked with different size, where it will be coverted to multi _broadcast_oop() in coalesed range, which is implemented by collective.
    Compared with coalesed send/recv, each collective will always bump seqCollective_ and create work. However, the work is returned but not used.
    However, coalesed p2p will not bump seqP2P_ and not create word. Why is the pattern different?
    image
    image

The final result is:
image

Versions

tot main

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Metadata

Metadata

Assignees

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions