Skip to content

Conversation

@pietern
Copy link
Contributor

@pietern pietern commented Sep 5, 2018

The existing tests had every rank run send to every other rank and only
then switch to recv mode. This only works if the send operations are
non-blocking and the passed tensors are immediately copied to some kind
of send buffer. Instead, every send must be matched with a recv on the
other side, because from the API perspective they may block.

E.g. imagine a 1GB tensor being sent to every other rank. It can only go
through if there is a recv on the other side, or it will deadlock.

This change reflects this in the send/recv unit tests.

@pietern
Copy link
Contributor Author

pietern commented Sep 5, 2018

FYI, I'm waiting for CI to confirm the tests don't break for MPI and/or THD. I have confirmed they work with the upcoming gloo send/recv support.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pietern has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

The existing tests had every rank run send to every other rank and only
then switch to recv mode. This only works if the send operations are
non-blocking and the passed tensors are immediately copied to some kind
of send buffer. Instead, every send must be matched with a recv on the
other side, because from the API perspective they may block.

E.g. imagine a 1GB tensor being sent to every other rank. It can only go
through if there is a recv on the other side, or it will deadlock.

This change reflects this in the send/recv unit tests.
@pietern pietern force-pushed the distributed-send-recv-test branch from 37317d2 to 8b7bf53 Compare September 5, 2018 19:35
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pietern has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@teng-li teng-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
The existing tests had every rank run send to every other rank and only
then switch to recv mode. This only works if the send operations are
non-blocking and the passed tensors are immediately copied to some kind
of send buffer. Instead, every send must be matched with a recv on the
other side, because from the API perspective they may block.

E.g. imagine a 1GB tensor being sent to every other rank. It can only go
through if there is a recv on the other side, or it will deadlock.

This change reflects this in the send/recv unit tests.
Pull Request resolved: pytorch#11275

Differential Revision: D9658197

Pulled By: pietern

fbshipit-source-id: fb6a3fc03b42343a9dfeed0def30d94914e76974
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants