Skip to content

Conversation

@pietern
Copy link
Contributor

@pietern pietern commented Sep 7, 2018

This change removes the skips for the existing send/recv tests in the backwards compatibility layer.

@pietern pietern added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 7, 2018
@pietern pietern requested a review from teng-li September 7, 2018 17:39
Copy link
Contributor

@teng-li teng-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to store srcRank in work, let's also do that in MPI too

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@pietern
Copy link
Contributor Author

pietern commented Sep 7, 2018

@teng-li In the MPI backend this is not needed as we capture it in the lambda. For the gloo backend we need to capture it because we don't have a "run" lambda. The operation gets kicked off and completes on the gloo device thread, not on the c10d thread pool. Then the returned work object can call waitSend or waitRecv with the srcRank pointer to wait for the operation to complete.

Updated to use std::make_shared and it works just fine casting/constructing a base class copy.

@pietern pietern force-pushed the c10d-gloo-send-recv branch from 2a9777e to dee3c39 Compare September 7, 2018 18:30
@teng-li
Copy link
Contributor

teng-li commented Sep 7, 2018

build still failing

@pietern
Copy link
Contributor Author

pietern commented Sep 7, 2018

The gloo bump surfaces a build issue with GCC 7. It's missing the include.

Submitted pytorch/gloo#131 to fix.

@pietern pietern force-pushed the c10d-gloo-send-recv branch from dee3c39 to 1c1b782 Compare September 7, 2018 21:15
@pietern
Copy link
Contributor Author

pietern commented Sep 7, 2018

Updated gloo submodule to include the GCC 7 fix.

Copy link
Contributor

@teng-li teng-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Watch for the CI

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pietern has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pietern is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
This change removes the skips for the existing send/recv tests in the backwards compatibility layer.
Pull Request resolved: pytorch#11387

Reviewed By: teng-li

Differential Revision: D9729330

Pulled By: pietern

fbshipit-source-id: f8899219a94d806386d03e9ef53bff622d8658a3
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants