Skip to content

Conversation

@mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Jun 6, 2019

Fix #20421

ProcessGroupGloo only requires input/output tensors to be contiguous. Contiguous tensors might not start from the beginning of the underlying storage, e.g., chunk(..., dim=0)[1]. The current implementation passes tensor.storage().data() ptr to gloo buffer. This leads to wrong results if the tensor has a non-zero storage offset.

The proposed solution is to use tensor.data_ptr() instead. Let's see if this breaks any tests.

cc @qijianan777

@mrshenli mrshenli requested review from apaszke and pietern as code owners June 6, 2019 20:57
@pytorchbot pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 6, 2019
@mrshenli
Copy link
Contributor Author

mrshenli commented Jun 6, 2019

This might fix #21480 as well, let me add a test. No, different problem.

@mrshenli mrshenli changed the title Fix ProcessGroupGloo allgather for tensors with shared storage [WIP] Fix ProcessGroupGloo allgather for tensors with shared storage Jun 7, 2019
@pietern
Copy link
Contributor

pietern commented Jun 10, 2019

I suppose this is not specific to tensors with shared storage but to all views more generally, right?

@mrshenli
Copy link
Contributor Author

I suppose this is not specific to tensors with shared storage but to all views more generally, right?

I think it applies to all cases where tensor.data() != tensor.storage().data(). So, not necessarily all views, we could still have a view representing the first row of a 2D contiguous tensor not hitting the error.

@mrshenli mrshenli changed the title [WIP] Fix ProcessGroupGloo allgather for tensors with shared storage Fix ProcessGroupGloo allgather for tensors with shared storage Jun 11, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove getDataPointer in this PR or is it still used somewhere?

@mrshenli
Copy link
Contributor Author

Can we remove getDataPointer in this PR or is it still used somewhere?

@pietern I found that getDataPointer and getDataPointers are only used in ProcessGroupGloo for now. So, I moved the change and comments to getDataPointer instead.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in 39d4121.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The result of gloo all_gather error

6 participants