Fix ProcessGroupGloo allgather for tensors with shared storage #21490

mrshenli · 2019-06-06T20:57:19Z

Fix #20421

ProcessGroupGloo only requires input/output tensors to be contiguous. Contiguous tensors might not start from the beginning of the underlying storage, e.g., chunk(..., dim=0)[1]. The current implementation passes tensor.storage().data() ptr to gloo buffer. This leads to wrong results if the tensor has a non-zero storage offset.

The proposed solution is to use tensor.data_ptr() instead. Let's see if this breaks any tests.

cc @qijianan777

…rage

mrshenli · 2019-06-06T20:59:02Z

~~This might fix #21480 as well, let me add a test.~~ No, different problem.

pietern · 2019-06-10T08:07:11Z

I suppose this is not specific to tensors with shared storage but to all views more generally, right?

mrshenli · 2019-06-10T15:28:34Z

I suppose this is not specific to tensors with shared storage but to all views more generally, right?

I think it applies to all cases where tensor.data() != tensor.storage().data(). So, not necessarily all views, we could still have a view representing the first row of a 2D contiguous tensor not hitting the error.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

pietern

Can we remove getDataPointer in this PR or is it still used somewhere?

mrshenli · 2019-06-12T14:09:39Z

Can we remove getDataPointer in this PR or is it still used somewhere?

@pietern I found that getDataPointer and getDataPointers are only used in ProcessGroupGloo for now. So, I moved the change and comments to getDataPointer instead.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-06-12T19:07:18Z

@mrshenli merged this pull request in 39d4121.

Fix ProcessGroupGloo for tensor with contiguous but not exclusive sto…

4898c64

…rage

mrshenli requested review from apaszke and pietern as code owners June 6, 2019 20:57

pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 6, 2019

mrshenli changed the title ~~Fix ProcessGroupGloo allgather for tensors with shared storage~~ [WIP] Fix ProcessGroupGloo allgather for tensors with shared storage Jun 7, 2019

mrshenli added 2 commits June 11, 2019 12:21

Merge remote-tracking branch 'upstream/master' into allgather

6eef998

Move tests to test_c10d_spawn

04009b8

mrshenli changed the title ~~[WIP] Fix ProcessGroupGloo allgather for tensors with shared storage~~ Fix ProcessGroupGloo allgather for tensors with shared storage Jun 11, 2019

facebook-github-bot reviewed Jun 11, 2019

View reviewed changes

pietern approved these changes Jun 12, 2019

View reviewed changes

address comments

1ec9051

facebook-github-bot reviewed Jun 12, 2019

View reviewed changes

facebook-github-bot closed this in 39d4121 Jun 12, 2019

facebook-github-bot added the merged label Jun 12, 2019

mruberry added the Merged label Oct 28, 2020

gcramer23 added high priority and removed high priority labels Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ProcessGroupGloo allgather for tensors with shared storage #21490

Fix ProcessGroupGloo allgather for tensors with shared storage #21490

Uh oh!

mrshenli commented Jun 6, 2019

Uh oh!

mrshenli commented Jun 6, 2019 •

edited

Loading

Uh oh!

pietern commented Jun 10, 2019

Uh oh!

mrshenli commented Jun 10, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

pietern left a comment

Uh oh!

mrshenli commented Jun 12, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Jun 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fix ProcessGroupGloo allgather for tensors with shared storage #21490

Fix ProcessGroupGloo allgather for tensors with shared storage #21490

Uh oh!

Conversation

mrshenli commented Jun 6, 2019

Uh oh!

mrshenli commented Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietern commented Jun 10, 2019

Uh oh!

mrshenli commented Jun 10, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Jun 12, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mrshenli commented Jun 6, 2019 •

edited

Loading