Skip to content

Conversation

@mrshenli
Copy link
Contributor

Fixes #20651

Communication collectives in torch.distributed call CUDACachingAllocator::recordStream() on input and output tensors to prevent their memory blocks being freed too early. CUDACachingAllocator uses tensor's data pointer to track memory blocks, which does not accept null pointers. However, empty tensor's storage().data() might be null. In this case, as there is no associated memory block for the empty tensor, it should be fine to make recordStream() a no-op.

Tests only cover broadcast empty tensors for GLOO backend, because GLOO does not support empty inputs (pytorch/gloo/issues/179). It can be addressed in either ProcessGroupGloo or GLOO itself. Will add more tests when that gap is filled.

@mrshenli mrshenli requested a review from colesbury May 17, 2019 20:26
@mrshenli mrshenli requested review from apaszke and pietern as code owners May 17, 2019 20:26
@pytorchbot pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general oncall: distributed Add this issue/PR to distributed oncall triage queue module: internals Related to internal abstractions in c10 and ATen labels May 17, 2019
@mrshenli mrshenli added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 17, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in 8acaa28.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.distributed.reduce empty tensor bug

5 participants