Skip to content

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented Aug 3, 2018

    Correctly share CUDA Parameters, requires_grad and hooks.
    
    Previously, the following was true:
    
    - If you put a Parameter for a CUDA tensor
      in multiprocessing queue (or otherwise tried to transfer it),
      this failed, saying that we cannot pickle CUDA storage.
      This is issue #9996.
    
    - If you put a leaf Tensor that requires_grad=True through the
      multiprocessing queue, it would come out the other end as
      requires_grad=False (It should have come out the other end
      as requires_grad=True).  Similarly, backwards hooks were
      lost.
    
    - If you put a non-leaf Tensor that requires_grad=True through
      the multiprocessing queue, it would come out the other end
      as requires_grad=False.
    
    The root cause for the first issue was that implementation of
    reductions for Parameter used the superclass implementation
    (tensor) in __reduce_ex__, but this always picks up the
    non-ForkingPickler reduction, which doesn't work with CUDA tensors.
    So, we registered a new ForkingPickler specifically for Parameter,
    and adjusted the code to correctly rewrap a Tensor in a Parameter
    if it was originally a parameter.
        
    While working on this, we realized that requires_grad and backwards
    hooks would not be preserved in the ForkingPickler reduction
    implementation.  We fixed the reducer to save these parameters.
    However, Adam Paszke pointed out that we shouldn't allow sending
    requires_grad=True, non-leaf Tensors over a multiprocessing
    queue, since we don't actually support autograd over process
    boundar.  We now throw an error in this case; this may cause
    previously working code to fail, but this is easy enough to fix;
    just detach() the tensor before sending it.  The error message says
    so.
    
    Fixes #9996.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ailzhang ailzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ezyang ! This fixes the issue for me. The lint test probably just need a retest.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Previously, the following was true:

- If you put a Parameter for a CUDA tensor
  in multiprocessing queue (or otherwise tried to transfer it),
  this failed, saying that we cannot pickle CUDA storage.
  This is issue pytorch#9996.

- If you put a leaf Tensor that requires_grad=True through the
  multiprocessing queue, it would come out the other end as
  requires_grad=False (It should have come out the other end
  as requires_grad=True).  Similarly, backwards hooks were
  lost.

- If you put a non-leaf Tensor that requires_grad=True through
  the multiprocessing queue, it would come out the other end
  as requires_grad=False.

The root cause for the first issue was that implementation of
reductions for Parameter used the superclass implementation
(tensor) in __reduce_ex__, but this always picks up the
non-ForkingPickler reduction, which doesn't work with CUDA tensors.
So, we registered a new ForkingPickler specifically for Parameter,
and adjusted the code to correctly rewrap a Tensor in a Parameter
if it was originally a parameter.

While working on this, we realized that requires_grad and backwards
hooks would not be preserved in the ForkingPickler reduction
implementation.  We fixed the reducer to save these parameters.
However, Adam Paszke pointed out that we shouldn't allow sending
requires_grad=True, non-leaf Tensors over a multiprocessing
queue, since we don't actually support autograd over process
boundar.  We now throw an error in this case; this may cause
previously working code to fail, but this is easy enough to fix;
just detach() the tensor before sending it.  The error message says
so.

Fixes pytorch#9996.

Signed-off-by: Edward Z. Yang <[email protected]>
@ezyang ezyang force-pushed the pr/fix-cuda-parameter-sharing branch from e5b61f2 to f247c31 Compare August 6, 2018 16:36
@ezyang
Copy link
Contributor Author

ezyang commented Aug 6, 2018

@ailzhang @fmassa @apaszke OK, the patch has been updated to fix another bug with multiprocessing requires_grad sharing, and the code is a bit clearer now.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang added 2 commits August 6, 2018 11:05
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Member

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid the with open(os.devnull, "w") and just use ctx.SimpleQueue with normal try-except behavior.

for device in devices:
var0 = Variable(torch.arange(1., 26, device=device).view(5, 5), requires_grad=True)
var = var0 * 2
# We can't do the pickling indirectly, e.g., with a queue.put,

This comment was marked as off-topic.

devices.append('cuda')
for device in devices:
for requires_grad in [True, False]:
var = Variable(torch.arange(1., 26, device=device).view(5, 5), requires_grad=requires_grad)

This comment was marked as off-topic.

Signed-off-by: Edward Z. Yang <[email protected]>
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
```
    Correctly share CUDA Parameters, requires_grad and hooks.

    Previously, the following was true:

    - If you put a Parameter for a CUDA tensor
      in multiprocessing queue (or otherwise tried to transfer it),
      this failed, saying that we cannot pickle CUDA storage.
      This is issue pytorch#9996.

    - If you put a leaf Tensor that requires_grad=True through the
      multiprocessing queue, it would come out the other end as
      requires_grad=False (It should have come out the other end
      as requires_grad=True).  Similarly, backwards hooks were
      lost.

    - If you put a non-leaf Tensor that requires_grad=True through
      the multiprocessing queue, it would come out the other end
      as requires_grad=False.

    The root cause for the first issue was that implementation of
    reductions for Parameter used the superclass implementation
    (tensor) in __reduce_ex__, but this always picks up the
    non-ForkingPickler reduction, which doesn't work with CUDA tensors.
    So, we registered a new ForkingPickler specifically for Parameter,
    and adjusted the code to correctly rewrap a Tensor in a Parameter
    if it was originally a parameter.

    While working on this, we realized that requires_grad and backwards
    hooks would not be preserved in the ForkingPickler reduction
    implementation.  We fixed the reducer to save these parameters.
    However, Adam Paszke pointed out that we shouldn't allow sending
    requires_grad=True, non-leaf Tensors over a multiprocessing
    queue, since we don't actually support autograd over process
    boundar.  We now throw an error in this case; this may cause
    previously working code to fail, but this is easy enough to fix;
    just detach() the tensor before sending it.  The error message says
    so.

    Fixes pytorch#9996.
```
Pull Request resolved: pytorch#10220

Differential Revision: D9160746

Pulled By: ezyang

fbshipit-source-id: a39c0dbc012ba5afc7a9e646da5c7f325b3cf05c
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants