Skip to content

Conversation

@ssnl
Copy link
Collaborator

@ssnl ssnl commented Apr 29, 2019

The mp notes are not updated after #16854. (The torch.multiprocessing page is.)

@ssnl ssnl requested a review from VitalyFedyunin April 29, 2019 04:28
@pytorchbot pytorchbot added the module: docs Related to our documentation, both in docs/ and docblocks label Apr 29, 2019
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still warn that even this refcounting can't save you if the child process exits!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I will add a sentence. Do you mean that it can't save you in the sense that

  1. the refcount is not decremented when child process exits abnormally,
    or
  2. the child process will segfault if the sending process exits?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added both. Let me know if they aren't what you expected, or are incorrect.

@ssnl ssnl force-pushed the cuda-multi-doc branch from 62c28fe to 79dd56c Compare May 4, 2019 02:01
@ssnl
Copy link
Collaborator Author

ssnl commented May 4, 2019

@VitalyFedyunin Could you review this PR?

@ezyang
Copy link
Contributor

ezyang commented May 6, 2019

Yes, @VitalyFedyunin, it would be great if you could to take a look. @ssnl poke me about this in a few days if it is still not reviewed by then.

@VitalyFedyunin
Copy link
Contributor

on it

as long as the receiving process retains a copy of the tensor. It is implemented
under the hood but requires users to follow the best practices for the program
to run correctly. For example, the sending process must stay alive as long as
the consumer process has references to the tensor, and the refcounting can not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the sending process must stay alive as long as
the consumer process has references to the tensor.

Not sure if it obvious or worth mentioning, but sending process must stay alive as long as there are non zero elements in the 'outgoing' queue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting. What will happen when there are shared CUDA tensor in queues but the sending process exits? Are those not freed by CUDA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as it is consumer's code which is doing ref-dec. In theory we can add this corner case, but it will complicate things by a lot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But if the consumer either (1) eventually dies, or (2) gets the tensor from the queue and it gets GC'ed. The memory is still freed, right?

@ssnl
Copy link
Collaborator Author

ssnl commented May 14, 2019

@pytorchbot merge this please

@pytorchbot pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label May 14, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@soumith merged this pull request in 83fe928.

@ssnl ssnl deleted the cuda-multi-doc branch May 27, 2019 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-this-please Was marked for merge with @pytorchbot merge this please module: docs Related to our documentation, both in docs/ and docblocks open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants