-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix cuda multiprocessing cached memory #14736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aten/src/THC/THCAllocator.cpp
Outdated
| int cur_device; | ||
| THCudaCheck(cudaGetDevice(&cur_device)); | ||
| auto* context = new THCIpcDeleter(data, device); | ||
| auto* context = new THCIpcDeleter(basePtr, device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do a std::move(basePtr) here
aten/src/THC/THCAllocator.cpp
Outdated
| THCIpcDeleter::THCIpcDeleter(void* data, int device) | ||
| : data_(data), device_(device) {} | ||
| THCIpcDeleter::THCIpcDeleter(std::shared_ptr<void> basePtr, int device) | ||
| : basePtr_(basePtr), device_(device) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also do a std::move here
torch/multiprocessing/reductions.py
Outdated
| tensor.size(), | ||
| tensor.stride(), | ||
| tensor_offset + storage_offset, | ||
| tensor_offset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to clarify what these variables mean now
| THWStoragePtr base(THWStorage_(newWithDataAndAllocator)( | ||
| LIBRARY_STATE | ||
| THCIpcDeleter::makeDataPtr(devPtr, device), | ||
| THCIpcDeleter::makeDataPtr(basePtr, devPtr, device), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::move here
|
I'm still working on two failed test: test_empty_tensor_sharing_cuda and test_cuda_small_tensors(multiple devices). |
|
(the above two) Tests are fixed. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
aten/src/THC/THCAllocator.cpp
Outdated
|
|
||
| at::DataPtr THCIpcDeleter::makeDataPtr(void* data, int device) { | ||
| // Refer to NB [CUDA IPC and the caching allocator] for more details | ||
| // basePtr - device ptr of allocated CUDA memory region. This memory region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "memory region" here is a bit ambiguous. A more specific moniker is "a single cudaMalloc allocation; this may be a large block of memory which is managed by the caching allocator".
aten/src/THC/THCAllocator.cpp
Outdated
| // construct the new storage. | ||
| // Every time a storage on the memory region go out of scope, the ref_count | ||
| // of basePtr will be decreased 1, until it's closed in its deleter (calling | ||
| // cudaIpoCloseMemHandle) when ref_count is 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cudaIpcCloseMemHandle (Ipo)
aten/src/THC/THCAllocator.cpp
Outdated
| // device - device of memory | ||
| // Here basePtr should be saved in the struct, while data should be used to | ||
| // construct the new storage. | ||
| // Every time a storage on the memory region go out of scope, the ref_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Well, it's whatever the shared pointers internal refcount is; ref_count seems to imply that we're manually refcounting something by the name of the ref_count identifier but we're not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar: Every time a storage referring to the IPC memory region goes out of scope, the reference count on the memory region will be decreased by one, until it's zero, at which point IPC memory region is closed (by calling cudaIpcCloseMemHandle).
aten/src/THC/THCAllocator.cpp
Outdated
| THCudaCheck(cudaGetDevice(&prev_device)); | ||
| THCudaCheck(cudaSetDevice(device_)); | ||
| THCudaCheck(cudaIpcCloseMemHandle(data_)); | ||
| THCudaCheck(cudaSetDevice(prev_device)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the cudaSetDevice here anymore. In fact, you don't need this destructor at all.
aten/src/THC/THCCachingAllocator.cpp
Outdated
| devPtr, | ||
| [curr_device](void *ptr) { | ||
| THCudaCheck(cudaSetDevice(curr_device)); | ||
| THCudaCheck(cudaIpcCloseMemHandle(ptr));}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deleter doesn't look like it actually it's removing entries from the table. So it seems to me that we are leaking memory, because ipcMemHandle_to_devptr will just gradually get larger, without ever being GC'ed.
aten/src/THC/THCCachingAllocator.cpp
Outdated
| // is called by receiver process to get access to the memory where the tensor | ||
| // was built on sender process. | ||
| // | ||
| // CUDA IPC only allows sharing a big memory block associated with a IpcMemHandle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general nit when writing documentation like this: if you write "IpcMemHandle" (capitalized specifically in this way), the implication is that this refers to an actual /name/ of some sort in the codebase. However, in this case, you're using this as an abbreviation for cudaIpcMemHandle_t. It's better to spell it out entirely, so that if someone greps for cudaIpcMemHandle_t they hit this code (they are probably not going to grep for IpcMemHandle).
aten/src/THC/THCAllocator.cpp
Outdated
| } | ||
|
|
||
| at::DataPtr THCIpcDeleter::makeDataPtr(void* data, int device) { | ||
| // Refer to NB [CUDA IPC and the caching allocator] for more details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nittiest of nits: the prevailing convention in the codebase is Note [Blah blah], rather than NB ;)
aten/src/THC/THCCachingAllocator.cpp
Outdated
|
|
||
| // | ||
| // In CUDA IPC, sender sends a tensor to receiver, THCCaching_CUDAIpcDevptr | ||
| // is called by receiver process to get access to the memory where the tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd probably say something more like, is called by the receiving process to map the CUDA memory from the sending process into its own address space.
aten/src/THC/THCCachingAllocator.cpp
Outdated
| if (ipcMemHandle_to_devptr.find(handle) == ipcMemHandle_to_devptr.end() | ||
| || ipcMemHandle_to_devptr[handle].expired()) { | ||
| void *devPtr = nullptr; | ||
| cudaIpcMemHandle_t ipc_handle = *(cudaIpcMemHandle_t*)handle.c_str(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in C++ land, we use a reinterpret_cast here.
torch/multiprocessing/reductions.py
Outdated
| shared_cache[storage_handle] = StorageWeakRef(storage) | ||
| storage_cls, storage_device, storage_handle, storage_size, storage_offset, | ||
| requires_grad): | ||
| if storage_handle is None or tensor_size == 0 or storage_size == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, is tensor_size ever zero? Isn't it always some sort of tuple?
torch/multiprocessing/reductions.py
Outdated
| # to just make a storage for the entire caching allocator block. | ||
| # a bigger region (0xA000) than the one I wanted (0xA100)". | ||
| # | ||
| # Note that this cudaMalloc allocation might not be a single type of storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a little bit of trouble following the line of reasoning in the edited text. Part of the problem is you immediately jump into a paragraph explaining why the cudaMalloc allocation cannot be a single type of storage. OK, so this is definitely something we talked about, but the reader of the comment isn't aware of the sordid history of how Edward was silly and tried to mash them all in one storage. You can maybe ease the transition with something like, "OK, so if you sent the cudaMalloc allocation, can you just wrap that up as one storage itself? No, because..."
torch/multiprocessing/reductions.py
Outdated
| # we have | ||
| # On sender side, the following info are required to passed to receiver for | ||
| # storage recontruction. | ||
| # 1. MemHandle(which can be translated to a basePtr in receiver process). The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really a translation: the act of opening an IPC memory handle /maps/ the memory into your local address. If I open and close a memory handle multiple times, CUDA is allowed to give it a different address; similarly, once I close the memory, I'm not allowed to access it, even if it really is still live on the original process.
torch/multiprocessing/reductions.py
Outdated
| # | ||
| # Tensor(size=0x100, offset=0x020, storage=Storage(data=0xA100, size=0x0100)) | ||
| # To send a tensor | ||
| # Tensor(size=0x040, offset=0x020, storage=Storage(data=0xA100, size=0x0100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this accurately describes what we do anymore. We don't send a "Storage", we send the CUDA allocation memory handle, and the offset of the storage into that allocation.
ezyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a lot of comments, but there are only two real show stoppers:
- Need to use device guard in the destructor for IPC handle
- We're leaking memory in the mapping table
Rest of it is docs, which we can improve after getting it in the release. Approving in light of this.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
555cdbd to
27afbfe
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
36701b5 to
00b6a83
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This PR fixes #11422 In the old world of CUDA IPC, when we want to share a tensor T from A to B, we have to share the whole CUDA mem allocation where T's storage sit in. And we casted it to the same type of storage of T's. This causes problem when two different types of storage got allocated to the same CUDA mem block. When we try to reconstruct the second tensor, it will complain about wrong storage type. In this PR we reconstruct the storage only (not the entire mem block). However, CUDA only allows one open memHandle once per process, we have to save the device pointer in a global cache so that we can reconstruct tensors as they come. Thanks a ton to ezyang who helped design the solution and debugged the issue! Pull Request resolved: pytorch/pytorch#14736 Differential Revision: D13335899 Pulled By: ailzhang fbshipit-source-id: cad69db392ed6f8fdc2b93a9dc2899f6d378c371
|
@ailzhang After this PR, we get the error Running the above script in pytorch 1.0 (or current nightly builds) gives the error: Assertion `self->allocator() != nullptr' failed. I tried to reinstall PyTorch with the pre-1.0 version (dated 20181202, which is just two days before this PR) and could not observe that error. Because this PR is the most relevant one, could you please suggest me some possibilities why that error happened? I know that it is better to raise an issue but I can't make an example which solely depends on pytorch and is independent of pyro :(. Thanks! FYI, this just happens with CUDA tensor and does not happen with pytorch-nightly-1.0.0.dev20181202. I tried to debug and observed that while rebuilding cuda tensor, the storage (which is returned by this line) has size different from storage_size_bytes (modulo unit byte of a float or double tensor). I guess somehow tensor's information is lost during the process of rebuilding tensors. |
|
@fehiepsi Can you remind me if Pyro has any C++ extension code? If it doesn't, it's almost certainly a problem on our end. |
|
@ezyang Pyro does not have any C++ extension code AFAIK. |
|
@fehiepsi I don't fully understand how Pyro works, but in the script above you are setting the |
|
Yes, it is strange that when setting the It is not necessary that setting |
|
@fehiepsi Hmmm no matter what |
This PR fixes #11422
In the old world of CUDA IPC, when we want to share a tensor T from A to B, we have to share the whole CUDA mem allocation where T's storage sit in. And we casted it to the same type of storage of T's.
This causes problem when two different types of storage got allocated to the same CUDA mem block. When we try to reconstruct the second tensor, it will complain about wrong storage type.
In this PR we reconstruct the storage only (not the entire mem block). However, CUDA only allows one open memHandle once per process, we have to save the device pointer in a global cache so that we can reconstruct tensors as they come.
Thanks a ton to @ezyang who helped design the solution and debugged the issue!