Fix cuda multiprocessing cached memory #14736

ailzhang · 2018-12-04T05:27:38Z

This PR fixes #11422

In the old world of CUDA IPC, when we want to share a tensor T from A to B, we have to share the whole CUDA mem allocation where T's storage sit in. And we casted it to the same type of storage of T's.

This causes problem when two different types of storage got allocated to the same CUDA mem block. When we try to reconstruct the second tensor, it will complain about wrong storage type.

In this PR we reconstruct the storage only (not the entire mem block). However, CUDA only allows one open memHandle once per process, we have to save the device pointer in a global cache so that we can reconstruct tensors as they come.

Thanks a ton to @ezyang who helped design the solution and debugged the issue!

ezyang · 2018-12-04T19:16:48Z

aten/src/THC/THCAllocator.cpp

  int cur_device;
  THCudaCheck(cudaGetDevice(&cur_device));
-  auto* context = new THCIpcDeleter(data, device);
+  auto* context = new THCIpcDeleter(basePtr, device);


Do a std::move(basePtr) here

ezyang · 2018-12-04T19:17:20Z

aten/src/THC/THCAllocator.cpp

-THCIpcDeleter::THCIpcDeleter(void* data, int device)
-    : data_(data), device_(device) {}
+THCIpcDeleter::THCIpcDeleter(std::shared_ptr<void> basePtr, int device)
+    : basePtr_(basePtr), device_(device) {}


Also do a std::move here

aten/src/THC/THCCachingAllocator.cpp

ezyang · 2018-12-04T19:24:55Z

torch/multiprocessing/reductions.py

                 tensor.size(),
                 tensor.stride(),
-                 tensor_offset + storage_offset,
+                 tensor_offset,


It would be nice to clarify what these variables mean now

ezyang · 2018-12-04T19:40:50Z

torch/csrc/generic/StorageSharing.cpp

  THWStoragePtr base(THWStorage_(newWithDataAndAllocator)(
      LIBRARY_STATE
-      THCIpcDeleter::makeDataPtr(devPtr, device),
+      THCIpcDeleter::makeDataPtr(basePtr, devPtr, device),


std::move here

ailzhang · 2018-12-05T02:54:05Z

I'm still working on two failed test: test_empty_tensor_sharing_cuda and test_cuda_small_tensors(multiple devices).

ailzhang · 2018-12-05T04:35:48Z

(the above two) Tests are fixed.

facebook-github-bot

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…multiprocessing

ezyang · 2018-12-05T05:02:47Z

aten/src/THC/THCAllocator.cpp


-at::DataPtr THCIpcDeleter::makeDataPtr(void* data, int device) {
+// Refer to NB [CUDA IPC and the caching allocator] for more details
+// basePtr - device ptr of allocated CUDA memory region. This memory region


nit: "memory region" here is a bit ambiguous. A more specific moniker is "a single cudaMalloc allocation; this may be a large block of memory which is managed by the caching allocator".

ezyang · 2018-12-05T05:03:11Z

aten/src/THC/THCAllocator.cpp

+// construct the new storage.
+// Every time a storage on the memory region go out of scope, the ref_count
+// of basePtr will be decreased 1, until it's closed in its deleter (calling
+// cudaIpoCloseMemHandle) when ref_count is 0.


nit: cudaIpcCloseMemHandle (Ipo)

ezyang · 2018-12-05T05:03:50Z

aten/src/THC/THCAllocator.cpp

+// device  - device of memory
+// Here basePtr should be saved in the struct, while data should be used to
+// construct the new storage.
+// Every time a storage on the memory region go out of scope, the ref_count


nit: Well, it's whatever the shared pointers internal refcount is; ref_count seems to imply that we're manually refcounting something by the name of the ref_count identifier but we're not

grammar: Every time a storage referring to the IPC memory region goes out of scope, the reference count on the memory region will be decreased by one, until it's zero, at which point IPC memory region is closed (by calling cudaIpcCloseMemHandle).

aten/src/THC/THCAllocator.cpp

ezyang · 2018-12-05T05:05:58Z

aten/src/THC/THCAllocator.cpp

  THCudaCheck(cudaGetDevice(&prev_device));
  THCudaCheck(cudaSetDevice(device_));
-  THCudaCheck(cudaIpcCloseMemHandle(data_));
  THCudaCheck(cudaSetDevice(prev_device));


You don't need the cudaSetDevice here anymore. In fact, you don't need this destructor at all.

ezyang · 2018-12-05T05:08:55Z

aten/src/THC/THCCachingAllocator.cpp

+        devPtr,
+        [curr_device](void *ptr) {
+          THCudaCheck(cudaSetDevice(curr_device));
+          THCudaCheck(cudaIpcCloseMemHandle(ptr));});


This deleter doesn't look like it actually it's removing entries from the table. So it seems to me that we are leaking memory, because ipcMemHandle_to_devptr will just gradually get larger, without ever being GC'ed.

ezyang · 2018-12-05T05:11:36Z

aten/src/THC/THCCachingAllocator.cpp

+// is called by receiver process to get access to the memory where the tensor
+// was built on sender process.
+//
+// CUDA IPC only allows sharing a big memory block associated with a IpcMemHandle,


A general nit when writing documentation like this: if you write "IpcMemHandle" (capitalized specifically in this way), the implication is that this refers to an actual /name/ of some sort in the codebase. However, in this case, you're using this as an abbreviation for cudaIpcMemHandle_t. It's better to spell it out entirely, so that if someone greps for cudaIpcMemHandle_t they hit this code (they are probably not going to grep for IpcMemHandle).

ezyang · 2018-12-05T05:12:03Z

aten/src/THC/THCAllocator.cpp

 }

-at::DataPtr THCIpcDeleter::makeDataPtr(void* data, int device) {
+// Refer to NB [CUDA IPC and the caching allocator] for more details


nittiest of nits: the prevailing convention in the codebase is Note [Blah blah], rather than NB ;)

ezyang · 2018-12-05T05:16:22Z

aten/src/THC/THCCachingAllocator.cpp

+
+//
+// In CUDA IPC, sender sends a tensor to receiver, THCCaching_CUDAIpcDevptr
+// is called by receiver process to get access to the memory where the tensor


nit: I'd probably say something more like, is called by the receiving process to map the CUDA memory from the sending process into its own address space.

ezyang · 2018-12-05T05:18:04Z

aten/src/THC/THCCachingAllocator.cpp

+  if (ipcMemHandle_to_devptr.find(handle) == ipcMemHandle_to_devptr.end()
+      || ipcMemHandle_to_devptr[handle].expired()) {
+    void *devPtr = nullptr;
+    cudaIpcMemHandle_t ipc_handle = *(cudaIpcMemHandle_t*)handle.c_str();


nit: in C++ land, we use a reinterpret_cast here.

ezyang · 2018-12-05T05:35:41Z

torch/multiprocessing/reductions.py

-        shared_cache[storage_handle] = StorageWeakRef(storage)
+                        storage_cls, storage_device, storage_handle, storage_size, storage_offset,
+                        requires_grad):
+    if storage_handle is None or tensor_size == 0 or storage_size == 0:


Wait, is tensor_size ever zero? Isn't it always some sort of tuple?

ezyang · 2018-12-05T05:42:35Z

torch/multiprocessing/reductions.py

-    # to just make a storage for the entire caching allocator block.
+    # a bigger region (0xA000) than the one I wanted (0xA100)".
+    #
+    # Note that this cudaMalloc allocation might not be a single type of storage.


I had a little bit of trouble following the line of reasoning in the edited text. Part of the problem is you immediately jump into a paragraph explaining why the cudaMalloc allocation cannot be a single type of storage. OK, so this is definitely something we talked about, but the reader of the comment isn't aware of the sordid history of how Edward was silly and tried to mash them all in one storage. You can maybe ease the transition with something like, "OK, so if you sent the cudaMalloc allocation, can you just wrap that up as one storage itself? No, because..."

ezyang · 2018-12-05T05:44:17Z

torch/multiprocessing/reductions.py

-    # we have
+    # On sender side, the following info are required to passed to receiver for
+    # storage recontruction.
+    #   1. MemHandle(which can be translated to a basePtr in receiver process). The


It's not really a translation: the act of opening an IPC memory handle /maps/ the memory into your local address. If I open and close a memory handle multiple times, CUDA is allowed to give it a different address; similarly, once I close the memory, I'm not allowed to access it, even if it really is still live on the original process.

ezyang · 2018-12-05T05:45:38Z

torch/multiprocessing/reductions.py

    #
-    #   Tensor(size=0x100, offset=0x020, storage=Storage(data=0xA100, size=0x0100))
+    # To send a tensor
+    #   Tensor(size=0x040, offset=0x020, storage=Storage(data=0xA100, size=0x0100))


I don't think this accurately describes what we do anymore. We don't send a "Storage", we send the CUDA allocation memory handle, and the offset of the storage into that allocation.

ezyang

I had a lot of comments, but there are only two real show stoppers:

Need to use device guard in the destructor for IPC handle
We're leaking memory in the mapping table

Rest of it is docs, which we can improve after getting it in the release. Approving in light of this.

facebook-github-bot

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This PR fixes #11422 In the old world of CUDA IPC, when we want to share a tensor T from A to B, we have to share the whole CUDA mem allocation where T's storage sit in. And we casted it to the same type of storage of T's. This causes problem when two different types of storage got allocated to the same CUDA mem block. When we try to reconstruct the second tensor, it will complain about wrong storage type. In this PR we reconstruct the storage only (not the entire mem block). However, CUDA only allows one open memHandle once per process, we have to save the device pointer in a global cache so that we can reconstruct tensors as they come. Thanks a ton to ezyang who helped design the solution and debugged the issue! Pull Request resolved: pytorch/pytorch#14736 Differential Revision: D13335899 Pulled By: ailzhang fbshipit-source-id: cad69db392ed6f8fdc2b93a9dc2899f6d378c371

fehiepsi · 2019-01-17T21:52:14Z

@ailzhang After this PR, we get the error RuntimeError: Assertion self->allocator() as in @neerajprad 's comment. I can replicate the same issue with the following script:

import torch
import torch.multiprocessing as mp

import pyro
import pyro.distributions as dist

torch.set_default_tensor_type(torch.cuda.FloatTensor)
n = 10

def model():
    loc = pyro.sample("loc", dist.Normal(torch.zeros(3), 1))
    pyro.sample("y", dist.Normal(loc, 1))  # comment out this line -> no error

def worker(q, e):
    for i in range(n):
        trace = {"normal": dist.Normal(0, 1)}  # comment out this line -> no error ???
        trace = pyro.poutine.trace(model).get_trace()  # this is just a dictionary which holds tensors
        q.put(trace.nodes["loc"])  # q.put(trace) also gives error
        e.wait()
        e.clear()

if __name__ == "__main__":
    ctx = mp.get_context("spawn")
    q = ctx.Queue()
    e = ctx.Event()
    p = ctx.Process(target=worker, args=(q, e))
    p.start()
    for i in range(n):
        trace = q.get()
        e.set()
    p.join()

Running the above script in pytorch 1.0 (or current nightly builds) gives the error: Assertion `self->allocator() != nullptr' failed. I tried to reinstall PyTorch with the pre-1.0 version (dated 20181202, which is just two days before this PR) and could not observe that error. Because this PR is the most relevant one, could you please suggest me some possibilities why that error happened? I know that it is better to raise an issue but I can't make an example which solely depends on pytorch and is independent of pyro :(. Thanks!

FYI, this just happens with CUDA tensor and does not happen with pytorch-nightly-1.0.0.dev20181202. I tried to debug and observed that while rebuilding cuda tensor, the storage (which is returned by this line) has size different from storage_size_bytes (modulo unit byte of a float or double tensor). I guess somehow tensor's information is lost during the process of rebuilding tensors.

ezyang · 2019-01-17T22:10:45Z

@fehiepsi Can you remind me if Pyro has any C++ extension code? If it doesn't, it's almost certainly a problem on our end.

fehiepsi · 2019-01-17T22:22:05Z

@ezyang Pyro does not have any C++ extension code AFAIK.

ailzhang · 2019-01-17T22:35:30Z

@fehiepsi I don't fully understand how Pyro works, but in the script above you are setting the trace variable twice. According to my local run, commenting any of those two lines works. Could you explain a bit what could be correlated in those lines? They seem like two independent statements setting the same variable to me.

fehiepsi · 2019-01-18T01:47:36Z

Yes, it is strange that when setting the trace variable twice, things break. I don't think that there is any correlated here (FYI, the trace returned by pyro.poutine... is a networkx.DiGraph). Removing the first line gives identical result to without removing (in CPU).

It is not necessary that setting trace twice is the only way to get the above error. We got this error in the main code without setting trace twice. Maybe pyro's trace structure is not a good candidate for pytorch multiprocessing's queue?

ailzhang · 2019-01-18T02:03:33Z

@fehiepsi Hmmm no matter what trace structure is, you only put a dict in the queue. It's likely a bug on our side, although the repro is tricky that I cannot fully understand. Let me dig it deeper, please let me know if you find a simpler repro in the mean time. Thanks!

fehiepsi · 2019-01-18T06:49:40Z

@ailzhang I made the issue #16141 with an example which does not rely on Pyro. Hope that help. :)

Ailing Zhang added 4 commits December 3, 2018 16:52

init working commit

c04fcd6

add destructor

ea57175

move to THCCaching

e39d0ee

cleanup

3b821e8

ezyang reviewed Dec 4, 2018

View reviewed changes

aten/src/THC/THCCachingAllocator.cpp Outdated Show resolved Hide resolved

ezyang reviewed Dec 4, 2018

View reviewed changes

Ailing Zhang added 2 commits December 4, 2018 14:24

address comment

4e6dd95

add comments

cafc979

cleanup code

2142219

ailzhang changed the title ~~[wip]Fix cuda multiprocessing cached memory~~ Fix cuda multiprocessing cached memory Dec 5, 2018

fix lint

422baec

ailzhang added the 1.0 label Dec 5, 2018

facebook-github-bot reviewed Dec 5, 2018

View reviewed changes

Ailing Zhang added 2 commits December 4, 2018 20:55

remove unnecessary lines

1f5902e

Merge branch 'master' of https://github.com/pytorch/pytorch into fix_…

651b558

…multiprocessing

ezyang reviewed Dec 5, 2018

View reviewed changes

aten/src/THC/THCAllocator.cpp Show resolved Hide resolved

ezyang reviewed Dec 5, 2018

View reviewed changes

ezyang approved these changes Dec 5, 2018

View reviewed changes

address comments

190d5cb

facebook-github-bot reviewed Dec 5, 2018

View reviewed changes

fix lint & try fixing windows test

27afbfe

ailzhang force-pushed the fix_multiprocessing branch from 555cdbd to 27afbfe Compare December 5, 2018 14:51

move to AT_CUDA_API since THC_API sets C linkage

8c176a1

facebook-github-bot reviewed Dec 5, 2018

View reviewed changes

fix windows test

00b6a83

ailzhang force-pushed the fix_multiprocessing branch from 36701b5 to 00b6a83 Compare December 5, 2018 16:35

facebook-github-bot reviewed Dec 5, 2018

View reviewed changes

facebook-github-bot closed this in be47470 Dec 5, 2018

fehiepsi mentioned this pull request Jan 8, 2019

Supporting multi-chain mcmc with cuda tensors pyro-ppl/pyro#1694

Merged

fehiepsi mentioned this pull request Jan 18, 2019

[multiprocessing] does not play well with distributions in GPU #16141

Closed

ezyang added this to the 1.0 milestone Apr 1, 2019

ezyang added the merged label Jun 25, 2019

Fix cuda multiprocessing cached memory #14736

Fix cuda multiprocessing cached memory #14736

Uh oh!

Conversation

ailzhang commented Dec 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ailzhang commented Dec 5, 2018

Uh oh!

ailzhang commented Dec 5, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

fehiepsi commented Jan 17, 2019 • edited by ezyang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Jan 17, 2019

Uh oh!

fehiepsi commented Jan 17, 2019

Uh oh!

ailzhang commented Jan 17, 2019

Uh oh!

fehiepsi commented Jan 18, 2019

ailzhang commented Dec 4, 2018 •

edited

Loading

fehiepsi commented Jan 17, 2019 •

edited by ezyang

Loading