Skip to content

Conversation

@ailzhang
Copy link
Contributor

@ailzhang ailzhang commented Dec 4, 2018

This PR fixes #11422

In the old world of CUDA IPC, when we want to share a tensor T from A to B, we have to share the whole CUDA mem allocation where T's storage sit in. And we casted it to the same type of storage of T's.

This causes problem when two different types of storage got allocated to the same CUDA mem block. When we try to reconstruct the second tensor, it will complain about wrong storage type.

In this PR we reconstruct the storage only (not the entire mem block). However, CUDA only allows one open memHandle once per process, we have to save the device pointer in a global cache so that we can reconstruct tensors as they come.

Thanks a ton to @ezyang who helped design the solution and debugged the issue!

int cur_device;
THCudaCheck(cudaGetDevice(&cur_device));
auto* context = new THCIpcDeleter(data, device);
auto* context = new THCIpcDeleter(basePtr, device);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do a std::move(basePtr) here

THCIpcDeleter::THCIpcDeleter(void* data, int device)
: data_(data), device_(device) {}
THCIpcDeleter::THCIpcDeleter(std::shared_ptr<void> basePtr, int device)
: basePtr_(basePtr), device_(device) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also do a std::move here

tensor.size(),
tensor.stride(),
tensor_offset + storage_offset,
tensor_offset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to clarify what these variables mean now

THWStoragePtr base(THWStorage_(newWithDataAndAllocator)(
LIBRARY_STATE
THCIpcDeleter::makeDataPtr(devPtr, device),
THCIpcDeleter::makeDataPtr(basePtr, devPtr, device),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::move here

@ailzhang
Copy link
Contributor Author

ailzhang commented Dec 5, 2018

I'm still working on two failed test: test_empty_tensor_sharing_cuda and test_cuda_small_tensors(multiple devices).

@ailzhang ailzhang changed the title [wip]Fix cuda multiprocessing cached memory Fix cuda multiprocessing cached memory Dec 5, 2018
@ailzhang ailzhang added the 1.0 label Dec 5, 2018
@ailzhang
Copy link
Contributor Author

ailzhang commented Dec 5, 2018

(the above two) Tests are fixed.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.


at::DataPtr THCIpcDeleter::makeDataPtr(void* data, int device) {
// Refer to NB [CUDA IPC and the caching allocator] for more details
// basePtr - device ptr of allocated CUDA memory region. This memory region
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "memory region" here is a bit ambiguous. A more specific moniker is "a single cudaMalloc allocation; this may be a large block of memory which is managed by the caching allocator".

// construct the new storage.
// Every time a storage on the memory region go out of scope, the ref_count
// of basePtr will be decreased 1, until it's closed in its deleter (calling
// cudaIpoCloseMemHandle) when ref_count is 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cudaIpcCloseMemHandle (Ipo)

// device - device of memory
// Here basePtr should be saved in the struct, while data should be used to
// construct the new storage.
// Every time a storage on the memory region go out of scope, the ref_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Well, it's whatever the shared pointers internal refcount is; ref_count seems to imply that we're manually refcounting something by the name of the ref_count identifier but we're not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar: Every time a storage referring to the IPC memory region goes out of scope, the reference count on the memory region will be decreased by one, until it's zero, at which point IPC memory region is closed (by calling cudaIpcCloseMemHandle).

THCudaCheck(cudaGetDevice(&prev_device));
THCudaCheck(cudaSetDevice(device_));
THCudaCheck(cudaIpcCloseMemHandle(data_));
THCudaCheck(cudaSetDevice(prev_device));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the cudaSetDevice here anymore. In fact, you don't need this destructor at all.

devPtr,
[curr_device](void *ptr) {
THCudaCheck(cudaSetDevice(curr_device));
THCudaCheck(cudaIpcCloseMemHandle(ptr));});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deleter doesn't look like it actually it's removing entries from the table. So it seems to me that we are leaking memory, because ipcMemHandle_to_devptr will just gradually get larger, without ever being GC'ed.

// is called by receiver process to get access to the memory where the tensor
// was built on sender process.
//
// CUDA IPC only allows sharing a big memory block associated with a IpcMemHandle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general nit when writing documentation like this: if you write "IpcMemHandle" (capitalized specifically in this way), the implication is that this refers to an actual /name/ of some sort in the codebase. However, in this case, you're using this as an abbreviation for cudaIpcMemHandle_t. It's better to spell it out entirely, so that if someone greps for cudaIpcMemHandle_t they hit this code (they are probably not going to grep for IpcMemHandle).

}

at::DataPtr THCIpcDeleter::makeDataPtr(void* data, int device) {
// Refer to NB [CUDA IPC and the caching allocator] for more details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nittiest of nits: the prevailing convention in the codebase is Note [Blah blah], rather than NB ;)


//
// In CUDA IPC, sender sends a tensor to receiver, THCCaching_CUDAIpcDevptr
// is called by receiver process to get access to the memory where the tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd probably say something more like, is called by the receiving process to map the CUDA memory from the sending process into its own address space.

if (ipcMemHandle_to_devptr.find(handle) == ipcMemHandle_to_devptr.end()
|| ipcMemHandle_to_devptr[handle].expired()) {
void *devPtr = nullptr;
cudaIpcMemHandle_t ipc_handle = *(cudaIpcMemHandle_t*)handle.c_str();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in C++ land, we use a reinterpret_cast here.

shared_cache[storage_handle] = StorageWeakRef(storage)
storage_cls, storage_device, storage_handle, storage_size, storage_offset,
requires_grad):
if storage_handle is None or tensor_size == 0 or storage_size == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, is tensor_size ever zero? Isn't it always some sort of tuple?

# to just make a storage for the entire caching allocator block.
# a bigger region (0xA000) than the one I wanted (0xA100)".
#
# Note that this cudaMalloc allocation might not be a single type of storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a little bit of trouble following the line of reasoning in the edited text. Part of the problem is you immediately jump into a paragraph explaining why the cudaMalloc allocation cannot be a single type of storage. OK, so this is definitely something we talked about, but the reader of the comment isn't aware of the sordid history of how Edward was silly and tried to mash them all in one storage. You can maybe ease the transition with something like, "OK, so if you sent the cudaMalloc allocation, can you just wrap that up as one storage itself? No, because..."

# we have
# On sender side, the following info are required to passed to receiver for
# storage recontruction.
# 1. MemHandle(which can be translated to a basePtr in receiver process). The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really a translation: the act of opening an IPC memory handle /maps/ the memory into your local address. If I open and close a memory handle multiple times, CUDA is allowed to give it a different address; similarly, once I close the memory, I'm not allowed to access it, even if it really is still live on the original process.

#
# Tensor(size=0x100, offset=0x020, storage=Storage(data=0xA100, size=0x0100))
# To send a tensor
# Tensor(size=0x040, offset=0x020, storage=Storage(data=0xA100, size=0x0100))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this accurately describes what we do anymore. We don't send a "Storage", we send the CUDA allocation memory handle, and the offset of the storage into that allocation.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a lot of comments, but there are only two real show stoppers:

  1. Need to use device guard in the destructor for IPC handle
  2. We're leaking memory in the mapping table

Rest of it is docs, which we can improve after getting it in the release. Approving in light of this.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ailzhang ailzhang force-pushed the fix_multiprocessing branch from 555cdbd to 27afbfe Compare December 5, 2018 14:51
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ailzhang ailzhang force-pushed the fix_multiprocessing branch from 36701b5 to 00b6a83 Compare December 5, 2018 16:35
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Dec 5, 2018
Summary:
This PR fixes #11422

In the old world of CUDA IPC, when we want to share a tensor T from A to B, we have to share the whole CUDA mem allocation where T's storage sit in. And we casted it to the same type of storage of T's.

This causes problem when two different types of storage got allocated to the same CUDA mem block. When we try to reconstruct the second tensor, it will complain about wrong storage type.

In this PR we reconstruct the storage only (not the entire mem block). However, CUDA only allows one open memHandle once per process, we have to save the device pointer in a global cache so that we can reconstruct tensors as they come.

Thanks a ton to ezyang who helped design the solution and debugged the issue!
Pull Request resolved: pytorch/pytorch#14736

Differential Revision: D13335899

Pulled By: ailzhang

fbshipit-source-id: cad69db392ed6f8fdc2b93a9dc2899f6d378c371
@fehiepsi
Copy link
Contributor

fehiepsi commented Jan 17, 2019

@ailzhang After this PR, we get the error RuntimeError: Assertion self->allocator() as in @neerajprad 's comment. I can replicate the same issue with the following script:

import torch
import torch.multiprocessing as mp

import pyro
import pyro.distributions as dist

torch.set_default_tensor_type(torch.cuda.FloatTensor)
n = 10

def model():
    loc = pyro.sample("loc", dist.Normal(torch.zeros(3), 1))
    pyro.sample("y", dist.Normal(loc, 1))  # comment out this line -> no error

def worker(q, e):
    for i in range(n):
        trace = {"normal": dist.Normal(0, 1)}  # comment out this line -> no error ???
        trace = pyro.poutine.trace(model).get_trace()  # this is just a dictionary which holds tensors
        q.put(trace.nodes["loc"])  # q.put(trace) also gives error
        e.wait()
        e.clear()

if __name__ == "__main__":
    ctx = mp.get_context("spawn")
    q = ctx.Queue()
    e = ctx.Event()
    p = ctx.Process(target=worker, args=(q, e))
    p.start()
    for i in range(n):
        trace = q.get()
        e.set()
    p.join()

Running the above script in pytorch 1.0 (or current nightly builds) gives the error: Assertion `self->allocator() != nullptr' failed. I tried to reinstall PyTorch with the pre-1.0 version (dated 20181202, which is just two days before this PR) and could not observe that error. Because this PR is the most relevant one, could you please suggest me some possibilities why that error happened? I know that it is better to raise an issue but I can't make an example which solely depends on pytorch and is independent of pyro :(. Thanks!

FYI, this just happens with CUDA tensor and does not happen with pytorch-nightly-1.0.0.dev20181202. I tried to debug and observed that while rebuilding cuda tensor, the storage (which is returned by this line) has size different from storage_size_bytes (modulo unit byte of a float or double tensor). I guess somehow tensor's information is lost during the process of rebuilding tensors.

@ezyang
Copy link
Contributor

ezyang commented Jan 17, 2019

@fehiepsi Can you remind me if Pyro has any C++ extension code? If it doesn't, it's almost certainly a problem on our end.

@fehiepsi
Copy link
Contributor

@ezyang Pyro does not have any C++ extension code AFAIK.

@ailzhang
Copy link
Contributor Author

@fehiepsi I don't fully understand how Pyro works, but in the script above you are setting the trace variable twice. According to my local run, commenting any of those two lines works. Could you explain a bit what could be correlated in those lines? They seem like two independent statements setting the same variable to me.

@fehiepsi
Copy link
Contributor

Yes, it is strange that when setting the trace variable twice, things break. I don't think that there is any correlated here (FYI, the trace returned by pyro.poutine... is a networkx.DiGraph). Removing the first line gives identical result to without removing (in CPU).

It is not necessary that setting trace twice is the only way to get the above error. We got this error in the main code without setting trace twice. Maybe pyro's trace structure is not a good candidate for pytorch multiprocessing's queue?

@ailzhang
Copy link
Contributor Author

@fehiepsi Hmmm no matter what trace structure is, you only put a dict in the queue. It's likely a bug on our side, although the repro is tricky that I cannot fully understand. Let me dig it deeper, please let me know if you find a simpler repro in the mean time. Thanks!

@fehiepsi
Copy link
Contributor

@ailzhang I made the issue #16141 with an example which does not rely on Pyro. Hope that help. :)

@ezyang ezyang added this to the 1.0 milestone Apr 1, 2019
@ezyang ezyang added the merged label Jun 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error on cuda.LongTensor which is sent via multiprocessing.Queue

4 participants