-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage #130472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130472
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 612b342 with merge base c101c45 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
eqy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this BC-breaking as it looks like there is an interface change? Would there need to be more handling on the "user's" side for e.g., ensuring that the free function is called on the correct stream?
It is technically BC breaking. For instance, RMM does pass along size, device_idx, and stream to its deallocator: https://github.com/rapidsai/rmm/blob/b8b67f8ceb52b50bf1c16d4f3305b7885de5c3ea/python/rmm/rmm/_lib/_torch_allocator.cpp#L54-L60. The contract for a deleter function seems to be this: https://github.com/pytorch/pytorch/blob/main/c10/util/UniqueVoidPtr.h#L11. I've thought about making Lines 7 to 12 in 46c5266
Do you have any suggestions on what should be done here? |
adcfcff to
1b29f57
Compare
|
Ok, I have pushed some changes that doesn't make this BC breaking anymore. Reading the notes in c10/util/UniqueVoidPtr.h and looking at COWDeleter implementation, I believe using a context pointer is the right solution here. I've created a CUDAPluggableAllocatorDeleterContext that records the data pointer, free_fn, size, device, and stream. When creating the DataPtr in CUDAPluggableAllocator, this context is now passed instead of just the data pointer. @eqy I've restored free_fn_ signature to what it was before. Does that resolve your concern about BC breakage? |
ezyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great thanks
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…cator usage (pytorch#130472) We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see pytorch#124807, pytorch#125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory). Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`. In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter. CC: @zdevito @ptrblck @eqy Pull Request resolved: pytorch#130472 Approved by: https://github.com/eqy, https://github.com/ezyang
…cator usage (pytorch#130472) We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see pytorch#124807, pytorch#125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory). Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`. In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter. CC: @zdevito @ptrblck @eqy Pull Request resolved: pytorch#130472 Approved by: https://github.com/eqy, https://github.com/ezyang
This change may also resolve #161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: #162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
This change may also resolve #161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: #162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed (cherry picked from commit 4a160da)
This change may also resolve pytorch#161789, though verification is still needed. PR pytorch#130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: pytorch#162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
This change may also resolve pytorch#161789, though verification is still needed. PR pytorch#130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: pytorch#162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
This change may also resolve pytorch#161789, though verification is still needed. PR pytorch#130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: pytorch#162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
[CUDA] revert PR 130472 (#162950) This change may also resolve #161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: #162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed (cherry picked from commit 4a160da) Co-authored-by: thenumberouscode <[email protected]>
We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see #124807, #125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).
Currently, CUDAPluggableAllocator maintains a global
current_custom_allocator. When creating theDataPtr,raw_deleterattachescustom_raw_deleterto the DataPtr which callscurrent_custom_allocator->raw_delete(...). This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in thecurrent_custom_allocator. For example, if allocation 1 was done withcudaMallocand allocation 2 was done withncclMemAlloc, and ifcurrent_custom_allocatoris currently pointing to the CUDAPluggableAllocator withncclMemAlloc- when cleaning up the allocation 1, we'd be usingncclMemFreeinstead ofcudaFree.In this PR, we solve the above problem by remembering the
free_fn_using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.CC: @zdevito @ptrblck @eqy