Skip to content

Conversation

@syed-ahmed
Copy link
Collaborator

@syed-ahmed syed-ahmed commented Jul 10, 2024

We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see #124807, #125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).

Currently, CUDAPluggableAllocator maintains a global current_custom_allocator. When creating the DataPtr, raw_deleter attaches custom_raw_deleter to the DataPtr which calls current_custom_allocator->raw_delete(...). This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the current_custom_allocator. For example, if allocation 1 was done with cudaMalloc and allocation 2 was done with ncclMemAlloc, and if current_custom_allocator is currently pointing to the CUDAPluggableAllocator with ncclMemAlloc - when cleaning up the allocation 1, we'd be using ncclMemFree instead of cudaFree.

In this PR, we solve the above problem by remembering the free_fn_ using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.

CC: @zdevito @ptrblck @eqy

@syed-ahmed syed-ahmed requested a review from eqy as a code owner July 10, 2024 18:38
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130472

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 612b342 with merge base c101c45 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@syed-ahmed
Copy link
Collaborator Author

@zdevito @albanD If you could review this, I'd really appreciate it! I think you might have the most context about this since you reviewed: #86786.

Copy link
Collaborator

@eqy eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this BC-breaking as it looks like there is an interface change? Would there need to be more handling on the "user's" side for e.g., ensuring that the free function is called on the correct stream?

@syed-ahmed
Copy link
Collaborator Author

syed-ahmed commented Jul 10, 2024

Is this BC-breaking as it looks like there is an interface change? Would there need to be more handling on the "user's" side for e.g., ensuring that the free function is called on the correct stream?

It is technically BC breaking. For instance, RMM does pass along size, device_idx, and stream to its deallocator: https://github.com/rapidsai/rmm/blob/b8b67f8ceb52b50bf1c16d4f3305b7885de5c3ea/python/rmm/rmm/_lib/_torch_allocator.cpp#L54-L60. However, PyTorch passes hard-coded/empty values to RMM, so those parameters are essentially unused: https://github.com/pytorch/pytorch/blob/main/torch/csrc/cuda/CUDAPluggableAllocator.cpp#L127-L129. cudaFreeAsync is the only deleter I can think of where the stream parameter is used, but it seems like there is no way to use this parameter through the CUDAPluggableAllocator interface today.

The contract for a deleter function seems to be this: https://github.com/pytorch/pytorch/blob/main/c10/util/UniqueVoidPtr.h#L11. I've thought about making DeleterFnPtr more generic, for instance making it into class or a template such that users can provide a deleter function signature: https://godbolt.org/z/aqh8x79Wx, but that solution seems very intrusive and gets complicated when handling equality of two deleters in the pytorch code base:

c10::DeleterFnPtr deleter_expected = &c10::refcounted_deleter;
c10::DeleterFnPtr deleter0 = storage0.data_ptr().get_deleter();
c10::DeleterFnPtr deleter1 = storage1.data_ptr().get_deleter();
if ((deleter0 != deleter_expected) || (deleter1 != deleter_expected)) {
return false;

Do you have any suggestions on what should be done here?

@syed-ahmed syed-ahmed force-pushed the pluggable-allocator-test branch from adcfcff to 1b29f57 Compare July 12, 2024 01:37
@syed-ahmed syed-ahmed changed the title Uses DeleterFnPtr as the type for CUDAPluggableAllocator free function Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage Jul 12, 2024
@syed-ahmed
Copy link
Collaborator Author

syed-ahmed commented Jul 12, 2024

Ok, I have pushed some changes that doesn't make this BC breaking anymore.

Reading the notes in c10/util/UniqueVoidPtr.h and looking at COWDeleter implementation, I believe using a context pointer is the right solution here. I've created a CUDAPluggableAllocatorDeleterContext that records the data pointer, free_fn, size, device, and stream. When creating the DataPtr in CUDAPluggableAllocator, this context is now passed instead of just the data pointer.

@eqy I've restored free_fn_ signature to what it was before. Does that resolve your concern about BC breakage?
@ezyang If you could review my usage of CUDAPluggableAllocatorDeleterContext, I'd really appreciate it!

@syed-ahmed syed-ahmed requested a review from eqy July 12, 2024 01:59
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great thanks

@ezyang
Copy link
Contributor

ezyang commented Jul 18, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 18, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@eqy
Copy link
Collaborator

eqy commented Jul 18, 2024

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

DiweiSun pushed a commit to DiweiSun/pytorch that referenced this pull request Jul 22, 2024
…cator usage (pytorch#130472)

We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see pytorch#124807, pytorch#125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).

Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls  `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`.

In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.

CC: @zdevito @ptrblck @eqy
Pull Request resolved: pytorch#130472
Approved by: https://github.com/eqy, https://github.com/ezyang
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Jul 25, 2024
…cator usage (pytorch#130472)

We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see pytorch#124807, pytorch#125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).

Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls  `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`.

In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.

CC: @zdevito @ptrblck @eqy
Pull Request resolved: pytorch#130472
Approved by: https://github.com/eqy, https://github.com/ezyang
pytorchmergebot pushed a commit that referenced this pull request Sep 19, 2025
This change may also resolve #161789, though verification is still needed.

PR #130472 would introduced the problem of  freeing the same address without clean metadata. according to the below discussion, reverted it.
Pull Request resolved: #162950
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
pytorchbot pushed a commit that referenced this pull request Sep 19, 2025
This change may also resolve #161789, though verification is still needed.

PR #130472 would introduced the problem of  freeing the same address without clean metadata. according to the below discussion, reverted it.
Pull Request resolved: #162950
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed

(cherry picked from commit 4a160da)
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
This change may also resolve pytorch#161789, though verification is still needed.

PR pytorch#130472 would introduced the problem of  freeing the same address without clean metadata. according to the below discussion, reverted it.
Pull Request resolved: pytorch#162950
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
This change may also resolve pytorch#161789, though verification is still needed.

PR pytorch#130472 would introduced the problem of  freeing the same address without clean metadata. according to the below discussion, reverted it.
Pull Request resolved: pytorch#162950
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
This change may also resolve pytorch#161789, though verification is still needed.

PR pytorch#130472 would introduced the problem of  freeing the same address without clean metadata. according to the below discussion, reverted it.
Pull Request resolved: pytorch#162950
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
atalman pushed a commit that referenced this pull request Sep 29, 2025
[CUDA] revert PR 130472 (#162950)

This change may also resolve #161789, though verification is still needed.

PR #130472 would introduced the problem of  freeing the same address without clean metadata. according to the below discussion, reverted it.
Pull Request resolved: #162950
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed

(cherry picked from commit 4a160da)

Co-authored-by: thenumberouscode <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants