Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage #130472

syed-ahmed · 2024-07-10T18:38:22Z

We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see #124807, #125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).

Currently, CUDAPluggableAllocator maintains a global current_custom_allocator. When creating the DataPtr, raw_deleter attaches custom_raw_deleter to the DataPtr which calls current_custom_allocator->raw_delete(...). This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the current_custom_allocator. For example, if allocation 1 was done with cudaMalloc and allocation 2 was done with ncclMemAlloc, and if current_custom_allocator is currently pointing to the CUDAPluggableAllocator with ncclMemAlloc - when cleaning up the allocation 1, we'd be using ncclMemFree instead of cudaFree.

In this PR, we solve the above problem by remembering the free_fn_ using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.

CC: @zdevito @ptrblck @eqy

pytorch-bot · 2024-07-10T18:38:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130472

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 612b342 with merge base c101c45 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

syed-ahmed · 2024-07-10T18:43:25Z

@zdevito @albanD If you could review this, I'd really appreciate it! I think you might have the most context about this since you reviewed: #86786.

eqy

Is this BC-breaking as it looks like there is an interface change? Would there need to be more handling on the "user's" side for e.g., ensuring that the free function is called on the correct stream?

syed-ahmed · 2024-07-10T23:01:34Z

Is this BC-breaking as it looks like there is an interface change? Would there need to be more handling on the "user's" side for e.g., ensuring that the free function is called on the correct stream?

It is technically BC breaking. For instance, RMM does pass along size, device_idx, and stream to its deallocator: https://github.com/rapidsai/rmm/blob/b8b67f8ceb52b50bf1c16d4f3305b7885de5c3ea/python/rmm/rmm/_lib/_torch_allocator.cpp#L54-L60. However, PyTorch passes hard-coded/empty values to RMM, so those parameters are essentially unused: https://github.com/pytorch/pytorch/blob/main/torch/csrc/cuda/CUDAPluggableAllocator.cpp#L127-L129. cudaFreeAsync is the only deleter I can think of where the stream parameter is used, ~~but it seems like there is no way to use this parameter through the CUDAPluggableAllocator interface today~~.

The contract for a deleter function seems to be this: https://github.com/pytorch/pytorch/blob/main/c10/util/UniqueVoidPtr.h#L11. I've thought about making DeleterFnPtr more generic, for instance making it into class or a template such that users can provide a deleter function signature: https://godbolt.org/z/aqh8x79Wx, but that solution seems very intrusive and gets complicated when handling equality of two deleters in the pytorch code base:

pytorch/c10/core/Storage.cpp

Lines 7 to 12 in 46c5266

    
           c10::DeleterFnPtr deleter_expected = &c10::refcounted_deleter; 
        
           c10::DeleterFnPtr deleter0 = storage0.data_ptr().get_deleter(); 
        
           c10::DeleterFnPtr deleter1 = storage1.data_ptr().get_deleter(); 
        
           if ((deleter0 != deleter_expected) || (deleter1 != deleter_expected)) { 
        
             return false;

Do you have any suggestions on what should be done here?

syed-ahmed · 2024-07-12T01:59:32Z

Ok, I have pushed some changes that doesn't make this BC breaking anymore.

Reading the notes in c10/util/UniqueVoidPtr.h and looking at COWDeleter implementation, I believe using a context pointer is the right solution here. I've created a CUDAPluggableAllocatorDeleterContext that records the data pointer, free_fn, size, device, and stream. When creating the DataPtr in CUDAPluggableAllocator, this context is now passed instead of just the data pointer.

@eqy I've restored free_fn_ signature to what it was before. Does that resolve your concern about BC breakage?
@ezyang If you could review my usage of CUDAPluggableAllocatorDeleterContext, I'd really appreciate it!

ezyang

great thanks

ezyang · 2024-07-18T01:22:34Z

@pytorchbot merge

pytorchmergebot · 2024-07-18T01:24:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-18T07:23:03Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

eqy · 2024-07-18T07:44:51Z

@pytorchmergebot merge

pytorchmergebot · 2024-07-18T07:46:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@zdevito

…cator usage (pytorch#130472) We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see pytorch#124807, pytorch#125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory). Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`. In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter. CC: @zdevito @ptrblck @eqy Pull Request resolved: pytorch#130472 Approved by: https://github.com/eqy, https://github.com/ezyang

@zdevito

…cator usage (pytorch#130472) We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see pytorch#124807, pytorch#125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory). Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`. In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter. CC: @zdevito @ptrblck @eqy Pull Request resolved: pytorch#130472 Approved by: https://github.com/eqy, https://github.com/ezyang

This change may also resolve #161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: #162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed

This change may also resolve #161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: #162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed (cherry picked from commit 4a160da)

This change may also resolve pytorch#161789, though verification is still needed. PR pytorch#130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: pytorch#162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed

[CUDA] revert PR 130472 (#162950) This change may also resolve #161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: #162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed (cherry picked from commit 4a160da) Co-authored-by: thenumberouscode <[email protected]>

syed-ahmed requested a review from eqy as a code owner July 10, 2024 18:38

pytorchbot added the open source label Jul 10, 2024

Aidyn-A requested review from albanD and zdevito July 10, 2024 18:50

Aidyn-A added topic: not user facing topic category module: CUDACachingAllocator labels Jul 10, 2024

eqy reviewed Jul 10, 2024

View reviewed changes

syed-ahmed added 2 commits July 11, 2024 18:36

Uses DeleterFnPtr as the type for CUDAPluggableAllocator's free function

65de834

Uses context to remeber allocation metadata

1b29f57

syed-ahmed force-pushed the pluggable-allocator-test branch from adcfcff to 1b29f57 Compare July 12, 2024 01:37

syed-ahmed changed the title ~~Uses DeleterFnPtr as the type for CUDAPluggableAllocator free function~~ Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage Jul 12, 2024

syed-ahmed requested a review from eqy July 12, 2024 01:59

Moves CUDAPluggableAllocator symbols to libtorch_cuda

612b342

eqy approved these changes Jul 12, 2024

View reviewed changes

ezyang approved these changes Jul 18, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 18, 2024

pytorchmergebot added the merging label Jul 18, 2024

pytorchmergebot added the Merged label Jul 18, 2024

pytorchmergebot closed this in 38b7d89 Jul 18, 2024

pytorchmergebot removed the merging label Jul 18, 2024

syed-ahmed mentioned this pull request Jul 19, 2024

[RFC] Mix and Match CUDA Allocators using Private Pools #124807

Closed

henrylhtsang mentioned this pull request Jul 31, 2024

[BE][typing] fix types in common pruning #132309

Closed

darrin-willis mentioned this pull request May 7, 2025

Inconsistent size passed to custom CUDA alloc/free in torch::unique_consecutive #153109

Closed

thenumberouscode mentioned this pull request Sep 17, 2025

[CUDA] revert PR 130472 #162950

Closed

pytorchbot mentioned this pull request Sep 19, 2025

[CUDA] revert PR 130472 #163379

Merged

ngimel mentioned this pull request Sep 25, 2025

[v.2.9.0] Release Tracker #162497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage #130472

Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage #130472

Uh oh!

syed-ahmed commented Jul 10, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 10, 2024 •

edited

Loading

Uh oh!

syed-ahmed commented Jul 10, 2024

Uh oh!

eqy left a comment

Uh oh!

syed-ahmed commented Jul 10, 2024 •

edited

Loading

Uh oh!

syed-ahmed commented Jul 12, 2024 •

edited

Loading

Uh oh!

ezyang left a comment

Uh oh!

ezyang commented Jul 18, 2024

Uh oh!

pytorchmergebot commented Jul 18, 2024

Uh oh!

pytorchmergebot commented Jul 18, 2024

Uh oh!

eqy commented Jul 18, 2024

Uh oh!

pytorchmergebot commented Jul 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage #130472

Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage #130472

Uh oh!

Conversation

syed-ahmed commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130472

✅ No Failures

Uh oh!

syed-ahmed commented Jul 10, 2024

Uh oh!

eqy left a comment

Choose a reason for hiding this comment

Uh oh!

syed-ahmed commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

syed-ahmed commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jul 18, 2024

Uh oh!

pytorchmergebot commented Jul 18, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 18, 2024

Uh oh!

eqy commented Jul 18, 2024

Uh oh!

pytorchmergebot commented Jul 18, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

syed-ahmed commented Jul 10, 2024 •

edited

Loading

pytorch-bot bot commented Jul 10, 2024 •

edited

Loading

syed-ahmed commented Jul 10, 2024 •

edited

Loading

syed-ahmed commented Jul 12, 2024 •

edited

Loading