-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[CUDA] revert PR 130472 #162950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA] revert PR 130472 #162950
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162950
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9a7bd01 with merge base 1c16c18 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@ngimel Please take a look when you have a moment. |
|
@pytorchbot label "release notes:cuda" |
|
Didn't find following labels among repository labels: release notes:cuda |
|
@pytorchbot label "release notes: cuda" |
| { | ||
| const std::lock_guard<std::mutex> lock(allocator_mutex_); | ||
| allocation_metadata_.emplace(r, _AllocationMetadata(size, device, stream)); | ||
| auto result = allocation_metadata_.try_emplace(r, size, device, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can previous allocation with the same address not have been erased during free?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can previous allocation with the same address not have been erased during
free?
oh, because the same address was not erased before it was allocated again, it was deallocated by context, and the context did not clear the key in allocation_metadata_.
pytorch/torch/csrc/cuda/CUDAPluggableAllocator.cpp
Lines 22 to 25 in 9009c4d
void CUDAPluggableAllocatorDeleterContext::free() { free_fn_(data_, size_, device_, stream_); delete this; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SO that's the original sin, raw_delete behavior is different from context deleter. Context deleter has to also clean up metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SO that's the original sin,
raw_deletebehavior is different from context deleter. Context deleter has to also clean up metadata.
I think we should not do clean metadata in context, for this is not the design goal of context, see PR #130472. it said the author want context to free the memory without depending on an allocator. and the metadata is a private member of an allocator. so we should not record metadata in allocate function again for context itself which already records the metadata. removing record metadata may be a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then not recording the metadata sounds like a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So essentially the context as implemented in #130472 is bc-breaking? Nice! If we stop recording metadata, that is a bc-breaking change. If we stop cleaning up metadata, metadata is useless. cc @syed-ahmed
…to prevent free failures when deallocating the same memory address with different sizes.
3a1093f to
2c11b85
Compare
|
Actually, after talking to @syed-ahmed we've decided that the proper fix is to just revert #130472, it's not doing anything useful. @thenumberouscode can you please try that? |
|
I'm verifying locally if reverting #130472 fixes the problem. Recording our conversation here for posterity, #130472 was written to allow usage of multiple CUDAPluggableAllocator in the same program and at that time I thought we would need that functionality to enable MemPools. However, when MemPools were implemented, we actually didn't end up needing that functionality.
|
Yes, I already had revert PR 130472 in my local development environment. Local testing shows the issue had been resolved. I've updated this PR's code to revert PR 130472. cc @syed-ahmed |
6ef5a02 to
ee549fd
Compare
| cudaStream_t stream_{}; | ||
| }; | ||
|
|
||
| #if defined(USE_ROCM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel @syed-ahmed this line should not be reverted to #if defined(TORCH_HIP_VERSION) for it is imported by another bug fix PR a19b667
|
Thank you @thenumberouscode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you @thenumberouscode!
|
@syed-ahmed I had cleaned all the build cache use #include <torch/csrc/cuda/CUDAPluggableAllocator.h>
#include <torch/torch.h>
std::unordered_map<void*, ssize_t> allocation_sizes;
#define EXPECT_EQ(actual, expected) \
do { \
auto actual_val = (actual); \
auto expected_val = (expected); \
if (actual_val != expected_val) { \
std::cerr << "FAIL: EXPECT_EQ(" << #actual << ", " << #expected << ")" << std::endl; \
std::cerr << " Expected: " << expected_val << std::endl; \
std::cerr << " Actual: " << actual_val << std::endl; \
std::cerr << " At: " << __FILE__ << ":" << __LINE__ << std::endl; \
return false; \
} \
} while (0)
void* logging_malloc(size_t size, int device, cudaStream_t stream) {
void* ptr;
cudaMalloc(&ptr, size);
std::cout << "alloc ptr=" << ptr << " size=" << size << " device=" << device
<< " stream=" << stream << std::endl;
allocation_sizes[ptr] = size;
return ptr;
}
void logging_free(void* ptr, size_t size, int device, cudaStream_t stream) {
std::cout << "free ptr=" << ptr << " size=" << size << " device=" << device
<< " stream=" << stream << std::endl;
// Print out any frees that don't match the allocation sizes
if (allocation_sizes.find(ptr) != allocation_sizes.end()) {
if (allocation_sizes[ptr] != size) {
std::cout << "*** ERROR: free mismatch: " << ptr << " size=" << size
<< " expected=" << allocation_sizes[ptr] << std::endl;
}
} else {
std::cout << "WARNING: free of unknown ptr=" << ptr << std::endl;
}
cudaFree(ptr);
allocation_sizes.erase(ptr);
}
// TEST(TestTorchUnique, UniqueComparisonTest) {
int main() {
auto custom_allocator =
torch::cuda::CUDAPluggableAllocator::createCustomAllocator(logging_malloc, logging_free);
torch::cuda::CUDAPluggableAllocator::changeCurrentAllocator(custom_allocator);
// Run the command 3 times; the first 2 will pass and the third invocation will have
// different sizes in alloc and free
for (int i = 0; i < 3; ++i) {
std::cout << "\n Starting test " << i << std::endl;
// Initialize simple sorted tensor with repeats
torch::Tensor sorted_tensor =
torch::tensor({0, 0, 0, 1, 1, 2, 3, 3, 3, 3, 5},
torch::TensorOptions().dtype(torch::kFloat32).device(at::kCUDA));
std::cout << "Starting unique_consecutive" << std::endl;
// This operation will call malloc/free with different sizes on the same pointer
auto unique_dim_result = torch::unique_consecutive(sorted_tensor, false, true, 0);
std::cout << "Finished unique_consecutive" << std::endl;
// Everything below is only there to validate correct results
auto unique_dim_values = std::get<0>(unique_dim_result);
auto unique_dim_counts = std::get<2>(unique_dim_result);
// Check tensor sizes
EXPECT_EQ(unique_dim_values.size(0), 5);
EXPECT_EQ(unique_dim_counts.size(0), 5);
// Copy to CPU before accessing elements
torch::Tensor cpu_values = unique_dim_values.cpu();
torch::Tensor cpu_counts = unique_dim_counts.cpu();
// Use accessors on the CPU tensors
auto values_accessor = cpu_values.accessor<float, 1>();
auto counts_accessor = cpu_counts.accessor<int64_t, 1>();
// Check individual values using accessors
EXPECT_EQ(values_accessor[0], 0.0f);
EXPECT_EQ(values_accessor[1], 1.0f);
EXPECT_EQ(values_accessor[2], 2.0f);
EXPECT_EQ(values_accessor[3], 3.0f);
EXPECT_EQ(values_accessor[4], 5.0f);
// Check count values using accessors
EXPECT_EQ(counts_accessor[0], 3);
EXPECT_EQ(counts_accessor[1], 2);
EXPECT_EQ(counts_accessor[2], 1);
EXPECT_EQ(counts_accessor[3], 4);
EXPECT_EQ(counts_accessor[4], 1);
std::cout << "Finished test " << i << " \n" << std::endl;
}
}output: Starting test 0
alloc ptr=0x762f79600000 size=44 device=0 stream=0
Starting unique_consecutive
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x762f79600200 size=88 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x762f79600400 size=96 device=0 stream=0
alloc ptr=0x762f79600600 size=1535 device=0 stream=0
free ptr=0x762f79600600 size=1535 device=0 stream=0
alloc ptr=0x762f79600600 size=40 device=0 stream=0
alloc ptr=0x762f79600800 size=255 device=0 stream=0
free ptr=0x762f79600800 size=255 device=0 stream=0
free ptr=0x762f79600400 size=96 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x762f79600400 size=20 device=0 stream=0
free ptr=0x762f79600200 size=88 device=0 stream=0
Finished unique_consecutive
Finished test 0
free ptr=0x762f79600400 size=20 device=0 stream=0
free ptr=0x762f79600600 size=40 device=0 stream=0
free ptr=0x762f79600000 size=44 device=0 stream=0
Starting test 1
alloc ptr=0x762f72800000 size=44 device=0 stream=0
Starting unique_consecutive
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x762f72800200 size=88 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x762f72800400 size=96 device=0 stream=0
alloc ptr=0x762f72800600 size=1535 device=0 stream=0
free ptr=0x762f72800600 size=1535 device=0 stream=0
alloc ptr=0x762f72800600 size=40 device=0 stream=0
alloc ptr=0x762f72800800 size=255 device=0 stream=0
free ptr=0x762f72800800 size=255 device=0 stream=0
free ptr=0x762f72800400 size=96 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x762f72800400 size=20 device=0 stream=0
free ptr=0x762f72800200 size=88 device=0 stream=0
Finished unique_consecutive
Finished test 1
free ptr=0x762f72800400 size=20 device=0 stream=0
free ptr=0x762f72800600 size=40 device=0 stream=0
free ptr=0x762f72800000 size=44 device=0 stream=0
Starting test 2
alloc ptr=0x762f72800000 size=44 device=0 stream=0
Starting unique_consecutive
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x762f72800200 size=88 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x762f72800400 size=96 device=0 stream=0
alloc ptr=0x762f72800600 size=1535 device=0 stream=0
free ptr=0x762f72800600 size=1535 device=0 stream=0
alloc ptr=0x762f72800600 size=40 device=0 stream=0
alloc ptr=0x762f72800800 size=255 device=0 stream=0
free ptr=0x762f72800800 size=255 device=0 stream=0
free ptr=0x762f72800400 size=96 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x762f72800400 size=20 device=0 stream=0
free ptr=0x762f72800200 size=88 device=0 stream=0
Finished unique_consecutive
Finished test 2
free ptr=0x762f72800400 size=20 device=0 stream=0
free ptr=0x762f72800600 size=40 device=0 stream=0
free ptr=0x762f72800000 size=44 device=0 stream=0 |
|
@syed-ahmed I've used the main branch to reproduce the bug. It occur again. So I guess the revert may work. please let me know if I am wrong. ./consecutive
Starting test 0
alloc ptr=0x7d1f03600000 size=44 device=0 stream=0
Starting unique_consecutive
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1f03600200 size=88 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1f03600400 size=96 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1f03600600 size=1535 device=0 stream=0
free ptr=0x7d1f03600600 size=1535 device=0 stream=0
alloc ptr=0x7d1f03600600 size=40 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
WARNING: free of unknown ptr=0
alloc ptr=0x7d1f03600800 size=255 device=0 stream=0
free ptr=0x7d1f03600800 size=255 device=0 stream=0
free ptr=0x7d1f03600400 size=96 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1f03600400 size=20 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
free ptr=0x7d1f03600200 size=88 device=0 stream=0
Finished unique_consecutive
Finished test 0
free ptr=0x7d1f03600400 size=20 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
WARNING: free of unknown ptr=0
free ptr=0x7d1f03600600 size=40 device=0 stream=0
free ptr=0x7d1f03600000 size=44 device=0 stream=0
Starting test 1
alloc ptr=0x7d1efa800000 size=44 device=0 stream=0
Starting unique_consecutive
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1efa800200 size=88 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1efa800400 size=96 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1efa800600 size=1535 device=0 stream=0
free ptr=0x7d1efa800600 size=1535 device=0 stream=0
alloc ptr=0x7d1efa800600 size=40 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
WARNING: free of unknown ptr=0
alloc ptr=0x7d1efa800800 size=255 device=0 stream=0
free ptr=0x7d1efa800800 size=255 device=0 stream=0
free ptr=0x7d1efa800400 size=96 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1efa800400 size=20 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
free ptr=0x7d1efa800200 size=88 device=0 stream=0
Finished unique_consecutive
Finished test 1
free ptr=0x7d1efa800400 size=20 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
WARNING: free of unknown ptr=0
free ptr=0x7d1efa800600 size=40 device=0 stream=0
free ptr=0x7d1efa800000 size=44 device=0 stream=0
Starting test 2
alloc ptr=0x7d1efa800000 size=44 device=0 stream=0
Starting unique_consecutive
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1efa800200 size=88 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1efa800400 size=96 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1efa800600 size=1535 device=0 stream=0
free ptr=0x7d1efa800600 size=40 device=0 stream=0
*** ERROR: free mismatch: 0x7d1efa800600 size=40 expected=1535
alloc ptr=0x7d1efa800600 size=40 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
WARNING: free of unknown ptr=0
alloc ptr=0x7d1efa800800 size=255 device=0 stream=0
free ptr=0x7d1efa800800 size=255 device=0 stream=0
free ptr=0x7d1efa800400 size=96 device=0 stream=0
alloc ptr=0 size=0 device=0 stream=0
alloc ptr=0x7d1efa800400 size=20 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
free ptr=0x7d1efa800200 size=88 device=0 stream=0
Finished unique_consecutive
Finished test 2
free ptr=0x7d1efa800400 size=20 device=0 stream=0
free ptr=0 size=0 device=0 stream=0
WARNING: free of unknown ptr=0
free ptr=0x7d1efa800600 size=40 device=0 stream=0
free ptr=0x7d1efa800000 size=44 device=0 stream=0 |
|
Can we add code from #161789 as a test (or actually a specifically constructed example that would try to allocate to different-sized buffers with the same address? Custom allocators can be constructed with load_inline |
|
@syed-ahmed are you ok with landing this PR as is? Even if it doesn't fix all the failures (I doubt it, tbh, @thenumberouscode examples show that it actually fixes the previous failure) it's a strict improvememt. |
|
Yes, landing this as is sounds good to me. Let's keep the original bug open. @thenumberouscode thank you for the PR and the follow-ups! |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@syed-ahmed @thenumberouscode can one of you take an AI to add the test as a follow-up, and to verify if original bug(s) are fixed, because currently @syed-ahmed sees failures even on top of this, whereas @thenumberouscode doesn't? |
|
@pytorchbot cherry-pick --onto release/2.9 -c regression |
This change may also resolve #161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: #162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed (cherry picked from commit 4a160da)
Cherry picking #162950The cherry pick PR is at #163379 and it is recommended to link a regression cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
I believe @thenumberouscode is also seeing per their last comment. I'll take the AI and will post the test on Monday. |
|
If I'm reading his comments correctly, bug doesn't repro with revert for him, and repros on the main branch. |
@ngimel Yes, I cannot reproduce the bug with the reverted code, but I can reproduce it on the main branch without the revert. cc @syed-ahmed |
@
@ngimel I'm willing to follow up and add the unit test as well. I can do this if @syed-ahmed |
This change may also resolve pytorch#161789, though verification is still needed. PR pytorch#130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: pytorch#162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
This change may also resolve pytorch#161789, though verification is still needed. PR pytorch#130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: pytorch#162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
|
@thenumberouscode yes, please add the test, greatly appreciated! |
|
Added a test, need to make sure it's running in the CI and cleanup logging if needed: #163575 |
This change may also resolve pytorch#161789, though verification is still needed. PR pytorch#130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: pytorch#162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
[CUDA] revert PR 130472 (#162950) This change may also resolve #161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: #162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed (cherry picked from commit 4a160da) Co-authored-by: thenumberouscode <[email protected]>
This change may also resolve #161789, though verification is still needed.
PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it.