-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix bug in atomicAdd for int16_t #29231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
Fixes: #29153 [ghstack-poisoned]
|
What is the bug? Is the original calculation just flat-out wrong? This information would be nice to have as well in the commit message / PR body |
it's in the linked issue, but I've now updated the PR to include the the repro here too. Yes the original code was just wrong, returned zeros. |
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not a full review) I had some comments on the testing. Am trying to understanding what the bug was and what the fix is at the moment.
test/test_torch.py
Outdated
| else: | ||
| tensor = torch.randint(0, 10, size, dtype=dtype, device=device) | ||
|
|
||
| # index_add calls atomicAdd on cuda. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we're relying on index_add calling atomicAdd. It might be clearer to introduce a C++ API test that only calls atomicAdd so that we're still testing this behavior if someone changes the implementation of index_add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check with you tomorrow. Last I looked it didn't seem like we had the infrastructure setup to build and run tests on cu files. If you know an easy way to do that or where they'd go maybe point me to it, otherwise I think setting that up could be a separate task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of our cpp tests definitely run on CUDA: https://github.com/pytorch/pytorch/blob/master/test/cpp/jit/test_fuser.cpp
What I am looking for is an English description of what the bug is and how this PR fixes it |
Fixes: #29153 Bug is that atomicAdd doesn't correctly add values for some dtypes due to incorrect casting. Was returning zeros. Incorrect behavior before this PR: ``` In [23]: sparse=torch.sparse_coo_tensor(indices=torch.tensor([[0,0],[1,1]]), values=torch.tensor([5, 6], dtype=torch.int16), size=(2,2), device='cuda', dtype=torch.int16 ) In [24]: sparse Out[24]: tensor(indices=tensor([[0, 0], [1, 1]]), values=tensor([5, 6]), device='cuda:0', size=(2, 2), nnz=2, dtype=torch.int16, layout=torch.sparse_coo) In [25]: sparse.coalesce() Out[25]: tensor(indices=tensor([[0], [1]]), values=tensor([11]), device='cuda:0', size=(2, 2), nnz=1, dtype=torch.int16, layout=torch.sparse_coo) In [26]: sparse.to_dense() Out[26]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) In [27]: sparse.coalesce().to_dense() Out[27]: tensor([[ 0, 11], [ 0, 0]], device='cuda:0', dtype=torch.int16) In [30]: torch.add(torch.zeros([2,2],dtype=torch.int16, device='cuda'), sparse) Out[30]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) ``` [ghstack-poisoned]
test/test_torch.py
Outdated
| if is_signed(dtype): | ||
| tensor = torch.randint(-5, 15, size, dtype=dtype, device=device) | ||
| else: | ||
| tensor = torch.randint(0, 10, size, dtype=dtype, device=device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The atomicAdd bug looks like it is triggered in a very specific situation, when the sum being stored back into a memory address is signed. Would it be worth testing that case specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to handle sign properly, but it wasn't just that: the original code just always returned zero. In intermediate versions I thought I fixed the issue, but I ran into issues that only occurred at specific alignments of values within the 32 bit words. In practice I don't feel like I got good test signal when I wrote specific narrowly defined tests for particular cases, and the randint for the signed case should always get some negatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give some sample inputs to index_add that trigger the bug? I want to know roughly the order of magnitude of the probability is that the randomly generated input does not catch the bug; we want to make our tests robust to changes in seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example from the initial issue we actually got all zeros, so almost any random value would catch that. There are lots of possible ways this can be wrong though that were missed when I just tried to validate narrowly for particular values, and random tensors capture that more thoroughly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you walk through an example input and how it gets to all zeros? There's a lot of bit shifting going on and I'd like to make sure I understand how the implementation works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, nothing gets added at all. I'm not actually sure if there's a case that doesn't trigger this bug besides maybe adding a tensor of zeros. Here's an example in 1.3.0 for a single value with the expected result for int64 and the zero result for int16, it's basically a no-op:
# expected 0 + 1 == 1
>>> dtype=torch.int64
>>> torch.zeros([1], device='cuda', dtype=dtype).index_add(0, torch.tensor([0], device='cuda'), torch.ones([1], dtype=dtype, device='cuda'))
tensor([1], device='cuda:0')
>>> dtype=torch.int16
# unexpected 0 + 1 == 0
>>> torch.zeros([1], device='cuda', dtype=dtype).index_add(0, torch.tensor([0], device='cuda'), torch.ones([1], dtype=dtype, device='cuda'))
tensor([0], device='cuda:0', dtype=torch.int16)
It would be easy to add this example that motivated the change, but it's just so trivial that it doesn't add coverage. The test with a full multi-dimensional tensor and a variety of values can capture potential corner cases that this narrow test wouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not complaining about the test case coverage, I'm still trying to understand why the previous code is wrong. Because the fix was to change from doing unsigned math to signed math, there should be a non-trivial set of positive values for which the math is correct, but this doesn't seem to be the case from your explanation.
To help bridge my gap in understanding, it would be really helpful if you could run through an end-to-end example about how almost all inputs give zeros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's worth trying to dig too deeply into all the ways the existing code was incorrect. It didn't work in any situation I know of and was never tested. There's more wrong with it than just the sign issue.
One thing that's wrong is that in sum = val + (size_t)address & 2 ? T(old >> 16) : T(old & 0xffff);, ... the val + (size_t)address &2 as an expression has precedence. If val is non-zero, we'll always call T(old>>16) regardless of what the address was. Just fixing that, we still have the sign issue, but I'm not sure that completely covers the problems here.
There are so many ways untested code can be wrong, it's hard to be certain of all of them without a lot of investigation and additional testing.
|
for reference it looks like the original code may have been adapted from: https://github.com/rapidsai/cudf/blob/8ddb744a716aa8fe736284272b065cf5b7085065/cpp/src/reductions/reduction_operators.cuh#L20-L43 |
Fixes: #29153 Bug is that atomicAdd doesn't correctly add values for some dtypes due to incorrect casting. Was returning zeros. Incorrect behavior before this PR: ``` In [23]: sparse=torch.sparse_coo_tensor(indices=torch.tensor([[0,0],[1,1]]), values=torch.tensor([5, 6], dtype=torch.int16), size=(2,2), device='cuda', dtype=torch.int16 ) In [24]: sparse Out[24]: tensor(indices=tensor([[0, 0], [1, 1]]), values=tensor([5, 6]), device='cuda:0', size=(2, 2), nnz=2, dtype=torch.int16, layout=torch.sparse_coo) In [25]: sparse.coalesce() Out[25]: tensor(indices=tensor([[0], [1]]), values=tensor([11]), device='cuda:0', size=(2, 2), nnz=1, dtype=torch.int16, layout=torch.sparse_coo) In [26]: sparse.to_dense() Out[26]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) In [27]: sparse.coalesce().to_dense() Out[27]: tensor([[ 0, 11], [ 0, 0]], device='cuda:0', dtype=torch.int16) In [30]: torch.add(torch.zeros([2,2],dtype=torch.int16, device='cuda'), sparse) Out[30]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) ``` [ghstack-poisoned]
Fixes: #29153 Bug is that atomicAdd doesn't correctly add values for some dtypes due to incorrect casting. Was returning zeros. Incorrect behavior before this PR: ``` In [23]: sparse=torch.sparse_coo_tensor(indices=torch.tensor([[0,0],[1,1]]), values=torch.tensor([5, 6], dtype=torch.int16), size=(2,2), device='cuda', dtype=torch.int16 ) In [24]: sparse Out[24]: tensor(indices=tensor([[0, 0], [1, 1]]), values=tensor([5, 6]), device='cuda:0', size=(2, 2), nnz=2, dtype=torch.int16, layout=torch.sparse_coo) In [25]: sparse.coalesce() Out[25]: tensor(indices=tensor([[0], [1]]), values=tensor([11]), device='cuda:0', size=(2, 2), nnz=1, dtype=torch.int16, layout=torch.sparse_coo) In [26]: sparse.to_dense() Out[26]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) In [27]: sparse.coalesce().to_dense() Out[27]: tensor([[ 0, 11], [ 0, 0]], device='cuda:0', dtype=torch.int16) In [30]: torch.add(torch.zeros([2,2],dtype=torch.int16, device='cuda'), sparse) Out[30]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) ``` [ghstack-poisoned]
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks correct. Had some comments about the tests
test/test_torch.py
Outdated
| if is_signed(dtype): | ||
| tensor = torch.randint(-5, 15, size, dtype=dtype, device=device) | ||
| else: | ||
| tensor = torch.randint(0, 10, size, dtype=dtype, device=device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give some sample inputs to index_add that trigger the bug? I want to know roughly the order of magnitude of the probability is that the randomly generated input does not catch the bug; we want to make our tests robust to changes in seed.
aten/src/THC/THCAtomics.cuh
Outdated
| old = atomicCAS(address_as_ui, assumed, old); | ||
| old_bytes = (old >> shift) & 0xff; | ||
| T tmp = THCNumerics<T>::add(val, old_bytes); | ||
| // maintain same size in initial cast to avoid padding negative values with 1s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the code reader I'd like a little more information. Something like:
"If we did something like uint32_t newval = tmp (which is -1), then newval = 0xffffffff, but we really just wanted to store one byte (0x000000ff)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated a bit, also made the cast in-line instead of extra tmp values.
Fixes: #29153 Bug is that atomicAdd doesn't correctly add values for some dtypes due to incorrect casting. Was returning zeros. Incorrect behavior before this PR: ``` In [23]: sparse=torch.sparse_coo_tensor(indices=torch.tensor([[0,0],[1,1]]), values=torch.tensor([5, 6], dtype=torch.int16), size=(2,2), device='cuda', dtype=torch.int16 ) In [24]: sparse Out[24]: tensor(indices=tensor([[0, 0], [1, 1]]), values=tensor([5, 6]), device='cuda:0', size=(2, 2), nnz=2, dtype=torch.int16, layout=torch.sparse_coo) In [25]: sparse.coalesce() Out[25]: tensor(indices=tensor([[0], [1]]), values=tensor([11]), device='cuda:0', size=(2, 2), nnz=1, dtype=torch.int16, layout=torch.sparse_coo) In [26]: sparse.to_dense() Out[26]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) In [27]: sparse.coalesce().to_dense() Out[27]: tensor([[ 0, 11], [ 0, 0]], device='cuda:0', dtype=torch.int16) In [30]: torch.add(torch.zeros([2,2],dtype=torch.int16, device='cuda'), sparse) Out[30]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) ``` [ghstack-poisoned]
Fixes: #29153 Bug is that atomicAdd doesn't correctly add values for some dtypes due to incorrect casting. Was returning zeros. Incorrect behavior before this PR: ``` In [23]: sparse=torch.sparse_coo_tensor(indices=torch.tensor([[0,0],[1,1]]), values=torch.tensor([5, 6], dtype=torch.int16), size=(2,2), device='cuda', dtype=torch.int16 ) In [24]: sparse Out[24]: tensor(indices=tensor([[0, 0], [1, 1]]), values=tensor([5, 6]), device='cuda:0', size=(2, 2), nnz=2, dtype=torch.int16, layout=torch.sparse_coo) In [25]: sparse.coalesce() Out[25]: tensor(indices=tensor([[0], [1]]), values=tensor([11]), device='cuda:0', size=(2, 2), nnz=1, dtype=torch.int16, layout=torch.sparse_coo) In [26]: sparse.to_dense() Out[26]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) In [27]: sparse.coalesce().to_dense() Out[27]: tensor([[ 0, 11], [ 0, 0]], device='cuda:0', dtype=torch.int16) In [30]: torch.add(torch.zeros([2,2],dtype=torch.int16, device='cuda'), sparse) Out[30]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) ``` [ghstack-poisoned]
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks fine, had some nits with the testing
test/test_torch.py
Outdated
| # add coverage for issue with atomic add that appeared only for | ||
| # specific dtypes on cuda: | ||
| # https://github.com/pytorch/pytorch/issues/29153 | ||
| def test_dtypes_for_index_add(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't call this "test_dtypes_for_index_add". The other test doesn't test dtypes, but it really should (on the other hand, test_index_add shouldn't test CUDA because then it would run for too long).
What you're doing here is checking that an index_add between zeros and a tensor gives back the original tensor. How about naming this something like test_index_add_trivial or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like "trivial" because it doesn't capture what's actually tested. The main purpose of this test is to capture if index_add doesn't work for some dtypes due to special dtype handling code. I could do another PR at some point to merge the two though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I feel really weird having both tests. In terms of design, it makes sense for all tests in test_torch, unless otherwise specified, to test all dtypes; something named "test_index_add" should really test everything unless there are multiple tests that have a prefix named "test_index_add" that have differentiating suffixes. I don't want to block this fix on that but it would be great if you could send a followup PR to clean the tests up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine to eventually merge the two. I agree all of these tests should eventually be running for all dtypes.
Fixes: #29153 Bug is that atomicAdd doesn't correctly add values for some dtypes due to incorrect casting. Was returning zeros. Incorrect behavior before this PR: ``` In [23]: sparse=torch.sparse_coo_tensor(indices=torch.tensor([[0,0],[1,1]]), values=torch.tensor([5, 6], dtype=torch.int16), size=(2,2), device='cuda', dtype=torch.int16 ) In [24]: sparse Out[24]: tensor(indices=tensor([[0, 0], [1, 1]]), values=tensor([5, 6]), device='cuda:0', size=(2, 2), nnz=2, dtype=torch.int16, layout=torch.sparse_coo) In [25]: sparse.coalesce() Out[25]: tensor(indices=tensor([[0], [1]]), values=tensor([11]), device='cuda:0', size=(2, 2), nnz=1, dtype=torch.int16, layout=torch.sparse_coo) In [26]: sparse.to_dense() Out[26]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) In [27]: sparse.coalesce().to_dense() Out[27]: tensor([[ 0, 11], [ 0, 0]], device='cuda:0', dtype=torch.int16) In [30]: torch.add(torch.zeros([2,2],dtype=torch.int16, device='cuda'), sparse) Out[30]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) ``` [ghstack-poisoned]
Summary: Pull Request resolved: pytorch/pytorch#29231 Fixes: pytorch/pytorch#29153 Bug is that atomicAdd doesn't correctly add values for some dtypes due to incorrect casting. Was returning zeros. Incorrect behavior before this PR: ``` In [23]: sparse=torch.sparse_coo_tensor(indices=torch.tensor([[0,0],[1,1]]), values=torch.tensor([5, 6], dtype=torch.int16), size=(2,2), device='cuda', dtype=torch.int16 ) In [24]: sparse Out[24]: tensor(indices=tensor([[0, 0], [1, 1]]), values=tensor([5, 6]), device='cuda:0', size=(2, 2), nnz=2, dtype=torch.int16, layout=torch.sparse_coo) In [25]: sparse.coalesce() Out[25]: tensor(indices=tensor([[0], [1]]), values=tensor([11]), device='cuda:0', size=(2, 2), nnz=1, dtype=torch.int16, layout=torch.sparse_coo) In [26]: sparse.to_dense() Out[26]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) In [27]: sparse.coalesce().to_dense() Out[27]: tensor([[ 0, 11], [ 0, 0]], device='cuda:0', dtype=torch.int16) In [30]: torch.add(torch.zeros([2,2],dtype=torch.int16, device='cuda'), sparse) Out[30]: tensor([[0, 0], [0, 0]], device='cuda:0', dtype=torch.int16) ``` Test Plan: Imported from OSS Differential Revision: D18575666 Pulled By: nairbv fbshipit-source-id: 9b193b386bf4a9615014aa890d2e9f4f694940ac
Stack from ghstack:
Fixes: #29153
Bug is that atomicAdd doesn't correctly add values for some dtypes due to incorrect casting. Was returning zeros.
Incorrect behavior before this PR:
Differential Revision: D18575666