-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Implementing cuda kernel for tril_indices and triu_indices #15203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
62178ba to
d9c6339
Compare
d9c6339 to
6967d60
Compare
|
There seems to be some non-deterministic error. The |
|
|
|
It seems to be a bug in existing code: UPDATE: |
|
Can you run some performance numbers comparing torch.tril_indices on the gpu and on CPU with numpy.tril_indices? Quick sanity check that the gpu acceleration actually helps |
|
Good point! Yes, will do. |
test/test_cuda.py
Outdated
| torch.tril_indices(row, col, offset, dtype=dtype, device='cuda')) | ||
| x = torch.ones(row, col, dtype=dtype, device='cpu') \ | ||
| .tril(offset).nonzero().transpose(0, 1).cuda() | ||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to synchronize here. self.assertEqual should have a sync point somewhere
|
It seems to be an irrelevant error: |
6706c56 to
70d469b
Compare
| self._compare_trilu_indices(3, 513, offset=1, dtype=torch.long) | ||
| self._compare_trilu_indices(513, 3, offset=1, dtype=torch.int) | ||
| self._compare_trilu_indices(513, 0, offset=1, dtype=torch.double) | ||
| for test_args in tri_tests_args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You can deduplicate _compare_trilu_indices and use it in both TestTorch and TestCuda by having it take a "device" argument.
Here is an example of how to call a function defined in TestTorch from TestCuda:
Line 937 in d71fac2
| _TestTorchMixin._test_neg(self, lambda t: t.cuda()) |
test/test_cuda.py
Outdated
| # explicitly convert 'cpu' tensor to 'cuda' to avoid a bug in tril | ||
| # and triu cuda kernel (see #15226) | ||
| x = torch.ones(row, col, dtype=dtype, device='cpu') \ | ||
| .tril(offset).nonzero().transpose(0, 1).cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help with the deduplication between this and the _compare_trilu_indices in test_torch.py, you can do
x = torch.ones(row, col, dtype=dtype, device='cpu') \
.tril(offset).nonzero().transpose(0, 1).to(device)
here, assuming we pass in a 'device' for _compare_trilu_indices
| f <<= 1; | ||
| auto b = f - 1; | ||
| auto c = - (x << 1); | ||
| row = (int64_t) ::floor((-b + ::sqrt(b * b - 4.0 * c))/2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this accumulate in double or float? Instead of floor, would it be better to round the result of this to the nearest integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about the numeric stability: can we run a test on some input with b * b much larger than 4 * c?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It accumulates in double. I will add some tests for large b*b and small 4 * c.
Regarding the instability, I do see cases where the result was incorrectly rounded up when I was using float. Should I add back the check (if the # of elems up to row was larger than x), or do you know any other more stable way to calculate the row idx? I thought about binary search, but it would be much slower compared to directly calling sqrt.
would it be better to round the result of this to the nearest integer
Do you mean using the __double2ll_rd method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double precision is good; double arithmetic on GPUs is slower than float arithmetic but it is better that this kernel is correct.
What exactly is unstable here? Is the problem taking the square root of a very big number?
Do you mean using the __double2ll_rd method?
I'm not sure what _double2ll_rd is, but llround might work too: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__DOUBLE.html#group__CUDA__MATH__DOUBLE_1g6e401c3a6f291b874fc95b8480bcad02
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using single precision, I encountered an error while calculating indices for a 5k X 5k matrices. The result should be something like 4999.999998, but got rounded up to 5000 as float only has 7-8 digit precision.
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I had two minor comments:
- please try to deduplicate the testing code more; I think we can keep one version of it in test/test_torch.py and call it directly from test/test_cuda.py
- I'm a little worried about the numeric stability -- can you try a test case where b^2 is much greater than 4c and check the correctness?
|
I added code to explicitly handle precision loss when converting |
|
hmmm, our Jenkins server does not have enough RAM to test large tensors. Let me reduce the test size a bit. |
b3efac8 to
3bcb83e
Compare
|
Encountered an irrelevant error: The following larger tensor tests pass on my local server which would hit precision loss problem if we do not handle them specifically: # row, col
[
(536870901, 1),
(1, 536870901),
(268435455, 2, 1),
(2, 268435455, 1)
] |
|
that's fine, I doubt users will pick something so large. A lot of the code in our codebase uses int instead of int64 so I think users would hit other problems before they run into this one :) |
test/common_methods_invocations.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this take? It might be better if we don't run all of these in our CI if it takes too long / is too resource intensive (pick your favorite one!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the gpudevserver, it takes ~2 minutes in total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I realize it is good to have these to test resolve_root_int's fallback binary search path. Let's leave them then unless they take too long to run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, 2 minutes is too long for a unit test to be running -- could you keep one of these test cases and note the rest of them in a comment in the code somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will comment out all tests but one, and add descriptions in TensorFactories.cu to remind developers if they would like to make some changes.
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thank you @mrshenli!
Left a minor comment about testing but the code looks good
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
d9f277d to
5f08940
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
5f08940 to
36a8b5e
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
36a8b5e to
b00e2e6
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Followup PR of #14904, and the stretch goal of #12653. Directly calculate coordinates in the original tensor using column index in the result tensor. Every GPU thread takes care of a column (two numbers) in the output tensor. The implementation detects and handles precision loss during calculating the square root of a `int64_t` variable, and supports tensors with up to `row * column = 2 ^ 59` numbers. Algorithm details are describe in [comments of TensorFactories.cu](https://github.com/pytorch/pytorch/blob/23ddb6f58a1c8a7a660a793f174cf014230176c6/aten/src/ATen/native/cuda/TensorFactories.cu#L109-L255). zou3519 Pull Request resolved: pytorch/pytorch#15203 Reviewed By: zou3519 Differential Revision: D13517695 Pulled By: mrshenli fbshipit-source-id: 86b305d22cac08c8962a3b0cf8e9e620b7ec33ea

Followup PR of #14904, and the stretch goal of #12653.
Directly calculate coordinates in the original tensor using column index in the result tensor. Every GPU thread takes care of a column (two numbers) in the output tensor.
The implementation detects and handles precision loss during calculating the square root of a
int64_tvariable, and supports tensors with up torow * column = 2 ^ 59numbers.Algorithm details are describe in comments of TensorFactories.cu.
@zou3519