-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Skip manual backward for cdist with case p=2
#31167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cdist kernel grid to avoid CUDA errorcdist kernel grid parameter to avoid CUDA invalid configuration error
|
I think this will solve #27209 as well. |
tkerola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small comments.
4ed44d5 to
2e1aa18
Compare
💊 CircleCI build failures summary and remediationsAs of commit 7ab0f56:
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 🕵️ 2 new failures recognized by patternsThe following build failures do not appear to be due to upstream breakage:
|
108d618 to
bc353c4
Compare
ngimel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test for the case you are fixing. Also, this will still break for m>65K168 which is approx 32 million, but that's better than before.
|
Similar issue in @emcastillo Let me know, if you want to fix both methods in the same run or if I should take care of |
|
I am trying to write the test, but the required matrix sizes for it to fail are quite big, resulting in the test failing with an out-of-memory error when running the functional checks. How should I proceed in this case? @ptrblck you can take care of pdist :) |
|
@emcastillo it looks like you are hitting #24345, and it looks like it was never resolved. You can either get back to whatever cdist implementation was before pytorch 1.2 (you'd need to add batching support, because it did not exist before pytorch 1.2) that supposedly did not use as much memory, or you can, at least for euclidian distance, let pytorch figure that backward pass itself, and call the necessary matrix multiplies (#31599), that would take care of most practical cases. Non-euclidian distances would still through an error. |
|
Thanks for the advice! Happy new year |
|
Happy New Year! Similar thing is done for adaptive_avg_pooling https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/AdaptiveAveragePooling.cpp#L324-L340 - in some cases it can have device-independent differentiable implementation, and in this case it does not have device-specific dispatch and gradient formula, and in the general case _adaptive_avg_pool2d has device-specific dispatch and device-specific backward formulas (also look in native_functions.yaml for adaptive_avg_pool2d and _adaptive_avg_pool2d). |
|
I met this problem in pytorch v1.3.1. For my case, I need to compute the similarity between a matrix (N, 2500, 256) with another matrix (N, 2500, 256), does this will deal with it ? Also I am not sure whether it will be cherry-picked in v1.4. Or there exists some workaround for this. Edit: N is a small number, eg. 8. |
|
@ngimel, sorry for the delay I can finally start working on this. What I understood from reading the code and the links you pointed me out is that I need to define a generic new function for This function should be the main one that is called when Please correct me if I am wrong (which I most likely am 😂) |
bc353c4 to
f9154c3
Compare
|
@emcastillo please address @ailzhang's comment and rebase. Thanks! |
|
Hi, so there’s a list of tests that we skip in pytorch/xla. In the PR linked above I’ve added these two tests in the skipper list. So if you remove the if branch now, it should be passing ;)
Get Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: emcastillo <[email protected]>
Sent: Monday, February 24, 2020 5:35:59 PM
To: pytorch/pytorch <[email protected]>
Cc: Ailing Zhang <[email protected]>; Mention <[email protected]>
Subject: Re: [pytorch/pytorch] Skip manual backward for `cdist` with case `p=2` (#31167)
@emcastillo commented on this pull request.
________________________________
In test/test_torch.py<#31167 (comment)>:
@@ -9745,20 +9745,22 @@ def test_cdist_norm_batch(self, device):
self.assertTrue(torch.allclose(expected, actual))
def test_cdist_large(self, device):
- for cm in ['use_mm_for_euclid_dist_if_necessary', 'use_mm_for_euclid_dist', 'donot_use_mm_for_euclid_dist']:
- x = torch.randn(1000, 10, device=device)
- y = torch.randn(1000, 10, device=device)
- actual = torch.cdist(x, y, p=2, compute_mode=cm)
- expected = self._brute_cdist(x, y, p=2)
- self.assertTrue(torch.allclose(expected, actual))
+ if self.device_type in ('cpu', 'cuda'):
Hi!
The problem here is that in these two tests it says that cdist is not compatible with XLA, hence the failure. This is expected as there is an device check in cdist that throws an exception if device != cuda or cpu.
Do you have any clue of why these two tests are executed while other cdist tests seem to be ok?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#31167?email_source=notifications&email_token=ABIBI6TNMZBKY46Q6I5VRPDRERYX7A5CNFSM4JZY7VP2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCWXW6EY#discussion_r383613249>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABIBI6QBEYOE4CYDBFOBNILRERYX7ANCNFSM4JZY7VPQ>.
|
e12b25e to
7ab0f56
Compare
|
Check removed and rebased! |
|
@ngimel I think that the failures are not related to my changes. Can you confirm please? |
ailzhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Thanks, xla failures are probably not related to your changes, but @ailzhang would know more. |
|
@ngimel The current failed tests are quantization tests ;) |
Summary: Fixes an issue with `cdist` backward calculation for large inputs for the euclidean case. The grid size when launching the kernel exceeded the 2^16 limit for the second dimension, resulting in `RuntimeError: CUDA error: invalid configuration argument` Code to reproduce: ``` h, w, d = 800, 1216, 12 n = 133 A = torch.randn(n, d).cuda() B = torch.randn(h, w, d).cuda() A.requires_grad = True B.requires_grad = True B = B.reshape(-1, d).contiguous() dist = torch.cdist(A, B) loss = dist.sum() loss.backward() ``` Thanks to tkerola for the bug report, reproduction and suggesting a solution. Pull Request resolved: #31167 Differential Revision: D20035605 Pulled By: ngimel fbshipit-source-id: ae28ba4b549ee07a8bd937bb1de2438dc24eaa17
Summary: Fixes an issue with `cdist` backward calculation for large inputs for the euclidean case. The grid size when launching the kernel exceeded the 2^16 limit for the second dimension, resulting in `RuntimeError: CUDA error: invalid configuration argument` Code to reproduce: ``` h, w, d = 800, 1216, 12 n = 133 A = torch.randn(n, d).cuda() B = torch.randn(h, w, d).cuda() A.requires_grad = True B.requires_grad = True B = B.reshape(-1, d).contiguous() dist = torch.cdist(A, B) loss = dist.sum() loss.backward() ``` Thanks to tkerola for the bug report, reproduction and suggesting a solution. Pull Request resolved: pytorch#31167 Differential Revision: D20035605 Pulled By: ngimel fbshipit-source-id: ae28ba4b549ee07a8bd937bb1de2438dc24eaa17
|
How can I update my version of torch to get this change? |
|
You can get nightly packages following instructions on pytorch.org. |
|
I'm still getting an error with pytorch 1.7.1 and nightly Hard to reproduce though (it definitely is because of |
Fixes an issue with
cdistbackward calculation for large inputs for the euclidean case.The grid size when launching the kernel exceeded the 2^16 limit for the second dimension, resulting in
RuntimeError: CUDA error: invalid configuration argumentCode to reproduce:
Thanks to @tkerola for the bug report, reproduction and suggesting a solution.