-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Port CUDA torch.geqrf to ATen #56251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 2843955 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 ghstack-source-id: 9d6b500 Pull Request resolved: pytorch#56251
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 ghstack-source-id: ecdc580 Pull Request resolved: pytorch#56251
| dispatch: | ||
| CPU: geqrf | ||
| CUDA: legacy::cuda::_th_geqrf | ||
| CPU, CUDA: geqrf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to do it this PR, but consider making this structured! https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md
|
|
||
| // magmaGeqrf uses a hybrid CPU-GPU algorithm to compute the elementary reflectors. | ||
| // The driver routine geqrf2_gpu accepts a tensor on the CPU for elementary reflectors. | ||
| Tensor tau_cpu = at::empty(tau.sizes(), tau.options().device(at::kCPU)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to allocate this in pinned memory and use non-blocking copy below? The effect is probably negligible, but I don't know off the top of my head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it wouldn't hurt.
Is it simply Tensor tau_cpu = at::empty(tau.sizes(), tau.options().device(at::kCPU).pinned_memory(true)); ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and ,/*non_blocking*/true for copy_ below, iirc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! It did give non-negligible speedup:
| | before | after |
|------------------------------|--------|-------|
| torch.Size([2, 2]) | 3.0 | 2.3 |
| torch.Size([2, 2, 2]) | 5.8 | 4.4 |
| torch.Size([32, 2, 2]) | 88.3 | 70.6 |
| torch.Size([8, 8]) | 2.9 | 2.2 |
| torch.Size([2, 8, 8]) | 5.7 | 4.4 |
| torch.Size([32, 8, 8]) | 88.3 | 70.7 |
| torch.Size([16, 16]) | 3.0 | 2.2 |
| torch.Size([2, 16, 16]) | 5.7 | 4.5 |
| torch.Size([32, 16, 16]) | 89.0 | 72.4 |
| torch.Size([32, 32]) | 3.0 | 2.3 |
| torch.Size([2, 32, 32]) | 5.8 | 4.6 |
| torch.Size([32, 32, 32]) | 91.6 | 72.7 |
| torch.Size([64, 64]) | 3.1 | 2.4 |
| torch.Size([2, 64, 64]) | 6.0 | 4.7 |
| torch.Size([32, 64, 64]) | 93.1 | 74.0 |
| torch.Size([128, 128]) | 3.5 | 2.8 |
| torch.Size([2, 128, 128]) | 6.9 | 5.5 |
| torch.Size([32, 128, 128]) | 107.8 | 88.1 |
| torch.Size([256, 256]) | 4.7 | 3.7 |
| torch.Size([2, 256, 256]) | 9.2 | 7.2 |
| torch.Size([32, 256, 256]) | 144.1 | 112.1 |
| torch.Size([512, 512]) | 6.9 | 6.0 |
| torch.Size([2, 512, 512]) | 13.3 | 11.8 |
| torch.Size([32, 512, 512]) | 209.3 | 184.6 |
| torch.Size([1024, 1024]) | 15.4 | 14.6 |
| torch.Size([2, 1024, 1024]) | 30.5 | 28.9 |
| torch.Size([32, 1024, 1024]) | 486.5 | 462.4 |
Times are in milliseconds (ms).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just hope magma is synchronizing when it is writing tau, otherwise nonblocking copy might produce wrong result (it will copy tau to gpu before it's properly populated)
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 ghstack-source-id: 3827eca Pull Request resolved: pytorch#56251
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 ghstack-source-id: 3e76211 Pull Request resolved: pytorch#56251
|
Hey @IvanYashchuk, unfortunately the internal tools are unable to merge this; would you try rebasing the stack now that the first 2 prs are in? |
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves #24569 Differential Revision: [D27960155](https://our.internmc.facebook.com/intern/diff/D27960155) [ghstack-poisoned]
This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 ghstack-source-id: 501c5f3 Pull Request resolved: pytorch#56251
|
@mruberry, I rebased the stack. |
Summary: Pull Request resolved: pytorch#56251 This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D27960155 Pulled By: mruberry fbshipit-source-id: a8b010c41d703a5de4bf40b045c89e6b95b5a5ca
Summary: Pull Request resolved: pytorch#56251 This PR ports `torch.geqrf` from TH to ATen for CUDA path. Resolves pytorch#24569 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D27960155 Pulled By: mruberry fbshipit-source-id: a8b010c41d703a5de4bf40b045c89e6b95b5a5ca
Stack from ghstack:
This PR ports
torch.geqrffrom TH to ATen for CUDA path.Resolves #24569
Differential Revision: D27960155