Skip to content

Conversation

@IvanYashchuk
Copy link
Collaborator

@IvanYashchuk IvanYashchuk commented Apr 16, 2021

Stack from ghstack:

Using cuSOLVER path with pytest test/test_ops.py -k 'linalg_qr' --durations=5 cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: D27960154

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 16, 2021

💊 CI failures summary and remediations

As of commit 9cc5653 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 16, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. pytorch#51552

ghstack-source-id: 4f0cbb7
Pull Request resolved: pytorch#56256
@IvanYashchuk IvanYashchuk removed the request for review from ezyang April 16, 2021 10:03
@IvanYashchuk IvanYashchuk added the module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul label Apr 16, 2021
@IvanYashchuk IvanYashchuk requested a review from mruberry April 16, 2021 10:04
@IvanYashchuk
Copy link
Collaborator Author

Time spent for running pytest test/test_ops.py -k 'linalg_qr' --durations=5.
cuSOLVER:

====================================================== slowest 5 durations =======================================================
8.03s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_complex64
2.67s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_float32
2.65s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_linalg_qr_cpu_complex128
1.73s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_linalg_qr_cuda_complex128
1.37s call     test/test_ops.py::TestOpInfoCUDA::test_duplicate_method_tests_linalg_qr_cuda_float32
================================= 49 passed, 41 skipped, 12294 deselected, 5 warnings in 31.98s ==================================

MAGMA:

====================================================== slowest 5 durations =======================================================
39.57s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_linalg_qr_cuda_complex128
11.12s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_linalg_qr_cuda_float64
5.31s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_float32
5.28s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_complex64
2.75s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_linalg_qr_cpu_complex128
============================ 49 passed, 41 skipped, 12294 deselected, 5 warnings in 81.28s (0:01:21) =============================

@IvanYashchuk
Copy link
Collaborator Author

IvanYashchuk commented Apr 16, 2021

Here is MAGMA vs cuSOLVER comparison for non-batched square inputs for modes 'complete', 'reduced', 'r':

|                          | cuSOLVER, 'complete' | MAGMA, 'complete' | cuSOLVER, 'reduced' | MAGMA, 'reduced' | cuSOLVER, 'r' | MAGMA, 'r' |
|--------------------------|----------------------|-------------------|---------------------|------------------|---------------|------------|
| torch.Size([2, 2])       | 0.084                | 8.0               | 0.0774              | 7.6              | 0.0504        | 3.3        |
| torch.Size([8, 8])       | 0.0877               | 7.6               | 0.0872              | 8.1              | 0.0474        | 3.2        |
| torch.Size([16, 16])     | 0.158                | 7.6               | 0.1569              | 8.3              | 0.1577        | 3.3        |
| torch.Size([32, 32])     | 0.4164               | 7.6               | 0.413               | 8.5              | 0.2835        | 3.3        |
| torch.Size([64, 64])     | 0.9334               | 8.0               | 0.9257              | 8.4              | 0.6559        | 3.3        |
| torch.Size([128, 128])   | 2.0622               | 9.3               | 2.045               | 9.8              | 1.554         | 3.9        |
| torch.Size([256, 256])   | 3.5756               | 12.4              | 3.548               | 12.9             | 2.342         | 5.1        |
| torch.Size([512, 512])   | 8.6611               | 17.4              | 8.593               | 18.7             | 5.797         | 8.3        |
| torch.Size([1024, 1024]) | 23.4609              | 36.9              | 23.342              | 37.4             | 15.196        | 15.6       |
| torch.Size([2048, 2048]) | 92.3197              | 118.7             | 92.247              | 120.1            | 54.483        | 43.9       |
| torch.Size([4096, 4096]) | 497.0645             | 694.1             | 494.418             | 695.7            | 277.952       | 243.5      |
| torch.Size([8192, 8192]) | 3267.1995            | 4603.7            | 3250.727            | 4617.3           | 1713.537      | 1536.7     |

Times are in milliseconds (ms).

MAGMA is only faster than cuSOLVER for large size inputs and mode='r'. For all other cases cuSOLVER is better.

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552, #47953

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 16, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. pytorch#51552

ghstack-source-id: 4f5361a
Pull Request resolved: pytorch#56256
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552, #47953

[ghstack-poisoned]
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552, #47953

[ghstack-poisoned]
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552, #47953

[ghstack-poisoned]
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552, #47953

[ghstack-poisoned]
}

std::tuple<Tensor, Tensor> _linalg_qr_helper_cpu(const Tensor& input, std::string mode) {
std::tuple<Tensor, Tensor> _linalg_qr_helper_default(const Tensor& input, std::string mode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "default" and not "cpu"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have now linalg_qr_helper_magma that uses MAGMA for the QR decomposition, it can't be implemented using geqrf_stub + orgqr_stub, because orgqr_stub only supports cuSOLVER for CUDA inputs. In addition, MAGMA doesn't follow LAPACK API for geqrf and orgqr operations that together form the QR decomposition. That's why we need to have a separate function for MAGMA.

And we have _linalg_qr_helper_default with "_default" and not "_cpu" because this function supports both CPU and CUDA inputs, for CUDA inputs cuSOLVER&cuBLAS is used.

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit that referenced this pull request Apr 26, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit that referenced this pull request Apr 26, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 26, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. pytorch#51552

ghstack-source-id: 2f98cde
Pull Request resolved: pytorch#56256
@mruberry
Copy link
Collaborator

Time to start landing the second part of this stack!

@xwang233 would you take a look at this PR in the stack?

Copy link
Collaborator

@xwang233 xwang233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is very concise and LGTM.

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 29, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. pytorch#51552

ghstack-source-id: e94b357
Pull Request resolved: pytorch#56256
IvanYashchuk added a commit that referenced this pull request Apr 29, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit that referenced this pull request Apr 29, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 29, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. pytorch#51552

ghstack-source-id: 574f15d
Pull Request resolved: pytorch#56256
Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamped

@facebook-github-bot
Copy link
Contributor

@mruberry merged this pull request in ff59039.

@facebook-github-bot facebook-github-bot deleted the gh/ivanyashchuk/17/head branch May 4, 2021 14:16
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#56256

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See pytorch#56256 (comment).

Performance comparison: pytorch#56256 (comment).

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27960154

Pulled By: mruberry

fbshipit-source-id: 5312330d82337dec2856ec5527156a3a547a0b50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants