-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
Linear algebra GPU backend tracking issue [MAGMA/cuSOLVER/cuBLAS]
Currently, most GPU linear algebra operators are using MAGMA for their backends, with only a few using cuSOLVER/cuBLAS instead. To improve performance, we would like to migrate the backend of bad-performing MAGMA linear algebra operators to cuSOLVER/cuBLAS backends if they perform better.
This issue is used to track which linear algebra operators currently do not use MAGMA as their GPU backend by default, and also track a list of known bad-performing MAGMA operators that could benefit from cuSOLVER/cuBLAS. Feel free to modify this list and link to this issue if you are aware of any such operators.
We welcome contributions to add cuSOLVER/cuBLAS backends for bad-performing MAGMA operators. Please make sure you add benchmark for your PR, and add heuristics that dispatch the operator to different backends if necessary.
(This issue doesn't track CPU or other backends.)
CUDA version requirement for cuSOLVER/cuBLAS
cuSOLVER/cuBLAS is only enabled when CUDA version is >= 10.1.243 [#45452]. There is no limitation on GPU architectures.
If your CUDA version is lower than that, everything will be dispatched to MAGMA. If MAGMA is not linked in your build, you will get runtime error while calling these linear algebra operators on GPU.
Operators that currently use non-MAGMA backends
For simplicity, we use b for batch size, m, n for matrix size. A two-dimensional tensor is considered a batch size 1 matrix. Without explicit exceptions, b == 1 cases include both 2d tensor and >=3d tensor with batch dimension == 1.
Also, most torch.linalg.x shares the same backend as torch.x linear algebra operator by default.
| operator | cusolver? | magma? | others? | comment |
|---|---|---|---|---|
torch.inverse, torch.linalg.inv_ex |
b <= 2 |
otherwise | ||
torch.svd |
always | if (m <= 32 && n <= 32 && b > 1 && ( !some || m == n )) gesvdjBatched; else gesvdj; |
||
torch.cholesky, torch.linalg.cholesky_ex |
always | otherwise | b > 1 uses cusolver only when cuda >= 11.3 |
|
torch.cholesky_solve |
b == 1 |
otherwise | ||
torch.cholesky_inverse |
b == 1 |
otherwise | It uses cholesky_solve as the backend. |
|
torch.orgqr |
always | |||
torch.ormqr |
always | |||
torch.geqrf |
always | if (n <= 256 && b >= max(2, n / 16)) cublas_batched; else cusolver_looped |
||
torch.linalg.qr |
always | It uses geqrf + orgqr as the backend. |
||
torch.linalg.eigh |
always | |||
torch.lu_solve |
(b == 1 && n > 512) || (b > 2 && n <= 128) |
otherwise | b and n are tensor sizes of LU_data or matrix "A". |
|
torch.lstsq |
always | It uses geqrf, ormqr, and triangular_solve. |
last updated fe4ded0, June 29th, 2021
Pytorch 1.9 linear algebra development plan
See #47953 (comment)
For detailed MAGMA mechanism
See #47953 (comment)
See also
- torch.linalg in PyTorch 1.10 tracker #42666
- Batched MAGMA calls illegally read CUDA memory #26996
- Add cusolver to build, rewrite MAGMA inverse with cusolver #42403
- Linear algebra GPU library function bug tracking issue [magma/cusolver/cublas] #53879
cc @ezyang @gchanan @zou3519 @bdhirsh @ngimel @vishwakftw @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @VitalyFedyunin @ptrblck @IvanYashchuk