Skip to content

Linear algebra GPU backend tracking issue [magma/cusolver/cublas] #47953

@xwang233

Description

@xwang233

Linear algebra GPU backend tracking issue [MAGMA/cuSOLVER/cuBLAS]

Currently, most GPU linear algebra operators are using MAGMA for their backends, with only a few using cuSOLVER/cuBLAS instead. To improve performance, we would like to migrate the backend of bad-performing MAGMA linear algebra operators to cuSOLVER/cuBLAS backends if they perform better.

This issue is used to track which linear algebra operators currently do not use MAGMA as their GPU backend by default, and also track a list of known bad-performing MAGMA operators that could benefit from cuSOLVER/cuBLAS. Feel free to modify this list and link to this issue if you are aware of any such operators.

We welcome contributions to add cuSOLVER/cuBLAS backends for bad-performing MAGMA operators. Please make sure you add benchmark for your PR, and add heuristics that dispatch the operator to different backends if necessary.

(This issue doesn't track CPU or other backends.)

CUDA version requirement for cuSOLVER/cuBLAS

cuSOLVER/cuBLAS is only enabled when CUDA version is >= 10.1.243 [#45452]. There is no limitation on GPU architectures.
If your CUDA version is lower than that, everything will be dispatched to MAGMA. If MAGMA is not linked in your build, you will get runtime error while calling these linear algebra operators on GPU.

Operators that currently use non-MAGMA backends

For simplicity, we use b for batch size, m, n for matrix size. A two-dimensional tensor is considered a batch size 1 matrix. Without explicit exceptions, b == 1 cases include both 2d tensor and >=3d tensor with batch dimension == 1.

Also, most torch.linalg.x shares the same backend as torch.x linear algebra operator by default.

operator cusolver? magma? others? comment
torch.inverse, torch.linalg.inv_ex b <= 2 otherwise
torch.svd always if (m <= 32 && n <= 32 && b > 1 && ( !some || m == n )) gesvdjBatched; else gesvdj;
torch.cholesky, torch.linalg.cholesky_ex always otherwise b > 1 uses cusolver only when cuda >= 11.3
torch.cholesky_solve b == 1 otherwise
torch.cholesky_inverse b == 1 otherwise It uses cholesky_solve as the backend.
torch.orgqr always
torch.ormqr always
torch.geqrf always if (n <= 256 && b >= max(2, n / 16)) cublas_batched; else cusolver_looped
torch.linalg.qr always It uses geqrf + orgqr as the backend.
torch.linalg.eigh always
torch.lu_solve (b == 1 && n > 512) || (b > 2 && n <= 128) otherwise b and n are tensor sizes of LU_data or matrix "A".
torch.lstsq always It uses geqrf, ormqr, and triangular_solve.

last updated fe4ded0, June 29th, 2021

Pytorch 1.9 linear algebra development plan

See #47953 (comment)

For detailed MAGMA mechanism

See #47953 (comment)

See also

cc @ezyang @gchanan @zou3519 @bdhirsh @ngimel @vishwakftw @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @VitalyFedyunin @ptrblck @IvanYashchuk

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulmodule: performanceIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions