Skip to content

Conversation

@IvanYashchuk
Copy link
Collaborator

@IvanYashchuk IvanYashchuk commented Apr 16, 2021

Stack from ghstack:

Differential Revision: D27960152

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 16, 2021

💊 CI failures summary and remediations

As of commit 5011c45 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Apr 26 11:29:21 ERROR 2021-04-26T10:42:18Z: scc...eof ((socklen_t)))\n ^\n" }
Apr 26 11:29:21 ERROR 2021-04-26T10:42:11Z: sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "conftest.c: In function \'main\':\nconftest.c:332:2: error: \'struct sockaddr\' has no member named \'sa_len\'\n x.sa_len = 0;\n  ^\n" }
Apr 26 11:29:21 
Apr 26 11:29:21 ERROR 2021-04-26T10:42:14Z: sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "conftest.c: In function \'main\':\nconftest.c:366:10: error: \'RTLD_MEMBER\' undeclared (first use in this function); did you mean \'RTLD_NEXT\'?\n   (void) RTLD_MEMBER;\n          ^~~~~~~~~~~\n          RTLD_NEXT\nconftest.c:366:10: note: each undeclared identifier is reported only once for each function it appears in\n" }
Apr 26 11:29:21 
Apr 26 11:29:21 ERROR 2021-04-26T10:42:15Z: sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "conftest.c:361:9: error: unknown type name \'not\'\n         not a universal capable compiler\n         ^~~\nconftest.c:361:15: error: expected \'=\', \',\', \';\', \'asm\' or \'__attribute__\' before \'universal\'\n         not a universal capable compiler\n               ^~~~~~~~~\nconftest.c:361:15: error: unknown type name \'universal\'\n" }
Apr 26 11:29:21 
Apr 26 11:29:21 ERROR 2021-04-26T10:42:15Z: sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "conftest.c: In function \'main\':\nconftest.c:367:4: error: unknown type name \'not\'; did you mean \'ino_t\'?\n    not big endian\n    ^~~\n    ino_t\nconftest.c:367:12: error: expected \'=\', \',\', \';\', \'asm\' or \'__attribute__\' before \'endian\'\n    not big endian\n            ^~~~~~\n" }
Apr 26 11:29:21 
Apr 26 11:29:21 ERROR 2021-04-26T10:42:16Z: sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "conftest.c: In function \'main\':\nconftest.c:378:4: error: \'struct stat\' has no member named \'st_mtimespec\'; did you mean \'st_mtim\'?\n st.st_mtimespec.tv_nsec = 1;\n    ^~~~~~~~~~~~\n    st_mtim\n" }
Apr 26 11:29:21 
Apr 26 11:29:21 ERROR 2021-04-26T10:42:18Z: sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "conftest.c: In function \'main\':\nconftest.c:402:24: error: expected expression before \')\' token\n if (sizeof ((socklen_t)))\n                        ^\n" }
Apr 26 11:29:21 
Apr 26 11:29:21 ERROR 2021-04-26T11:29:14Z: sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "\u{1b}[01m\u{1b}[K/var/lib/jenkins/workspace/test/cpp/api/transformer.cpp:\u{1b}[m\u{1b}[K In function \'\u{1b}[01m\u{1b}[Kvoid transformer_decoder_test_helper(bool)\u{1b}[m\u{1b}[K\':\n\u{1b}[01m\u{1b}[K/var/lib/jenkins/workspace/test/cpp/api/transformer.cpp:609:6:\u{1b}[m\u{1b}[K \u{1b}[01;31m\u{1b}[Kinternal compiler error: \u{1b}[m\u{1b}[Kin equal_mem_array_ref_p, at tree-ssa-scopedtables.c:429\n void \u{1b}[01;31m\u{1b}[Ktransformer_decoder_test_helper\u{1b}[m\u{1b}[K(bool is_cuda) {\n      \u{1b}[01;31m\u{1b}[K^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\u{1b}[m\u{1b}[K\nPlease submit a full bug report,\nwith preprocessed source if appropriate.\nSee <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.\n" }
Apr 26 11:29:21 
Apr 26 11:29:21 =========== If your build fails, please take a look at the log above for possible reasons ===========
Apr 26 11:29:21 Compile requests                   10260
Apr 26 11:29:21 Compile requests executed           6000
Apr 26 11:29:21 Cache hits                          5632
Apr 26 11:29:21 Cache hits (C/C++)                  5153
Apr 26 11:29:21 Cache hits (CUDA)                    479
Apr 26 11:29:21 Cache misses                         299

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 16, 2021
ghstack-source-id: 28c2115
Pull Request resolved: pytorch#56252
Copy link
Collaborator

@xwang233 xwang233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This overall looks good. I have left some comments.

params,
m,
n,
CUDA_R_32F,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be beneficial to get rid of the template specialization for 64-bit API with something like this?

#ifdef USE_CUSOLVER_64_BIT
cusolverDnParams_t params;
cudaDataType datatype = at::cuda::solver::get_cusolver_datatype<scalar_t>();
TORCH_CUSOLVER_CHECK(cusolverDnCreateParams(&params));
for (int64_t i = 0; i < batch_size; i++) {
at::cuda::solver::xpotrs(
handle, params, uplo, n, nrhs, datatype,
A_ptr + i * A_matrix_stride,
lda, datatype,
self_working_copy_ptr + i * self_matrix_stride,
ldb,
infos_ptr
);
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it doesn't bring any value and it's a matter of taste, I think we can continue using templates here as xgeqrf supports only the same combination of data types as the original geqrf.
image

IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 19, 2021
ghstack-source-id: 6da75be
Pull Request resolved: pytorch#56252
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 19, 2021
ghstack-source-id: ebe22ac
Pull Request resolved: pytorch#56252
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 20, 2021
ghstack-source-id: 01ccae0
Pull Request resolved: pytorch#56252
Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, @xwang233!

IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 26, 2021
ghstack-source-id: 9d9edd8
Pull Request resolved: pytorch#56252
@facebook-github-bot
Copy link
Contributor

@mruberry merged this pull request in 27a8ece.

Comment on lines +2048 to +2054
void geqrf_kernel(const Tensor& input, const Tensor& tau, int64_t m, int64_t n) {
#if defined(USE_CUSOLVER)
return geqrf_cusolver(input, tau, m, n);
#else
return geqrf_magma(input, tau, m, n);
#endif
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @IvanYashchuk , I forgot to ask for a benchmark table for cusolver vs magma. I see that matrices of all shapes are dispatched to cusolver path in this heuristic. Is there any cusolver performance complaint for cusolverDnXgeqrf and cusolverDn<T>geqrf?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.linalg.qr with mode='r' basically just calls geqrf + triu_. Here are the results for that #56256 (comment). They show that for large sizes MAGMA is a bit faster.

Here are the results comparing geqrf_cusolver and geqrf_magma. For large sizes, MAGMA variant is faster but still we are using cuSOLVER here unconditionally, since we aim to remove all uses of single input MAGMA functions because they create and destroy cuda streams internally.

|                               | cuSOLVER | MAGMA  |
|-------------------------------|----------|--------|
| torch.Size([2, 2])            | 0.049    | 5.3    |
| torch.Size([2, 2, 2])         | 0.034    | 10.2   |
| torch.Size([32, 2, 2])        | 0.417    | 189.5  |
| torch.Size([64, 2, 2])        | 0.840    | 321.8  |
| torch.Size([128, 2, 2])       | 1.6      | 632.9  |
| torch.Size([8, 8])            | 0.062    | 6.1    |
| torch.Size([2, 8, 8])         | 0.122    | 12.4   |
| torch.Size([32, 8, 8])        | 1.8      | 157.6  |
| torch.Size([64, 8, 8])        | 3.7      | 319.0  |
| torch.Size([128, 8, 8])       | 7.5      | 724.8  |
| torch.Size([16, 16])          | 0.125    | 6.7    |
| torch.Size([2, 16, 16])       | 0.247    | 12.7   |
| torch.Size([32, 16, 16])      | 3.9      | 152.8  |
| torch.Size([64, 16, 16])      | 7.8      | 312.1  |
| torch.Size([128, 16, 16])     | 15.6     | 661.9  |
| torch.Size([32, 32])          | 0.256    | 5.7    |
| torch.Size([2, 32, 32])       | 5.1      | 10.1   |
| torch.Size([32, 32, 32])      | 8.1      | 250.8  |
| torch.Size([64, 32, 32])      | 16.2     | 376.7  |
| torch.Size([128, 32, 32])     | 32.5     | 682.1  |
| torch.Size([64, 64])          | 0.658    | 5.7    |
| torch.Size([2, 64, 64])       | 1.3      | 9.3    |
| torch.Size([32, 64, 64])      | 20.9     | 211.8  |
| torch.Size([64, 64, 64])      | 41.8     | 312.9  |
| torch.Size([128, 64, 64])     | 83.7     | 556.3  |
| torch.Size([128, 128])        | 1.5      | 5.2    |
| torch.Size([2, 128, 128])     | 3.1      | 11.6   |
| torch.Size([32, 128, 128])    | 49.8     | 208.4  |
| torch.Size([64, 128, 128])    | 99.8     | 361.6  |
| torch.Size([128, 128, 128])   | 199.6    | 903.5  |
| torch.Size([256, 256])        | 2.3      | 9.7    |
| torch.Size([2, 256, 256])     | 4.6      | 14.7   |
| torch.Size([32, 256, 256])    | 75.9     | 228.9  |
| torch.Size([64, 256, 256])    | 152.0    | 419.8  |
| torch.Size([128, 256, 256])   | 303.9    | 846.4  |
| torch.Size([512, 512])        | 5.8      | 9.8    |
| torch.Size([2, 512, 512])     | 11.727   | 17.9   |
| torch.Size([32, 512, 512])    | 187.4    | 285.0  |
| torch.Size([64, 512, 512])    | 374.8    | 594.5  |
| torch.Size([128, 512, 512])   | 749.3    | 1263.3 |
| torch.Size([1024, 1024])      | 15.3     | 16.3   |
| torch.Size([2, 1024, 1024])   | 30.6     | 32.7   |
| torch.Size([32, 1024, 1024])  | 490.8    | 527.6  |
| torch.Size([64, 1024, 1024])  | 985.4    | 1022.6 |
| torch.Size([128, 1024, 1024]) | 1978.6   | 2026.9 |
|                               |          |        |
| torch.Size([512, 512])        | 8.0      | 11.9   |
| torch.Size([1024, 1024])      | 15.1     | 22.5   |
| torch.Size([2048, 2048])      | 54.9     | 54.9   |
| torch.Size([4096, 4096])      | 276.4    | 265.8  |
| torch.Size([8192, 8192])      | 1712.3   | 1555.8 |
Times are in milliseconds (ms).

@facebook-github-bot facebook-github-bot deleted the gh/ivanyashchuk/13/head branch April 30, 2021 14:16
crcrpar pushed a commit to crcrpar/pytorch that referenced this pull request May 7, 2021
Summary: Pull Request resolved: pytorch#56252

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27960152

Pulled By: mruberry

fbshipit-source-id: 0510a302aab50623d7490efaba0133f740cd57c3
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary: Pull Request resolved: pytorch#56252

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27960152

Pulled By: mruberry

fbshipit-source-id: 0510a302aab50623d7490efaba0133f740cd57c3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants