Skip to content

Conversation

@pearu
Copy link
Collaborator

@pearu pearu commented Oct 24, 2019

This PR implements support for generalized LU factorization that is required for various algorithms such as PCA (see issue #8049).

@pearu pearu self-assigned this Oct 24, 2019
@pearu pearu force-pushed the pearu/generalized-LU branch from 4b5bb4e to f5d59f6 Compare October 27, 2019 15:58
@pearu pearu changed the title Generalized LU factorization (WIP) Generalized LU factorization Oct 27, 2019
vishwakftw
vishwakftw previously approved these changes Oct 27, 2019
Copy link
Contributor

@vishwakftw vishwakftw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been lurking around this PR, waiting for it to be completed, and it finally is! Thank you!!

I just have some comments before this lands.

@pearu pearu requested a review from vishwakftw October 28, 2019 08:50
@ezyang
Copy link
Contributor

ezyang commented Oct 28, 2019

fbgemm submodule update doesn't look intentional

@pearu pearu force-pushed the pearu/generalized-LU branch 4 times, most recently from 4703bf3 to b71a29d Compare October 28, 2019 15:48
@pearu
Copy link
Collaborator Author

pearu commented Oct 28, 2019

@ezyang , the fbgemm submodule update issue is now fixed.

@pearu pearu changed the title Generalized LU factorization Generalized LU factorization [WIP] Oct 29, 2019
@ezyang
Copy link
Contributor

ezyang commented Oct 29, 2019

What still needs to be done on this PR? It's titled WIP

@pearu
Copy link
Collaborator Author

pearu commented Oct 29, 2019

@ezyang , the PR is complete as it is, however, there is a magma bug issue, see
#28608 (comment)
that appears only for small square singular matrices on cuda devices. @vishwakftw had concerns that users might interpret the appearing nans as their errors and, as I understood, suggested to hold back this PR. I believe that the issue should be fixed in magma and until then document the magma bug in lu documentation. If that would be satisfactory, I can update the PR and remove WIP asap.
.

@pearu pearu changed the title Generalized LU factorization [WIP] Generalized LU factorization Oct 29, 2019
@vishwakftw
Copy link
Contributor

@pearu, I think the PR can land as is, but raise an issue when the matrices are singular i.e., preserving older behavior. Perhaps when the magma bug is fixed, we can enable accepting singular matrices. Do you think this is reasonable?

@pearu
Copy link
Collaborator Author

pearu commented Oct 29, 2019

@vishwakftw, let me sum up the situation:
The old behavior is:

  1. LU factorization accepts only square matrices
  2. LU factorization raises an error when the input matrix is singular

This PR changes the old behavior as follows:
A. LU factorization accepts square and non-square matrices
B. LU factorization allows singular square matrices
C. LU factorization allows singular non-square matrices

The MAGMA bug is effective only when the following conditions are satisfied all at the same time:
(i) A data is stored on a CUDA device
(ii) len(A.shape) > 2, that is, the input A contains batches of matrices
(iii) m == n and m <= 32, that is, the batches contain small square matrices
(iv) pivot == True

To preserve the backward compatibility behavior, avoid the MAGMA bug, and support non-square LU factorizations in full, it would be sufficient to enable the following features from this PR:

  • A, B, C if storage is on cpu RAM
  • A, B, C if storage is on cuda RAM and len(A.shape) == 2
  • A, C if storage is cuda RAM and len(A.shape) > 2, B is disabled (that is, singular square matrices cause an exception when used in patches) until a fix to MAGMA bug will be available.

In essence, the current NaN results will be replaced with an exception until the NaN results will be made impossible by a fix to the MAGMA bug.

Notice that I tend to insist that singular inputs should be allowed as much as possible because for linear algebra algorithms such as truncated SVD or PCA, not allowing singular inputs to LU would be a big blocker. In my opinion, the MAGMA bug is effective only for a small subset of possible application and it would not be reasonable to hold back the development of new algorithms for which there is high demand.

Btw, there exists also another approach: instead of using magmaLuBatched, use the same approach as in the case of CPU RAM storage. That will reduce performance for certain (but possibly important) cases but would enable A, B, C in-full.

@ezyang
Copy link
Contributor

ezyang commented Oct 30, 2019

This approach SGTM!

@pearu
Copy link
Collaborator Author

pearu commented Oct 30, 2019

This approach SGTM!

@ezyang Just to be sure, which approach are you referring to?

  1. Raise an exception if a singular square matrix is used in patches. (A, C, partial B)
  2. Discard using magmaLuBatched and use the CPU algorithm. (A, B, C, all in-full)

@vishwakftw
Copy link
Contributor

vishwakftw commented Oct 30, 2019

@pearu maybe there is another way to resolve it, and this won't affect runtime for non-singular matrices.

I propose the following:

  • perform batched LU using magmaLuBatched.
  • Find if there are any singular matrices
  • If yes, replace nans with zero, and fix the pivot tensor to be arange(1, n + 1).

@pearu
Copy link
Collaborator Author

pearu commented Oct 30, 2019

I like the proposal, @vishwakftw, as it will be a full solution.
However, I am not sure that replacement nan -> 0 will always give the correct LU factorization.
It might be safer to recompute the LU factorization using magmaLu for singular matrices.

Re pivot == arange(1, n + 1), that might not be generally correct. In the case of recomputing LU, it would also be unnecessary.

@vishwakftw
Copy link
Contributor

@pearu we could also do that: identify singular matrices using infos and perform the LU decomposition for them separately. This would solve the pivot issue too.

@vishwakftw vishwakftw dismissed their stale review October 31, 2019 14:47

Need to fix the singular case on CUDA

@pearu
Copy link
Collaborator Author

pearu commented Oct 31, 2019

yes, exactly, I am currently working on this approach.

@kostmo
Copy link
Member

kostmo commented Oct 31, 2019

CircleCI build failures summary

As of commit 4a72b8b:

  • 0/2 flaky
  • 2/2 failures introduced in this PR

Here are the reasons each build failed.


This comment was automatically generated by Dr. CI.
Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 5 time(s).

@pearu pearu requested a review from vishwakftw October 31, 2019 20:39
@pearu pearu requested a review from vishwakftw November 1, 2019 16:36
Copy link
Contributor

@vishwakftw vishwakftw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes look good to me.

Would it be possible for you to check if there is a perf regression for non-singular matrices?

@pearu
Copy link
Collaborator Author

pearu commented Nov 1, 2019

Theoretically, there should not be as the loop over infos tensor in batchCheckErrors was replaced with another loop over infos tensor for checking the singularity.
However, I noticed that batchCheckErrors copies infos to host ram before running the loop while the loop for singularity checking does not do that. This means a performance penalty as the infos tensor is accessed item-wise from cuda memory...
I'll fix it and run some performance tests as well.

@pearu
Copy link
Collaborator Author

pearu commented Nov 1, 2019

For the cases where magma issue 13 is effective, there exists some regression in the performance that is related to the overhead of scanning the info tensor for positive values. For instance, consider an LU factorization of an identity matrix in batches:

import torch
N = <number of batches>
n = <matrix size>
a=torch.zeros((N, n, n)); a[:]=torch.eye(n); a_cuda=a.cuda()
timing = <timeit a_cuda.lu(pivot=True)>

We have the following timing results (best of three runs):

N,n pytorch master this PR
10,2 108 µs ± 3.7 µs 120 µs ± 425 ns
10000, 2 154 µs ± 200 ns 183 µs ± 4.17 µs
10000, 32 567 µs ± 2.48 µs 592 µs ± 445 ns
10000, 33 2.39 ms ± 1.8 µs 2.39 ms ± 732 ns

So, the additional overhead is about 12-30 µs per lu call iff magma issue 13 is effective.
Otherwise (see the last row), there is no regression in performance.

Btw, the timings were identical when using a=torch.randn((N, n, n))

@vishwakftw
Copy link
Contributor

I guess this fine, thank you for providing the benchmarks @pearu.

@vishwakftw
Copy link
Contributor

@pytorchbot rebase this please

@pearu
Copy link
Collaborator Author

pearu commented Nov 4, 2019

@vishwakftw @ezyang , is rebasing this PR stuck somewhere?

@vishwakftw
Copy link
Contributor

Semi-automatic rebase is not working, could you please try to rebase manually?

@pearu pearu force-pushed the pearu/generalized-LU branch from e40ccca to 7c7bd66 Compare November 4, 2019 20:32
@pearu
Copy link
Collaborator Author

pearu commented Nov 4, 2019

@pytorchbot rebase this please

@pearu pearu force-pushed the pearu/generalized-LU branch from 7c7bd66 to a277f54 Compare November 4, 2019 21:02
@pearu
Copy link
Collaborator Author

pearu commented Nov 5, 2019

@vishwakftw, rebase is done.

@vishwakftw
Copy link
Contributor

Failures look spurious. @pytorchbot merge this please.

@pytorchbot pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Nov 5, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in fd4f22e.

@pearu pearu deleted the pearu/generalized-LU branch November 5, 2019 21:43
zdevito pushed a commit to zdevito/ATen that referenced this pull request Nov 5, 2019
Summary:
This PR implements support for generalized LU factorization that is required for various algorithms such as PCA (see issue pytorch/pytorch#8049).
Pull Request resolved: pytorch/pytorch#28608

Differential Revision: D18326449

Pulled By: ezyang

fbshipit-source-id: d4011d75710e06e87ddbf5ad9afae42ba3330548
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-this-please Was marked for merge with @pytorchbot merge this please Merged open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants