-
Notifications
You must be signed in to change notification settings - Fork 26.3k
CPU-Strided-Complex Support for reduce ops and linpack ops #27653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I have added complex number support for linear algebra ops and reduce ops. There are a lot of CI failures after pulling changes from master. I am mainly seeing jit failures I don't mind waiting on this PR if you need to fix some other problems. Dylan |
|
@pytorchbot rebase this please |
|
Windows failure is real: |
| auto m = self.size(-2); | ||
| auto n = self.size(-1); | ||
| auto k = std::min(m, n); | ||
| int64_t lrwork = jobz == 'N' ? 5*k : k*std::max(5*k+7,2*std::max(m,n)+2*k+1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this formula come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion. I meant to take this settings from the LAPACK docs, but I think I copied the wrong setting when I was searching other sites for the reasoning behind this setting. The code has been updated with the setting for LAPACK 3.8.
The rwork setting in LAPACK is the minimum setting. Increasing this number by 2x, 10x, and 100x did not impact performance in an measurable way.
| scalar_t wkopt; | ||
| lapackGeqrf<scalar_t>(m, n, self_data, m, tau_data, &wkopt, lwork, &info); | ||
| lwork = static_cast<int>(wkopt); | ||
| lwork = static_cast<int>(zabs<scalar_t, value_t>(wkopt)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it a while and couldn't figure it out. Why is the zabs call here necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wkopt cannot be cast directly to the lwork data type without converting to a real number. wkopt is a positive real number that is stored in a std::complex variable thus zabs() and real_impl() should both do the job, but real_impl() is more performant.
I have updated the template utility functions in zmath.h to support:
- complex return type: when performing tensor operations that maintain a fixed dtype.
- real return type: when performing low-level C++ operations.
All of these functions are no-ops for other data types.
|
I have fixed the windows CI failure by name mangling a couple variable names called |
| } \ | ||
| }() | ||
|
|
||
| #define AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(SCALARTYPE1, SCALARTYPE2, TYPE, NAME, ...) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gchanan can you remind me what the rules for adding dispatch macros here?
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Are you able to merge this PR? Thanks. |
| auto mn = std::min(m, n); | ||
| auto mx = std::max(m, n); | ||
| int64_t lrwork; // These settings are based on LAPACK 3.8. | ||
| if (jobz == 'N'){ | ||
| lrwork = 5*mn; | ||
| }else if (mx > 10*mn){ | ||
| lrwork = 5*mn*mn + 5*mn; | ||
| } else { | ||
| lrwork = std::max(5*mn*mn + 5*mn, 2*mx*mn + 2*mn*mn + mn); | ||
| } | ||
| Tensor rwork = at::empty({std::max(int64_t(1), lrwork)}, at::kInt); | ||
| auto rwork_data = rwork.data_ptr<int>(); | ||
| Tensor iwork = at::empty({8*mn}, at::kInt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please include spaces before and after *?
Couple more comments:
- For LAPACK 3.6,
lrworkneeds to be at least7 * mn. Could we do that so that users with older LAPACK versions are not at a disadvantage? - Also, in the case
mx > 10 * mn: how did you obtain10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I have made the changes and update the comment to say it supports LAPACK 3.6.
- I had to make the assumption that
mx >> mnimpliesmx > 10 * mn. I don't supporttorch.rand()with complex numbers so I don't really have a good way to test this right now.
Once legacy::cpu::_th_random_ is ported from TH I should be able to better test this. Also, rwork only impacts complex number svd calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one more concern: if rwork is required for complex valued matrices / tensors, why allocate it for the general case?
Can't we special case as follows?
Tensor rwork;
int* rwork_data = std::nullptr;
if (at::isComplex(self)) {
rwork = at::empty(.., at::kInt);
rwork_data = rwork.data_ptr<int>();
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I will update the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've made the changes. Have a good weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dylanbespalko, you have a good one too.
Sorry, it got stuck in land limbo. Do you want to handle Vishwak's comments? |
Vishwak's concerns are valid, however these values are only impacting the complex number SVD functions because Dylan. |
|
@dylanbespalko thanks for your work, would it be possible for you to address the formatting changes? Also, this needs a rebase. |
|
Yes, can address these issues. I guess there are merge conflicts anyways. I'm busy with other stuff today, but I can have this fixed tomorrow. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Can you merge these changes? |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
I pulled changes from master and re-ran the CI. I don't think these failures are from my code. The previous CI was passing all except: pr/pytorch-win-ws2016-cuda9-cudnn7-py3. That machine is now timing out after 2+ hours. |
Summary: In-tree changes to pytorch to support complex numbers are being submitted here. Out-of-tree support for complex numbers is here: [pytorch-cpu-strided-complex extension](https://gitlab.com/pytorch-complex/pytorch-cpu-strided-complex) Changes so far: - [x] Renamed references to variable "I" that may be confused for "I" defined in complex.h. I did this to avoid crazy CI failures messages as complex.h is included by more source files. - aten/src/ATen/native/cpu/Loops.h (Renamed I to INDEX) - aten/src/ATen/native/cuda/Loops.cuh (Renamed I to INDEX) - aten/src/ATen/core/ivalue_inl.h (Renamed I to INDEX) - c10/util/Array.h (Renamed I to INDEX) - c10/util/C++17.h (Renamed I to INDEX) - c10/util/Metaprogramming.h (Renamed I to INDEX) - c10/util/SmallVector.h (custom renaming) - [x] Added complex support of Linear Algebra Ops. - SVD needed to be modified to support mixed data types - Example U(std::complex<double)), S(double), V(std::complex<double>) - See before and after benchmark below (No observable change in performance). - [x] Added complex support of Reduce Ops. - var/std computations could have been faster if it was possible to interpret std::complex<double> Tensor as a double Tensor. - [x] Added complex derivative support for autograd functionality. - derivatives are the same as defined by numpy autograd library for real(), imag(), conj(), angle(). These functions only affect complex numbers. - derivative of abs() has not been modified to not interfere with existing code. - Autograd defines abs() for complex numbers and fabs() for real numbers. I will look into this further down the road. ---------------------------------------- PyTorch/Caffe2 Operator Micro-benchmarks Before Changes ---------------------------------------- Tag : short Benchmarking PyTorch: svd Mode: Eager Name: svd_M512_N512 Input: M: 512, N: 512 Forward Execution Time (us) : 162339.425 Forward Execution Time (us) : 162517.479 Forward Execution Time (us) : 162847.775 ---------------------------------------- PyTorch/Caffe2 Operator Micro-benchmarks After Changes ---------------------------------------- Tag : short Benchmarking PyTorch: svd Mode: Eager Name: svd_M512_N512 Input: M: 512, N: 512 Forward Execution Time (us) : 162032.117 Forward Execution Time (us) : 161943.484 Forward Execution Time (us) : 162513.786 Pull Request resolved: pytorch/pytorch#27653 Differential Revision: D17907886 Pulled By: ezyang fbshipit-source-id: a88b6d0427591ec1fba09e97c880f535c5d0e513
…7653) Summary: In-tree changes to pytorch to support complex numbers are being submitted here. Out-of-tree support for complex numbers is here: [pytorch-cpu-strided-complex extension](https://gitlab.com/pytorch-complex/pytorch-cpu-strided-complex) Changes so far: - [x] Renamed references to variable "I" that may be confused for "I" defined in complex.h. I did this to avoid crazy CI failures messages as complex.h is included by more source files. - aten/src/ATen/native/cpu/Loops.h (Renamed I to INDEX) - aten/src/ATen/native/cuda/Loops.cuh (Renamed I to INDEX) - aten/src/ATen/core/ivalue_inl.h (Renamed I to INDEX) - c10/util/Array.h (Renamed I to INDEX) - c10/util/C++17.h (Renamed I to INDEX) - c10/util/Metaprogramming.h (Renamed I to INDEX) - c10/util/SmallVector.h (custom renaming) - [x] Added complex support of Linear Algebra Ops. - SVD needed to be modified to support mixed data types - Example U(std::complex<double)), S(double), V(std::complex<double>) - See before and after benchmark below (No observable change in performance). - [x] Added complex support of Reduce Ops. - var/std computations could have been faster if it was possible to interpret std::complex<double> Tensor as a double Tensor. - [x] Added complex derivative support for autograd functionality. - derivatives are the same as defined by numpy autograd library for real(), imag(), conj(), angle(). These functions only affect complex numbers. - derivative of abs() has not been modified to not interfere with existing code. - Autograd defines abs() for complex numbers and fabs() for real numbers. I will look into this further down the road. ---------------------------------------- PyTorch/Caffe2 Operator Micro-benchmarks Before Changes ---------------------------------------- Tag : short Benchmarking PyTorch: svd Mode: Eager Name: svd_M512_N512 Input: M: 512, N: 512 Forward Execution Time (us) : 162339.425 Forward Execution Time (us) : 162517.479 Forward Execution Time (us) : 162847.775 ---------------------------------------- PyTorch/Caffe2 Operator Micro-benchmarks After Changes ---------------------------------------- Tag : short Benchmarking PyTorch: svd Mode: Eager Name: svd_M512_N512 Input: M: 512, N: 512 Forward Execution Time (us) : 162032.117 Forward Execution Time (us) : 161943.484 Forward Execution Time (us) : 162513.786 Pull Request resolved: pytorch#27653 Differential Revision: D17907886 Pulled By: ezyang fbshipit-source-id: a88b6d0427591ec1fba09e97c880f535c5d0e513
| auto mn = std::min(m, n); | ||
| Tensor iwork = at::empty({8*mn}, at::kInt); | ||
| auto iwork_data = iwork.data_ptr<int>(); | ||
| Tensor rwork; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dylanbespalko why is this required? did you have tests to verify the svd behavior for complex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working to add svd for complex on cuda if we need to add something similar on cuda as well since magma follows lapack
In-tree changes to pytorch to support complex numbers are being submitted here.
Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension
Changes so far:
PyTorch/Caffe2 Operator Micro-benchmarks Before Changes
Tag : short
Benchmarking PyTorch: svd
Mode: Eager
Name: svd_M512_N512
Input: M: 512, N: 512
Forward Execution Time (us) : 162339.425
Forward Execution Time (us) : 162517.479
Forward Execution Time (us) : 162847.775
PyTorch/Caffe2 Operator Micro-benchmarks After Changes
Tag : short
Benchmarking PyTorch: svd
Mode: Eager
Name: svd_M512_N512
Input: M: 512, N: 512
Forward Execution Time (us) : 162032.117
Forward Execution Time (us) : 161943.484
Forward Execution Time (us) : 162513.786