-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Vectorized complex unary and binary op support. #26500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Added Complex support with AVX to unary ops and binary ops.
In-tree changes to pytorch to support complex numbers are being submitted here. Preliminary Benchmarks are here.
Questions:
|
| value_t (*zabs_)(T) = at::native::zabs; | ||
| for (int i = 0; i != Vec256<T>::size(); i++) { | ||
| c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] : a[i]); | ||
| c[i] = zabs_(a[i]) < zabs_(min_vec[i]) ? min_vec[i] : (zabs_(a[i]) > zabs_(max_vec[i]) ? max_vec[i] : a[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers: zabs is no op on real only numbers. (The semantics of zabs is pretty weird. In particular, zabs(-1) == -1 if -1 is a non-complex type, but == 1 if it is a complex type. That seems very fishy, semantically to me. I'd prefer not to implement clamp on complex unless we have a really good reason for hacking it up this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I was under the impression that the min, max arguments could be tensors, but the torch docs indicate that they are just numeric constants.
The following is a common voltage control mechanism that is commonly used in my work:
np.where(np.where(np.abs(a) > np.abs(min), a, min) < np.abs(max), max)
Perhaps I should just use torch.where().
| double real_value = std::real(val); | ||
| double imag_value = std::imag(val); | ||
| values = _mm256_setr_pd(real_value, imag_value, | ||
| real_value, imag_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to trust that you got the SIMD instructions correct here. If there is some sort of unit test you could write here that would be spiffy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the unit tests for complex numbers can be found at pytorch-cpu-strided-complex extension. Once things are running smoothly I can move them into the pytorch unit tests.
| return _mm256_permute_pd(imag_(), 0x05); //b a | ||
| } | ||
| Vec256<std::complex<double>> acos() const { | ||
| return map(std::acos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fallbacks like this, can't you just inherit the implementations from the base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I actually couldn't figure that out. I read that template classes don't support inheritance? Am I crazy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-virtual template methods inherit just fine. https://godbolt.org/z/Gt1QfK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into it. Vec256 is a template class. While inheritance is perfectly valid
eg class Vec256z : public Vec256<std:complex<double>>{} it requires that I define a new class Vec256z. The rest of the code is creating template objects using Vec256<scalar_t>. In your example code you needed to create a new class name B, whereas I have to use Vec256.
You may notice that Vec256<int64_t> only defines several methods of Vec256<T>. I think this is because very few math kernels use integer specific function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's right, we're specializing, not inheriting. OK, this seems fine!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Victim of this Vec256 approach. We should open help group.
| using namespace vec256; | ||
|
|
||
| template <typename traits, std::size_t... I> | ||
| template <typename traits, std::size_t... INDEX> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No substantive changes here, I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was causing a bunch of CI failures. On some machines std::complex #defines the letter I. If you include complex.h you will get build errors on some machines.
|
Looks good. My comments are just minor. Did not check the vectorization carefully.
Condition was added in #8488. @cpuhrsch do you remember why you Xed out apple? (EDIT: Judging from the PR, it kind of looks like it didn't build on OS X, and we didn't figure out why at the time.)
I don't know the answer offhand to this, but it would surprise me if MKL wasn't vectorizing. |
|
Yes, I disabled it for OSX because there was a hard error for mkl VML abs. Instead of fixing this issue at that time I commented it out due to prioritization. |
That should be fine, I deploy on linux anyways. I benchmarked two different memory layouts for complex number support. The secondary solution was better for norm related calculations, however I also know that MKL has complex support for |
|
Have you had a chance to look at I'm currently looking into ways to speed up reciprocal and div. You can find my other AVX implementations here. |
|
Hi! Sorry I have 7 more PRs to review before this one. I promise to take a look by the end of this week. |
VitalyFedyunin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really want to see benchmarking numbers, especially for existing altered ops like clamp
| cpu_kernel_vec( | ||
| iter, | ||
| [=](scalar_t a) -> scalar_t { return (1 / (1 + std::exp((-a)))); }, | ||
| [=](scalar_t a) -> scalar_t { return (decltype(a)(1) / (decltype(a)(1) + std::exp((-a)))); }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, why decltype(a) instead of scalar_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fixed.
| return _mm256_permute_pd(imag_(), 0x05); //b a | ||
| } | ||
| Vec256<std::complex<double>> acos() const { | ||
| return map(std::acos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Victim of this Vec256 approach. We should open help group.
| } | ||
| __m256d real_() const { | ||
| auto mask = _mm256_setr_pd(1.0, 0.0, 1.0, 0.0); | ||
| return _mm256_mul_pd(values, mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bitmask might be faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent suggestion. Done.
Here are the benchmarking numbers for dtype=float64 Build: BUILD_TEST=0 USE_NATIVE_ARCH=ON python setup.py develop PyTorch/Caffe2 Operator Micro-benchmarks with zabs()Tag : short Benchmarking PyTorch: clamp PyTorch/Caffe2 Operator Micro-benchmarks without zabs()Tag : short Benchmarking PyTorch: clamp While the zabs() noop does not affect performance, there appears to be a problem with
|
|
Assuming all goes well with the CI tonight, I think I have made the requested changes. Note I have also modified:
Let me know if you see any problems. |
|
The errors in look real-ish. |
| return ret; | ||
| } | ||
| Vec256<T> angle() const { | ||
| return *this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These default implementations look wrong lol
aten/src/ATen/native/UnaryOps.cpp
Outdated
| } \ | ||
| Tensor& _##op##_out_##prefix(Tensor& result, const Tensor& self) { \ | ||
| checkBackend(#op, result, Backend::device); \ | ||
| checkDeviceType(#op, result, DeviceType::CPU); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error probably caused by this change
| # needs to be imported after torch | ||
| import cpp_extension # noqa | ||
|
|
||
| import cpp_extension # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
ezyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything looks good, just need to get ci happy
|
The CI looks happy now :). Have a good weekend. |
|
The fbgemm submodule changes have been removed from the PR. I don't know how they got in there. I pulled from master and I'm running CI again. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Added Complex support with AVX to unary ops and binary ops. I need to add nan propagation to minimum() and maximum() in the future. In-tree changes to pytorch to support complex numbers are being submitted here. Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension Preliminary Benchmarks are here. I tried rrii and riri and found that riri is better in most situations. Divide is very slow because you can't reduce 1/(x+y) Sqrt is also very slow. Reciprocal could be sped up after I add conj() Everything else is typically within 20% of the real number performance. Questions: Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(__APPLE__) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it. Is MKL just calling AVX? Pull Request resolved: pytorch/pytorch#26500 Differential Revision: D17835431 Pulled By: ezyang fbshipit-source-id: 6746209168fbeb567af340c22bf34af28286bd54
Summary: Added Complex support with AVX to unary ops and binary ops. I need to add nan propagation to minimum() and maximum() in the future. In-tree changes to pytorch to support complex numbers are being submitted here. Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension Preliminary Benchmarks are here. I tried rrii and riri and found that riri is better in most situations. Divide is very slow because you can't reduce 1/(x+y) Sqrt is also very slow. Reciprocal could be sped up after I add conj() Everything else is typically within 20% of the real number performance. Questions: Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(__APPLE__) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it. Is MKL just calling AVX? Pull Request resolved: pytorch#26500 Differential Revision: D17835431 Pulled By: ezyang fbshipit-source-id: 6746209168fbeb567af340c22bf34af28286bd54
Added Complex support with AVX to unary ops and binary ops.
I need to add nan propagation to minimum() and maximum() in the future.
In-tree changes to pytorch to support complex numbers are being submitted here.
Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension
Preliminary Benchmarks are here.
I tried rrii and riri and found that riri is better in most situations.
Divide is very slow because you can't reduce 1/(x+y)
Sqrt is also very slow.
Reciprocal could be sped up after I add conj()
Everything else is typically within 20% of the real number performance.
Questions:
Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(APPLE) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it.
Is MKL just calling AVX?