Skip to content

Conversation

@dylanbespalko
Copy link
Contributor

@dylanbespalko dylanbespalko commented Sep 19, 2019

Added Complex support with AVX to unary ops and binary ops.

I need to add nan propagation to minimum() and maximum() in the future.
In-tree changes to pytorch to support complex numbers are being submitted here.
Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension

Preliminary Benchmarks are here.

I tried rrii and riri and found that riri is better in most situations.
Divide is very slow because you can't reduce 1/(x+y)
Sqrt is also very slow.
Reciprocal could be sped up after I add conj()
Everything else is typically within 20% of the real number performance.
Questions:

Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(APPLE) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it.
Is MKL just calling AVX?

@pytorchbot pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: operators labels Sep 19, 2019
@pytorchbot pytorchbot added caffe2 module: internals Related to internal abstractions in c10 and ATen labels Sep 19, 2019
@pytorchbot pytorchbot added the module: docs Related to our documentation, both in docs/ and docblocks label Sep 19, 2019
@dylanbespalko dylanbespalko changed the title WIP: Vectorized complex unary and binary op support. Vectorized complex unary and binary op support. Sep 23, 2019
@dylanbespalko
Copy link
Contributor Author

@ezyang, @VitalyFedyunin

Added Complex support with AVX to unary ops and binary ops.

  • I need to add nan propagation to minimum() and maximum() in the future.

In-tree changes to pytorch to support complex numbers are being submitted here.
Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension

Preliminary Benchmarks are here.

  • I tried rrii and riri and found that riri is better in most situations.
  • Divide is very slow because you can't reduce 1/(x+y)
  • Sqrt is also very slow.
  • Reciprocal could be sped up after I add conj()
  • Everything else is typically within 20% of the real number performance.

Questions:

  1. Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(__APPLE__) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it.
  2. Is MKL just calling AVX?

value_t (*zabs_)(T) = at::native::zabs;
for (int i = 0; i != Vec256<T>::size(); i++) {
c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] : a[i]);
c[i] = zabs_(a[i]) < zabs_(min_vec[i]) ? min_vec[i] : (zabs_(a[i]) > zabs_(max_vec[i]) ? max_vec[i] : a[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: zabs is no op on real only numbers. (The semantics of zabs is pretty weird. In particular, zabs(-1) == -1 if -1 is a non-complex type, but == 1 if it is a complex type. That seems very fishy, semantically to me. I'd prefer not to implement clamp on complex unless we have a really good reason for hacking it up this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I was under the impression that the min, max arguments could be tensors, but the torch docs indicate that they are just numeric constants.

The following is a common voltage control mechanism that is commonly used in my work:
np.where(np.where(np.abs(a) > np.abs(min), a, min) < np.abs(max), max)

Perhaps I should just use torch.where().

double real_value = std::real(val);
double imag_value = std::imag(val);
values = _mm256_setr_pd(real_value, imag_value,
real_value, imag_value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to trust that you got the SIMD instructions correct here. If there is some sort of unit test you could write here that would be spiffy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the unit tests for complex numbers can be found at pytorch-cpu-strided-complex extension. Once things are running smoothly I can move them into the pytorch unit tests.

return _mm256_permute_pd(imag_(), 0x05); //b a
}
Vec256<std::complex<double>> acos() const {
return map(std::acos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fallbacks like this, can't you just inherit the implementations from the base?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I actually couldn't figure that out. I read that template classes don't support inheritance? Am I crazy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into it. Vec256 is a template class. While inheritance is perfectly valid
eg class Vec256z : public Vec256<std:complex<double>>{} it requires that I define a new class Vec256z. The rest of the code is creating template objects using Vec256<scalar_t>. In your example code you needed to create a new class name B, whereas I have to use Vec256.

You may notice that Vec256<int64_t> only defines several methods of Vec256<T>. I think this is because very few math kernels use integer specific function calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's right, we're specializing, not inheriting. OK, this seems fine!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Victim of this Vec256 approach. We should open help group.

using namespace vec256;

template <typename traits, std::size_t... I>
template <typename traits, std::size_t... INDEX>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No substantive changes here, I guess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was causing a bunch of CI failures. On some machines std::complex #defines the letter I. If you include complex.h you will get build errors on some machines.

@ezyang
Copy link
Contributor

ezyang commented Sep 23, 2019

Looks good. My comments are just minor. Did not check the vectorization carefully.

Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(APPLE) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it.

Condition was added in #8488. @cpuhrsch do you remember why you Xed out apple? (EDIT: Judging from the PR, it kind of looks like it didn't build on OS X, and we didn't figure out why at the time.)

Is MKL just calling AVX?

I don't know the answer offhand to this, but it would surprise me if MKL wasn't vectorizing.

@cpuhrsch
Copy link
Contributor

Yes, I disabled it for OSX because there was a hard error for mkl VML abs. Instead of fixing this issue at that time I commented it out due to prioritization.

@dylanbespalko
Copy link
Contributor Author

@cpuhrsch

Yes, I disabled it for OSX because there was a hard error for mkl VML abs. Instead of fixing this issue at that time I commented it out due to prioritization.

That should be fine, I deploy on linux anyways. I benchmarked two different memory layouts for complex number support. The secondary solution was better for norm related calculations, however I also know that MKL has complex support for Abs(). I was just curious to see if this could replace my primary implementation.

@dylanbespalko
Copy link
Contributor Author

@VitalyFedyunin,

Have you had a chance to look at aten/src/ATen/cpu/vec256/vec256_complex_double.h?

I'm currently looking into ways to speed up reciprocal and div. You can find my other AVX implementations here.
The AVX implantation used here is: vec256_complex_double_riri.h.

@VitalyFedyunin
Copy link
Contributor

Hi! Sorry I have 7 more PRs to review before this one. I promise to take a look by the end of this week.

@yf225 yf225 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 25, 2019
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really want to see benchmarking numbers, especially for existing altered ops like clamp

cpu_kernel_vec(
iter,
[=](scalar_t a) -> scalar_t { return (1 / (1 + std::exp((-a)))); },
[=](scalar_t a) -> scalar_t { return (decltype(a)(1) / (decltype(a)(1) + std::exp((-a)))); },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why decltype(a) instead of scalar_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed.

return _mm256_permute_pd(imag_(), 0x05); //b a
}
Vec256<std::complex<double>> acos() const {
return map(std::acos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Victim of this Vec256 approach. We should open help group.

}
__m256d real_() const {
auto mask = _mm256_setr_pd(1.0, 0.0, 1.0, 0.0);
return _mm256_mul_pd(values, mask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bitmask might be faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent suggestion. Done.

@dylanbespalko
Copy link
Contributor Author

dylanbespalko commented Oct 4, 2019

I really want to see benchmarking numbers, especially for existing altered ops like clamp

Here are the benchmarking numbers for dtype=float64

Build: BUILD_TEST=0 USE_NATIVE_ARCH=ON python setup.py develop
python -m pt.complex_clamp_test --omp_num_threads 1 --mkl_num_threads 1


PyTorch/Caffe2 Operator Micro-benchmarks with zabs()

Tag : short

Benchmarking PyTorch: clamp
Mode: Eager
Name: clamp_M512_N512
Input: M: 512, N: 512
Forward Execution Time (us) : 120.053
Forward Execution Time (us) : 119.689
Forward Execution Time (us) : 119.227


PyTorch/Caffe2 Operator Micro-benchmarks without zabs()

Tag : short

Benchmarking PyTorch: clamp
Mode: Eager
Name: clamp_M512_N512
Input: M: 512, N: 512
Forward Execution Time (us) : 119.571
Forward Execution Time (us) : 119.316
Forward Execution Time (us) : 119.235

While the zabs() noop does not affect performance, there appears to be a problem with using value_t = float; in the Vec256 class. On some compilers I get build errors where that expression has been expanded to using value_t = using value_t = float; suggesting there is some erroneous pre-processing. I will use enable_if<> instead as seen in Vec256_base.h.

using value_t = float should always be valid in the KernelOps files because it occurs directly after using scalar_t=std::complex<float>.

@dylanbespalko
Copy link
Contributor Author

@ezyang,

Assuming all goes well with the CI tonight, I think I have made the requested changes. Note I have also modified:

  • Modified Vec256::angle() to provide a continuous 0-360 degrees range.
  • Modified Vec256::clamp()/minimum()/maximum()/abs() to use enable_if method overloads that don't require ztype() or zabs()
  • Modified Vec256::reciprocal()/div() to improve performance.
  • Added Vec256::conj
  • Modified Vec256 methods to use AND instead of MUL operations to improve performance.

Let me know if you see any problems.

@ezyang
Copy link
Contributor

ezyang commented Oct 4, 2019

The errors in


Oct 04 10:05:45 ======================================================================
Oct 04 10:05:45 ERROR: test_DoubleTensor_lgamma (__main__.TestCuda)
Oct 04 10:05:45 ----------------------------------------------------------------------
Oct 04 10:05:45 Traceback (most recent call last):
Oct 04 10:05:45   File "/var/lib/jenkins/workspace/test/common_utils.py", line 543, in wrapper
Oct 04 10:05:45     method(*args, **kwargs)
Oct 04 10:05:45   File "/var/lib/jenkins/workspace/test/common_utils.py", line 543, in wrapper
Oct 04 10:05:45     method(*args, **kwargs)
Oct 04 10:05:45   File "test_cuda.py", line 637, in tmp
Oct 04 10:05:45     gpu_result = getattr(gpu_tensor, fn)(*gpu_args)
Oct 04 10:05:45 RuntimeError: Expected tensor to have cpu DeviceType, but got tensor with cuda DeviceType (while checking arguments for lgamma)

look real-ish.

return ret;
}
Vec256<T> angle() const {
return *this;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These default implementations look wrong lol

} \
Tensor& _##op##_out_##prefix(Tensor& result, const Tensor& self) { \
checkBackend(#op, result, Backend::device); \
checkDeviceType(#op, result, DeviceType::CPU); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error probably caused by this change

# needs to be imported after torch
import cpp_extension # noqa

import cpp_extension # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything looks good, just need to get ci happy

@dylanbespalko
Copy link
Contributor Author

@ezyang,

The CI looks happy now :). Have a good weekend.

@dylanbespalko
Copy link
Contributor Author

@ezyang,

The fbgemm submodule changes have been removed from the PR. I don't know how they got in there. I pulled from master and I'm running CI again.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 7c472ec.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 9, 2019
Summary:
Added Complex support with AVX to unary ops and binary ops.

I need to add nan propagation to minimum() and maximum() in the future.
In-tree changes to pytorch to support complex numbers are being submitted here.
Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension

Preliminary Benchmarks are here.

I tried rrii and riri and found that riri is better in most situations.
Divide is very slow because you can't reduce 1/(x+y)
Sqrt is also very slow.
Reciprocal could be sped up after I add conj()
Everything else is typically within 20% of the real number performance.
Questions:

Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(__APPLE__) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it.
Is MKL just calling AVX?
Pull Request resolved: pytorch/pytorch#26500

Differential Revision: D17835431

Pulled By: ezyang

fbshipit-source-id: 6746209168fbeb567af340c22bf34af28286bd54
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary:
Added Complex support with AVX to unary ops and binary ops.

I need to add nan propagation to minimum() and maximum() in the future.
In-tree changes to pytorch to support complex numbers are being submitted here.
Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension

Preliminary Benchmarks are here.

I tried rrii and riri and found that riri is better in most situations.
Divide is very slow because you can't reduce 1/(x+y)
Sqrt is also very slow.
Reciprocal could be sped up after I add conj()
Everything else is typically within 20% of the real number performance.
Questions:

Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(__APPLE__) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it.
Is MKL just calling AVX?
Pull Request resolved: pytorch#26500

Differential Revision: D17835431

Pulled By: ezyang

fbshipit-source-id: 6746209168fbeb567af340c22bf34af28286bd54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

caffe2 Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: docs Related to our documentation, both in docs/ and docblocks module: internals Related to internal abstractions in c10 and ATen module: third_party open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants