Vectorized complex unary and binary op support. #26500

dylanbespalko · 2019-09-19T21:57:58Z

Added Complex support with AVX to unary ops and binary ops.

I need to add nan propagation to minimum() and maximum() in the future.
In-tree changes to pytorch to support complex numbers are being submitted here.
Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension

Preliminary Benchmarks are here.

I tried rrii and riri and found that riri is better in most situations.
Divide is very slow because you can't reduce 1/(x+y)
Sqrt is also very slow.
Reciprocal could be sped up after I add conj()
Everything else is typically within 20% of the real number performance.
Questions:

Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(APPLE) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it.
Is MKL just calling AVX?

…x numbers

…x numbers2

dylanbespalko · 2019-09-23T00:31:52Z

@ezyang, @VitalyFedyunin

Added Complex support with AVX to unary ops and binary ops.

I need to add nan propagation to minimum() and maximum() in the future.

In-tree changes to pytorch to support complex numbers are being submitted here.
Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension

Preliminary Benchmarks are here.

I tried rrii and riri and found that riri is better in most situations.
Divide is very slow because you can't reduce 1/(x+y)
Sqrt is also very slow.
Reciprocal could be sped up after I add conj()
Everything else is typically within 20% of the real number performance.

Questions:

Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(__APPLE__) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it.
Is MKL just calling AVX?

ezyang · 2019-09-23T14:32:10Z

aten/src/ATen/cpu/vec256/vec256_base.h

+  value_t (*zabs_)(T) = at::native::zabs;
  for (int i = 0; i != Vec256<T>::size(); i++) {
-    c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] : a[i]);
+    c[i] = zabs_(a[i]) < zabs_(min_vec[i]) ? min_vec[i] : (zabs_(a[i]) > zabs_(max_vec[i]) ? max_vec[i] : a[i]);


Note for reviewers: zabs is no op on real only numbers. (The semantics of zabs is pretty weird. In particular, zabs(-1) == -1 if -1 is a non-complex type, but == 1 if it is a complex type. That seems very fishy, semantically to me. I'd prefer not to implement clamp on complex unless we have a really good reason for hacking it up this way.

Hmm, I was under the impression that the min, max arguments could be tensors, but the torch docs indicate that they are just numeric constants.

The following is a common voltage control mechanism that is commonly used in my work:
np.where(np.where(np.abs(a) > np.abs(min), a, min) < np.abs(max), max)

Perhaps I should just use torch.where().

ezyang · 2019-09-23T14:39:19Z

aten/src/ATen/cpu/vec256/vec256_complex_double.h

+    double real_value = std::real(val);
+    double imag_value = std::imag(val);
+    values = _mm256_setr_pd(real_value, imag_value,
+                            real_value, imag_value);


I'm going to trust that you got the SIMD instructions correct here. If there is some sort of unit test you could write here that would be spiffy.

Yes, the unit tests for complex numbers can be found at pytorch-cpu-strided-complex extension. Once things are running smoothly I can move them into the pytorch unit tests.

ezyang · 2019-09-23T14:40:06Z

aten/src/ATen/cpu/vec256/vec256_complex_double.h

+    return _mm256_permute_pd(imag_(), 0x05);        //b        a
+  }
+  Vec256<std::complex<double>> acos() const {
+    return map(std::acos);


For fallbacks like this, can't you just inherit the implementations from the base?

Yeah, I actually couldn't figure that out. I read that template classes don't support inheritance? Am I crazy?

Non-virtual template methods inherit just fine. https://godbolt.org/z/Gt1QfK

I looked into it. Vec256 is a template class. While inheritance is perfectly valid
eg class Vec256z : public Vec256<std:complex<double>>{} it requires that I define a new class Vec256z. The rest of the code is creating template objects using Vec256<scalar_t>. In your example code you needed to create a new class name B, whereas I have to use Vec256.

You may notice that Vec256<int64_t> only defines several methods of Vec256<T>. I think this is because very few math kernels use integer specific function calls.

Oh that's right, we're specializing, not inheriting. OK, this seems fine!

+1 Victim of this Vec256 approach. We should open help group.

ezyang · 2019-09-23T14:41:58Z

aten/src/ATen/native/cpu/Loops.h

 using namespace vec256;

-template <typename traits, std::size_t... I>
+template <typename traits, std::size_t... INDEX>


No substantive changes here, I guess?

Yes, this was causing a bunch of CI failures. On some machines std::complex #defines the letter I. If you include complex.h you will get build errors on some machines.

ezyang · 2019-09-23T14:45:31Z

Looks good. My comments are just minor. Did not check the vectorization carefully.

Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(APPLE) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it.

Condition was added in #8488. @cpuhrsch do you remember why you Xed out apple? (EDIT: Judging from the PR, it kind of looks like it didn't build on OS X, and we didn't figure out why at the time.)

Is MKL just calling AVX?

I don't know the answer offhand to this, but it would surprise me if MKL wasn't vectorizing.

cpuhrsch · 2019-09-23T17:26:43Z

Yes, I disabled it for OSX because there was a hard error for mkl VML abs. Instead of fixing this issue at that time I commented it out due to prioritization.

dylanbespalko · 2019-09-23T21:54:50Z

@cpuhrsch

Yes, I disabled it for OSX because there was a hard error for mkl VML abs. Instead of fixing this issue at that time I commented it out due to prioritization.

That should be fine, I deploy on linux anyways. I benchmarked two different memory layouts for complex number support. The secondary solution was better for norm related calculations, however I also know that MKL has complex support for Abs(). I was just curious to see if this could replace my primary implementation.

dylanbespalko · 2019-09-25T17:40:57Z

@VitalyFedyunin,

Have you had a chance to look at aten/src/ATen/cpu/vec256/vec256_complex_double.h?

I'm currently looking into ways to speed up reciprocal and div. You can find my other AVX implementations here.
The AVX implantation used here is: vec256_complex_double_riri.h.

VitalyFedyunin · 2019-09-25T18:47:09Z

Hi! Sorry I have 7 more PRs to review before this one. I promise to take a look by the end of this week.

VitalyFedyunin

I really want to see benchmarking numbers, especially for existing altered ops like clamp

VitalyFedyunin · 2019-09-30T14:48:32Z

aten/src/ATen/native/cpu/UnaryOpsKernel.cpp

    cpu_kernel_vec(
        iter,
-        [=](scalar_t a) -> scalar_t { return (1 / (1 + std::exp((-a)))); },
+        [=](scalar_t a) -> scalar_t { return (decltype(a)(1) / (decltype(a)(1) + std::exp((-a)))); },


Wait, why decltype(a) instead of scalar_t

Yes, fixed.

VitalyFedyunin · 2019-09-30T14:59:20Z

aten/src/ATen/cpu/vec256/vec256_complex_double.h

+    return _mm256_permute_pd(imag_(), 0x05);        //b        a
+  }
+  Vec256<std::complex<double>> acos() const {
+    return map(std::acos);


+1 Victim of this Vec256 approach. We should open help group.

VitalyFedyunin · 2019-09-30T15:02:13Z

aten/src/ATen/cpu/vec256/vec256_complex_double.h

+  }
+  __m256d real_() const {
+    auto mask = _mm256_setr_pd(1.0, 0.0, 1.0, 0.0);
+    return _mm256_mul_pd(values, mask);


Bitmask might be faster

Excellent suggestion. Done.

dylanbespalko · 2019-10-04T06:04:16Z

I really want to see benchmarking numbers, especially for existing altered ops like clamp

Here are the benchmarking numbers for dtype=float64

Build: BUILD_TEST=0 USE_NATIVE_ARCH=ON python setup.py develop
python -m pt.complex_clamp_test --omp_num_threads 1 --mkl_num_threads 1

PyTorch/Caffe2 Operator Micro-benchmarks with zabs()

Tag : short

Benchmarking PyTorch: clamp
Mode: Eager
Name: clamp_M512_N512
Input: M: 512, N: 512
Forward Execution Time (us) : 120.053
Forward Execution Time (us) : 119.689
Forward Execution Time (us) : 119.227

PyTorch/Caffe2 Operator Micro-benchmarks without zabs()

Tag : short

Benchmarking PyTorch: clamp
Mode: Eager
Name: clamp_M512_N512
Input: M: 512, N: 512
Forward Execution Time (us) : 119.571
Forward Execution Time (us) : 119.316
Forward Execution Time (us) : 119.235

While the zabs() noop does not affect performance, there appears to be a problem with using value_t = float; in the Vec256 class. On some compilers I get build errors where that expression has been expanded to using value_t = using value_t = float; suggesting there is some erroneous pre-processing. I will use enable_if<> instead as seen in Vec256_base.h.

using value_t = float should always be valid in the KernelOps files because it occurs directly after using scalar_t=std::complex<float>.

dylanbespalko · 2019-10-04T07:08:03Z

@ezyang,

Assuming all goes well with the CI tonight, I think I have made the requested changes. Note I have also modified:

Modified Vec256::angle() to provide a continuous 0-360 degrees range.
Modified Vec256::clamp()/minimum()/maximum()/abs() to use enable_if method overloads that don't require ztype() or zabs()
Modified Vec256::reciprocal()/div() to improve performance.
Added Vec256::conj
Modified Vec256 methods to use AND instead of MUL operations to improve performance.

Let me know if you see any problems.

ezyang · 2019-10-04T15:13:14Z

The errors in


Oct 04 10:05:45 ======================================================================
Oct 04 10:05:45 ERROR: test_DoubleTensor_lgamma (__main__.TestCuda)
Oct 04 10:05:45 ----------------------------------------------------------------------
Oct 04 10:05:45 Traceback (most recent call last):
Oct 04 10:05:45   File "/var/lib/jenkins/workspace/test/common_utils.py", line 543, in wrapper
Oct 04 10:05:45     method(*args, **kwargs)
Oct 04 10:05:45   File "/var/lib/jenkins/workspace/test/common_utils.py", line 543, in wrapper
Oct 04 10:05:45     method(*args, **kwargs)
Oct 04 10:05:45   File "test_cuda.py", line 637, in tmp
Oct 04 10:05:45     gpu_result = getattr(gpu_tensor, fn)(*gpu_args)
Oct 04 10:05:45 RuntimeError: Expected tensor to have cpu DeviceType, but got tensor with cuda DeviceType (while checking arguments for lgamma)

look real-ish.

ezyang · 2019-10-04T15:15:04Z

aten/src/ATen/cpu/vec256/vec256_base.h

+    return ret;
+  }
+  Vec256<T> angle() const {
+    return *this;


These default implementations look wrong lol

ezyang · 2019-10-04T15:16:12Z

aten/src/ATen/native/UnaryOps.cpp

  }                                                                    \
  Tensor& _##op##_out_##prefix(Tensor& result, const Tensor& self) {   \
-    checkBackend(#op, result, Backend::device);                        \
+    checkDeviceType(#op, result, DeviceType::CPU);                    \


Error probably caused by this change

ezyang · 2019-10-04T15:17:52Z

benchmarks/operator_benchmark/benchmark_core.py

 # needs to be imported after torch
 import cpp_extension # noqa

+import cpp_extension # noqa


ezyang

everything looks good, just need to get ci happy

dylanbespalko · 2019-10-05T06:34:38Z

@ezyang,

The CI looks happy now :). Have a good weekend.

…nto unary_binary_ops

dylanbespalko · 2019-10-09T11:56:10Z

@ezyang,

The fbgemm submodule changes have been removed from the PR. I don't know how they got in there. I pulled from master and I'm running CI again.

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-10-09T20:38:36Z

@ezyang merged this pull request in 7c472ec.

Summary: Added Complex support with AVX to unary ops and binary ops. I need to add nan propagation to minimum() and maximum() in the future. In-tree changes to pytorch to support complex numbers are being submitted here. Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension Preliminary Benchmarks are here. I tried rrii and riri and found that riri is better in most situations. Divide is very slow because you can't reduce 1/(x+y) Sqrt is also very slow. Reciprocal could be sped up after I add conj() Everything else is typically within 20% of the real number performance. Questions: Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(__APPLE__) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it. Is MKL just calling AVX? Pull Request resolved: pytorch/pytorch#26500 Differential Revision: D17835431 Pulled By: ezyang fbshipit-source-id: 6746209168fbeb567af340c22bf34af28286bd54

Summary: Added Complex support with AVX to unary ops and binary ops. I need to add nan propagation to minimum() and maximum() in the future. In-tree changes to pytorch to support complex numbers are being submitted here. Out-of-tree support for complex numbers is here: pytorch-cpu-strided-complex extension Preliminary Benchmarks are here. I tried rrii and riri and found that riri is better in most situations. Divide is very slow because you can't reduce 1/(x+y) Sqrt is also very slow. Reciprocal could be sped up after I add conj() Everything else is typically within 20% of the real number performance. Questions: Why does macOS not support mil? #if AT_MKL_ENABLED() && !defined(__APPLE__) in vml.h. MKL does support some complex operations like Abs, so I was curious about trying it. Is MKL just calling AVX? Pull Request resolved: pytorch#26500 Differential Revision: D17835431 Pulled By: ezyang fbshipit-source-id: 6746209168fbeb567af340c22bf34af28286bd54

Modified existing Vec256 classes to support new operations for comple…

0fa644d

…x numbers

pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: operators labels Sep 19, 2019

Modified existing Vec256 classes to support new operations for comple…

34b59b7

…x numbers2

pytorchbot added caffe2 module: internals Related to internal abstractions in c10 and ATen labels Sep 19, 2019

ezyang added the open source label Sep 19, 2019

Added new unary ops for complex number support

13704f5

pytorchbot added the module: docs Related to our documentation, both in docs/ and docblocks label Sep 19, 2019

dylanbespalko added 7 commits September 19, 2019 20:04

Added complex double Vec256 support

2e0520a

Minor fixes to complex riri format

439e94c

Fixing CI Failure with VML.h

4ec8f38

Added complex_float_riri

9773c7f

Added vec256_complex_float

dff2a81

Resolved merge conflicts that occured when lgamma was added to UnaryOps

be1be31

Merged changes from pytorch master

30ea40e

dylanbespalko changed the title ~~WIP: Vectorized complex unary and binary op support.~~ Vectorized complex unary and binary op support. Sep 23, 2019

ezyang reviewed Sep 23, 2019

View reviewed changes

merged changes with new UnaryOp that was added

905a125

yf225 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 25, 2019

VitalyFedyunin reviewed Sep 30, 2019

View reviewed changes

pytorchbot added the module: third_party label Oct 4, 2019

fixes to non-vectorized angle

820a6f2

ezyang reviewed Oct 4, 2019

View reviewed changes

ezyang requested changes Oct 4, 2019

View reviewed changes

dylanbespalko added 6 commits October 4, 2019 09:27

simplified Vec256_base abs(), fixed device check of VEC_OPs

4131b62

same as last

e71f8a5

fixing Vec256_abs on CI

f5a23f3

Reverting back to previous solution for abs()

96bebae

trying lambda function for abs()

21e4814

fixed angle() function

aa24778

ezyang and others added 4 commits October 7, 2019 09:47

Merge remote-tracking branch 'origin/master' into unary_binary_ops

ae68278

merging from master (fixing conflict with fbgemm submodule

9710e6c

Merge branch 'unary_binary_ops' of github.com:dylanbespalko/pytorch i…

0ddd807

…nto unary_binary_ops

fixing accidental fbgemm submodule changes

9e111a7

ezyang approved these changes Oct 9, 2019

View reviewed changes

facebook-github-bot reviewed Oct 9, 2019

View reviewed changes

facebook-github-bot closed this in 7c472ec Oct 9, 2019

facebook-github-bot added the merged label Oct 9, 2019

mruberry added the Merged label Oct 28, 2020

Vectorized complex unary and binary op support. #26500

Vectorized complex unary and binary op support. #26500

Uh oh!

Conversation

dylanbespalko commented Sep 19, 2019 • edited by ezyang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dylanbespalko commented Sep 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Sep 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpuhrsch commented Sep 23, 2019

Uh oh!

dylanbespalko commented Sep 23, 2019

Uh oh!

dylanbespalko commented Sep 25, 2019

Uh oh!

VitalyFedyunin commented Sep 25, 2019

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dylanbespalko commented Oct 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTorch/Caffe2 Operator Micro-benchmarks with zabs()

PyTorch/Caffe2 Operator Micro-benchmarks without zabs()

Uh oh!

dylanbespalko commented Oct 4, 2019

Uh oh!

ezyang commented Oct 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

dylanbespalko commented Oct 5, 2019

Uh oh!

dylanbespalko commented Oct 9, 2019

Uh oh!

facebook-github-bot left a comment

dylanbespalko commented Sep 19, 2019 •

edited by ezyang

Loading

ezyang commented Sep 23, 2019 •

edited

Loading

dylanbespalko commented Oct 4, 2019 •

edited

Loading