Potential strict aliasing rule violation in bitwise_binary_op (on ARM/NEON)

## 🐛 Bug

The following code snippet fails on a source build on ARM (Neoverse N1) when the vectorized code path is triggered (tensor length >=16) when compiling with `-O1 -fstrict-aliasing` (or any higher optimization):
```python
import torch
device = 'cpu'
size = 16
a = torch.full((size,), 6, dtype=torch.int32, device=device)
b = torch.full((size,), 3, dtype=torch.int32, device=device)
print(a & b)

tensor([    40960,         0,  46473216,     43691, 201326912,     43691,
                0,         0,     40960,         0,  46473216,     43691,
        201326912,     43691,         0,         0], dtype=torch.int32)
```
Expected output:
```python
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=torch.int32)
```
## TL;DR
### Potential root cause
Strict aliasing rule violation in
https://github.com/pytorch/pytorch/blob/2a5116e1599be7d7fa1be9572f47c316716b74c3/aten/src/ATen/cpu/vec/vec_base.h#L742-L743

### Fix could be
```cpp
template<class T, typename Op>
static inline Vectorized<T> bitwise_binary_op(const Vectorized<T> &a, const Vectorized<T> &b, Op op) {
  static constexpr uint32_t element_no = VECTOR_WIDTH / sizeof(intmax_t);
  __at_align__ intmax_t buffer[element_no];
  intmax_t a_ptr[element_no]; // !
  intmax_t b_ptr[element_no]; // !
  std::memcpy(&a_ptr, &a, sizeof(intmax_t) * element_no); // !
  std::memcpy(&b_ptr, &b, sizeof(intmax_t) * element_no); // !
  for (uint32_t i = 0U; i < element_no; ++ i) {
    buffer[i] = op(a_ptr[i], b_ptr[i]);
  }
  return Vectorized<T>::loadu(buffer);
}
```
Reference: [What is the Strict Aliasing Rule and Why do we care?](https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8#how-do-we-type-pun-correctly) by @shafik

We've verified this fix locally on tensor shapes in `arange(1, 1280)`, but please add others as needed to check, if the proposed fix is valid or not.

## Details 

### Vectorized code path
We've seen unit test failures on ARM nodes for `bitwise_and` operations and first narrowed it down to the vectorized code path, as all sizes <16 work as expected and to optimized code (`-O0` did not reproduce this issue and we needed to rebuild with `REL_WITH_DEB_INFO=1`).

### Heisenbug
While trying to further isolate the issue it turned out to be a [Heisenbug](https://en.wikipedia.org/wiki/Heisenbug), which "disappeared" once we looked too close into the functions in question.
E.g. by adding a `std::cout` here:
https://github.com/pytorch/pytorch/blob/94d621584a8d2780252546aa787aab23203221b2/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp#L258
the method worked correctly again. Also, `gdb` refused to step into
https://github.com/pytorch/pytorch/blob/c5bee1ec4f261f3e250ea3ee974a0e13fb79de3b/aten/src/ATen/native/cpu/Loops.h#L216-L217
or break in `bitwise_binary_op`, but we were unsure, if this is due to the release + debug symbols build or not.
Adding a `raise(SIGTRAP)` in `bitwise_binary_op` also restored the functionality again.

As the next step we've used
```cpp
template<class T, typename Op>
static inline Vectorized<T> __attribute__((optimize("O0"))) bitwise_binary_op(const Vectorized<T> &a, const Vectorized<T> &b, Op op) {
```
to disable any optimizations in this method to be able to step into it via `gdb`, but this of course also fixed the issue (it also turns out that `__attribute((optimize("-fno-strict-aliasing")))` is also a valid workaround).

`valgrind --track-origins=yes` also confirmed the usage of uninitialized memory in `at::native::bitwise_and_kernel` in the failing cases and wasn't reporting issues otherwise.

### Bisecting compiler opimization flags
This lead us to the assumption, that one of the optimizations might trigger the failure and we bisected it to `-O1 -fstrict-aliasing`.
After reading @shafik's [great blog post](https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8#how-do-we-type-pun-correctly) (I would highly recommend reading it in case you are interested why the linked type punning is invalid) we realized that the vectorized code path might indeed violate the strict aliasing rule.

Replacing the (invalid) `reinterpret_cast` with the `memcpy` restores the behavior.

### Why no failures on x86/AVX?
It's a bit unclear why this issue was only visible on this particular ARM node as [vec_base.h](https://github.com/pytorch/pytorch/blob/2a5116e1599be7d7fa1be9572f47c316716b74c3/aten/src/ATen/cpu/vec/vec_base.h#L691) seems to use the invalid `reinterpret_cast` in more places (in AVX code) so this issue should also have been visible on x86 (we might also want to check and fix it).

### NIT
It seems that `-Wno-strict-aliasing` is used
https://github.com/pytorch/pytorch/blob/2a5116e1599be7d7fa1be9572f47c316716b74c3/CMakeLists.txt#L751
in combination with `-O2`
https://github.com/pytorch/pytorch/blob/2a5116e1599be7d7fa1be9572f47c316716b74c3/CMakeLists.txt#L734
which seems dangerous, as it would disable the warnings while the compiler could still use this optimization.


## Environment

```
PyTorch version: 1.10.0a0+0aef44c
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (aarch64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.3
Libc version: glibc-2.31

Python version: 3.8.12 | packaged by conda-forge | (default, Sep 29 2021, 20:54:09)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-84-generic-aarch64-with-glibc2.17
```

CC @zasdfgbnm @mcarilli (thanks both of you for the great help!) @malfet for visibility.
Please add others as needed to check, if the proposed fix is valid or not.



cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @malfet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential strict aliasing rule violation in bitwise_binary_op (on ARM/NEON) #66119

🐛 Bug

TL;DR

Potential root cause

Fix could be

Details

Vectorized code path

Heisenbug

Bisecting compiler opimization flags

Why no failures on x86/AVX?

NIT

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	const intmax_t a_ptr = reinterpret_cast<const intmax_t>((const T*) a);
	const intmax_t b_ptr = reinterpret_cast<const intmax_t>((const T*) b);

	auto out1 = c10::guts::apply(std::forward<vec_func_t>(vop), std::move(args1));
	auto out2 = c10::guts::apply(std::forward<vec_func_t>(vop), std::move(args2));

Potential strict aliasing rule violation in bitwise_binary_op (on ARM/NEON) #66119

Description

🐛 Bug

TL;DR

Potential root cause

Fix could be

Details

Vectorized code path

Heisenbug

Bisecting compiler opimization flags

Why no failures on x86/AVX?

NIT

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions