Skip to content

Potential strict aliasing rule violation in bitwise_binary_op (on ARM/NEON) #66119

@ptrblck

Description

@ptrblck

🐛 Bug

The following code snippet fails on a source build on ARM (Neoverse N1) when the vectorized code path is triggered (tensor length >=16) when compiling with -O1 -fstrict-aliasing (or any higher optimization):

import torch
device = 'cpu'
size = 16
a = torch.full((size,), 6, dtype=torch.int32, device=device)
b = torch.full((size,), 3, dtype=torch.int32, device=device)
print(a & b)

tensor([    40960,         0,  46473216,     43691, 201326912,     43691,
                0,         0,     40960,         0,  46473216,     43691,
        201326912,     43691,         0,         0], dtype=torch.int32)

Expected output:

tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=torch.int32)

TL;DR

Potential root cause

Strict aliasing rule violation in

const intmax_t *a_ptr = reinterpret_cast<const intmax_t*>((const T*) a);
const intmax_t *b_ptr = reinterpret_cast<const intmax_t*>((const T*) b);

Fix could be

template<class T, typename Op>
static inline Vectorized<T> bitwise_binary_op(const Vectorized<T> &a, const Vectorized<T> &b, Op op) {
  static constexpr uint32_t element_no = VECTOR_WIDTH / sizeof(intmax_t);
  __at_align__ intmax_t buffer[element_no];
  intmax_t a_ptr[element_no]; // !
  intmax_t b_ptr[element_no]; // !
  std::memcpy(&a_ptr, &a, sizeof(intmax_t) * element_no); // !
  std::memcpy(&b_ptr, &b, sizeof(intmax_t) * element_no); // !
  for (uint32_t i = 0U; i < element_no; ++ i) {
    buffer[i] = op(a_ptr[i], b_ptr[i]);
  }
  return Vectorized<T>::loadu(buffer);
}

Reference: What is the Strict Aliasing Rule and Why do we care? by @shafik

We've verified this fix locally on tensor shapes in arange(1, 1280), but please add others as needed to check, if the proposed fix is valid or not.

Details

Vectorized code path

We've seen unit test failures on ARM nodes for bitwise_and operations and first narrowed it down to the vectorized code path, as all sizes <16 work as expected and to optimized code (-O0 did not reproduce this issue and we needed to rebuild with REL_WITH_DEB_INFO=1).

Heisenbug

While trying to further isolate the issue it turned out to be a Heisenbug, which "disappeared" once we looked too close into the functions in question.
E.g. by adding a std::cout here:


the method worked correctly again. Also, gdb refused to step into
auto out1 = c10::guts::apply(std::forward<vec_func_t>(vop), std::move(args1));
auto out2 = c10::guts::apply(std::forward<vec_func_t>(vop), std::move(args2));

or break in bitwise_binary_op, but we were unsure, if this is due to the release + debug symbols build or not.
Adding a raise(SIGTRAP) in bitwise_binary_op also restored the functionality again.

As the next step we've used

template<class T, typename Op>
static inline Vectorized<T> __attribute__((optimize("O0"))) bitwise_binary_op(const Vectorized<T> &a, const Vectorized<T> &b, Op op) {

to disable any optimizations in this method to be able to step into it via gdb, but this of course also fixed the issue (it also turns out that __attribute((optimize("-fno-strict-aliasing"))) is also a valid workaround).

valgrind --track-origins=yes also confirmed the usage of uninitialized memory in at::native::bitwise_and_kernel in the failing cases and wasn't reporting issues otherwise.

Bisecting compiler opimization flags

This lead us to the assumption, that one of the optimizations might trigger the failure and we bisected it to -O1 -fstrict-aliasing.
After reading @shafik's great blog post (I would highly recommend reading it in case you are interested why the linked type punning is invalid) we realized that the vectorized code path might indeed violate the strict aliasing rule.

Replacing the (invalid) reinterpret_cast with the memcpy restores the behavior.

Why no failures on x86/AVX?

It's a bit unclear why this issue was only visible on this particular ARM node as vec_base.h seems to use the invalid reinterpret_cast in more places (in AVX code) so this issue should also have been visible on x86 (we might also want to check and fix it).

NIT

It seems that -Wno-strict-aliasing is used

string(APPEND CMAKE_CXX_FLAGS " -Wno-strict-aliasing")

in combination with -O2
string(APPEND CMAKE_CXX_FLAGS " -O2 -fPIC")

which seems dangerous, as it would disable the warnings while the compiler could still use this optimization.

Environment

PyTorch version: 1.10.0a0+0aef44c
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (aarch64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.3
Libc version: glibc-2.31

Python version: 3.8.12 | packaged by conda-forge | (default, Sep 29 2021, 20:54:09)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-84-generic-aarch64-with-glibc2.17

CC @zasdfgbnm @mcarilli (thanks both of you for the great help!) @malfet for visibility.
Please add others as needed to check, if the proposed fix is valid or not.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @malfet

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: armRelated to ARM architectures builds of PyTorch. Includes Apple M1module: correctness (silent)issue that returns an incorrect result silentlytriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions