🐛 Bug
The following code snippet fails on a source build on ARM (Neoverse N1) when the vectorized code path is triggered (tensor length >=16) when compiling with -O1 -fstrict-aliasing (or any higher optimization):
import torch
device = 'cpu'
size = 16
a = torch.full((size,), 6, dtype=torch.int32, device=device)
b = torch.full((size,), 3, dtype=torch.int32, device=device)
print(a & b)
tensor([ 40960, 0, 46473216, 43691, 201326912, 43691,
0, 0, 40960, 0, 46473216, 43691,
201326912, 43691, 0, 0], dtype=torch.int32)
Expected output:
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=torch.int32)
TL;DR
Potential root cause
Strict aliasing rule violation in
|
const intmax_t *a_ptr = reinterpret_cast<const intmax_t*>((const T*) a); |
|
const intmax_t *b_ptr = reinterpret_cast<const intmax_t*>((const T*) b); |
Fix could be
template<class T, typename Op>
static inline Vectorized<T> bitwise_binary_op(const Vectorized<T> &a, const Vectorized<T> &b, Op op) {
static constexpr uint32_t element_no = VECTOR_WIDTH / sizeof(intmax_t);
__at_align__ intmax_t buffer[element_no];
intmax_t a_ptr[element_no]; // !
intmax_t b_ptr[element_no]; // !
std::memcpy(&a_ptr, &a, sizeof(intmax_t) * element_no); // !
std::memcpy(&b_ptr, &b, sizeof(intmax_t) * element_no); // !
for (uint32_t i = 0U; i < element_no; ++ i) {
buffer[i] = op(a_ptr[i], b_ptr[i]);
}
return Vectorized<T>::loadu(buffer);
}
Reference: What is the Strict Aliasing Rule and Why do we care? by @shafik
We've verified this fix locally on tensor shapes in arange(1, 1280), but please add others as needed to check, if the proposed fix is valid or not.
Details
Vectorized code path
We've seen unit test failures on ARM nodes for bitwise_and operations and first narrowed it down to the vectorized code path, as all sizes <16 work as expected and to optimized code (-O0 did not reproduce this issue and we needed to rebuild with REL_WITH_DEB_INFO=1).
Heisenbug
While trying to further isolate the issue it turned out to be a Heisenbug, which "disappeared" once we looked too close into the functions in question.
E.g. by adding a std::cout here:
the method worked correctly again. Also,
gdb refused to step into
|
auto out1 = c10::guts::apply(std::forward<vec_func_t>(vop), std::move(args1)); |
|
auto out2 = c10::guts::apply(std::forward<vec_func_t>(vop), std::move(args2)); |
or break in
bitwise_binary_op, but we were unsure, if this is due to the release + debug symbols build or not.
Adding a
raise(SIGTRAP) in
bitwise_binary_op also restored the functionality again.
As the next step we've used
template<class T, typename Op>
static inline Vectorized<T> __attribute__((optimize("O0"))) bitwise_binary_op(const Vectorized<T> &a, const Vectorized<T> &b, Op op) {
to disable any optimizations in this method to be able to step into it via gdb, but this of course also fixed the issue (it also turns out that __attribute((optimize("-fno-strict-aliasing"))) is also a valid workaround).
valgrind --track-origins=yes also confirmed the usage of uninitialized memory in at::native::bitwise_and_kernel in the failing cases and wasn't reporting issues otherwise.
Bisecting compiler opimization flags
This lead us to the assumption, that one of the optimizations might trigger the failure and we bisected it to -O1 -fstrict-aliasing.
After reading @shafik's great blog post (I would highly recommend reading it in case you are interested why the linked type punning is invalid) we realized that the vectorized code path might indeed violate the strict aliasing rule.
Replacing the (invalid) reinterpret_cast with the memcpy restores the behavior.
Why no failures on x86/AVX?
It's a bit unclear why this issue was only visible on this particular ARM node as vec_base.h seems to use the invalid reinterpret_cast in more places (in AVX code) so this issue should also have been visible on x86 (we might also want to check and fix it).
NIT
It seems that -Wno-strict-aliasing is used
|
string(APPEND CMAKE_CXX_FLAGS " -Wno-strict-aliasing") |
in combination with
-O2
|
string(APPEND CMAKE_CXX_FLAGS " -O2 -fPIC") |
which seems dangerous, as it would disable the warnings while the compiler could still use this optimization.
Environment
PyTorch version: 1.10.0a0+0aef44c
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (aarch64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.3
Libc version: glibc-2.31
Python version: 3.8.12 | packaged by conda-forge | (default, Sep 29 2021, 20:54:09) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-84-generic-aarch64-with-glibc2.17
CC @zasdfgbnm @mcarilli (thanks both of you for the great help!) @malfet for visibility.
Please add others as needed to check, if the proposed fix is valid or not.
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @malfet
🐛 Bug
The following code snippet fails on a source build on ARM (Neoverse N1) when the vectorized code path is triggered (tensor length >=16) when compiling with
-O1 -fstrict-aliasing(or any higher optimization):Expected output:
TL;DR
Potential root cause
Strict aliasing rule violation in
pytorch/aten/src/ATen/cpu/vec/vec_base.h
Lines 742 to 743 in 2a5116e
Fix could be
Reference: What is the Strict Aliasing Rule and Why do we care? by @shafik
We've verified this fix locally on tensor shapes in
arange(1, 1280), but please add others as needed to check, if the proposed fix is valid or not.Details
Vectorized code path
We've seen unit test failures on ARM nodes for
bitwise_andoperations and first narrowed it down to the vectorized code path, as all sizes <16 work as expected and to optimized code (-O0did not reproduce this issue and we needed to rebuild withREL_WITH_DEB_INFO=1).Heisenbug
While trying to further isolate the issue it turned out to be a Heisenbug, which "disappeared" once we looked too close into the functions in question.
E.g. by adding a
std::couthere:pytorch/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp
Line 258 in 94d6215
the method worked correctly again. Also,
gdbrefused to step intopytorch/aten/src/ATen/native/cpu/Loops.h
Lines 216 to 217 in c5bee1e
or break in
bitwise_binary_op, but we were unsure, if this is due to the release + debug symbols build or not.Adding a
raise(SIGTRAP)inbitwise_binary_opalso restored the functionality again.As the next step we've used
to disable any optimizations in this method to be able to step into it via
gdb, but this of course also fixed the issue (it also turns out that__attribute((optimize("-fno-strict-aliasing")))is also a valid workaround).valgrind --track-origins=yesalso confirmed the usage of uninitialized memory inat::native::bitwise_and_kernelin the failing cases and wasn't reporting issues otherwise.Bisecting compiler opimization flags
This lead us to the assumption, that one of the optimizations might trigger the failure and we bisected it to
-O1 -fstrict-aliasing.After reading @shafik's great blog post (I would highly recommend reading it in case you are interested why the linked type punning is invalid) we realized that the vectorized code path might indeed violate the strict aliasing rule.
Replacing the (invalid)
reinterpret_castwith thememcpyrestores the behavior.Why no failures on x86/AVX?
It's a bit unclear why this issue was only visible on this particular ARM node as vec_base.h seems to use the invalid
reinterpret_castin more places (in AVX code) so this issue should also have been visible on x86 (we might also want to check and fix it).NIT
It seems that
-Wno-strict-aliasingis usedpytorch/CMakeLists.txt
Line 751 in 2a5116e
in combination with
-O2pytorch/CMakeLists.txt
Line 734 in 2a5116e
which seems dangerous, as it would disable the warnings while the compiler could still use this optimization.
Environment
CC @zasdfgbnm @mcarilli (thanks both of you for the great help!) @malfet for visibility.
Please add others as needed to check, if the proposed fix is valid or not.
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @malfet