-
Notifications
You must be signed in to change notification settings - Fork 8.3k
libunwind: getSavedRegister: invalid read from memory #60460
Description
@alexey-milovidov Are you OK with us re-opening this please?
We're seeing the same thing on all of our production clusters running v23.3, but it's re-producible in clean builds from master from my testing.
What we see:
ClickHouse servers across our entire fleet are occasionally segfaulting, with only the below information in logs:
2024.02.24 13:21:53.198661 [ 7 ] {} <Fatal> Application: Child process was terminated by signal 11
No consistent pattern in the query log or server logs stand out to me before or after we see the segfaults.
Reviewing coredumps, here is an example stack trace (almost identical to a few on this ticket):
gdb) bt
#0 0x0000000015da2fd1 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getSavedRegister (addressSpace=..., registers=..., cfa=cfa@entry=140650633216840, savedReg=...) at ./contrib/libunwind/src/DwarfInstructions.hpp:111
#1 0x0000000015da1f81 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarf (addressSpace=..., pc=<optimized out>, fdeStart=<optimized out>, registers=..., isSignalFrame=@0x7fedb2ae2609: true) at ./contrib/libunwind/src/DwarfInstructions.hpp:258
#2 0x0000000015da1aa2 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarfFDE (this=0x7fedb2ae2508) at ./contrib/libunwind/src/UnwindCursor.hpp:1002
#3 libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step (this=0x7fedb2ae2508) at ./contrib/libunwind/src/UnwindCursor.hpp:2818
#4 0x0000000015da11c8 in unw_backtrace (buffer=0x7fedb2ae2718, size=45) at ./contrib/libunwind/src/libunwind.cpp:350
#5 0x000000000ec18088 in StackTrace::tryCapture (this=0x7fedb2ae2708) at ./src/Common/StackTrace.cpp:287
#6 StackTrace::StackTrace (this=0x7fedb2ae2708, signal_context=...) at ./src/Common/StackTrace.cpp:258
#7 0x000000000edffead in signalHandler (sig=<optimized out>, info=0x7fedb2ae2bf0, context=0x7fedb2ae2ac0) at ./src/Daemon/BaseDaemon.cpp:148
#8 <signal handler called>
#9 0x0000000015da2fd1 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getSavedRegister (addressSpace=..., registers=..., cfa=cfa@entry=140650633216840, savedReg=...) at ./contrib/libunwind/src/DwarfInstructions.hpp:111
#10 0x0000000015da1f81 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarf (addressSpace=..., pc=<optimized out>, fdeStart=<optimized out>, registers=..., isSignalFrame=@0x7febc7077799: true) at ./contrib/libunwind/src/DwarfInstructions.hpp:258
#11 0x0000000015da1aa2 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarfFDE (this=0x7febc7077698) at ./contrib/libunwind/src/UnwindCursor.hpp:1002
#12 libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step (this=0x7febc7077698) at ./contrib/libunwind/src/UnwindCursor.hpp:2818
#13 0x0000000015da11c8 in unw_backtrace (buffer=0x7febc7077838, size=45) at ./contrib/libunwind/src/libunwind.cpp:350
#14 0x000000000ec18088 in StackTrace::tryCapture (this=0x7febc7077828) at ./src/Common/StackTrace.cpp:287
#15 StackTrace::StackTrace (this=0x7febc7077828, signal_context=...) at ./src/Common/StackTrace.cpp:258
#16 0x000000000ec2c5cb in DB::(anonymous namespace)::writeTraceInfo (trace_type=DB::TraceType::Real, info=<optimized out>, context=0x7febc7077d80) at ./src/Common/QueryProfiler.cpp:68
#17 <signal handler called>
#18 0x0000000015ce6629 in LZ4_compress_default (src=<optimized out>, dst=<optimized out>, srcSize=<optimized out>, maxOutputSize=<optimized out>) at ./contrib/lz4/lib/lz4.c:1387
#19 0x0000000012ea3a53 in DB::ICompressionCodec::compress (this=0x7fe0d527b620, source=0x7fd72fab4040 "\003", source_size=65536, dest=0x7fd72db49abb "\202e151024b\"\003") at ./src/Compression/ICompressionCodec.cpp:88
#20 0x0000000012e36d00 in DB::CompressedWriteBuffer::nextImpl (this=0x7fed6b6d4cd8) at ./src/Compression/CompressedWriteBuffer.cpp:42
#21 0x0000000014353684 in std::__1::__uninitialized_allocator_copy[abi:v15000]<std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::regex_token_iterator<std::__1::__wrap_iter<char const*>, char, std::__1::regex_traits<char> >, std::__1::regex_token_iterator<std::__1::__wrap_iter<char const*>, char, std::__1::regex_traits<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*>(std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >&, std::__1::regex_token_iterator<std::__1::__wrap_iter<char const*>, char, std::__1::regex_traits<char> >, std::__1::regex_token_iterator<std::__1::__wrap_iter<char const*>, char, std::__1::regex_traits<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) (__alloc=..., __first1=..., __last1=..., __first2=0x606f100 <vtable for DB::HashingWriteBuffer+16>) at ./contrib/llvm-project/libcxx/include/__memory/uninitialized_algorithms.h:545
#22 0x0000000000000000 in ?? ()
Seeing the below frame:
#16 0x000000000ec2c5cb in DB::(anonymous namespace)::writeTraceInfo (trace_type=DB::TraceType::Real, info=<optimized out>, context=0x7febc7077d80) at ./src/Common/QueryProfiler.cpp:68
It seems like our case aligns with what @azat mentioned about this occuring as a result of query profiling, so we can try and turn them off as a first step just to confirm.
These frames are also pretty consistent from the coredumps I have reviewed so far:
#18 0x0000000015ce6629 in LZ4_compress_default (src=<optimized out>, dst=<optimized out>, srcSize=<optimized out>, maxOutputSize=<optimized out>) at ./contrib/lz4/lib/lz4.c:1387
#19 0x0000000012ea3a53 in DB::ICompressionCodec::compress (this=0x7fe0d527b620, source=0x7fd72fab4040 "\003", source_size=65536, dest=0x7fd72db49abb "\202e151024b\"\003") at ./src/Compression/ICompressionCodec.cpp:88
#20 0x0000000012e36d00 in DB::CompressedWriteBuffer::nextImpl (this=0x7fed6b6d4cd8) at ./src/Compression/CompressedWriteBuffer.cpp:42
Could this be an issue of trying to unwind a stack trace of an INSERT query that is behaving unexpectedly?
Another pattern that I noticed is that we see far more segfaults on ClickHouse replicas that receive fairly heavy INSERT workloads, but I will review some more coredumps and investigate a bit more to make sure this is definitely the case.
You're welcome to assign it to me if you want to and I will try to dig a bit deeper and find the cause if we don't already know what it is.
Originally posted by @seandhaynes in #33531 (comment)