-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Frequent segfaults in large processing job #60219
Description
Describe the unexpected behaviour
I am running large jobs, consisting of selecting data from ReplacingMergeTrees, and inserting results in others. The tables have between a million and 300 million rows. The queries are pretty simple and do not use polygonal dictionaries (which is the
Every few hours, one of the servers segfaults and exits without anything suspicious in the logs (even when enabling flush-on-crash) at debug level.
How to reproduce
- Happens both on 23.12.2.59 and on 24.1.5.6
- The setup is a cluster with 2 shards, on 2 different machines.
- The client uses the TCP interface.
Expected behavior
Not segfaulting :)
Error message and/or stacktrace
This is an example, visualized in Grafana (crash around 10:32):



I managed to get 3 core dumps, which are unfortunately very large (~100 GB).
They start with
Core was generated by `/usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml'.
Program terminated with signal SIGSEGV, Segmentation fault.
[Current thread is 1 (LWP 1561)]
(gdb) bt
#0 0x00000000171a7c22 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getSavedRegister(libunwind::LocalAddressSpace&, libunwind::Registers_x86_64 const&, unsigned long, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::RegisterLocation const&)
()
#1 0x00000000171a6abe in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarf(libunwind::LocalAddressSpace&, unsigned long, unsigned long, libunwind::Registers_x86_64&, bool&) ()
#2 0x00000000171a5c94 in unw_step ()
#3 0x000000000c81e5f9 in StackTrace::StackTrace(ucontext_t const&) ()
#4 0x000000000cae52c1 in signalHandler(int, siginfo_t*, void*) ()
#5 <signal handler called>
#6 0x00000000171a7c22 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getSavedRegister(libunwind::LocalAddressSpace&, libunwind::Registers_x86_64 const&, unsigned long, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::RegisterLocation const&)
()
Backtrace stopped: Cannot access memory at address 0x7fdc008b8658
which is not very helpful (any idea how I could get what happened before libunwind got called?)
I'm attaching the output of thread apply all bt on one of the dumps
bt.log
Some of the threads that are maybe suspicious
Thread 1056 (LWP 1010):
#0 0x0000000011b01e76 in DB::ColumnTuple::~ColumnTuple() ()
#1 0x000000000722ee51 in std::__1::vector<DB::ColumnWithTypeAndName, std::__1::allocator<DB::ColumnWithTypeAndName> >::~vector[abi:v15000]() ()
#2 0x000000001251f3b1 in DB::MergeTreeReadTask::~MergeTreeReadTask() ()
#3 0x000000001251d4fe in DB::MergeTreeSelectProcessor::read() ()
#0 0x0000000016f5a6ce in ZSTD_decompress ()
#1 0x000000001075089f in DB::CompressionCodecZSTD::doDecompressData(char const*, unsigned int, char*, unsigned int) const ()
#2 0x0000000010795871 in DB::ICompressionCodec::decompress(char const*, unsigned int, char*) const ()