Skip to content

Frequent segfaults in large processing job #60219

@cpg314

Description

@cpg314

Describe the unexpected behaviour
I am running large jobs, consisting of selecting data from ReplacingMergeTrees, and inserting results in others. The tables have between a million and 300 million rows. The queries are pretty simple and do not use polygonal dictionaries (which is the

Every few hours, one of the servers segfaults and exits without anything suspicious in the logs (even when enabling flush-on-crash) at debug level.

How to reproduce

  • Happens both on 23.12.2.59 and on 24.1.5.6
  • The setup is a cluster with 2 shards, on 2 different machines.
  • The client uses the TCP interface.

Expected behavior
Not segfaulting :)

Error message and/or stacktrace

This is an example, visualized in Grafana (crash around 10:32):
image
image
image

I managed to get 3 core dumps, which are unfortunately very large (~100 GB).
They start with

Core was generated by `/usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml'.
Program terminated with signal SIGSEGV, Segmentation fault.
[Current thread is 1 (LWP 1561)]
(gdb) bt
#0  0x00000000171a7c22 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getSavedRegister(libunwind::LocalAddressSpace&, libunwind::Registers_x86_64 const&, unsigned long, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::RegisterLocation const&)
    ()
#1  0x00000000171a6abe in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarf(libunwind::LocalAddressSpace&, unsigned long, unsigned long, libunwind::Registers_x86_64&, bool&) ()
#2  0x00000000171a5c94 in unw_step ()
#3  0x000000000c81e5f9 in StackTrace::StackTrace(ucontext_t const&) ()
#4  0x000000000cae52c1 in signalHandler(int, siginfo_t*, void*) ()
#5  <signal handler called>
#6  0x00000000171a7c22 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getSavedRegister(libunwind::LocalAddressSpace&, libunwind::Registers_x86_64 const&, unsigned long, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::RegisterLocation const&)
    ()
Backtrace stopped: Cannot access memory at address 0x7fdc008b8658

which is not very helpful (any idea how I could get what happened before libunwind got called?)

I'm attaching the output of thread apply all bt on one of the dumps
bt.log

Some of the threads that are maybe suspicious

Thread 1056 (LWP 1010):
#0  0x0000000011b01e76 in DB::ColumnTuple::~ColumnTuple() ()
#1  0x000000000722ee51 in std::__1::vector<DB::ColumnWithTypeAndName, std::__1::allocator<DB::ColumnWithTypeAndName> >::~vector[abi:v15000]() ()
#2  0x000000001251f3b1 in DB::MergeTreeReadTask::~MergeTreeReadTask() ()
#3  0x000000001251d4fe in DB::MergeTreeSelectProcessor::read() ()
#0  0x0000000016f5a6ce in ZSTD_decompress ()
#1  0x000000001075089f in DB::CompressionCodecZSTD::doDecompressData(char const*, unsigned int, char*, unsigned int) const ()
#2  0x0000000010795871 in DB::ICompressionCodec::decompress(char const*, unsigned int, char*) const ()

Metadata

Metadata

Assignees

No one assigned

    Labels

    st-need-infoWe need extra data to continue (waiting for response). Either some details or a repro of the issue.unexpected behaviourResult is unexpected, but not entirely wrong at the same time.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions