Frequent segfaults in large processing job

**Describe the unexpected behaviour**
I am running large jobs, consisting of selecting data from ReplacingMergeTrees, and inserting results in others. The tables have between a million and 300 million rows. The queries are pretty simple and do not use polygonal dictionaries (which is the 

Every few hours, one of the servers segfaults and exits without anything suspicious in the logs (even when enabling flush-on-crash) at debug level.


**How to reproduce**
* Happens both on 23.12.2.59 and on 24.1.5.6
* The setup is a cluster with 2 shards, on 2 different machines.
* The client uses the TCP interface.

**Expected behavior**
Not segfaulting :)

**Error message and/or stacktrace**

This is an example, visualized in Grafana (crash around 10:32):
![image](https://github.com/ClickHouse/ClickHouse/assets/44120267/c72b645b-04f5-4272-9105-fd6ca25ce1bb)
![image](https://github.com/ClickHouse/ClickHouse/assets/44120267/6c1a1f28-9d0d-42d0-b046-febf073faa47)
![image](https://github.com/ClickHouse/ClickHouse/assets/44120267/74223787-6b61-4538-a7b3-6fbae9bad3a7)

I managed to get 3 core dumps, which are unfortunately very large (~100 GB).
They start with
```
Core was generated by `/usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml'.
Program terminated with signal SIGSEGV, Segmentation fault.
[Current thread is 1 (LWP 1561)]
(gdb) bt
#0  0x00000000171a7c22 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getSavedRegister(libunwind::LocalAddressSpace&, libunwind::Registers_x86_64 const&, unsigned long, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::RegisterLocation const&)
    ()
#1  0x00000000171a6abe in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarf(libunwind::LocalAddressSpace&, unsigned long, unsigned long, libunwind::Registers_x86_64&, bool&) ()
#2  0x00000000171a5c94 in unw_step ()
#3  0x000000000c81e5f9 in StackTrace::StackTrace(ucontext_t const&) ()
#4  0x000000000cae52c1 in signalHandler(int, siginfo_t*, void*) ()
#5  <signal handler called>
#6  0x00000000171a7c22 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getSavedRegister(libunwind::LocalAddressSpace&, libunwind::Registers_x86_64 const&, unsigned long, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::RegisterLocation const&)
    ()
Backtrace stopped: Cannot access memory at address 0x7fdc008b8658
```
which is not very helpful (any idea how I could get what happened before libunwind got called?)

I'm attaching the output of `thread apply all bt` on one of the dumps
[bt.log](https://github.com/ClickHouse/ClickHouse/files/14358605/bt.log)

Some of the threads that are maybe suspicious
```
Thread 1056 (LWP 1010):
#0  0x0000000011b01e76 in DB::ColumnTuple::~ColumnTuple() ()
#1  0x000000000722ee51 in std::__1::vector<DB::ColumnWithTypeAndName, std::__1::allocator<DB::ColumnWithTypeAndName> >::~vector[abi:v15000]() ()
#2  0x000000001251f3b1 in DB::MergeTreeReadTask::~MergeTreeReadTask() ()
#3  0x000000001251d4fe in DB::MergeTreeSelectProcessor::read() ()
```

```
#0  0x0000000016f5a6ce in ZSTD_decompress ()
#1  0x000000001075089f in DB::CompressionCodecZSTD::doDecompressData(char const*, unsigned int, char*, unsigned int) const ()
#2  0x0000000010795871 in DB::ICompressionCodec::decompress(char const*, unsigned int, char*) const ()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent segfaults in large processing job #60219

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Frequent segfaults in large processing job #60219

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions