Possible deadlock using clusterAllReplicas against system tables

**Context**

Using ClickHouse server (version `24.5.1.1763`) with 2 nodes. 

I tried to see if there was anything close to what I'm experiencing in newer versions changelog but could not find anything.

**Describe what's wrong**

Using [clickhouse-driver](https://github.com/mymarilyn/clickhouse-driver) to perform a query with the following pattern:

```sql
SELECT read_rows, read_bytes, written_rows, written_bytes, round(elapsed * 1000) as elapsed_ms
FROM clusterAllReplicas('cluster_name', system.processes) 
WHERE query_id = 'query_id'
LIMIT 1
```

The query (against system.processes - or system.query_log) occasionally hang forever:

- It never respond
- It never timeout

I have to kill the client when it occurs.

On the server-side I can still see the query running in `system.process` where the `elapsed` is still increasing. The query is unkillable (the kill query itself adds up to the never ending queries).


When It happens the state of the threads assigned to the query have the following stack trace:


TCPHandler
```
DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)

std::__1::mutex::lock()
DB::RemoteQueryExecutor::cancel()
DB::ExecutingGraph::cancel(bool)
DB::PullingAsyncPipelineExecutor::cancel()
DB::PullingAsyncPipelineExecutor::~PullingAsyncPipelineExecutor()
DB::TCPHandler::runImpl()
DB::TCPHandler::run()
Poco::Net::TCPServerConnection::start()
Poco::Net::TCPServerDispatcher::run()
Poco::PooledThread::run()
Poco::ThreadImpl::runnableEntry(void*)
```

ThreadPool
```
DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)

ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false, true>, void*>)
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
void* std::__1::__thread_proxy[abi:v15000]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>>(void*)
```

QueryPullPipeEx
```
DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)

ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::wait()
DB::PipelineExecutor::execute(unsigned long, bool)
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0>(DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
void* std::__1::__thread_proxy[abi:v15000]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>>(void*)
```

QueryFullPipeEx
```
DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)

DB::TimerDescriptor::drain() const
DB::TimerDescriptor::reset() const
DB::HedgedConnections::disableChangingReplica(DB::HedgedConnections::ReplicaLocation const&)
DB::HedgedConnections::receivePacketFromReplica(DB::HedgedConnections::ReplicaLocation const&)
DB::HedgedConnections::receivePacketUnlocked(std::__1::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, unsigned int)>)
DB::RemoteQueryExecutorReadContext::Task::run(std::__1::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, unsigned int)>, std::__1::function<void ()>)
void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>>(boost::context::detail::transfer_t)
```


**How to reproduce**

- Using ClickHouse server `24.5.1.1763`
- Using TCP interface (with clickhouse-driver)

Due to the nature of the issue it's complicated to give an exact steps to reproduce.

**Expected behavior**

No deadlock.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible deadlock using clusterAllReplicas against system tables #66351

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible deadlock using clusterAllReplicas against system tables #66351

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions