Skip to content

Possible deadlock using clusterAllReplicas against system tables #66351

@panthony

Description

@panthony

Context

Using ClickHouse server (version 24.5.1.1763) with 2 nodes.

I tried to see if there was anything close to what I'm experiencing in newer versions changelog but could not find anything.

Describe what's wrong

Using clickhouse-driver to perform a query with the following pattern:

SELECT read_rows, read_bytes, written_rows, written_bytes, round(elapsed * 1000) as elapsed_ms
FROM clusterAllReplicas('cluster_name', system.processes) 
WHERE query_id = 'query_id'
LIMIT 1

The query (against system.processes - or system.query_log) occasionally hang forever:

  • It never respond
  • It never timeout

I have to kill the client when it occurs.

On the server-side I can still see the query running in system.process where the elapsed is still increasing. The query is unkillable (the kill query itself adds up to the never ending queries).

When It happens the state of the threads assigned to the query have the following stack trace:

TCPHandler

DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)

std::__1::mutex::lock()
DB::RemoteQueryExecutor::cancel()
DB::ExecutingGraph::cancel(bool)
DB::PullingAsyncPipelineExecutor::cancel()
DB::PullingAsyncPipelineExecutor::~PullingAsyncPipelineExecutor()
DB::TCPHandler::runImpl()
DB::TCPHandler::run()
Poco::Net::TCPServerConnection::start()
Poco::Net::TCPServerDispatcher::run()
Poco::PooledThread::run()
Poco::ThreadImpl::runnableEntry(void*)

ThreadPool

DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)

ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false, true>, void*>)
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
void* std::__1::__thread_proxy[abi:v15000]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>>(void*)

QueryPullPipeEx

DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)

ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::wait()
DB::PipelineExecutor::execute(unsigned long, bool)
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0>(DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
void* std::__1::__thread_proxy[abi:v15000]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>>(void*)

QueryFullPipeEx

DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)

DB::TimerDescriptor::drain() const
DB::TimerDescriptor::reset() const
DB::HedgedConnections::disableChangingReplica(DB::HedgedConnections::ReplicaLocation const&)
DB::HedgedConnections::receivePacketFromReplica(DB::HedgedConnections::ReplicaLocation const&)
DB::HedgedConnections::receivePacketUnlocked(std::__1::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, unsigned int)>)
DB::RemoteQueryExecutorReadContext::Task::run(std::__1::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, unsigned int)>, std::__1::function<void ()>)
void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>>(boost::context::detail::transfer_t)

How to reproduce

  • Using ClickHouse server 24.5.1.1763
  • Using TCP interface (with clickhouse-driver)

Due to the nature of the issue it's complicated to give an exact steps to reproduce.

Expected behavior

No deadlock.

Metadata

Metadata

Assignees

No one assigned

    Labels

    potential bugTo be reviewed by developers and confirmed/rejected.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions