-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Possible deadlock using clusterAllReplicas against system tables #66351
Description
Context
Using ClickHouse server (version 24.5.1.1763) with 2 nodes.
I tried to see if there was anything close to what I'm experiencing in newer versions changelog but could not find anything.
Describe what's wrong
Using clickhouse-driver to perform a query with the following pattern:
SELECT read_rows, read_bytes, written_rows, written_bytes, round(elapsed * 1000) as elapsed_ms
FROM clusterAllReplicas('cluster_name', system.processes)
WHERE query_id = 'query_id'
LIMIT 1The query (against system.processes - or system.query_log) occasionally hang forever:
- It never respond
- It never timeout
I have to kill the client when it occurs.
On the server-side I can still see the query running in system.process where the elapsed is still increasing. The query is unkillable (the kill query itself adds up to the never ending queries).
When It happens the state of the threads assigned to the query have the following stack trace:
TCPHandler
DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
std::__1::mutex::lock()
DB::RemoteQueryExecutor::cancel()
DB::ExecutingGraph::cancel(bool)
DB::PullingAsyncPipelineExecutor::cancel()
DB::PullingAsyncPipelineExecutor::~PullingAsyncPipelineExecutor()
DB::TCPHandler::runImpl()
DB::TCPHandler::run()
Poco::Net::TCPServerConnection::start()
Poco::Net::TCPServerDispatcher::run()
Poco::PooledThread::run()
Poco::ThreadImpl::runnableEntry(void*)
ThreadPool
DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false, true>, void*>)
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
void* std::__1::__thread_proxy[abi:v15000]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>>(void*)
QueryPullPipeEx
DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::wait()
DB::PipelineExecutor::execute(unsigned long, bool)
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0>(DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
void* std::__1::__thread_proxy[abi:v15000]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>>(void*)
QueryFullPipeEx
DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
DB::TimerDescriptor::drain() const
DB::TimerDescriptor::reset() const
DB::HedgedConnections::disableChangingReplica(DB::HedgedConnections::ReplicaLocation const&)
DB::HedgedConnections::receivePacketFromReplica(DB::HedgedConnections::ReplicaLocation const&)
DB::HedgedConnections::receivePacketUnlocked(std::__1::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, unsigned int)>)
DB::RemoteQueryExecutorReadContext::Task::run(std::__1::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, unsigned int)>, std::__1::function<void ()>)
void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine>>>(boost::context::detail::transfer_t)
How to reproduce
- Using ClickHouse server
24.5.1.1763 - Using TCP interface (with clickhouse-driver)
Due to the nature of the issue it's complicated to give an exact steps to reproduce.
Expected behavior
No deadlock.