-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Unexpectedly stuck queries after upgrade from 24.1.2.5 to 24.8.4.13 #69904
Description
Describe the issue
After upgrading the Clickhouse version from 24.1.2.5 to 24.7.3.42, we faced with periodic partial failures of cluster nodes. After some time, it was decided to upgrade to 24.8.4.13. This did not help. Also after it, the new query analyzer was disabled, since it caused many problems when executing previously stable queries (allow_experimental_analyzer: 0).
Most often, the failure does not occur immediately. It is preceded by the growth of several metrics - RWLockActiveReaders, BackgroundMergesAndMutationsPoolTask (no mutations are performed at that time) and the number of parallel queries. Essentially, the failure is caused by one of the cluster nodes starting to hit the limit set by the max_concurrent_queries parameter, and new queries stop being executed.
The queries that are stuck seem to have nothing in common - they could be queries to select data from system tables or user data tables, DDL queries, etc. The tables they use are different.
The only solution we could find was to restart the server process.
On metrics it looks likes this (Yellow and orange lines):
RWLockActiveReaders:

BackgroundMergesAndMutationsPoolTask:

Clickhouse runs on Centos 7 with elrepo kernel 5.4.224-1.el7.elrepo.x86_64.
How to reproduce
We can't reproduce it, it happens unexpectedly.
How can we find out the root cause of this behavior?