Skip to content

Unexpectedly stuck queries after upgrade from 24.1.2.5 to 24.8.4.13 #69904

@ppavlov39

Description

@ppavlov39

Describe the issue
After upgrading the Clickhouse version from 24.1.2.5 to 24.7.3.42, we faced with periodic partial failures of cluster nodes. After some time, it was decided to upgrade to 24.8.4.13. This did not help. Also after it, the new query analyzer was disabled, since it caused many problems when executing previously stable queries (allow_experimental_analyzer: 0).

Most often, the failure does not occur immediately. It is preceded by the growth of several metrics - RWLockActiveReaders, BackgroundMergesAndMutationsPoolTask ​​(no mutations are performed at that time) and the number of parallel queries. Essentially, the failure is caused by one of the cluster nodes starting to hit the limit set by the max_concurrent_queries parameter, and new queries stop being executed.

The queries that are stuck seem to have nothing in common - they could be queries to select data from system tables or user data tables, DDL queries, etc. The tables they use are different.
The only solution we could find was to restart the server process.

On metrics it looks likes this (Yellow and orange lines):
RWLockActiveReaders:
image

BackgroundMergesAndMutationsPoolTask:
image

Clickhouse runs on Centos 7 with elrepo kernel 5.4.224-1.el7.elrepo.x86_64.

How to reproduce
We can't reproduce it, it happens unexpectedly.

How can we find out the root cause of this behavior?

Metadata

Metadata

Assignees

No one assigned

    Labels

    st-need-infoWe need extra data to continue (waiting for response). Either some details or a repro of the issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions