Balance the query load between replicas in a shard

Hi, I sometimes saw an imbalanced query load distribution between the replicas in a shard.

e.g:
![image](https://user-images.githubusercontent.com/6710599/80538563-f271a580-8973-11ea-889d-c592e491be87.png)


I'm using random as the load_balancing policy. The two replicas are using similar HW resource throughout the time, the replication queue remains very low so I don't think any of them got stale.

Sometimes when a flood of queries come in, they could be sent to one node way more than the other as the graph indicates so the one received most of the query could reach max_concurrent_queries. Is there anything I can do to force an even distribution between them?

I tried to list the nodes from the node receiving the distributed query:
```
SELECT
    shard_num,
    shard_weight,
    replica_num,
    errors_count,
    estimated_recovery_time
FROM system.clusters
LIMIT 4

┌─shard_num─┬─shard_weight─┬─replica_num─┬─errors_count─┬─estimated_recovery_time─┐
│         1 │            1 │           1 │            1 │                       0 │
│         1 │            1 │           2 │            1 │                       0 │
│         2 │            1 │           1 │            1 │                       0 │
│         2 │            1 │           2 │            1 │                       0 │
│         3 │            1 │           1 │            1 │                       0 │
│         3 │            1 │           2 │            1 │                       0 │
│         4 │            1 │           1 │            1 │                       0 │
│         4 │            1 │           2 │            0 │                       0 │
└───────────┴──────────────┴─────────────┴──────────────┴─────────────────────────┘

4 rows in set. Elapsed: 36.449 sec.
```
A lot of the nodes have errors_count as 1, is there any way for me to check what is that error? Is it possible that a bunch of queries came in and when one replica had 1 error and one replica had 0, so they all got sent to the replica with 0 errors like shard 4 above?

Also the above query took a really long time, is that expected? There are 300+ nodes in the table.

* Which ClickHouse server version to use
20.3.5


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Balance the query load between replicas in a shard #10564

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Balance the query load between replicas in a shard #10564

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions