-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Retries in distributed queries if a server stopped responding during a query. #58380
Description
Use case
A cluster has a dynamically changing number of replicas, and some replicas disappear during a running query.
Describe the solution you'd like
If an internal query hasn't returned any blocks of data yet (the query contains ORDER BY or GROUP BY, so it only starts to return the data near the end of its run time) but the connection was closed, reconnect to another replica and send the query again.
Caveats
The progress bar will be slightly wrong.
In some cases, the network connection hangs rather than being reset. It will be more difficult to make a failover in this case, but it is possible if we lower the socket read/write timeout and drop the connection if we don't have process packets for a certain time. Alternatively, we can send "ping" packets during the query run time. Also, we can lower TCP keep-alive.
Additional context
This is especially useful for parallel replicas.
We can also have this option in clickhouse-client for normal queries.