-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
To reproduce
Problem Statement
Consider a deployment where you have two instances, A and B.
Happy Scenario
A is read-write, B is read-only.
You connect a sender, send data, all is well.
A goes down, and B is made a read-write instance. Sender will swap to B and continue to write data.
Sad scenario
A is read-write, B is read-only.
A is down already.
You try to connect a sender. The sender tries instance B, and is rejected due to instance B not accepting writes. The sender then blacklists instance B.
The user notices A is down, and makes B a read-write instance.
The sender does not recover for a long time, because it blacklisted B. Not until all of the retries are exhausted does it give up and fallback, eventually getting the data in.
RCA
The blacklisting code was added as an optimisation to avoid wasted retries. Unfortunately, it does not account for deployments where senders may be created or destroyed ad-hoc, and are not already in a steady-state.
This issue can be triggered any time a client is created during a failover, causing artificial delay to ingestions on the new read-write instance.
Workaround
Explicitly set protocol_version to avoid this code path.
Solution
Remove blacklisting.
Minor note
Some parameters are available in the sender builder, that are not available in the config string. For example, maxBackoffMillis. This should be corrected in due course.
QuestDB version:
Master (9.1.0+)
OS, in case of Docker specify Docker and the Host OS:
N/A
File System, in case of Docker specify Host File System:
N/A
Full Name:
Nick Woolmer
Affiliation:
QuestDB
Have you followed Linux, MacOs kernel configuration steps to increase Maximum open files and Maximum virtual memory areas limit?
- Yes, I have
Additional context
No response