-
Notifications
You must be signed in to change notification settings - Fork 24.5k
Description
Describe the bug
A master Redis instance replicating to a replica instance will not successfully perform a graceful shutdown (such as triggered by SIGTERM) in the presence of TCP congestion or CPU overload, losing significant portions of the data offset between the master and replica. This appears to be due to incomplete socket handling.
To reproduce
To reproduce I ran experiments with a low-powered computer as the replica, a larger computer running the master, and hacked up a script. All computers were running Linux. The replica computer is slower than the master, ensuring it can't keep up when under heavy load. It was tested with Redis 6.2.6 and b7afac6, which include the recent replication improvements (#9166), but with the same results in this area.
#!/usr/bin/env python3
import time
import redis
r = redis.Redis(host='master', port='6379')
r2 = redis.Redis(host='slave', port='6379')
pipe = r.pipeline(transaction=False)
for i in range(1, 5000000):
pipe.set('mykey', i)
if (i % 100 == 0):
try:
pipe.execute()
last_i = i
except redis.exceptions.ConnectionError:
print('i:', i, 'last i:', last_i)
time.sleep(1)
slave_i = int(r2.get('mykey'))
print('s:', slave_i)
print('nice!' if (slave_i in (i, last_i)) else 'ate my data :(')
exit()
if (i % 50000 == 0):
print('m: {} s: {}'.format(int(r.get('mykey')), int(r2.get('mykey'))))
pipe.execute()
Example output where 76607 writes were lost after several seconds of high load followed by a graceful master shutdown:
m: 50000 s: 38899
m: 100000 s: 76823
m: 150000 s: 115703
m: 200000 s: 155760
m: 250000 s: 198549
m: 300000 s: 242704
m: 350000 s: 285037
m: 400000 s: 325550
m: 450000 s: 362876
m: 500000 s: 406120
User manually issued a SIGINT to the server here.
i: 523000 last i: 522900
s: 446293
ate my data :(
Expected behavior
Like when writing to disk (if enabled) at shutdown, make a thorough effort completing writing to replicas before shutdown. I did not find a detailed description in the documentation of what should be expected, but I interpret this section of the replication docs to imply writing to a replica should be equivalent to writing to disk, which I assume is only slow under high load, rather than losing data:
It is possible to use replication to avoid the cost of having the master writing the full dataset to disk: a typical technique involves configuring your master redis.conf to avoid persisting to disk at all, then connect a replica configured to save from time to time, or with AOF enabled. However this setup must be handled with care, since a restarting master will start with an empty dataset: if the replica tries to synchronize with it, the replica will be emptied as well.
Additional information
I've found two root causes:
- writeToClient has logic to "send as much as possible if the client is a slave", but if it gets an error it will abort sending. Errors include EAGAIN or EWOULDBLOCK (TCP congestion). Thus graceful shutdown relies on the entire remaining replication data to fit in the socket buffer, else it will give up and cause data loss.
- After sending sockets are promptly closed and the process exit. However the slave still sends periodic ACK messages and the socket has not been shutdown or handled this case in any other way, and if received before sending all data, that will abort draining the socket buffer (what did fit), closing the connecting by sending an RST.
I find that at low load (no congestion and able to complete draining the second in under one second) and high luck (hope last ACK was sent just before the shutdown began, so the master has a full second to complete draining the socket), the graceful disconnection of replicas works and the TCP connection is closed by the master by a FIN.
At a high load, presumably high enough to consistently fill the socket buffer, you get hit by both these issues.
I can confirm the first issue by setting the slave socket to blocking mode in flushSlavesOutputBuffers before writeToClient. This completed writing the application buffer to the socket buffer, as it blocks on congestion instead of returning EAGAIN and aborting.
Blocking works and is roughly equivalent to writing to (a slow) disk, however unlike a (one) file, when replicating the master could be configured with multiple replicas. They would be handled strictly one at a time, which could increase the time to shut down much more than if they would drain concurrently, depending on where the bottleneck is, so blocking is no ideal here.
This improves the result considerably but there's still a window at the end where the second issue may abort when draining what's left in the socket buffer. To confirm this I did a quick and dirty removal of the 1-second replication cron task which sends the ACK message. This kept the replica quiet on its end of the socket during the time the master shuts down, which allowed the socket to drain fully then and end the conversation with a FIN.
With both hacks in place the example script works "all the time", meaning, the several times I've tried, on Linux only, and with no other edge cases taken into account.
m: 50000 s: 39825
m: 100000 s: 79622
m: 150000 s: 120246
m: 200000 s: 158027
m: 250000 s: 198540
m: 300000 s: 236777
m: 350000 s: 276379
m: 400000 s: 315071
m: 450000 s: 354218
m: 500000 s: 392000
i: 530100 last i: 530000
s: 530000
nice!