-
Notifications
You must be signed in to change notification settings - Fork 24.5k
Description
The issue that is being addressed
During bgsave, the primary child process, sends an rdb snapshot to the replica. During this time any write command that the primary has to process is kept in the COB in order to be sent to the replica once bgsave is done.
If the save is taking for too long or write commands are coming in high frequency, the COB may reach its limits, causing full sync to fail.
By implementing a more efficient way to stream RDB data from primary to replica, the feature intends to not only reduce COB overrun, but also simplify the bgsave process.
Description
The primary will simultaneously send RDB and command stream during the full sync. The replica on the other hand will store the command stream in a local buffer to be streamed locally once new snapshot is loaded.
By doing so we will gain the following:
- Reduce memory load. Replica will be responsible to the online replication buffer, this will reduce the memory load on the primary during the save, which is already heavy due to COW memory usage.
- Faster sync. The primary main process is a bottle neck. The feature will take from the primary main process the responsibility of being the bridge between the background process and the replica, thus even when primary main process handles heavy commands, the full sync will continue.
- Reduce stale data. Currently in cases that the COB at the primary holds a lot of important changes, when replica finishes loading the DB sent from primary it starts answering clients’ reads. This leaves the clients with stale data which may take long to propagate (Transferring a COB of a few giga will take minutes) .
With this feature the commands streaming will be much faster at the replica side, making the time between the replica start responding to client to the time when the replica is completely up to data considerably shorter.

Alternatives we've considered
One connection for both data types
During fsync primary will use the connection to its replica to write both rdb and command stream. This will be done by adding a header as a prefix to each message indicating the message type.
$RDB
RDB data...
$COB
Command Stream data...
...
Pros:
- Simplicity. Single connection, easy to read and debug.
- Chit chat protocol line is easily developed on top of. The replica command’s header processing will be done in generic way, adding more message type in the future will be much easier. Discussed in [WIP] replication: handling PINGs as out-of-band data #8440.
Cons:
- Reduce parallelism. Comparing to the first option, all writes and key iteration must go through main process.
- Message header overhead. Each message on the primary replica connection will have few more bytes to pass the message type.
Additional information
Preliminary results
We checked the memory influence of the feature, with 3gb size DB, and high write on primary we where able to move 80% of the used memory by the COB to the replica side.
- In regular bgsave cob starts to grow once the back ground process is created, comparing to the POC where the cob grows only during the load from disk. In practice we will be able to reduce the size much more by frequent yielding during the snapshot load.
- Why steps? I used
client_recent_max_output_bufferto measure the cob size, it shows the recent max so it only updates once every ~1 second. - Maximum flat: this is also a result of the
recent_max, also there is a continues write burst to the primary during this time, so although replica started reading from the primary COB, there are clients filling it.
This is the impact on the replica buffer size. Same memory usage distributed between the primary and the replica (mostly the replica).

Design
- Flow
- High level design
Once we know full sync is needed:
Setup:
- Replica creates a second channel.
- From second channel replica sends new command
new-command <master_host> <master_port> --rdb <path_to_replica_disk> +send-end-offset
asking for snapshot with new--rdb sub command (similar to rdb command but diskless on primary side), so primary child process will send the end offset of the new RDB before the RDB itself.
Full Sync:
- On primary side: Bgprocess created. (Bgprocess uses second channel).
- The end offset of the yet uncreated RDB should be sent.
- Bgprocess starts sending RDB through second channel.
- On the replica side:
- Get the end-offset, and send back through main channel:
PSYNC <master_repl_id> <rdb_end_offset>+1 - Meanwhile receiving from second channel the RDB, saving it to disk or memory.
- Get the end-offset, and send back through main channel:
- Primary answer with PSYNC and from now on sends the online replication data to the new “connected“ replica.
- Replica reads the replication data and store it in local COB:
void readQueryFromClient(connection *conn) {
// ... Read and parse command
if((c->flags & CLIENT_MASTER) && server.repl_state == REPL_STATE_TRANSFER){
SaveCommandsToLocalCob(c->query_buff, offset);
server.amz.pending_master_stream_size += nread;
return;
}
// ... Process command
}
- When child process sent all data it gets rid of the second connection and terminates.
Recovery
- Replica finished loading the RDB.
- Replica streams the local COB to memory.
- Replica back online.
- Replica Streams the regular COB from primary to memory.


