Skip to content

Dual channel replication#60

Merged
PingXie merged 110 commits intovalkey-io:unstablefrom
naglera:rdb-channel
Jul 17, 2024
Merged

Dual channel replication#60
PingXie merged 110 commits intovalkey-io:unstablefrom
naglera:rdb-channel

Conversation

@naglera
Copy link
Contributor

@naglera naglera commented Mar 28, 2024

In this PR we introduce the main benefit of dual channel replication by continuously steaming the COB (client output buffers) in parallel to the RDB and thus keeping the primary's side COB small AND accelerating the overall sync process. By streaming the replication data to the replica during the full sync, we reduce

  1. Memory load from the primary's node.
  2. CPU load from the primary's main process. Latest performance tests

Motivation

  • Reduce primary memory load. We do that by moving the COB tracking to the replica side. This also decrease the chance for COB overruns. Note that primary's input buffer limits at the replica side are less restricted then primary's COB as the replica plays less critical part in the replication group. While increasing the primary’s COB may end up with primary reaching swap and clients suffering, at replica side we’re more at ease with it. Larger COB means better chance to sync successfully.
  • Reduce primary main process CPU load. By opening a new, dedicated connection for the RDB transfer, child processes can have direct access to the new connection. Due to TLS connection restrictions, this was not possible using one main connection. We eliminate the need for the child process to use the primary's child-proc -> main-proc pipeline, thus freeing up the main process to process clients queries.

Dual Channel Replication high level interface design

  • Dual channel replication begins when the replica sends a REPLCONF CAPA dual-channel to the primary during initial
    handshake. This is used to state that the replica is capable of dual channel sync and that this is the replica's main channel, which is not used for snapshot transfer.
  • When replica lacks sufficient data for PSYNC, the primary will send +DUALCHANNELSYNC response instead
    of RDB data. As a next step, the replica creates a new connection (rdb-channel) and configures it against
    the primary with the appropriate capabilities and requirements. The replica then requests a sync
    using the RDB channel.
  • Prior to forking, the primary sends the replica the snapshot's end repl-offset, and attaches the replica
    to the replication backlog to keep repl data until the replica requests psync. The replica uses the main
    channel to request a PSYNC starting at the snapshot end offset.
  • The primary main threads sends incremental changes via the main channel, while the bgsave process
    sends the RDB directly to the replica via the rdb-channel. As for the replica, the incremental
    changes are stored on a local buffer, while the RDB is loaded into memory.
  • Once the replica completes loading the rdb, it drops the rdb channel connection and streams the accumulated incremental
    changes into memory. Repl steady state continues normally.

New replica state machine

image

Data

image

image

image

Explanation

These graphs demonstrate performance improvements during full sync sessions using rdb-channel + streaming rdb directly from the background process to the replica.

First graph- with at most 50 clients and light weight commands, we saw 5%-7.5% improvement in write latency during sync session.
Two graphs below- full sync was tested during heavy read commands from the primary (such as sdiff, sunion on large sets). In that case, the child process writes to the replica without sharing CPU with the loaded main process. As a result, this not only improves client response time, but may also shorten sync time by about 50%. The shorter sync time results in less memory being used to store replication diffs (>60% in some of the tested cases).

Test setup

Both primary and replica in the performance tests ran on the same machine. RDB size in all tests is 3.7gb. I generated write load using valkey-benchmark ./valkey-benchmark -r 100000 -n 6000000 lpush my_list __rand_int__.

@naglera naglera force-pushed the rdb-channel branch 3 times, most recently from 8b1bfef to 3179baf Compare May 22, 2024 16:24
@madolson madolson added the major-decision-pending Major decision pending by TSC team label Jun 3, 2024
@naglera naglera force-pushed the rdb-channel branch 7 times, most recently from 8b4aba0 to 5411d92 Compare June 10, 2024 17:57
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the HLD. I didn't review in dept yet.

I don't like that we use "This is the main connection" to implicitly mean "I'm capable of receiving the RDB dump on another connection".

Here's my suggestion:

  • The replica can use REPLCONF CAPA to tell the primary.
  • Check the reply +FULLRESYNC vs -FULLSYNCNEEDED to find out if the primary supports this or not.

@zuiderkwast
Copy link
Contributor

Great! I'll take a closer look another day. You'll need to fix some merge conflicts (replace all master/slave with primary/replica in source code and comments) and fix the DCO (the CI job log says use git rebase HEAD~16 --signoff, then force-push) and probably run clang-format.

@madolson madolson added major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Jun 17, 2024
@naglera naglera force-pushed the rdb-channel branch 2 times, most recently from dd17e3e to 676a6bb Compare June 18, 2024 09:45
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @naglera! I am directionally aligned with the idea and the high level implementation but I have only scratched the surface of this PR. The core team has marked this for Valkey 8.0. If you are committed as well, can you please 1) resolve the merge conflicts; 2) fix the primary/replica terminology; 3) clang-format the changes, when you get a chance? I will continue my review afterwards. Thanks again for the contribution!

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR. I roughly read the code locally, high Level LGTM.

The code mixes connection and channel at some points. I suggest we stick with just one name.
In addition, some comments are outdated. I didn't read them carefully and marked them.
I think there's a lot of new replconf / status interaction, is there any way to reduce it? (I haven't reviewed all of them, they may be necessary)

thanks again, please resolve the conflict (i know it's painful in a big PR...)

naglera added 7 commits June 26, 2024 12:58
In this PR we introduce the main benefit of rdb channel by continuously
steaming the COB in parallel to the RDB and thus keeping the primary
side COB small AND accelerating the overall sync process.
By streaming the replication data to the replica during the full sync,
we reduce
1. Memory load from the primary node.
2. CPU load from the primary main process (will be introduced in later
   PR).
This opens up possibilities to future improvements with better
TLS connection handling and removal of the the need to pipeline the RDB
from the child process to the main.

Squashed commit of the following:

commit 7d4756681655ae1c6e04681710740e422a6def4c
Merge: 97894e3 7253862
Author: naglera <[email protected]>
Date:   Wed May 22 08:15:26 2024 +0000

    Merge branch 'unstable' into rdb-channel

commit 97894e3
Author: Amit Nagler <[email protected]>
Date:   Sun May 19 15:21:58 2024 +0000

    fix psync stat error count

    rdb channel sync should also count psync failures.
    This commit fix tests "replication partial resync" to run with rdb
    channel enabled

commit a51b0a5
Author: Amit Nagler <[email protected]>
Date:   Sun May 19 11:06:04 2024 +0000

    fix test PSYNC2: Full resync after Master restart when too many key expired

commit 96c93f3
Author: Amit Nagler <[email protected]>
Date:   Sun May 19 09:17:45 2024 +0000

    Allow auth using rdb channel

commit 9916f80
Author: Amit Nagler <[email protected]>
Date:   Thu May 16 12:59:44 2024 +0000

    fix use after free error & adjust test's configuration

commit c343261
Author: Amit Nagler <[email protected]>
Date:   Thu May 16 12:19:33 2024 +0000

    spell check fix

commit e2decc4
Author: Amit Nagler <[email protected]>
Date:   Tue May 7 13:54:26 2024 +0000

    Fix replication tests to work with rdb channel

commit bc010e8
Author: Amit Nagler <[email protected]>
Date:   Thu May 9 10:40:47 2024 +0000

    fix initial offset issue at replica side

commit 1ecf5fb
Author: Amit Nagler <[email protected]>
Date:   Tue May 7 12:44:21 2024 +0000

    fix leak of repl_transfer_tmpfile, and use before allocation of end offset buffer

commit 47d482f
Author: Amit Nagler <[email protected]>
Date:   Tue May 7 10:05:13 2024 +0000

    fix crash after repl_transfer_s double free

commit b73dcc2
Author: Amit Nagler <[email protected]>
Date:   Sun Apr 21 15:43:09 2024 +0000

    replica connection free at fork process

commit eb28c91
Author: Amit Nagler <[email protected]>
Date:   Sun Apr 21 13:19:55 2024 +0000

    Allow abort sync connection from rdb con sync

commit 08885c5
Author: Amit Nagler <[email protected]>
Date:   Wed Apr 17 15:14:23 2024 +0000

    Revert "free repl_transfer_tmpfile after rdb sync abort"

    This reverts commit bfa469d.

commit b89fa41
Author: Amit Nagler <[email protected]>
Date:   Wed Apr 17 15:13:30 2024 +0000

    Fix cancel replicaiton handshake rdb-channel case

commit 2eaa109
Author: Amit Nagler <[email protected]>
Date:   Tue Apr 16 15:31:11 2024 +0000

    add new line at the end of rio.c

commit bfa469d
Author: Amit Nagler <[email protected]>
Date:   Tue Apr 16 15:28:26 2024 +0000

    free repl_transfer_tmpfile after rdb sync abort

commit e8462aa
Author: Amit Nagler <[email protected]>
Date:   Tue Apr 16 13:06:26 2024 +0000

    free conns in both direct an normal cases

commit 9ad4bd0
Author: Amit Nagler <[email protected]>
Date:   Wed Apr 10 11:03:43 2024 +0000

    stat_total_reads_processed->stat_net_repl_input_bytes

commit e56ca18
Author: Amit Nagler <[email protected]>
Date:   Tue Apr 9 16:17:30 2024 +0000

    add replconf sub command documentation

commit cefd9e2
Merge: 9454ec2 03650e9
Author: naglera <[email protected]>
Date:   Tue Apr 9 19:26:58 2024 +0300

    Merge branch 'unstable' into rdb-channel

commit 9454ec2
Author: naglera <[email protected]>
Date:   Tue Apr 9 15:09:00 2024 +0000

    fix comments

commit ad361c2
Author: Amit Nagler <[email protected]>
Date:   Tue Apr 9 09:15:34 2024 +0000

    Fix info format

commit 241b130
Author: Amit Nagler <[email protected]>
Date:   Tue Apr 9 08:56:47 2024 +0000

    Fix leak in rdbSaveToSlavesSockets

commit 74b0ca5
Author: Amit Nagler <[email protected]>
Date:   Tue Apr 9 08:31:56 2024 +0000

    Fix leak in slaveTryPartialResynchronization

commit 80b9c5f
Author: Amit Nagler <[email protected]>
Date:   Tue Apr 9 08:12:36 2024 +0000

    Fix fsync requirement bug

    Fix requirement bug causing Dumping an RDB - functions only: yes test failure

commit 841013d
Author: naglera <naglera>
Date:   Mon Apr 8 12:52:36 2024 +0000

    update top comment

commit 3722cb1
Author: naglera <[email protected]>
Date:   Mon Apr 8 11:50:32 2024 +0000

    Handle COB overrun during rdb-chan sync

commit 2adf3c2
Author: naglera <[email protected]>
Date:   Tue Mar 19 13:18:39 2024 +0000

    Avoid reading from replica's rdb client after it marked as closed asap

commit 8db90c7
Merge: 6b60dd2 6411629
Author: naglera <[email protected]>
Date:   Mon Apr 8 08:23:21 2024 +0000

    Merge branch 'unstable' into rdb-channel

commit 6b60dd2
Author: naglera <[email protected]>
Date:   Mon Mar 18 15:21:23 2024 +0000

    Remove lookupClientByIDGeneric

commit 7511f26
Author: Amit Nagler <[email protected]>
Date:   Thu Mar 14 13:56:54 2024 +0000

    debug command for wait_before_rdb_client_free

commit 3ad61d0
Author: naglera <[email protected]>
Date:   Thu Mar 14 11:45:59 2024 +0000

    Use radix tree for waiting replicas for psync

commit 8d9a057
Author: Amit Nagler <[email protected]>
Date:   Wed Mar 13 17:39:11 2024 +0000

    renaming and minor comments

commit fab702a
Author: Amit Nagler <[email protected]>
Date:   Wed Mar 13 17:38:34 2024 +0000

    use CLIENT_PROTECTED_RDB_CHANNEL flag instead of CLIENT_PROTECTED

commit 9b320d0
Author: naglera <[email protected]>
Date:   Wed Mar 13 16:07:10 2024 +0000

    Test edge cases of master connection peering

commit 0597fce
Author: Amit Nagler <[email protected]>
Date:   Wed Mar 13 09:39:25 2024 +0000

    Fix  CI workflow run comments

commit baa932c
Author: naglera <[email protected]>
Date:   Wed Mar 13 19:09:59 2024 +0200

    Update src/server.h

    Co-authored-by: debing.sun <[email protected]>

commit 073a86b
Author: naglera <[email protected]>
Date:   Tue Mar 12 16:10:14 2024 +0000

    Prevent freeing repl buffer blocks that are used by pre established psync connection

    This is necessery for the case in which the RDB is loaded before psync
    establshed. We do that by protecting the RDB client for short grace
    period (5sec) that will allow the replica main channel to finish
    handshake.

commit 75e68e1
Author: naglera <[email protected]>
Date:   Tue Mar 12 13:47:59 2024 +0000

    update replica state machine diagram

commit 5c4c824
Author: naglera <[email protected]>
Date:   Tue Mar 12 13:43:57 2024 +0000

    Use identity to hash dict integers and store keys as plain text

commit 616455e
Author: naglera <[email protected]>
Date:   Tue Mar 12 13:22:21 2024 +0000

    rename REPLCONF identify => set-rdb-conn-id

commit 83a471d
Author: naglera <[email protected]>
Date:   Tue Mar 12 12:58:34 2024 +0000

    remove getLongLongFromObjectOrReply and fix identations

commit da22d0d
Author: naglera <[email protected]>
Date:   Mon Mar 11 17:02:04 2024 +0000

    Rename peer & unpeer => add & remove

commit 39364a6
Author: naglera <[email protected]>
Date:   Mon Mar 11 16:23:27 2024 +0000

    Rename pending_slaves to slaves_waiting_psync

commit 7073371
Author: naglera <[email protected]>
Date:   Mon Mar 11 15:51:24 2024 +0000

    Use cliet id for pending replicas dict

commit 8ccfe0b
Author: naglera <[email protected]>
Date:   Thu Mar 7 12:30:22 2024 +0000

    fix info field: replstate name

commit 79eae09
Author: naglera <[email protected]>
Date:   Mon Feb 26 14:22:17 2024 +0000

    Fix comments and add high level design top comment and diagram

commit 5e20d8e
Author: naglera <[email protected]>
Date:   Mon Feb 26 13:26:17 2024 +0000

    move pending_slaves dict update in case of cob overrun to freeClient

commit 98f0b9c
Author: naglera <[email protected]>
Date:   Sun Feb 25 19:46:58 2024 +0200

    Update src/replication.c

    update slaves_waitlng_psync documentation

    Co-authored-by: Oran Agra <[email protected]>

commit f668d29
Author: naglera <[email protected]>
Date:   Sun Feb 25 19:43:22 2024 +0200

    Update src/replication.c

    Co-authored-by: Oran Agra <[email protected]>

commit 9edeec3
Author: naglera <[email protected]>
Date:   Wed Feb 21 15:19:13 2024 +0000

    spellcheck fix

commit e6bc18a
Author: naglera <[email protected]>
Date:   Wed Feb 21 15:15:31 2024 +0000

    add documentation

commit 93d0e47
Author: naglera <[email protected]>
Date:   Wed Feb 21 13:56:35 2024 +0000

    remove unnecessary if from rollback mechanism

commit b4868fb
Author: naglera <[email protected]>
Date:   Wed Feb 21 09:56:40 2024 +0000

    fix comments

commit 6de2dc3
Author: naglera <[email protected]>
Date:   Mon Feb 19 14:01:58 2024 +0000

    add documentation

commit 24a34dd
Author: naglera <[email protected]>
Date:   Sun Feb 18 10:39:41 2024 +0000

    Test peering connection

commit 6bcbc62
Author: naglera <[email protected]>
Date:   Fri Feb 16 13:33:23 2024 +0000

    Rollback after COB overrun during rdb-channel sync handshake

commit 669e22b
Author: naglera <[email protected]>
Date:   Thu Feb 15 15:53:28 2024 +0000

    Fix rollback mechanism

commit 4130f8f
Author: naglera <[email protected]>
Date:   Wed Feb 14 11:05:34 2024 +0000

    Peer replica with repl backlog block upon fork

commit ee527c3
Author: naglera <[email protected]>
Date:   Thu Feb 15 08:16:23 2024 +0000

    rename SLAVE_STATE_ACTIVE_RDB_CHAN to SLAVE_STATE_BG_RDB_LOAD

    tmp rename

commit 8832ed7
Author: naglera <[email protected]>
Date:   Sun Feb 11 10:47:04 2024 +0000

    fix comment- end offset format parsing

commit 2028315
Author: naglera <[email protected]>
Date:   Thu Feb 8 07:59:08 2024 +0000

    Test fixes

commit 19cddc4
Author: naglera <[email protected]>
Date:   Wed Feb 7 20:10:00 2024 +0000

    Split completeTaskRDBChannelSync into completeTaskRDBChannelSyncMainConn and completeTaskRDBChannelSyncRdbConn

commit d25c36e
Author: naglera <[email protected]>
Date:   Tue Feb 6 16:51:05 2024 +0000

    Simplfy replica state machine by spliting replica state

    server.repl_state will be used to represent the main conneciton state.
    server.repl_rdb_conn_state is introduced to represent the rdb-channel
    state.

    Main channel can reuse REPL_STATE_SEND_PSYNC,
    REPL_STATE_RECEIVE_PSYNC_REPLY, REPL_STATE_TRANSFER after rdb-channel
    initialization.

commit ed6d69a
Author: naglera <[email protected]>
Date:   Tue Feb 6 11:55:41 2024 +0000

    Merge rdbSaveToSlavesSocketsWithPipeline with rdbSaveToSlavesSocketsDirect

commit 1e1f74f
Author: naglera <[email protected]>
Date:   Tue Feb 6 09:40:30 2024 +0000

    Small comment fixes

commit 5fc0be9
Author: naglera <[email protected]>
Date:   Mon Feb 5 17:30:48 2024 +0000

    Hide repl-rdb-channel config, . Clarify fallback point. Better utilize master_supports_rdb_channel

commit e44bc69
Author: naglera <[email protected]>
Date:   Mon Feb 5 10:00:43 2024 +0000

    Fix typo

commit b4ead39
Author: naglera <[email protected]>
Date:   Mon Feb 5 10:51:07 2024 +0000

    Fix possible crash after sync failure

commit d50e456
Author: naglera <[email protected]>
Date:   Wed Jan 31 08:03:55 2024 +0000

    Space fixes and redis.conf update

commit acd12a8
Author: naglera <[email protected]>
Date:   Tue Jan 30 13:06:15 2024 +0000

    Test rdb-connection race condition

    Test the case in which the replica establish psync connection after the
    RDB was already loaded.
    Added debug command in order to froce main process to be slower then the
    bg child.

commit 1a6052f
Author: naglera <[email protected]>
Date:   Mon Jan 29 14:22:37 2024 +0000

    Disable child process pipeline

    Stream replication data directly from bg child process to replica

commit f71122c
Author: naglera <[email protected]>
Date:   Wed Jan 24 17:39:03 2024 +0000

    Refactor- Extract rdb-channel logic from slaveTryPartialReSync

    Extract setupMainConnForPsync from slaveTryPartialResynchronization to
    make slaveTryPartialResynchronization readable.

commit a67a5ff
Author: naglera <[email protected]>
Date:   Wed Jan 24 14:57:00 2024 +0000

    Add time threshold to replica buffer streaming progress callback

commit cedbf8a
Author: naglera <[email protected]>
Date:   Wed Jan 24 10:39:42 2024 +0000

    Stop reading from main connection after replica's local buffer limit reached

commit 03477b0
Author: naglera <[email protected]>
Date:   Wed Jan 24 08:54:24 2024 +0000

    switch primary -> master

commit 09b30b3
Author: naglera <[email protected]>
Date:   Wed Jan 24 08:47:52 2024 +0000

    unwrap processEndOffsetResponse

commit 4337734
Author: naglera <[email protected]>
Date:   Mon Jan 22 17:59:28 2024 +0000

    add new replication state REPL_RDB_CONN_SEND_CAPA

commit 7cd6252
Author: naglera <[email protected]>
Date:   Mon Jan 22 14:15:17 2024 +0000

    add documetation for process end offset response

commit 2e64230
Author: naglera <[email protected]>
Date:   Mon Jan 22 13:05:43 2024 +0000

    switch rdb-channel sync to normal sync only when we know that master does not support rdb-channel

commit d65381a
Author: naglera <[email protected]>
Date:   Mon Jan 22 10:53:28 2024 +0000

    Send Ack at the end of readSyncBulkPayload if it isn't rdb connection sync. AlSome method renamed

commit ac6d324
Author: naglera <[email protected]>
Date:   Thu Jan 18 09:51:00 2024 +0000

    Make repl_provisional_master contain only the necessary client's fields

commit bab299e
Author: naglera <[email protected]>
Date:   Tue Jan 16 11:07:02 2024 +0000

    Mostly renaming and comment fixes

commit c8de526
Author: naglera <[email protected]>
Date:   Mon Jan 15 07:56:35 2024 +0000

    Use -FULLSYNCNEEDED instead of empty bulk

commit 400664d
Author: naglera <[email protected]>
Date:   Mon Jan 8 17:06:06 2024 +0000

    Fixed comments, rename methods and states

commit be1f66f
Author: naglera <[email protected]>
Date:   Thu Jan 4 18:24:28 2024 +0000

    Move sendCurentOffsetToReplica to replication.c

commit 03d35b6
Author: naglera <[email protected]>
Date:   Thu Jan 4 14:08:51 2024 +0000

    Remove sendReplicationOffsetToReplicas(), since we dont want to permit rdb-channel sync along with master side disk sync

commit b7c1688
Author: naglera <[email protected]>
Date:   Wed Jan 3 16:05:15 2024 +0000

    rename primary_can_sync_using_rdb_channel-> master_supports_rdb_channel

commit 21c797e
Author: naglera <[email protected]>
Date:   Thu Jan 4 10:35:51 2024 +0000

    Decrement pending_repl_data.len during replica buffer streaming.

    Stop using stat_repl_processed_bytes to follow streaming progress,
    instread keep record of the buffer's peak.

commit 33bfd16
Author: naglera <[email protected]>
Date:   Wed Jan 3 16:01:16 2024 +0000

    replica buffer will use the same mechanism as other client reply buffers

commit 07f24f5
Author: naglera <[email protected]>
Date:   Tue Jan 2 14:34:06 2024 +0000

    Remove mentions of second channel

commit f0252d8
Author: naglera <[email protected]>
Date:   Tue May 23 09:51:05 2023 +0000

    Fix indentation

commit 3efab3a
Author: naglera <[email protected]>
Date:   Tue May 23 06:49:18 2023 +0000

    Fix comments

    Removed adlist method and primary block size.
    Use sync write for sending rdb end offset.
    Removed replconf connected sub command. The replica will send ack
    instead.
    Fixed git diff issue.
    Refactored replicationAbortSyncTransfer + replicationAbortSyncTransfer.

commit 6ef6736
Author: naglera <[email protected]>
Date:   Thu May 4 16:41:54 2023 +0000

    void instead of empty params

commit d976a81
Author: naglera <[email protected]>
Date:   Thu May 4 16:03:36 2023 +0000

    Rename repl_data_buf to pending_repl_data

    Fix memory leak
    Added void param to isOngoingRdbChannelSync

commit 9d464e2
Merge: 7297a35 e49c2a5
Author: naglera <[email protected]>
Date:   Thu May 4 14:16:49 2023 +0300

    Merge branch 'unstable' into rdb-channel

commit 7297a35
Author: naglera <[email protected]>
Date:   Thu May 4 11:00:31 2023 +0000

    Use linked list for replication data

    1. The replica side mmap replication buffer has been replaced with a linked list
    replication buffer.
    2. The replica buffer limit now depends on the client-output-buffer-limit type replica.
    3. Buffering policy changed to not cancel sync when replication buffer is full. We
    instead continue to sync without reading from the replication data socket, so when the
    replica side reaches replication buffer limits, the primary replication buffer will
    take part in the replication data buffering.

commit b9bddab
Author: naglera <[email protected]>
Date:   Sun Apr 30 07:52:41 2023 +0000

    fix CI workflow run comments

commit cc74971
Author: naglera <[email protected]>
Date:   Thu Apr 27 08:31:44 2023 +0000

    Rdb channel for full sync

    In this PR we introduce the main benefit of rdb channel by continuously
    steaming the COB in parallel to the RDB and thus keeping the primary
    side COB small AND accelerating the overall sync process.
    By streaming the replication data to the replica during the full sync,
    we reduce
    1. Memory load from the primary node.
    2. CPU load from the primary main process (will be introduced in later
       PR).
    This opens up possibilities to future improvements with better
    TLS connection handling and removal of the the need to pipeline the RDB
    from the child process to the main.

commit 4a93f4d
Author: naglera <[email protected]>
Date:   Thu Apr 27 08:31:44 2023 +0000

    Rdb channel for full sync

    In this PR we introduce the main benefit of rdb channel by continuously
    steaming the COB in parallel to the RDB and thus keeping the primary
    side COB small AND accelerating the overall sync process.
    By streaming the replication data to the replica during the full sync,
    we reduce
    1. Memory load from the primary node.
    2. CPU load from the primary main process (will be introduced in later
       PR).
    This opens up possibilities to future improvements with better
    TLS connection handling and removal of the the need to pipeline the RDB
    from the child process to the main.

Signed-off-by: naglera <[email protected]>
Also added test Sync should continue if not all slaves dropped rdb-channel

Signed-off-by: naglera <[email protected]>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Dec 10, 2025
…d MEMORY STATS

After introducing dual channel replication in valkey-io#60, we added two new info
fields replicas_repl_buffer_size` and `replicas_repl_buffer_peak`to the
info replication section.

However, we did not display the actual memory usage of the replication
buffer in info memory section. Furthermore, since the replicas replication
buffer was not previously included in memory overhead section, this caused
the `used_memory_dataset` and `used_memory_overhead` fields in INFO MEMORY
to display incorrect values, and also caused the `overhead.total` field in
`MEMORY STATS` to display an incorrect value. Previously, on replica nodes,
the replicas replication buffer was included in dataset memory.

Added a new `mem_replicas_repl_buffer` field in INFO MEMORY section. During
the dual channel replication, it show total memory consumed by the replicas
replication buffer on replica side.

Added a `new replicas.repl.buffer` field in MEMORY STATS. It is the same
metric as `mem_replicas_repl_buffer`.

Signed-off-by: Binbin <[email protected]>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Dec 18, 2025
In dual channel replication, when the replica abort the replication,
if the local buffer is huge, free it in async way to avoid the latency.
Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
aradz44 pushed a commit to aradz44/valkey that referenced this pull request Dec 23, 2025
…mited by COB (valkey-io#2824)

After introducing the dual channel replication in valkey-io#60, we decided in valkey-io#915
not to add a new configuration item to limit the replica's local replication
buffer, just use "client-output-buffer-limit replica hard" to limit it.

We need to document this behavior and mention that once the limit is reached,
all future data will accumulate in the primary side.

Signed-off-by: Binbin <[email protected]>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Dec 26, 2025
There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.
Signed-off-by: Binbin <[email protected]>
enjoy-binbin added a commit that referenced this pull request Dec 29, 2025
There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in #60.

Signed-off-by: Binbin <[email protected]>
jjuleslasarte pushed a commit to jjuleslasarte/valkey that referenced this pull request Jan 6, 2026
…mited by COB (valkey-io#2824)

After introducing the dual channel replication in valkey-io#60, we decided in valkey-io#915
not to add a new configuration item to limit the replica's local replication
buffer, just use "client-output-buffer-limit replica hard" to limit it.

We need to document this behavior and mention that once the limit is reached,
all future data will accumulate in the primary side.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: jjuleslasarte <[email protected]>
jjuleslasarte pushed a commit to jjuleslasarte/valkey that referenced this pull request Jan 6, 2026
…mited by COB (valkey-io#2824)

After introducing the dual channel replication in valkey-io#60, we decided in valkey-io#915
not to add a new configuration item to limit the replica's local replication
buffer, just use "client-output-buffer-limit replica hard" to limit it.

We need to document this behavior and mention that once the limit is reached,
all future data will accumulate in the primary side.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: jjuleslasarte <[email protected]>
jdheyburn pushed a commit to jdheyburn/valkey that referenced this pull request Jan 8, 2026
…mited by COB (valkey-io#2824)

After introducing the dual channel replication in valkey-io#60, we decided in valkey-io#915
not to add a new configuration item to limit the replica's local replication
buffer, just use "client-output-buffer-limit replica hard" to limit it.

We need to document this behavior and mention that once the limit is reached,
all future data will accumulate in the primary side.

Signed-off-by: Binbin <[email protected]>
jdheyburn pushed a commit to jdheyburn/valkey that referenced this pull request Jan 8, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
zuiderkwast pushed a commit to zuiderkwast/placeholderkv that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Roshan Khatri <[email protected]>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Roshan Khatri <[email protected]>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Roshan Khatri <[email protected]>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 30, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Roshan Khatri <[email protected]>
zuiderkwast pushed a commit to zuiderkwast/placeholderkv that referenced this pull request Jan 30, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
enjoy-binbin added a commit that referenced this pull request Jan 30, 2026
…2948)

In dual channel replication, when the replica abort the replication,
if the local buffer is huge, free it in async way to avoid the latency.
Dual channel replication was introduced in #60.

Signed-off-by: Binbin <[email protected]>
immangat pushed a commit to immangat/valkey that referenced this pull request Feb 1, 2026
…alkey-io#2948)

In dual channel replication, when the replica abort the replication,
if the local buffer is huge, free it in async way to avoid the latency.
Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
zuiderkwast pushed a commit that referenced this pull request Feb 3, 2026
There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in #60.

Signed-off-by: Binbin <[email protected]>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Feb 4, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Roshan Khatri <[email protected]>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Feb 18, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Roshan Khatri <[email protected]>
harrylin98 pushed a commit to harrylin98/valkey_forked that referenced this pull request Feb 19, 2026
…alkey-io#2948)

In dual channel replication, when the replica abort the replication,
if the local buffer is huge, free it in async way to avoid the latency.
Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Feb 20, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Roshan Khatri <[email protected]>
madolson pushed a commit that referenced this pull request Feb 24, 2026
There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in #60.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Roshan Khatri <[email protected]>
madolson pushed a commit that referenced this pull request Feb 24, 2026
There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in #60.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Roshan Khatri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major-decision-approved Major decision approved by TSC team

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants