Fix reconfiguring sub-replica causing data loss when myself change shard_id by enjoy-binbin · Pull Request #944 · valkey-io/valkey

enjoy-binbin · 2024-08-26T07:34:23Z

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow #885 and closes #942.

…ard_id In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-08-26T07:37:24Z

Is there a good way to write a test case to cover this? I don't know how to trigger this sub-replica case easier.

Full CI: https://github.com/valkey-io/valkey/actions/runs/10555652357

codecov · 2024-08-26T07:51:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.46%. Comparing base (4fe8320) to head (f5cd658).
Report is 9 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #944      +/-   ##
============================================
+ Coverage     70.44%   70.46%   +0.02%     
============================================
  Files           113      114       +1     
  Lines         61744    61730      -14     
============================================
+ Hits          43493    43499       +6     
+ Misses        18251    18231      -20

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`85.89% <100.00%> (+0.32%)`	⬆️

... and 21 files with indirect coverage changes

PingXie · 2024-08-27T04:42:07Z

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated,

I take it that we ended up with a replica (myself) disagreeing with its primary on the shard-id? Do you have a log snippet handy that demonstrates this? Let me take a closer look at the shard id update logic, holistically. There have been a few tricky bugs in this area. I wonder if they are affected by common issues.

enjoy-binbin · 2024-08-27T04:49:56Z

The link in #942 has the full logs, but i happend has some

9e03820e6636099e9a10d26a1677ae00b3b452d3:

# The primary
33772:M 25 Aug 2024 01:14:36.928 * Migrating slot 0 to node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 ()
33772:M 25 Aug 2024 01:14:37.027 * Slot 0 is migrated from node 9e03820e6636099e9a10d26a1677ae00b3b452d3 () in shard 1338c68632657b706e6233d3ad9ae051427d1b6b to node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32.
33772:M 25 Aug 2024 01:14:37.029 * Slot 0 is no longer being migrated to node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32.
33772:M 25 Aug 2024 01:14:37.043 * Configuration change detected. Reconfiguring myself as a replica of node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32

# The offset is 0 as we expected
33772:S 25 Aug 2024 01:14:37.055 * Connecting to PRIMARY 127.0.0.1:21455
33772:S 25 Aug 2024 01:14:37.057 * PRIMARY <-> REPLICA sync started
33772:S 25 Aug 2024 01:14:37.083 * Non blocking connect for SYNC fired the event.
33772:S 25 Aug 2024 01:14:37.090 * Primary replied to PING, replication can continue...
33772:S 25 Aug 2024 01:14:37.098 * Partial resynchronization not possible (no cached primary)
33772:S 25 Aug 2024 01:14:37.103 * Full resync from primary: 42896f8f4927136a9514b738af65fbb571806be4:32118
33772:S 25 Aug 2024 01:14:37.109 - Client closed connection id=6 addr=127.0.0.1:34708 laddr=127.0.0.1:21452 fd=27 name= age=19 idle=0 flags=S db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16384 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=1 omem=16448 tot-mem=34758 events=r cmd=replconf user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=974 tot-net-out=635598 tot-cmds=26
33772:S 25 Aug 2024 01:14:37.110 * Connection with replica 127.0.0.1:21448 lost.
33772:S 25 Aug 2024 01:14:37.152 * Node 183d130727181bd5f28b3cf0675cbf0153f6b283 () is now a replica of node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 1338c68632657b706e6233d3ad9ae051427d1b6b
33772:S 25 Aug 2024 01:14:48.593 * Start of election delayed for 2885 milliseconds (rank #2, offset 0).

183d130727181bd5f28b3cf0675cbf0153f6b283:

# The replica
33547:S 25 Aug 2024 01:14:36.914 * Migrating slot 0 to node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 ()
33547:S 25 Aug 2024 01:14:37.029 * Assigning slot 0 to node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32
33547:S 25 Aug 2024 01:14:37.097 * Node 9e03820e6636099e9a10d26a1677ae00b3b452d3 () is no longer primary of shard 1338c68632657b706e6233d3ad9ae051427d1b6b; removed all 0 slot(s) it used to own
33547:S 25 Aug 2024 01:14:37.097 * Node 9e03820e6636099e9a10d26a1677ae00b3b452d3 () is now part of shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32

# In here, the replica's primary send its PING, and the replica need to update the sender's shard_id, and since the sender is myself's primary,
# here we also update myself's shard_id. see `updateShardId(sender, sender_claimed_primary->shard_id)`
33547:S 25 Aug 2024 01:14:37.097 * Node 9e03820e6636099e9a10d26a1677ae00b3b452d3 () is now a replica of node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32

# And then, doing the reconfiguring. Since myself's shard_id is updated in the above, here `!areInSameShard(myself->replicaof->replicaof, myself)`
# check failed, and we think we can do a psync and remain the old offset.
33547:S 25 Aug 2024 01:14:37.099 * I'm a sub-replica! Reconfiguring myself as a replica of 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 from 9e03820e6636099e9a10d26a1677ae00b3b452d3

# Doing the psync with the old offset
33547:M 25 Aug 2024 01:14:37.103 * Connection with primary lost.
33547:M 25 Aug 2024 01:14:37.104 * Caching the disconnected primary state.
33547:S 25 Aug 2024 01:14:37.110 * Connecting to PRIMARY 127.0.0.1:21455
33547:S 25 Aug 2024 01:14:37.110 * PRIMARY <-> REPLICA sync started
33547:S 25 Aug 2024 01:14:37.121 * Non blocking connect for SYNC fired the event.
33547:S 25 Aug 2024 01:14:37.123 * Primary replied to PING, replication can continue...
33547:S 25 Aug 2024 01:14:37.151 * Trying a partial resynchronization (request a23ef12063d15051cfb636553d48c95933a8429f:317789).
33547:S 25 Aug 2024 01:14:37.244 - Client closed connection id=8 addr=127.0.0.1:52912 laddr=127.0.0.1:21448 fd=28 name= age=1 idle=1 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16384 argv-mem=0 multi-mem=0 rbs=1024 rbp=987 obl=0 oll=0 omem=0 tot-mem=18310 events=r cmd=cluster|nodes user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=56 tot-net-out=1974 tot-cmds=2
33547:S 25 Aug 2024 01:14:47.522 # Failed to read response from the server: Success
33547:S 25 Aug 2024 01:14:47.522 # Primary did not reply to PSYNC, will try later
33547:S 25 Aug 2024 01:14:47.604 - Connection with Node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 at 127.0.0.1:31455 failed: Connection refused
33547:S 25 Aug 2024 01:14:48.008 * Connecting to PRIMARY 127.0.0.1:21455
33547:S 25 Aug 2024 01:14:48.008 * PRIMARY <-> REPLICA sync started
33547:S 25 Aug 2024 01:14:48.009 # Error condition on socket for SYNC: Connection refused
33547:S 25 Aug 2024 01:14:48.010 - Connection with Node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 at 127.0.0.1:31455 failed: Connection refused
33547:S 25 Aug 2024 01:14:48.415 * NODE 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () possibly failing.
33547:S 25 Aug 2024 01:14:48.418 - Connection with Node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 at 127.0.0.1:31455 failed: Connection refused
33547:S 25 Aug 2024 01:14:48.512 * FAIL message received from 9e03820e6636099e9a10d26a1677ae00b3b452d3 () about 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 ()
33547:S 25 Aug 2024 01:14:48.513 # Cluster state changed: fail

# Using the old offset. It is wrong
33547:S 25 Aug 2024 01:14:48.521 * Start of election delayed for 729 milliseconds (rank #0, offset 317788).

src/cluster_legacy.c

PingXie · 2024-08-27T17:26:27Z

Capturing the minimum repro setup and event sequence based on offline discussion (thanks for the details, @enjoy-binbin!).

Minimum repro setup

two shards, shard 1 with primary 1 and replica 2, and shard 2 with primary 2
shard 1 has one slot, slot 0, while shard 2 has the remaining slots
cluster-allow-replica-migration is enabled on primary 1 (which defaults to true)

Event sequence

now migrate slot 0 from shard 1 to shard 2
primary 1 loses slot 0, which is its last slot, to primary 2
because cluster-allow-replica-migration (this name seems a misnomer btw) is enabled on primary 1, primary 1 joins shard 2, following primary 2.
primary 1 sends PING to replica 1 claiming its replicaof is primary 2
replica 1 processes the PING message in clusterProcessPacket

Code flow

replica 1 first notices that primary 1 claims a different replicaof, changing from NULL to primary 2 at

valkey/src/cluster_legacy.c

Line 3311 in 54c0f74

if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) {
replica 1 then updates its view of primary 1's replicaof at

valkey/src/cluster_legacy.c

Line 3317 in 54c0f74

sender->replicaof = sender_claimed_primary;
and updates primary 1's shard_id at

valkey/src/cluster_legacy.c

Line 3321 in 54c0f74

updateShardId(sender, sender_claimed_primary->shard_id);
it then proceeds to the chain replication reduction logic at

valkey/src/cluster_legacy.c

Line 3388 in 54c0f74

if (myself->replicaof && myself->replicaof->replicaof && myself->replicaof->replicaof != myself) {
there it decides to replicate directly from primary 1's replicaof, which is primary 2, at

valkey/src/cluster_legacy.c

Line 3401 in 54c0f74

clusterSetPrimary(myself->replicaof->replicaof, 1, !areInSameShard(myself->replicaof->replicaof, myself));

Bug

As part of the recent fix of #885, replica 1 needs to decide whether a full sync is required or not when switching shards (from shard 1 to shard 2). This shard membership check is supposed to be done against replica 1's current shard_id, which however was lost in step 3 of "code flow". This then leads to replica 1 joining shard 2 with a completely different and incorrect replication history.

Does this match your understanding of the issue, @enjoy-binbin?

enjoy-binbin · 2024-08-28T01:57:09Z

yes. It match, thank you for the full details, it's very clear

Signed-off-by: Binbin <[email protected]>

PingXie

LGTM overall. I assume the test failed as we expect without the server fix?

src/cluster_legacy.c

enjoy-binbin · 2024-08-29T06:35:48Z

LGTM overall. I assume the test failed as we expect without the server fix?

Yes, the test i added will fail if the code without the server fix

Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

…ard_id (#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of #885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow #885 and closes #942. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>

…ard_id (valkey-io#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>

…ard_id (valkey-io#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Ping Xie <[email protected]>

enjoy-binbin requested a review from PingXie August 26, 2024 07:34

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 26, 2024

fix build warning

18dae3e

Signed-off-by: Binbin <[email protected]>

PingXie reviewed Aug 27, 2024

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

enjoy-binbin added 4 commits August 28, 2024 11:14

Merge remote-tracking branch 'upstream/unstable' into fix_data_loss

a031685

Signed-off-by: Binbin <[email protected]>

pull the chain replication reduction logic before the update of shard_id

361f617

Signed-off-by: Binbin <[email protected]>

fix timing issue

ed589db

Signed-off-by: Binbin <[email protected]>

fix cluster down timing issue

187352f

Signed-off-by: Binbin <[email protected]>

PingXie reviewed Aug 29, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin and others added 3 commits August 29, 2024 14:42

Update src/cluster_legacy.c

f7c4367

Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>

Update src/cluster_legacy.c

87dcb22

Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>

Update src/cluster_legacy.c

7788208

Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>

PingXie approved these changes Aug 29, 2024

View reviewed changes

enjoy-binbin added 2 commits August 29, 2024 14:49

fix format

44c84f5

Signed-off-by: Binbin <[email protected]>

fix format

f5cd658

Signed-off-by: Binbin <[email protected]>

enjoy-binbin merged commit ecbfb6a into valkey-io:unstable Aug 29, 2024

enjoy-binbin deleted the fix_data_loss branch August 29, 2024 14:39

lilpoozie2005 approved these changes Sep 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reconfiguring sub-replica causing data loss when myself change shard_id#944

Fix reconfiguring sub-replica causing data loss when myself change shard_id#944
enjoy-binbin merged 11 commits intovalkey-io:unstablefrom
enjoy-binbin:fix_data_loss

enjoy-binbin commented Aug 26, 2024 •

edited

Loading

Uh oh!

enjoy-binbin commented Aug 26, 2024 •

edited

Loading

Uh oh!

codecov bot commented Aug 26, 2024 •

edited

Loading

Uh oh!

PingXie commented Aug 27, 2024

Uh oh!

enjoy-binbin commented Aug 27, 2024 •

edited

Loading

Uh oh!

Uh oh!

PingXie commented Aug 27, 2024

Uh oh!

enjoy-binbin commented Aug 28, 2024

Uh oh!

PingXie left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin commented Aug 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

enjoy-binbin commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PingXie commented Aug 27, 2024

Uh oh!

enjoy-binbin commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

PingXie commented Aug 27, 2024

Minimum repro setup

Event sequence

Code flow

Bug

Uh oh!

enjoy-binbin commented Aug 28, 2024

Uh oh!

PingXie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin commented Aug 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

enjoy-binbin commented Aug 26, 2024 •

edited

Loading

enjoy-binbin commented Aug 26, 2024 •

edited

Loading

codecov bot commented Aug 26, 2024 •

edited

Loading

enjoy-binbin commented Aug 27, 2024 •

edited

Loading