Fix reconfiguring sub-replica causing data loss when myself change shard_id#944
Conversation
…ard_id In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
|
Is there a good way to write a test case to cover this? I don't know how to trigger this sub-replica case easier. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #944 +/- ##
============================================
+ Coverage 70.44% 70.46% +0.02%
============================================
Files 113 114 +1
Lines 61744 61730 -14
============================================
+ Hits 43493 43499 +6
+ Misses 18251 18231 -20
|
I take it that we ended up with a replica (myself) disagreeing with its primary on the shard-id? Do you have a log snippet handy that demonstrates this? Let me take a closer look at the shard id update logic, holistically. There have been a few tricky bugs in this area. I wonder if they are affected by common issues. |
|
The link in #942 has the full logs, but i happend has some 9e03820e6636099e9a10d26a1677ae00b3b452d3: 183d130727181bd5f28b3cf0675cbf0153f6b283: |
|
Capturing the minimum repro setup and event sequence based on offline discussion (thanks for the details, @enjoy-binbin!). Minimum repro setup
Event sequence
Code flow
BugAs part of the recent fix of #885, Does this match your understanding of the issue, @enjoy-binbin? |
|
yes. It match, thank you for the full details, it's very clear |
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
PingXie
left a comment
There was a problem hiding this comment.
LGTM overall. I assume the test failed as we expect without the server fix?
Yes, the test i added will fail if the code without the server fix |
Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>
Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>
Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
…ard_id (#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of #885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow #885 and closes #942. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>
…ard_id (#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of #885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow #885 and closes #942. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>
…ard_id (valkey-io#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>
…ard_id (valkey-io#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>
…ard_id (valkey-io#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>
…ard_id (valkey-io#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Ping Xie <[email protected]>
When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.
In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.
As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.
This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.
This one follow #885 and closes #942.