Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2#12604
Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2#12604madolson merged 3 commits intoredis:unstablefrom
Conversation
…d 7.2 In redis#10536, we introduced the assert, some older versions of servers (like 7.0) doesn't gossip shard_id, so we will not add the node to cluster->shards, and node->shard_id is filled in randomly and may not be found here. It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow like redis#12538. In this PR, we remove the assert and replace it with plain if. Fixes redis#12603.
|
@PingXie please also take a look, thanks |
|
@madolson @PingXie after this was merged, i found another problem. In this case, the node (7.2) 's nodes.conf will report Unrecoverable error: since the old server did not gossip the shard id, and we have these check when we are loading a node.conf: any ideas on this? |
|
Does this mean, there are still issues with migration? Or is this issue unrelated? |
|
Is there an issue open tracking the corrupted cluster config file problem that I can follow? |
|
please don't worry, we are dealing with it. |
…d 7.2 (#12604) In #10536, we introduced the assert, some older versions of servers (like 7.0) doesn't gossip shard_id, so we will not add the node to cluster->shards, and node->shard_id is filled in randomly and may not be found here. It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow like #12538. In this PR, we remove the assert and replace it with an unconditional removal. (cherry picked from commit e5ef161)
|
@enjoy-binbin what's the status here? |
|
I'm not quite sure what the correct fix is and am waiting for @madolson opinion. |
|
@PingXie Would you be able to take a look at this issue? |
|
ok, i think i finally have thought of a way. If the replica nodes does not send the shard-id, it means they do not support it, so set their shard-id to the shard-id of the master node. |
|
Sorry I missed this thread. The fix for the original crash seems fine to me. I will comment on the new fix in its own thread. |
In #10536, we introduced the assert, some older versions of servers
(like 7.0) doesn't gossip shard_id, so we will not add the node to
cluster->shards, and node->shard_id is filled in randomly and may not
be found here.
It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots
to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow
like #12538.
In this PR, we just remove the assert and the search, and just using
removeChannelsInSlot since it will do an return if there are no active
subscription for a given slot.
Fixes #12603.