Skip to content

[BUG] Redis Cluster node after becoming Replica allows adding of primary slot range #12717

@codeplayer14

Description

@codeplayer14

Bug Description:
Version: Redis version=7.0.11

Hi, I observed an issue with Redis cluster nodes where the following happened with our automated Redis management control plane:

  1. A node N1 was issued a CLUSTER REPLICATE command
  2. Control Plane checked if the node was unassigned (replication didn't happen yet so it returned unassigned)
  3. The node N1 was reassigned for Primary role - and issued CLUSTER ADDSLOTSRANGE for some other range
  4. Final state had the node both as replica and primary:

CLUSTER NODES O/P for Node on Port 7001

3dd70494e109dfdfa44fd2ff69d1a12be1f3642b 172.18.133.131:7001@17001,padgupta-lappy. myself,slave,nofailover c23aa8a150e0291eaad4b88cc505b88b8b2479b0 0 1698913926000 3 connected 8192-16383

c23aa8a150e0291eaad4b88cc505b88b8b2479b0 172.18.133.131:7000@17000,padgupta-lappy. master,nofailover - 0 1698914350449 3 connected 0-8191

CLUSTER NODES O/P for Node on Port 7000

c23aa8a150e0291eaad4b88cc505b88b8b2479b0 172.18.133.131:7000@17000,padgupta-lappy. myself,master,nofailover - 0 0 3 connected 0-8191

3dd70494e109dfdfa44fd2ff69d1a12be1f3642b 172.18.133.131:7001@17001,padgupta-lappy. slave,nofailover c23aa8a150e0291eaad4b88cc505b88b8b2479b0 0 1698914345168 3 connected

Post this, if you resend CLUSTER REPLICATE command to the NodeN1, Redis crashes.

I'm not sure if this dual role is expected in the first place, but it does cause a crash later.

BUG REPORT:

------ STACK TRACE ------

Backtrace:
/usr/bin/redis-server *:7001 [cluster] (clusterSetMaster+0xdf)[0x563c35e2a22f]
/usr/bin/redis-server *:7001 [cluster] (clusterCommand+0x1c13)[0x563c35e31823]
/usr/bin/redis-server *:7001 [cluster] (call+0xee)[0x563c35da510e]
/usr/bin/redis-server *:7001 [cluster] (processCommand+0x6fd)[0x563c35da617d]
/usr/bin/redis-server *:7001 [cluster] (processInputBuffer+0x107)[0x563c35dc2517]
/usr/bin/redis-server *:7001 [cluster] (readQueryFromClient+0x318)[0x563c35dc2a58]
/usr/bin/redis-server *:7001 [cluster] (+0x17323c)[0x563c35e9823c]
/usr/bin/redis-server *:7001 [cluster] (aeProcessEvents+0x1e2)[0x563c35d9c182]
/usr/bin/redis-server *:7001 [cluster] (aeMain+0x1d)[0x563c35d9c4bd]
/usr/bin/redis-server *:7001 [cluster] (main+0x354)[0x563c35d93df4]
/lib/x86_64-linux-gnu/libc.so.6 (+0x29d90)[0x7f5f788ccd90]
/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0x80)[0x7f5f788cce40]
/usr/bin/redis-server *:7001 [cluster] (_start+0x25)[0x563c35d944c5]

Repro Steps:

Minimal Pseduo-Code for Repro:

firstShard.ClusterAddSlots(context.TODO(), 0, 8191)
firstShard.ClusterBumpEpoch(context.TODO()))
secondShard.ClusterReplicate(context.TODO(), NodeIdOfFirstShard)
secondShard.ClusterAddSlots(context.TODO(), 8192, 16383)
secondShard.ClusterBumpEpoch(context.TODO()))

[For Crash- issue another replicate]
secondShard.ClusterReplicate(context.TODO(), NodeIdOfFirstShard)

Expectation:

If we send CLUSTER REPLICATE to a master node, it denies replication command. I would expect similar if we send CLUSTER ADDSLOTSRANGE to a replica node (until it's state is reset). It should atleast be consistent with view from other nodes eventually from gossip propogation. Currently, The CLUSTER NODES of the 2 shards don't converge (before the crash).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions