Drop the MEET packet if the link node is in handshake state by enjoy-binbin · Pull Request #1436 · valkey-io/valkey

enjoy-binbin · 2024-12-13T09:10:57Z

After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:

=== ASSERTION FAILED ===
==> '!link->node' is not true

In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.

Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.

After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink: ``` === ASSERTION FAILED === ==> '!link->node' is not true ``` In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link during the MEET processing, so if we receive a another MEET packet in a short time, the node is still in handshake state, we will meet this assert and crash the server. If the link is bound to a node and the node is in the handshake state, and we receive a MEET packet, it may be that the sender sent multiple MEET packets when reconnecting, and in here we are dropping the MEET. Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET packet eliminate the handshake state. Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-12-13T09:12:54Z

The ci fail link: https://github.com/valkey-io/valkey/actions/runs/12306207562/job/34347454716#step:3:4364

Runing in a loop can trigger the failiure
./runtest --single unit/cluster/cluster-reliable-meet --dont-clean --dont-pre-clean --loops 10

@pieturin could you take a look and see if i do it right?

hpatro

This code flow has become too brittle and we are always in catchup mode. I think we should step back and reconsider the MEET flow.

@madolson / @PingXie / @enjoy-binbin

src/cluster_legacy.c

pieturin · 2024-12-13T21:50:18Z

Something else we could do, is to avoid sending MEET packets repeatedly. We can introduce a node field like last_meet_packet_sent and only send a MEET packet if the last packet was sent more than the handshake timeout ago here:

valkey/src/cluster_legacy.c

Lines 4978 to 4986 in 5f7fe9e

    
               if (node->link != NULL && node->inbound_link == NULL && nodeInNormalState(node) && 
        
                   now - node->inbound_link_freed_time > handshake_timeout) { 
        
                   /* Node has an outbound link, but no inbound link for more than the handshake timeout. 
        
                    * This probably means this node does not know us yet, whereas we know it. 
        
                    * So we send it a MEET packet to do a handshake with it and correct the inconsistent cluster view. */ 
        
                   node->flags |= CLUSTER_NODE_MEET; 
        
                   serverLog(LL_NOTICE, "Sending MEET packet to node %.40s because there is no inbound link for it", node->name); 
        
                   clusterSendPing(node->link, CLUSTERMSG_TYPE_MEET); 
        
               }

madolson

I'm okay merging this and following up as well, given how consistently it causes issues in the CI.

enjoy-binbin · 2024-12-14T01:48:39Z

i forget to mention if we double this line, we will get the crash in #778.

    clusterSendPing(link, nodeInMeetState(node) ? CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);
+    clusterSendPing(link, nodeInMeetState(node) ? CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);

Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-12-16T05:43:15Z

i am merging it to unblock the CI, please feel free to update the comment or code if needed in #1441.

codecov · 2024-12-16T05:56:57Z

Codecov Report

Attention: Patch coverage is 50.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 70.79%. Comparing base (5f7fe9e) to head (9859060).
Report is 12 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	50.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1436      +/-   ##
============================================
- Coverage     70.90%   70.79%   -0.12%     
============================================
  Files           119      119              
  Lines         64631    64691      +60     
============================================
- Hits          45828    45799      -29     
- Misses        18803    18892      +89

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.67% <50.00%> (-0.20%)`	⬇️

... and 28 files with indirect coverage changes

PingXie · 2024-12-16T08:11:57Z

This code flow has become too brittle and we are always in catchup mode. I think we should step back and reconsider the MEET flow.

100% agreed. The entropy in the MEET flow has become increasingly unmanageable. Was the MEET protocol ever formally spec'd? I suspect not. It might be worthwhile to start by drafting a formal spec of the current behavior as a baseline.

PingXie · 2024-12-16T08:21:50Z

Dropping newer MEET messages feels counterintuitive. While it may help with the assert, it seems to compromise resilience. In this situation, the node associated with the first MEET is likely stale. Could this lead to the MEET ultimately failing, effectively nullifying the retries on the sender side?

Add `meet_sent` field in `clusterNode` indicating the last time we sent a MEET packet. Use this field to only (re-)send a MEET packet once every handshake timeout period when detecting a node without an inbound link. When receiving multiple MEET packets on the same link while the node is in handshake state, instead of dropping the packet, we now simply prevent the creation of a new node. This way we still process the MEET packet's gossip and reply with a PONG as any other packets. Improve some logging messages to include `human_nodename`. Add `nodeExceedsHandshakeTimeout()` function. This is a follow-up to this previous PR: #1307 And a partial fix to the crash described in: #1436 --------- Signed-off-by: Pierre Turin <[email protected]>

…o#1436) After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink: ``` === ASSERTION FAILED === ==> '!link->node' is not true ``` In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link during the MEET processing, so if we receive a another MEET packet in a short time, the node is still in handshake state, we will meet this assert and crash the server. If the link is bound to a node and the node is in the handshake state, and we receive a MEET packet, it may be that the sender sent multiple MEET packets so in here we are dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert will happen if the other sends a MEET packet because it detects that there is no inbound link, this node creates a new node in HANDSHAKE state (with a random node name), and respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET flag. This node is supposed to open an outbound connection to the other node in the next cron cycle, but before this happens, the other node re-sends a MEET on the same link because it still detects no inbound connection. Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET packet eliminate the handshake state. Signed-off-by: Binbin <[email protected]>

…y-io#1441) Add `meet_sent` field in `clusterNode` indicating the last time we sent a MEET packet. Use this field to only (re-)send a MEET packet once every handshake timeout period when detecting a node without an inbound link. When receiving multiple MEET packets on the same link while the node is in handshake state, instead of dropping the packet, we now simply prevent the creation of a new node. This way we still process the MEET packet's gossip and reply with a PONG as any other packets. Improve some logging messages to include `human_nodename`. Add `nodeExceedsHandshakeTimeout()` function. This is a follow-up to this previous PR: valkey-io#1307 And a partial fix to the crash described in: valkey-io#1436 --------- Signed-off-by: Pierre Turin <[email protected]>

enjoy-binbin requested review from hpatro and madolson December 13, 2024 09:34

hpatro approved these changes Dec 13, 2024

View reviewed changes

madolson reviewed Dec 13, 2024

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

pieturin reviewed Dec 13, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

pieturin mentioned this pull request Dec 13, 2024

Only (re-)send MEET packet once every handshake timeout period #1441

Merged

madolson approved these changes Dec 14, 2024

View reviewed changes

update the comment

9859060

Signed-off-by: Binbin <[email protected]>

enjoy-binbin merged commit e024b4b into valkey-io:unstable Dec 16, 2024

enjoy-binbin deleted the drop_dup_meet branch December 16, 2024 05:43

PingXie mentioned this pull request Dec 22, 2024

[NEW] Rethink and Redesign the MEET Protocol to Address Increasing Brittleness and Complexity #1471

Open

hpatro mentioned this pull request Jul 11, 2025

[CRASH] Crash in Valkey 8.0.3 during downgrade from Valkey 8.1.0 #2341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop the MEET packet if the link node is in handshake state#1436

Drop the MEET packet if the link node is in handshake state#1436
enjoy-binbin merged 2 commits intovalkey-io:unstablefrom
enjoy-binbin:drop_dup_meet

enjoy-binbin commented Dec 13, 2024 •

edited

Loading

Uh oh!

enjoy-binbin commented Dec 13, 2024

Uh oh!

hpatro left a comment

Uh oh!

Uh oh!

Uh oh!

pieturin commented Dec 13, 2024

Uh oh!

madolson left a comment

Uh oh!

enjoy-binbin commented Dec 14, 2024

Uh oh!

enjoy-binbin commented Dec 16, 2024

Uh oh!

codecov bot commented Dec 16, 2024

Uh oh!

PingXie commented Dec 16, 2024

Uh oh!

PingXie commented Dec 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

enjoy-binbin commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Dec 13, 2024

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pieturin commented Dec 13, 2024

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Dec 14, 2024

Uh oh!

enjoy-binbin commented Dec 16, 2024

Uh oh!

codecov bot commented Dec 16, 2024

Codecov Report

Uh oh!

PingXie commented Dec 16, 2024

Uh oh!

PingXie commented Dec 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

enjoy-binbin commented Dec 13, 2024 •

edited

Loading