Skip to content

Drop the MEET packet if the link node is in handshake state#1436

Merged
enjoy-binbin merged 2 commits intovalkey-io:unstablefrom
enjoy-binbin:drop_dup_meet
Dec 16, 2024
Merged

Drop the MEET packet if the link node is in handshake state#1436
enjoy-binbin merged 2 commits intovalkey-io:unstablefrom
enjoy-binbin:drop_dup_meet

Conversation

@enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Dec 13, 2024

After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:

=== ASSERTION FAILED ===
==> '!link->node' is not true

In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.

Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.

After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```

In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets when reconnecting,
and in here we are dropping the MEET. Note that in getNodeFromLinkAndMsg, the node in
the handshake state has a random name and not truly "known", so we don't know the sender.
Dropping the MEET packet can prevent us from creating a random node, avoid incorrect
link binding, and avoid duplicate MEET packet eliminate the handshake state.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin
Copy link
Member Author

The ci fail link: https://github.com/valkey-io/valkey/actions/runs/12306207562/job/34347454716#step:3:4364

Runing in a loop can trigger the failiure
./runtest --single unit/cluster/cluster-reliable-meet --dont-clean --dont-pre-clean --loops 10

@pieturin could you take a look and see if i do it right?

Copy link
Contributor

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code flow has become too brittle and we are always in catchup mode. I think we should step back and reconsider the MEET flow.

@madolson / @PingXie / @enjoy-binbin

@pieturin
Copy link
Contributor

Something else we could do, is to avoid sending MEET packets repeatedly. We can introduce a node field like last_meet_packet_sent and only send a MEET packet if the last packet was sent more than the handshake timeout ago here:

valkey/src/cluster_legacy.c

Lines 4978 to 4986 in 5f7fe9e

if (node->link != NULL && node->inbound_link == NULL && nodeInNormalState(node) &&
now - node->inbound_link_freed_time > handshake_timeout) {
/* Node has an outbound link, but no inbound link for more than the handshake timeout.
* This probably means this node does not know us yet, whereas we know it.
* So we send it a MEET packet to do a handshake with it and correct the inconsistent cluster view. */
node->flags |= CLUSTER_NODE_MEET;
serverLog(LL_NOTICE, "Sending MEET packet to node %.40s because there is no inbound link for it", node->name);
clusterSendPing(node->link, CLUSTERMSG_TYPE_MEET);
}

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay merging this and following up as well, given how consistently it causes issues in the CI.

@enjoy-binbin
Copy link
Member Author

i forget to mention if we double this line, we will get the crash in #778.

    clusterSendPing(link, nodeInMeetState(node) ? CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);
+    clusterSendPing(link, nodeInMeetState(node) ? CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin
Copy link
Member Author

i am merging it to unblock the CI, please feel free to update the comment or code if needed in #1441.

@enjoy-binbin enjoy-binbin merged commit e024b4b into valkey-io:unstable Dec 16, 2024
@enjoy-binbin enjoy-binbin deleted the drop_dup_meet branch December 16, 2024 05:43
@codecov
Copy link

codecov bot commented Dec 16, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 70.79%. Comparing base (5f7fe9e) to head (9859060).
Report is 12 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 50.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1436      +/-   ##
============================================
- Coverage     70.90%   70.79%   -0.12%     
============================================
  Files           119      119              
  Lines         64631    64691      +60     
============================================
- Hits          45828    45799      -29     
- Misses        18803    18892      +89     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.67% <50.00%> (-0.20%) ⬇️

... and 28 files with indirect coverage changes

@PingXie
Copy link
Member

PingXie commented Dec 16, 2024

This code flow has become too brittle and we are always in catchup mode. I think we should step back and reconsider the MEET flow.

100% agreed. The entropy in the MEET flow has become increasingly unmanageable. Was the MEET protocol ever formally spec'd? I suspect not. It might be worthwhile to start by drafting a formal spec of the current behavior as a baseline.

@PingXie
Copy link
Member

PingXie commented Dec 16, 2024

Dropping newer MEET messages feels counterintuitive. While it may help with the assert, it seems to compromise resilience. In this situation, the node associated with the first MEET is likely stale. Could this lead to the MEET ultimately failing, effectively nullifying the retries on the sender side?

hwware pushed a commit that referenced this pull request Dec 30, 2024
Add `meet_sent` field in `clusterNode` indicating the last time we sent
a MEET packet. Use this field to only (re-)send a MEET packet once every
handshake timeout period when detecting a node without an inbound link.

When receiving multiple MEET packets on the same link while the node is
in handshake state, instead of dropping the packet, we now simply
prevent the creation of a new node. This way we still process the MEET
packet's gossip and reply with a PONG as any other packets.

Improve some logging messages to include `human_nodename`. Add
`nodeExceedsHandshakeTimeout()` function.

This is a follow-up to this previous PR:
#1307
And a partial fix to the crash described in:
#1436

---------

Signed-off-by: Pierre Turin <[email protected]>
kronwerk pushed a commit to kronwerk/valkey that referenced this pull request Jan 27, 2025
…o#1436)

After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```

In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.

Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.

Signed-off-by: Binbin <[email protected]>
kronwerk pushed a commit to kronwerk/valkey that referenced this pull request Jan 27, 2025
…y-io#1441)

Add `meet_sent` field in `clusterNode` indicating the last time we sent
a MEET packet. Use this field to only (re-)send a MEET packet once every
handshake timeout period when detecting a node without an inbound link.

When receiving multiple MEET packets on the same link while the node is
in handshake state, instead of dropping the packet, we now simply
prevent the creation of a new node. This way we still process the MEET
packet's gossip and reply with a PONG as any other packets.

Improve some logging messages to include `human_nodename`. Add
`nodeExceedsHandshakeTimeout()` function.

This is a follow-up to this previous PR:
valkey-io#1307
And a partial fix to the crash described in:
valkey-io#1436

---------

Signed-off-by: Pierre Turin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants