Skip to content

Don't update node ip when peer ip is unreachable#10696

Merged
madolson merged 3 commits intoredis:unstablefrom
uvletter:fix_cluster_update_ip
Jul 20, 2022
Merged

Don't update node ip when peer ip is unreachable#10696
madolson merged 3 commits intoredis:unstablefrom
uvletter:fix_cluster_update_ip

Conversation

@uvletter
Copy link
Contributor

@uvletter uvletter commented May 9, 2022

As discussed in #9351, @bjosv mentioned that the cluster node IP may be updated to ? when getpeername() fails due to some reasons like the fd is invalid somehow, and the log looks like:

15:S 27 Apr 2022 00:38:49.999 # Address updated for node f55db6b9314afec8816cbb258cf8a8358ad1718c, now ?:6380
15:S 27 Apr 2022 00:38:49.999 # Address updated for node c16a99900ec94d8bfd4dc4e1b0c76340f7e853a6, now ?:6380
15:S 27 Apr 2022 00:38:49.999 # Address updated for node 530b194579251e73473b6c9f94b2ce79ab1ad7eb, now ?:6380

To solve this issue, the patch primarily do 2 things:

  1. forward the error code outside when the try to get peer name fails, and skip the current IP update in nodeUpdateAddressIfNeeded. It won't affect the normal IP update process as it will be retried the next round of PING.
  2. log the failure info of getpeername(), for further debug and so on
release notes:
Fix a bug where a cluster enabled replica node may permanently set its master's hostname to '?'.

Copy link
Contributor

@bjosv bjosv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we had a testcase to trigger this. Current testcases seem all have a valid ip.

@bjosv
Copy link
Contributor

bjosv commented May 10, 2022

Just as a thought, I think the reviewers in the core team would appreciate some details in the PR description above.
It would probably help if you add some text about the reason of this change and what users can expect (new behavior, new logs, ..)

@uvletter
Copy link
Contributor Author

@bjosv thanks a lot for your suggestions! I will deal them one by one.

It would be nice if we had a testcase to trigger this. Current testcases seem all have a valid ip.

I will try to make up a testcase for it, feeling it's not easy thouth...

@pieturin
Copy link

I left a comment on the issue explaining the root cause.

In order to reproduce the issue you could try to write a test that does the following:

  1. Create a cluster with 2 primaries and 2 replicas.
  2. Once they the replicas are fully in sync with their primaries, block a replica Redis process for an extend period of time (in my case 10min worked). This should allow the OS to close the cluster bus connection between the blocked replica and its primary.
  3. After the wait time, block the primary of the blocked replica for a short period of time so that replica gets the IP of the primary from another node in the cluster through gossiping.
  4. Once the primary is blocked, unblock the replica. After unblocking the replica syscalls on the master<->replica connection socket will return a ENOTCONN (Transport endpoint is not connected) error.
  5. Wait a few seconds (10sec should be enough), then unblock the primary.
  6. Now the replica should have ? as IP for its primary.

@pieturin
Copy link

@uvletter - Another potential way of fixing this issue would be to test if the returned IP is ? in nodeUpdateAddressIfNeeded after:

redis/src/cluster.c

Lines 1780 to 1782 in 6f7c1a8

nodeIp2String(ip,link,hdr->myip);
if (node->port == port && node->cport == cport && node->pport == pport &&
strcmp(ip,node->ip) == 0) return 0;

But I like your solution better since you are logging a warning in case we cannot get the socket info.

@uvletter
Copy link
Contributor Author

@pieturin Yeah I read the comment you left on the issue, the analysis is pretty reasonable. I tried to reproduce this problem by getpeername a socket which's closed by the peer, but it failed, your comment solved my puzzle. It's lucky my solution seems to work well while I cannot produce the problem.

@uvletter
Copy link
Contributor Author

uvletter commented Jul 4, 2022

@madolson @oranagra hey guys, can you give some attention to it and advises is welcome!

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there isn't a good way to test this change, I'm okay merging it as is.

@madolson madolson linked an issue Jul 8, 2022 that may be closed by this pull request
@uvletter
Copy link
Contributor Author

If there isn't a good way to test this change, I'm okay merging it as is.

A little late to get back. It's somewhat hard to simulate the situation... since it need something blocking the process quite a long time exactly just before handling the PING message. But at least it won't have negative effect, if the inference by pieturin is correct, it should solve it in theory.

I'm not sure this should be a warning, since it's being handled gracefully. It's an interesting failure, warranting a message, but it's not really an "error" which is what we use WARNING for.

Yep, as for result it's not a problem as it's being handled gracefully, a NOTICE to log the event is just sufficient.

@madolson
Copy link
Contributor

@madolson madolson added the release-notes indication that this issue needs to be mentioned in the release notes label Jul 20, 2022
@madolson madolson merged commit d00b8af into redis:unstable Jul 20, 2022
@madolson madolson removed their assignment Aug 8, 2022
@oranagra oranagra mentioned this pull request Sep 21, 2022
oranagra pushed a commit that referenced this pull request Sep 21, 2022
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
wangweilica pushed a commit to wangweilica/redis that referenced this pull request Apr 7, 2024
wangweilica pushed a commit to wangweilica/redis that referenced this pull request Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Cluster state changed: fail -----connecting to MASTER ?:7200

4 participants