Skip to content

Fix occasional hangs on replication reconnection.#7830

Merged
yossigo merged 1 commit intoredis:unstablefrom
yossigo:fix-sync-reconnect-issues
Sep 22, 2020
Merged

Fix occasional hangs on replication reconnection.#7830
yossigo merged 1 commit intoredis:unstablefrom
yossigo:fix-sync-reconnect-issues

Conversation

@yossigo
Copy link
Collaborator

@yossigo yossigo commented Sep 22, 2020

This happens only on diskless replicas when attempting to reconnect after failing to load an RDB file. It is more likely to occur with larger datasets.

After reconnection is initiated, replicationEmptyDbCallback() may get called and try to write to an unconnected socket. This triggered another issue where the connection is put into an error state and the connect handler never gets called. The problem is a regression introduced by commit c17e597.

After reconnection is initiated, replicationEmptyDbCallback() may get
called and try to write to an unconnected socket. This trigerred another
issue where the connection is put into an error state and the connect
handler never gets called.

This problem reproduced quite frequently running tests on MacOS.
@yossigo yossigo requested a review from oranagra September 22, 2020 07:17
@oranagra
Copy link
Member

@yossigo so this happens only in diskless replica, right?
when the rdb loading is aborted in the middle due to master disconnection, it has to call emptyDb to flush the partial database, and that one would have had a chance to write to the un-connected connection.

the reason we saw it rarely is because the database in that test wasn't very large. but on larger databases that take longer to flush, this is actually almost guaranteed?

i wanna know all the details so that we can put them into the commit comment when squashing and then consider cherry picking.

@yossigo
Copy link
Collaborator Author

yossigo commented Sep 22, 2020

@oranagra updated description with all the details.

@yossigo yossigo merged commit 1980f63 into redis:unstable Sep 22, 2020
@yossigo yossigo deleted the fix-sync-reconnect-issues branch September 22, 2020 08:38
oranagra pushed a commit that referenced this pull request Oct 27, 2020
This happens only on diskless replicas when attempting to reconnect after 
failing to load an RDB file. It is more likely to occur with larger datasets.

After reconnection is initiated, replicationEmptyDbCallback() may get called 
and try to write to an unconnected socket. This triggered another issue where
the connection is put into an error state and the connect handler never gets
called. The problem is a regression introduced by commit c17e597.

(cherry picked from commit 1980f63)
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Nov 4, 2020
This happens only on diskless replicas when attempting to reconnect after 
failing to load an RDB file. It is more likely to occur with larger datasets.

After reconnection is initiated, replicationEmptyDbCallback() may get called 
and try to write to an unconnected socket. This triggered another issue where
the connection is put into an error state and the connect handler never gets
called. The problem is a regression introduced by commit c17e597.
jschmieg pushed a commit to memKeyDB/memKeyDB that referenced this pull request Nov 6, 2020
This happens only on diskless replicas when attempting to reconnect after 
failing to load an RDB file. It is more likely to occur with larger datasets.

After reconnection is initiated, replicationEmptyDbCallback() may get called 
and try to write to an unconnected socket. This triggered another issue where
the connection is put into an error state and the connect handler never gets
called. The problem is a regression introduced by commit c17e597.

(cherry picked from commit 1980f63)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants