Skip to content

Kill disk-based fork child when all replicas drop and 'save' is not enabled#7819

Merged
oranagra merged 2 commits intoredis:unstablefrom
ShooterIT:kill-child
Sep 22, 2020
Merged

Kill disk-based fork child when all replicas drop and 'save' is not enabled#7819
oranagra merged 2 commits intoredis:unstablefrom
ShooterIT:kill-child

Conversation

@ShooterIT
Copy link
Member

#7717

If there is no any other slave waiting dumping RDB finished, the current child process need not continue to dump RDB, then we kill it. So child process won't use more memory, and we also can fork a new child process asap to dump rdb for next full synchronization or bgsave. But we also need to check if users enable 'save' RDB, if enable, we should not remove directly since that means RDB is important for users to keep data safe.

Btw, now, rdbRemoveTempFile in killRDBChild won't block server, so we can killRDBChild safely.

@oranagra oranagra linked an issue Sep 20, 2020 that may be closed by this pull request
@oranagra oranagra changed the title Kill child process if its dumping RDB isn't useful when full sync Kill disk-based fork child when all replicas drop and 'save' is not enabled Sep 22, 2020
@oranagra oranagra merged commit 1bb5794 into redis:unstable Sep 22, 2020
@yossigo
Copy link
Collaborator

yossigo commented Oct 26, 2020

Hi @ShooterIT, looks like the test produce some false positives. Looking at the test code it seems very timing dependent. Would you consider changing the simple delays with a more accurate check that probes the server state instead?

@oranagra
Copy link
Member

for the record, failed on CI for MacOS:

*** [err]: Kill rdb child process if its dumping RDB is not useful in tests/integration/replication.tcl
Expected [s 0 rdb_bgsave_in_progress] == 1 (context: type eval line 27 cmd {assert {[s 0 rdb_bgsave_in_progress] == 1}} proc ::start_server)

@oranagra
Copy link
Member

it looks like the one that fails is this one:

                # Slave1 disconnect with master
                $slave1 slaveof no one
                # Shouldn't kill child since another slave wait for rdb
                after 100
                assert {[s 0 rdb_bgsave_in_progress] == 1}

which is odd considering we already know the it's in progress by the previous wait_for_condition. maybe there was a race, and only one slave connected?
maybe increasing repl-diskless-sync-delay to more than 5 seconds is needed?

meanwhile, just to be on the safe side, i'm taking this commit out of 6.0.9

@oranagra oranagra mentioned this pull request Oct 26, 2020
@ShooterIT
Copy link
Member Author

Copy that @oranagra @yossigo
My bad, i found. It fails when only one slave connected with master and another slave is slow to connect with master. I will make a PR asap.

I could reproduce it on oranagra:6.0.9 branch but i didn't reproduce on unstable(i also don't find failed daily action in this test last month ), do we improve sync implementation in next commits?

@oranagra
Copy link
Member

ohh, now it all makes sense.
6.0 doesn't have #6271

JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Nov 4, 2020
…nabled (redis#7819)

When all replicas waiting for a bgsave get disconnected (possibly due to output buffer limit),
It may be good to kill the bgsave child. in diskless replication it already happens, but in
disk-based, the child may still serve some purpose (for persistence).

By killing the child, we prevent it from eating COW memory in vain, and we also allow a new child fork sooner for the next full synchronization or bgsave.
We do that only if rdb persistence wasn't enabled in the configuration.

Btw, now, rdbRemoveTempFile in killRDBChild won't block server, so we can killRDBChild safely.
@ShooterIT ShooterIT deleted the kill-child branch January 11, 2021 12:58
@oranagra oranagra mentioned this pull request Jan 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Disk Based Replication Causing OOM in Master (v6.0.5)

3 participants