Skip to content

Remove tmp rdb file in background thread#7762

Merged
oranagra merged 2 commits intoredis:unstablefrom
ShooterIT:remove-rdb-async
Sep 17, 2020
Merged

Remove tmp rdb file in background thread#7762
oranagra merged 2 commits intoredis:unstablefrom
ShooterIT:remove-rdb-async

Conversation

@ShooterIT
Copy link
Member

Remove tmp rdb file in background thread to avoid unnecessary time cost. Maybe removing tmp aof file is similar.

@oranagra
Copy link
Member

@ShooterIT why is the loop in prepareShutdown needed?
file will be closed anyway when the process exists?
similarly, why look at shutdown_asap? why not always use bg_unlink?

@ShooterIT
Copy link
Member Author

@oranagra Just it is a good programming style, we closed all files we opened when process exited even OS could clean up everything as much as possible, actually, Linux can't clean somethings such as POSIX semaphores when process exits, https://man7.org/linux/man-pages/man7/sem_overview.7.html

just above reason, I want to unlink file exactly when shutdown_asap is set since process will exit soon in this case.

@oranagra
Copy link
Member

From my experience, cleanup code at exit can often do more harm than good (like prevent the process from exiting or delay it, thus delaying the restart and making the service unavailable).
for instance, if for some reason the bio thread is stack redis will never exit.
i'd like to see what @yossigo thinks about these.

p.s. i would rather have sigShutdownHandler call rdbRemoveTempFile with a specific force flag before doing exit rather than rely on the global (can be set to true in another scenario, before redis reaches to shutdown via serverCron)

@ShooterIT ShooterIT marked this pull request as draft September 14, 2020 01:23
@ShooterIT
Copy link
Member Author

Another thing we should care about is we only can call async-signal-safe functions from signal handlers safely, bg_unlink will call pthread_mutex_lock and zmalloc functions that are not async-signal-safem, and access global variables. What's more, server.shutdown_asap also should be volatile sig_atomic_t type, also please have a look #7778

I also like to use a force flag to specify we use unlink or bg_unlink, I used shutdown_asap just because I didn't want to change to much. But I think I am wrong, I will break encapsulation and change the original logic silently if I use the global shutdown_asap.

@ShooterIT
Copy link
Member Author

hi @oranagra I took a more look at rdbRemoveTempFile, I found we use snprintf in it, AFAIK, snprintf is not async-signal-safe function, right?

@oranagra
Copy link
Member

@ShooterIT i see that indeed snprintf isn't listed as async-signal-safe (not required to be so in POSIX), but i actually don't understand why... unlike printf which messes with stdout, sprintf is just a utility function, so maybe it is safe although not required to be.

but maybe an easier solution would be to just use sdscatfmt instead?

@ShooterIT
Copy link
Member Author

@oranagra I also don't know it clearly. There are some discussions, https://stackoverflow.com/questions/6758222/snprintf-in-signal-handler-creates-segmentation-fault-if-started-with-valgrind
we only use format specifiers %d, maybe it is right, but i can't certain.

We can't use sdscatfmt because it will call malloc.

@yossigo
Copy link
Collaborator

yossigo commented Sep 16, 2020

I think we should just adopt a async-signal-safe implementation of a limited snprintf(), there are a few floating around and used for this purpose.

Regarding cleaning up, I agree with @ShooterIT this is a good practice but I also agree with @oranagra that we probably don't want to block a terminating process only because we're waiting for ulink() to return. I think a good middle way would be to put a time limit on waiting and produce a log message if we know we're exiting while some files have not yet been fully removed.

@oranagra
Copy link
Member

@yossigo why import another snprintf implementation and not just use sdscatfmt for this one case?
it might be less efficient (heap allocations), but for this minor print it's insignificant and i don't think this case justifies adding another big pile of code.

i can argue that even when rdbRemoveTempFile is called from prepare_shutdown, we may prefer to do bg_unlink and exit rather than either use blocking unlink or bg_unlink + wait loop.
although i'm not sure what are the implications (who's time will be used to actually delete the inode)

@yossigo
Copy link
Collaborator

yossigo commented Sep 16, 2020

@oranagra Why do you assume zmalloc() is safe?
Regarding the other issue, the interesting question is would unlink complete in the background or will it keep the process defunct until it is completed. Even if it does, I think it may still be better than blocking before exit().

@ShooterIT
Copy link
Member Author

@oranagra In signal handle, we only can use async-signal-safe functions, sdscatfmt will call malloc that is not async-signal-safe.

In prepareForShutdown, If we worry about blocking, we can just use bg_unlink and exit, right?

@oranagra
Copy link
Member

sorry, my bad about malloc.
in theory we can maybe create a stack based sds that is large enough and know that sdscatfmt isn't gonna realloc, but that's really ugly, so i guess we need to bring another implementation of sprintf.

i prefer bg_unlink and exit (assuming the kernel will still make sure the file is deleted), seems better to me than a loop to wait till we close the file, or a blocking unlink

@ShooterIT ShooterIT marked this pull request as ready for review September 17, 2020 03:12
@ShooterIT
Copy link
Member Author

Hi @oranagra @yossigo I use ll2string and strncpy to replace snprintf, that references to serverLogFromHandler. And i add some tests to cover this commit. Please have a look.

@ShooterIT
Copy link
Member Author

I also remove waiting loop in prepareForShutdown

@ShooterIT
Copy link
Member Author

In order to test file still is not closed, i add sleep in bioProcessBackgroundJobs in my test code but not in commit.

@@ -199,6 +199,7 @@ void *bioProcessBackgroundJobs(void *arg) {
 
         /* Process the job accordingly to its type. */
         if (type == BIO_CLOSE_FILE) {
+            sleep(2000);
             close((long)job->arg1);
         } else if (type == BIO_AOF_FSYNC) {
             redis_fsync((long)job->arg1);

@ShooterIT
Copy link
Member Author

@oranagra Please review my comments in prepareForShutdown, also please squash into one commit if you feel ok

@oranagra oranagra merged commit b002d2b into redis:unstable Sep 17, 2020
@oranagra
Copy link
Member

for {set i 0} {$i < 20} {incr i} {
r set $i $i
}
# It will cost 2s(20 * 100ms) to dump rdb
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, total time is 2s


# Child is dumping rdb
r bgsave
after 100
Copy link
Member Author

@ShooterIT ShooterIT Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is this problem? @oranagra yesterday, you pointed out my some similar problems, after 100 freezes tcl code?
#7819 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the assertion that fails is the one in line 42 (first assert in the second test).
and in this case what could have happened is that the after 100 was not enough, and we tested the existence of the rdb file before redis started to save it (since redis was for some reason frozen for some 100ms).

i guess we should change this to a wait_for_condition loop that will usually exit a lot sooner, but at the same time have longer timeout.
i suppose similar changes can be made in a few other places in these new tests (to reflect the spirit of the changes i asked for in #7819)

please make sure you can reproduce this (possibly bu pushing an experimental branch into github to use the actions) before attempting to solve it.

oranagra pushed a commit that referenced this pull request Oct 27, 2020
We're already using bg_unlink in several places to delete the rdb file in the background,
and avoid paying the cost of the deletion from our main thread.
This commit uses bg_unlink to remove the temporary rdb file in the background too.

However, in case we delete that rdb file just before exiting, we don't actually wait for the
background thread or the main thread to delete it, and just let the OS clean up after us.
i.e. we open the file, unlink it and exit with the fd still open.

Furthermore, rdbRemoveTempFile can be called from a thread and was using snprintf which is
not async-signal-safe, we now use ll2string instead.

(cherry picked from commit b002d2b)
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Nov 4, 2020
We're already using bg_unlink in several places to delete the rdb file in the background,
and avoid paying the cost of the deletion from our main thread.
This commit uses bg_unlink to remove the temporary rdb file in the background too.

However, in case we delete that rdb file just before exiting, we don't actually wait for the
background thread or the main thread to delete it, and just let the OS clean up after us.
i.e. we open the file, unlink it and exit with the fd still open.

Furthermore, rdbRemoveTempFile can be called from a thread and was using snprintf which is
not async-signal-safe, we now use ll2string instead.
jschmieg pushed a commit to memKeyDB/memKeyDB that referenced this pull request Nov 6, 2020
We're already using bg_unlink in several places to delete the rdb file in the background,
and avoid paying the cost of the deletion from our main thread.
This commit uses bg_unlink to remove the temporary rdb file in the background too.

However, in case we delete that rdb file just before exiting, we don't actually wait for the
background thread or the main thread to delete it, and just let the OS clean up after us.
i.e. we open the file, unlink it and exit with the fd still open.

Furthermore, rdbRemoveTempFile can be called from a thread and was using snprintf which is
not async-signal-safe, we now use ll2string instead.

(cherry picked from commit b002d2b)
@ShooterIT ShooterIT deleted the remove-rdb-async branch September 18, 2021 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants