Fix race bug (between flushAppendOnly and fsync thread)#5985
Fix race bug (between flushAppendOnly and fsync thread)#5985guybe7 wants to merge 1 commit intoredis:unstablefrom
Conversation
|
before the fix (strace): after traffic was stopped no fdatasync was called until shutdown, 7 (or whatever) seconds after traffic finished |
|
after the fix: fdatasync was called 1s after previous fdatasync (and after traffic is finished) |
Tail of the AOF may never be synced to disk (until shutdown) Consider the following scenario: 1. Some constant traffic on server 2. beforeSleep calls flushAppendOnly, schedules fsync, clears aof_buf 3. (fsync thread is running), traffic continues, beforeSleep, flushAppendOnly, this time no fsync is scheduled because it’s already in progress, clears aof_buf 4. Traffic stops, fsync thread finishes 5. Nothing will trigger another fsync because flushAppend returns immediately because aof_buf is empty In order to fix the above we remember the last synced offset and call fsync in flushAppendOnly if it's not as expected (different from aof_current_size) even if aof_buf is empty
|
side note:
the same race exists also in prepareForShutdown |
|
Thank you @guybe7, I've to study this a bit in depth. Will check tomorrow afternoon. |
|
@antirez thanks! |
|
Great discovery @guybe7, that should be fixed ASAP indeed. I've a question however, what about making the fix much simpler, just adding to serverCron() something like this? |
|
@antirez the problem starts when aof_buf is empty but the AOF still needs to be synced to disk (scenario is explained in top comment and commit comment). |
|
@guybe7 That looks like a simple refinement on top of what I suggested, we just create a function flushAppendOnlyPeriodically() that will do what I wrote above if the buffer is non empty, otherwise will just schedule the background fsync if the buffer is empty, no fsync is in progress, and aof_flush_postponed_start is non zero. And when it does, it just sets aof_flush_postponed_start to 0. |
|
@antirez i think i have not provided enough details:
i'll describe a simple scenario:
maybe it is possible to fix this problem in a simpler manner (with a boolean), but i'm relying on maintaining aof_sync_offset for a future PR (allowing modules to get notified when AOF is synced up to a certain point in time - may be also helpful for disque) another side note: since flushAppendOnly is called unconditionally in beforeSleep (every 100ms) we don't really need to call it from cron as well.. but that's harmless i suppose |
|
@guybe7 ok thanks for making it clearer. Yep I got "1" and I also understand is not a matter of calling flushAppendOnly() in serveCron(), I suggested to call a different function from serverCron() to handle just the missing fsync, but I was not clear enough. However now I get that this can happen even if the fsync was not in progress. The reason I want to fix it in the simplest way to start is that I think before making other PRs around that area I should start merging the A minor thing about the PR: are you sure it is needed to declare the variable volatile with atomic operations? I think this should not be needed. Second, we are going to merge soon the Redis threaded I/O, I think we can use _Atomic instead. I'll soon covert all Redis atomic vars to _Atomic from C11. The new Makefile in the threaded-io branch directly uses a C11 target. Cheers. |
|
Hi @antirez @guybe7 , just read this PR and I understand the delayed But some codes and comments in if (!sync_in_progress) aof_background_fsync(server.aof_fd);
aofStartBackgroundFsync();
/* Note that we update this time regardless of the fact fsync
* was actually started or nmot. server.aof_last_fsync is not
* used to really know when the latest fsync succeeded, but just
* as a timer to *try* to fsync once every second. */
server.aof_last_fsync = server.unixtime;It said that we don't care about the fact |
|
hi @antirez @soloestoy thanks for reviewing this PR. TBH i can't really think of a simpler way to keep track of "how much" was fsynced except sampling the current aof size right before calling fsync... if you have a better/simpler way please let me know. also, as i mentioned, having an indicator of the number of bytes guaranteed to be synced may be useful for modules (even tough it's possible to achieve the same result using aof_fsync_in_progress_epoch, but again, how can you tell if you're missing a call to fsync without sampling the current size before calling fsync?) i read the wait-aof code, it seems to me like it could work together with this PR (of course there will be some conflicts but basically it's just calling fsyncAppendOnly instead of redis_fsync) antirez, i think you're right about the volatile modifier, it's unnecessary and i'll drop it (if you think you'll merge this PR) soloestoy, IIUC aof_last_fsync is just a rough indicator of when we did the last fsync... it's not an accurate metric and it's only used in order to trigger an fsync (roughly) every seconds. |
|
Here is a simpler way IMHO, I think we don't need to check how many data synced in bio thread, the only thing we should know is the AOF file size when we create a bio job. When we create a fsync bio job, we record the AOF file size as a fsync offset. Next time if aof buffer is empty, just check if the recorded fsync offset is same with current AOF size, if not, create a new bio job, so no race, no lock, just an offset. See PR #6053 . |
|
hi @soloestoy thanks for your time i reviewed the fix, LGTM both the fixes still comply with my work on modules "wait-aof" feature as they both keep track of the current amount of synced bytes. @antirez it's up to you :) |
|
Yes @guybe7 , I use your idea and just simplify the implementation, hope you don't feel uncomfortable. |
|
@soloestoy not at all :) TBH i don't think there's a better way to do it... my personal opinion is that your PR should be merged rather than this one |
|
You two are such a cute pair :P |
|
Thanks both, I merged @soloestoy PR. It's unfortunate we have to track this information (however @guybe7 said he has at least an usage for that), however looks like there is no simple way out. There is an alternative, but has also disadvantages, that is: just fsync every second even if the buffer is empty. However for read-heavy workloads it looks absurd and may cause totally unneeded delays. So... I guess we picked the best option. Thank you. |
aof: enhance AOF_FSYNC_EVERYSEC, more details in #5985
|
Note: this was backported to Redis 5 together with other changes. I think Monday is time to release a new Redis 5 patch level version. |
aof: enhance AOF_FSYNC_EVERYSEC, more details in redis#5985
…YNC_ALWAYS This PR fix two bugs: 1. About the MP-AOF one, that code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with multipart-aof, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue SELECT, and i suppose there's no reason to update any of the fsync metrics. This should have been deleted with MP-AOF (introduced in redis#9788, 7.0). The aof_fsync_offset it updates will cause us to miss an fsync in flushAppendOnlyFile, if we stop write commands in AOF_FSYNC_EVERYSEC 2. About the AOF_FSYNC_ALWAYS one, it is follow redis#6053 (the details is in redis#5985), we also check AOF_FSYNC_ALWAYS to prevent users from switching from everysec to always. We detected these problems with WAITAOF test (redis 7.2), introduced in redis#11713.
…SYNC_ALWAYS (#11973) This PR fix several unrelated bugs that were discovered by the same set of tests (WAITAOF tests in #11713), could make the `WAITAOF` test hang. The change in `backgroundRewriteDoneHandler` is about MP-AOF. That leftover / old code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with MP-AOF, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue `SELECT`, and no reason to update any of the fsync variables in that flow. This should have been deleted with MP-AOF (introduced in #9788, 7.0). The damage is that the update to `aof_fsync_offset` will cause us to miss an fsync in `flushAppendOnlyFile`, that happens if we stop write commands in `AOF_FSYNC_EVERYSEC` while an AOFRW is in progress. This caused a new `WAITAOF` test to sometime hang forever. Also because of MP-AOF, we needed to change `aof_fsync_offset` to `aof_last_incr_fsync_offset` and match it to `aof_last_incr_size` in `flushAppendOnlyFile`. This is because in the past we compared `aof_fsync_offset` and `aof_current_size`, but with MP-AOF it could be the total AOF file will be smaller after AOFRW, and the (already existing) incr file still has data that needs to be fsynced. The change in `flushAppendOnlyFile`, about the `AOF_FSYNC_ALWAYS`, it is follow #6053 (the details is in #5985), we also check `AOF_FSYNC_ALWAYS` to handle a case where appendfsync is changed from everysec to always while there is data that's written but not yet fsynced.
…SYNC_ALWAYS (#11973) This PR fix several unrelated bugs that were discovered by the same set of tests (WAITAOF tests in #11713), could make the `WAITAOF` test hang. The change in `backgroundRewriteDoneHandler` is about MP-AOF. That leftover / old code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with MP-AOF, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue `SELECT`, and no reason to update any of the fsync variables in that flow. This should have been deleted with MP-AOF (introduced in #9788, 7.0). The damage is that the update to `aof_fsync_offset` will cause us to miss an fsync in `flushAppendOnlyFile`, that happens if we stop write commands in `AOF_FSYNC_EVERYSEC` while an AOFRW is in progress. This caused a new `WAITAOF` test to sometime hang forever. Also because of MP-AOF, we needed to change `aof_fsync_offset` to `aof_last_incr_fsync_offset` and match it to `aof_last_incr_size` in `flushAppendOnlyFile`. This is because in the past we compared `aof_fsync_offset` and `aof_current_size`, but with MP-AOF it could be the total AOF file will be smaller after AOFRW, and the (already existing) incr file still has data that needs to be fsynced. The change in `flushAppendOnlyFile`, about the `AOF_FSYNC_ALWAYS`, it is follow #6053 (the details is in #5985), we also check `AOF_FSYNC_ALWAYS` to handle a case where appendfsync is changed from everysec to always while there is data that's written but not yet fsynced. (cherry picked from commit cb17178)
Tail of the AOF may never be synced to disk (until shutdown)
Consider the following scenario:
flushAppendOnly, aof_buf is written to file and cleared, background
fsync thread starts (and ends), server.aof_last_fsync is set.
to aof_buf, beforeSleep is called, flushAppendOnly, aof_buf is written
to file and cleared, background fsync thread does not start because
it's been less than a second since server.aof_last_fsync was set.
because aof_buf is empty, without starting a sync and so on..
In order to fix the above we remember the last synced offset and
call fsync in flushAppendOnly if it's not as expected (different
from aof_current_size) even if aof_buf is empty
We use atomic operation on aof_current_size and aof_fsync_offset
because now they are read/written from 2 parallel threads (may
not be atomic in 32bit systems)