Fsync directory while persisting AOF manifest, RDB file, and config file#10737
Fsync directory while persisting AOF manifest, RDB file, and config file#10737oranagra merged 12 commits intoredis:unstablefrom
Conversation
|
@uvletter thank you. i do man for It does seem to need the |
oranagra
left a comment
There was a problem hiding this comment.
LGTM, however, i notice we have a few other places in redis where we do fsync and then rename, so i suppose all of these deserve the same fix.
can you go over them and check if that's right?
if it is, then i suppose we'd rather create a utility function to fsync a dir, so we can use it as a single call in all these places.
Sure, let me have a try |
|
@oranagra after a look around I find several places that perhaps requires a dir sync, like rdb generation, replication and config rewrite, there're also two places I didn't add a dir sync, the acl and cluster config persistence. As for acl, the file itself isn't fsync, so I think it make no sense to fsync dir only. And for cluster config, it's overwritten in place, so it make no sense to fsync dir, maybe we could refine it to the pattern create temp, write and rename in the future. I add a new util function |
oranagra
left a comment
There was a problem hiding this comment.
@uvletter thank you.
i think the two places you mentioned that aren't currently doing an fsync are things we wanna fix, but arguably not as part of this PR.
- i think that ACL SAVE is basically the same as CONFIG REWRITE, and it should have the same guarantees.
- clusterSaveConfig is also the same, and probably behaves the same as CONFIG REWRITE did before #7824
src/rdb.c
Outdated
| stopSaving(0); | ||
| return C_ERR; | ||
| } | ||
| if (fsyncFileDir(filename) == -1) goto werr; |
There was a problem hiding this comment.
same here goto werr is not valid at this point. the rename already succeeded (e.g. there's no point in unlink(tmpfile).
There was a problem hiding this comment.
If we regard unlink(tmpfile) as a best effort attempt, after rename succeeded we could also try to unlink(tmpfile)(maybe meaningless) and ignore the possible errors. I commit the same error handling in config.c just now.
| dir_fd = open(dname, O_RDONLY); | ||
| if (dir_fd == -1) { | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
I think we should probably silently ignore EISDIR and EINVAL.
There was a problem hiding this comment.
If we will ignore some errno, I think it's better to have a code comment why it's special, but I have no research on these errno...
There was a problem hiding this comment.
@uvletter Please see #10737 (comment) . errno check will help to detect OSes/Filesystems that do not support fsync() on a directory.
|
I'll dump what I think. 1- We should detect OSes that do not support
|
src/util.c
Outdated
| if (dir_fd == -1) { | ||
| /* Some OSs don't allow us to open directories at all (Windows returns | ||
| * EACCES), just ignore the error in that case */ | ||
| if (errno == EISDIR || errno == EACCES) { |
There was a problem hiding this comment.
We shouldn't ignore EACCESS:
- Redis is not supported on Windows anyway.
- It's a valid error if we have
w+xon the directory: we'll be able to create/rename files but not read.
There was a problem hiding this comment.
It was just not obvious from the comment if any other system returns EACCES if opening a directory is not allowed. Anyway, case-2 is valid indeed and probably no need to be paranoid about this and no need to ignore EACCESS.
oranagra
left a comment
There was a problem hiding this comment.
I've re-reviewed this PR and all the comments.
my conclusion is that directory fsync errors are probably very rare (if they exist at all),
and that we wanna handle them like any other write or rename error of the same flow in which they occur.
at this point in time, i'd rather not have a general panic (as if it's a catastrophe), and i'd rather not suffice by logging the error and ignore it (maybe in the future we'll know better).
my reasoning is that doing that, can't be a completely invalid choice (we're much more likely to get a write error anyway), so that code flow is probably validated by real users.
i would like to revive the directory fsync in replication.c, and would later (in some future PR, would like to modify acl.c and cluster.c to do that same).
anyone think i got this wrong?
src/config.c
Outdated
| if (retval) unlink(tmp_conffile); | ||
| if (retval) { | ||
| /* try to unlink the temp, ignoring possible errors */ | ||
| int old_errno = errno; |
There was a problem hiding this comment.
why do we backup errno here? we already printed the error and errno, and the caller doesn't need it.
if ti did, then we should back it up before close
There was a problem hiding this comment.
I thought a bad directory fsync would make the state of temp file uncertain, so the unlink may fail and corrupt the errno. Your advise wrapping before close is more universal, so I refactor into this way.
src/rdb.c
Outdated
| stopSaving(0); | ||
| return C_ERR; | ||
| } | ||
| if (fsyncFileDir(filename) == -1) goto werr; |
There was a problem hiding this comment.
maybe each of these goto werr should set some string variable, so that the error log contains an indication of which call failed? (one of two fsync, fclose, etc)
|
@oranagra I agree with your analysis, |
|
@oranagra Sounds good to me. Handling The only corner case is if you need to rollback |
|
I think we can't tell in which state we landed (and more so, considering the details here suggesting that a successful subsequent fsync isn't any good either, and the writes may be lost forever). So I'd argues that for now we should not attempt anything fancy that tries to get out of the unknown state, just handle the error as if write failed, rather than panic, or ignore the error. |
|
@oranagra I completely agree as it's also my origin proposal :-) |
oranagra
left a comment
There was a problem hiding this comment.
@redis/core-team please be aware of this change and comment if you have feedback.
i would like to include it in 7.0.3.
in theory this could cause some unexpected issues:
- platforms or file systems that don't support directory fsync will start failing (we tried to resolve that inside fsyncFileDir)
- cases where a hardware issue causing the fsync to fail, in which case we can either:
- ignore the error (best effort fsync)
- panic (since in some cases we can't be sure what's the result, i.e. if the rename succeeded or failed, and there may even be cases where a subsequent successful fsync would still not mean the data of the previous one is safe)
- or just treat it the same as we treat other write errors (which is what we did for now).
|
@uvletter thank you for your patience. maybe you'd like to handle cluster and ACL next? |
Yes! let me have a try |
As an outstanding part mentioned in #10737, we could just make the cluster config file and ACL file saving done with a more safe and atomic pattern (write to temp file, fsync, rename, fsync dir). The cluster config file uses an in-place overwrite and truncation (which was also used by the main config file before #7824). The ACL file is using the temp file and rename approach, but was missing an fsync. Co-authored-by: 朱天 <[email protected]>
…ile (redis#10737) The current process to persist files is `write` the data, `fsync` and `rename` the file, but a underlying problem is that the rename may be lost when a sudden crash like power outage and the directory hasn't been persisted. The article [Ensuring data reaches disk](https://lwn.net/Articles/457667/) mentions a safe way to update file should be: 1. create a new temp file (on the same file system!) 2. write data to the temp file 3. fsync() the temp file 4. rename the temp file to the appropriate name 5. fsync() the containing directory This commit handles CONFIG REWRITE, AOF manifest, and RDB file (both for persistence, and the one the replica gets from the master). It doesn't handle (yet), ACL SAVE and Cluster configs, since these don't yet follow this pattern.
As an outstanding part mentioned in redis#10737, we could just make the cluster config file and ACL file saving done with a more safe and atomic pattern (write to temp file, fsync, rename, fsync dir). The cluster config file uses an in-place overwrite and truncation (which was also used by the main config file before redis#7824). The ACL file is using the temp file and rename approach, but was missing an fsync. Co-authored-by: 朱天 <[email protected]>
|
It appears that these atomic operation changes were only merged into version V7, and not into version V6. |
@s1191994586-max multipart AOF didn't exist in 6.x, so some these are not applicable there. Regardless, we don't normally backport non-critical fixes to old stable releases, even an innocent looking fix carries some risk, so if the bug is non-critical, we can't afford to backport it all the way back to very old releases. |
The current process to persist files is
writethe data,fsyncandrenamethe file, but a underlying problem is that the rename may be lost when a sudden crash like power outage and the directory hasn't been persisted.The article Ensuring data reaches disk mentions a safe way to update file should be:
Referring to Leveldb ,
So we should also fsync the directory to ensure the safety.
This PR handles CONFIG REWRITE, AOF manifest, and RDB file (both for persistence, and the one the replica gets from the master).
It doesn't handle (yet), ACL SAVE and Cluster configs, since these don't yet follow this pattern.