Fsync directory while persisting AOF manifest, RDB file, and config file by uvletter · Pull Request #10737 · redis/redis

uvletter · 2022-05-17T06:03:10Z

The current process to persist files is write the data, fsync and rename the file, but a underlying problem is that the rename may be lost when a sudden crash like power outage and the directory hasn't been persisted.

The article Ensuring data reaches disk mentions a safe way to update file should be:

create a new temp file (on the same file system!)
write data to the temp file
fsync() the temp file
rename the temp file to the appropriate name
fsync() the containing directory

Referring to Leveldb ,

  Status Sync() override {
    // Ensure new files referred to by the manifest are in the filesystem.
    //
    // This needs to happen before the manifest file is flushed to disk, to
    // avoid crashing in a state where the manifest refers to files that are not
    // yet on disk.
    Status status = SyncDirIfManifest();
    if (!status.ok()) {
      return status;
    }

    status = FlushBuffer();
    if (!status.ok()) {
      return status;
    }

    return SyncFd(fd_, filename_);
  }

So we should also fsync the directory to ensure the safety.

This PR handles CONFIG REWRITE, AOF manifest, and RDB file (both for persistence, and the one the replica gets from the master).
It doesn't handle (yet), ACL SAVE and Cluster configs, since these don't yet follow this pattern.

chenyang8094 · 2022-05-18T01:32:46Z

@uvletter thank you. i do man for fsync and get:

Calling fsync() does not necessarily ensure that the entry in the directory containing the 
file has also reached disk.  For that an explicit fsync() on a file descriptor for the directory
is also needed.

It does seem to need the fsync directory, but others say it depends on the specific filesystem (and mount options, etc.), anyhow I think adding fsync makes the code more portable.

src/aof.c

chenyang8094

LGTM

oranagra

LGTM, however, i notice we have a few other places in redis where we do fsync and then rename, so i suppose all of these deserve the same fix.
can you go over them and check if that's right?

if it is, then i suppose we'd rather create a utility function to fsync a dir, so we can use it as a single call in all these places.

uvletter · 2022-05-22T07:33:33Z

LGTM, however, i notice we have a few other places in redis where we do fsync and then rename, so i suppose all of these deserve the same fix. can you go over them and check if that's right?

if it is, then i suppose we'd rather create a utility function to fsync a dir, so we can use it as a single call in all these places.

Sure, let me have a try

uvletter · 2022-05-22T16:19:24Z

@oranagra after a look around I find several places that perhaps requires a dir sync, like rdb generation, replication and config rewrite, there're also two places I didn't add a dir sync, the acl and cluster config persistence. As for acl, the file itself isn't fsync, so I think it make no sense to fsync dir only. And for cluster config, it's overwritten in place, so it make no sense to fsync dir, maybe we could refine it to the pattern create temp, write and rename in the future.

I add a new util function fsyncFileDir to sync the directory. The parameter using filename not dirname is for a easy use, as there're several types of directory to be synced, pwd for rdb, aof_dirname for aof, and custom dir for config, to avoid messing them up, fsyncFileDir just receive the file that changes, and auto parse the directory name.

oranagra

@uvletter thank you.
i think the two places you mentioned that aren't currently doing an fsync are things we wanna fix, but arguably not as part of this PR.

i think that ACL SAVE is basically the same as CONFIG REWRITE, and it should have the same guarantees.
clusterSaveConfig is also the same, and probably behaves the same as CONFIG REWRITE did before #7824

src/aof.c

src/config.c

oranagra · 2022-05-23T16:07:46Z

src/rdb.c

        stopSaving(0);
        return C_ERR;
    }
+    if (fsyncFileDir(filename) == -1) goto werr;


same here goto werr is not valid at this point. the rename already succeeded (e.g. there's no point in unlink(tmpfile).

If we regard unlink(tmpfile) as a best effort attempt, after rename succeeded we could also try to unlink(tmpfile)(maybe meaningless) and ignore the possible errors. I commit the same error handling in config.c just now.

src/replication.c

yossigo

@oranagra I agree that we should not universally panic when fsync fails, and instead decide on the right course of action on a case by case basis.

src/util.c

yossigo · 2022-05-26T18:23:39Z

src/util.c

+    dir_fd = open(dname, O_RDONLY);
+    if (dir_fd == -1) {
+        return -1;
+    }


I think we should probably silently ignore EISDIR and EINVAL.

If we will ignore some errno, I think it's better to have a code comment why it's special, but I have no research on these errno...

@uvletter Please see #10737 (comment) . errno check will help to detect OSes/Filesystems that do not support fsync() on a directory.

src/util.c

src/aof.c

src/replication.c

tezc · 2022-05-28T12:25:05Z

I'll dump what I think.

1- We should detect OSes that do not support fsync() on a directory and ignore errors for these: #10737 (comment)
2- One question is how to handle if rename() succeeds and directory fsync() fails. Looks like both sqlite and postgresql do not act like it's the end of the world when fsync() on a directory fails. Sqlite ignores it and PostgreSQL is a mix (not so sure, in some places it's ignored and in some places causes failures that I couldn't evaluate its impact).

This error case is probably super rare and does not happen with all the filesystems.
Logging the error and ignoring it might be a valid approach.
IMHO abort() will be a valid approach if admin can do something better than we can or if we prevent something bad by doing that. If restarting Redis is the only thing after abort(), maybe there is no point in abort(). Probably, it's better to decide if abort() makes sense in each use case.

yossigo · 2022-05-30T11:45:37Z

src/util.c

+    if (dir_fd == -1) {
+        /* Some OSs don't allow us to open directories at all (Windows returns
+         * EACCES), just ignore the error in that case */
+        if (errno == EISDIR || errno == EACCES) {


We shouldn't ignore EACCESS:

Redis is not supported on Windows anyway.

It's a valid error if we have w+x on the directory: we'll be able to create/rename files but not read.

It was just not obvious from the comment if any other system returns EACCES if opening a directory is not allowed. Anyway, case-2 is valid indeed and probably no need to be paranoid about this and no need to ignore EACCESS.

src/replication.c

oranagra

I've re-reviewed this PR and all the comments.
my conclusion is that directory fsync errors are probably very rare (if they exist at all),
and that we wanna handle them like any other write or rename error of the same flow in which they occur.
at this point in time, i'd rather not have a general panic (as if it's a catastrophe), and i'd rather not suffice by logging the error and ignore it (maybe in the future we'll know better).

my reasoning is that doing that, can't be a completely invalid choice (we're much more likely to get a write error anyway), so that code flow is probably validated by real users.

i would like to revive the directory fsync in replication.c, and would later (in some future PR, would like to modify acl.c and cluster.c to do that same).

anyone think i got this wrong?

oranagra · 2022-06-14T14:07:53Z

src/config.c

-    if (retval) unlink(tmp_conffile);
+    if (retval) {
+        /* try to unlink the temp, ignoring possible errors */
+        int old_errno = errno;


why do we backup errno here? we already printed the error and errno, and the caller doesn't need it.
if ti did, then we should back it up before close

I thought a bad directory fsync would make the state of temp file uncertain, so the unlink may fail and corrupt the errno. Your advise wrapping before close is more universal, so I refactor into this way.

oranagra · 2022-06-14T14:20:28Z

src/rdb.c

        stopSaving(0);
        return C_ERR;
    }
+    if (fsyncFileDir(filename) == -1) goto werr;


maybe each of these goto werr should set some string variable, so that the error log contains an indication of which call failed? (one of two fsync, fclose, etc)

yossigo · 2022-06-14T16:06:45Z

@oranagra I agree with your analysis, fsync() failure after a directory operation is not different from failure of the operation itself.

tezc · 2022-06-14T16:59:02Z

@oranagra Sounds good to me. Handling fsync() failure on a directory as another write/rename error is the best solution.

The only corner case is if you need to rollback rename() (If application logic requires rollback) after subsequent fsync() failure on the directory. Then, we have two options, abort or ignore. Because, I don't know how we can rollback rename() in this case.

oranagra · 2022-06-14T17:52:59Z

I think we can't tell in which state we landed (and more so, considering the details here suggesting that a successful subsequent fsync isn't any good either, and the writes may be lost forever).

So I'd argues that for now we should not attempt anything fancy that tries to get out of the unknown state, just handle the error as if write failed, rather than panic, or ignore the error.
If some delay we'll have specific info on a certain case on a certain platform, we can reopen this.

uvletter · 2022-06-15T08:43:17Z

@oranagra I completely agree as it's also my origin proposal :-)
According to the discusses I recover the code in replication.c, and make some other fix, if I'm not misunderstanding your comments.

oranagra

@redis/core-team please be aware of this change and comment if you have feedback.
i would like to include it in 7.0.3.

in theory this could cause some unexpected issues:

platforms or file systems that don't support directory fsync will start failing (we tried to resolve that inside fsyncFileDir)
cases where a hardware issue causing the fsync to fail, in which case we can either:
1. ignore the error (best effort fsync)
2. panic (since in some cases we can't be sure what's the result, i.e. if the rename succeeded or failed, and there may even be cases where a subsequent successful fsync would still not mean the data of the previous one is safe)
3. or just treat it the same as we treat other write errors (which is what we did for now).

oranagra · 2022-06-20T16:17:55Z

@uvletter thank you for your patience. maybe you'd like to handle cluster and ACL next?

uvletter · 2022-06-21T05:17:10Z

@uvletter thank you for your patience. maybe you'd like to handle cluster and ACL next?

Yes! let me have a try

As an outstanding part mentioned in #10737, we could just make the cluster config file and ACL file saving done with a more safe and atomic pattern (write to temp file, fsync, rename, fsync dir). The cluster config file uses an in-place overwrite and truncation (which was also used by the main config file before #7824). The ACL file is using the temp file and rename approach, but was missing an fsync. Co-authored-by: 朱天 <[email protected]>

…ile (redis#10737) The current process to persist files is `write` the data, `fsync` and `rename` the file, but a underlying problem is that the rename may be lost when a sudden crash like power outage and the directory hasn't been persisted. The article [Ensuring data reaches disk](https://lwn.net/Articles/457667/) mentions a safe way to update file should be: 1. create a new temp file (on the same file system!) 2. write data to the temp file 3. fsync() the temp file 4. rename the temp file to the appropriate name 5. fsync() the containing directory This commit handles CONFIG REWRITE, AOF manifest, and RDB file (both for persistence, and the one the replica gets from the master). It doesn't handle (yet), ACL SAVE and Cluster configs, since these don't yet follow this pattern.

As an outstanding part mentioned in redis#10737, we could just make the cluster config file and ACL file saving done with a more safe and atomic pattern (write to temp file, fsync, rename, fsync dir). The cluster config file uses an in-place overwrite and truncation (which was also used by the main config file before redis#7824). The ACL file is using the temp file and rename approach, but was missing an fsync. Co-authored-by: 朱天 <[email protected]>

s1191994586-max · 2026-01-30T09:36:31Z

It appears that these atomic operation changes were only merged into version V7, and not into version V6.

oranagra · 2026-01-31T12:15:04Z

It appears that these atomic operation changes were only merged into version V7, and not into version V6.

@s1191994586-max multipart AOF didn't exist in 6.x, so some these are not applicable there.

Regardless, we don't normally backport non-critical fixes to old stable releases, even an innocent looking fix carries some risk, so if the bug is non-critical, we can't afford to backport it all the way back to very old releases.

fsync the directory while persisting aof manifest

0133d07

uvletter force-pushed the fsync_aof_dir branch from 7e29b37 to 0133d07 Compare May 17, 2022 06:07

chenyang8094 approved these changes May 18, 2022 •

edited

Loading

View reviewed changes

chenyang8094 suggested changes May 18, 2022

View reviewed changes

src/aof.c Outdated Show resolved Hide resolved

chenyang8094 approved these changes May 18, 2022

View reviewed changes

oranagra reviewed May 22, 2022

View reviewed changes

fsync the directory after rename file

0860178

uvletter force-pushed the fsync_aof_dir branch from 8ebcbe0 to 0860178 Compare May 22, 2022 15:46

oranagra reviewed May 23, 2022

View reviewed changes

oranagra added the 7.0-must-have label May 24, 2022

skip dir fsync for AIX

bcc5e65

yossigo reviewed May 26, 2022

View reviewed changes

check the filename length

1c08d8b

uvletter added 3 commits May 30, 2022 05:22

ingore os-specific error

dfff063

format the code

8495cdf

format the code

3f046b6

yossigo reviewed May 30, 2022

View reviewed changes

uvletter added 3 commits May 31, 2022 05:56

not try to sync dir for replication

8a77d90

exclude EACCESS

2ed1a16

prevent unlink from dirtying errno

124348c

oranagra assigned yossigo Jun 11, 2022

oranagra reviewed Jun 14, 2022

View reviewed changes

some fix to cr comments

c8826fb

fix

331c2de

oranagra approved these changes Jun 15, 2022

View reviewed changes

oranagra added state:major-decision Requires core team consensus approval-needed Waiting for core team approval to be merged labels Jun 15, 2022

oranagra changed the title ~~Fsync aof directory while persisting manifest~~ Fsync directory while persisting AOF manifest, RDB file, and config file Jun 15, 2022

oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Jun 15, 2022

yossigo approved these changes Jun 16, 2022

View reviewed changes

oranagra merged commit 99a425d into redis:unstable Jun 20, 2022

uvletter mentioned this pull request Jul 3, 2022

Make cluster config file saving atomic and fsync acl #10924

Merged

oranagra mentioned this pull request Jul 11, 2022

Release 7.0.3 #10962

Merged

michaelklishin mentioned this pull request Jan 5, 2026

Ra v3 rabbitmq/ra#494

Merged

Conversation

uvletter commented May 17, 2022 • edited by oranagra Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenyang8094 commented May 18, 2022

Uh oh!

Uh oh!

Uh oh!

chenyang8094 left a comment

Choose a reason for hiding this comment

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

uvletter commented May 22, 2022

Uh oh!

uvletter commented May 22, 2022

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yossigo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tezc commented May 28, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yossigo commented Jun 14, 2022

Uh oh!

tezc commented Jun 14, 2022

Uh oh!

oranagra commented Jun 14, 2022

Uh oh!

uvletter commented Jun 15, 2022

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

oranagra commented Jun 20, 2022

Uh oh!

uvletter commented Jun 21, 2022

Uh oh!

s1191994586-max commented Jan 30, 2026

Uh oh!

oranagra commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

uvletter commented May 17, 2022 •

edited by oranagra

Loading

oranagra commented Jan 31, 2026 •

edited

Loading