Skip to content

Fsync directory while persisting AOF manifest, RDB file, and config file#10737

Merged
oranagra merged 12 commits intoredis:unstablefrom
uvletter:fsync_aof_dir
Jun 20, 2022
Merged

Fsync directory while persisting AOF manifest, RDB file, and config file#10737
oranagra merged 12 commits intoredis:unstablefrom
uvletter:fsync_aof_dir

Conversation

@uvletter
Copy link
Contributor

@uvletter uvletter commented May 17, 2022

The current process to persist files is write the data, fsync and rename the file, but a underlying problem is that the rename may be lost when a sudden crash like power outage and the directory hasn't been persisted.

The article Ensuring data reaches disk mentions a safe way to update file should be:

  1. create a new temp file (on the same file system!)
  2. write data to the temp file
  3. fsync() the temp file
  4. rename the temp file to the appropriate name
  5. fsync() the containing directory

Referring to Leveldb ,

  Status Sync() override {
    // Ensure new files referred to by the manifest are in the filesystem.
    //
    // This needs to happen before the manifest file is flushed to disk, to
    // avoid crashing in a state where the manifest refers to files that are not
    // yet on disk.
    Status status = SyncDirIfManifest();
    if (!status.ok()) {
      return status;
    }

    status = FlushBuffer();
    if (!status.ok()) {
      return status;
    }

    return SyncFd(fd_, filename_);
  }

So we should also fsync the directory to ensure the safety.

This PR handles CONFIG REWRITE, AOF manifest, and RDB file (both for persistence, and the one the replica gets from the master).
It doesn't handle (yet), ACL SAVE and Cluster configs, since these don't yet follow this pattern.

@chenyang8094
Copy link
Contributor

@uvletter thank you. i do man for fsync and get:

Calling fsync() does not necessarily ensure that the entry in the directory containing the 
file has also reached disk.  For that an explicit fsync() on a file descriptor for the directory
is also needed.

It does seem to need the fsync directory, but others say it depends on the specific filesystem (and mount options, etc.), anyhow I think adding fsync makes the code more portable.

chenyang8094
chenyang8094 approved these changes May 18, 2022
Copy link
Contributor

@chenyang8094 chenyang8094 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however, i notice we have a few other places in redis where we do fsync and then rename, so i suppose all of these deserve the same fix.
can you go over them and check if that's right?

if it is, then i suppose we'd rather create a utility function to fsync a dir, so we can use it as a single call in all these places.

@uvletter
Copy link
Contributor Author

LGTM, however, i notice we have a few other places in redis where we do fsync and then rename, so i suppose all of these deserve the same fix. can you go over them and check if that's right?

if it is, then i suppose we'd rather create a utility function to fsync a dir, so we can use it as a single call in all these places.

Sure, let me have a try

@uvletter
Copy link
Contributor Author

@oranagra after a look around I find several places that perhaps requires a dir sync, like rdb generation, replication and config rewrite, there're also two places I didn't add a dir sync, the acl and cluster config persistence. As for acl, the file itself isn't fsync, so I think it make no sense to fsync dir only. And for cluster config, it's overwritten in place, so it make no sense to fsync dir, maybe we could refine it to the pattern create temp, write and rename in the future.

I add a new util function fsyncFileDir to sync the directory. The parameter using filename not dirname is for a easy use, as there're several types of directory to be synced, pwd for rdb, aof_dirname for aof, and custom dir for config, to avoid messing them up, fsyncFileDir just receive the file that changes, and auto parse the directory name.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uvletter thank you.
i think the two places you mentioned that aren't currently doing an fsync are things we wanna fix, but arguably not as part of this PR.

  • i think that ACL SAVE is basically the same as CONFIG REWRITE, and it should have the same guarantees.
  • clusterSaveConfig is also the same, and probably behaves the same as CONFIG REWRITE did before #7824

src/rdb.c Outdated
stopSaving(0);
return C_ERR;
}
if (fsyncFileDir(filename) == -1) goto werr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here goto werr is not valid at this point. the rename already succeeded (e.g. there's no point in unlink(tmpfile).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we regard unlink(tmpfile) as a best effort attempt, after rename succeeded we could also try to unlink(tmpfile)(maybe meaningless) and ignore the possible errors. I commit the same error handling in config.c just now.

Copy link
Collaborator

@yossigo yossigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oranagra I agree that we should not universally panic when fsync fails, and instead decide on the right course of action on a case by case basis.

Comment on lines 948 to 951
dir_fd = open(dname, O_RDONLY);
if (dir_fd == -1) {
return -1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably silently ignore EISDIR and EINVAL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will ignore some errno, I think it's better to have a code comment why it's special, but I have no research on these errno...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uvletter Please see #10737 (comment) . errno check will help to detect OSes/Filesystems that do not support fsync() on a directory.

@tezc
Copy link
Collaborator

tezc commented May 28, 2022

I'll dump what I think.

1- We should detect OSes that do not support fsync() on a directory and ignore errors for these: #10737 (comment)
2- One question is how to handle if rename() succeeds and directory fsync() fails. Looks like both sqlite and postgresql do not act like it's the end of the world when fsync() on a directory fails. Sqlite ignores it and PostgreSQL is a mix (not so sure, in some places it's ignored and in some places causes failures that I couldn't evaluate its impact).

  • This error case is probably super rare and does not happen with all the filesystems.
  • Logging the error and ignoring it might be a valid approach.
  • IMHO abort() will be a valid approach if admin can do something better than we can or if we prevent something bad by doing that. If restarting Redis is the only thing after abort(), maybe there is no point in abort(). Probably, it's better to decide if abort() makes sense in each use case.

src/util.c Outdated
if (dir_fd == -1) {
/* Some OSs don't allow us to open directories at all (Windows returns
* EACCES), just ignore the error in that case */
if (errno == EISDIR || errno == EACCES) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't ignore EACCESS:

  1. Redis is not supported on Windows anyway.
  2. It's a valid error if we have w+x on the directory: we'll be able to create/rename files but not read.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just not obvious from the comment if any other system returns EACCES if opening a directory is not allowed. Anyway, case-2 is valid indeed and probably no need to be paranoid about this and no need to ignore EACCESS.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-reviewed this PR and all the comments.
my conclusion is that directory fsync errors are probably very rare (if they exist at all),
and that we wanna handle them like any other write or rename error of the same flow in which they occur.
at this point in time, i'd rather not have a general panic (as if it's a catastrophe), and i'd rather not suffice by logging the error and ignore it (maybe in the future we'll know better).

my reasoning is that doing that, can't be a completely invalid choice (we're much more likely to get a write error anyway), so that code flow is probably validated by real users.

i would like to revive the directory fsync in replication.c, and would later (in some future PR, would like to modify acl.c and cluster.c to do that same).

anyone think i got this wrong?

src/config.c Outdated
if (retval) unlink(tmp_conffile);
if (retval) {
/* try to unlink the temp, ignoring possible errors */
int old_errno = errno;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we backup errno here? we already printed the error and errno, and the caller doesn't need it.
if ti did, then we should back it up before close

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a bad directory fsync would make the state of temp file uncertain, so the unlink may fail and corrupt the errno. Your advise wrapping before close is more universal, so I refactor into this way.

src/rdb.c Outdated
stopSaving(0);
return C_ERR;
}
if (fsyncFileDir(filename) == -1) goto werr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe each of these goto werr should set some string variable, so that the error log contains an indication of which call failed? (one of two fsync, fclose, etc)

@yossigo
Copy link
Collaborator

yossigo commented Jun 14, 2022

@oranagra I agree with your analysis, fsync() failure after a directory operation is not different from failure of the operation itself.

@tezc
Copy link
Collaborator

tezc commented Jun 14, 2022

@oranagra Sounds good to me. Handling fsync() failure on a directory as another write/rename error is the best solution.

The only corner case is if you need to rollback rename() (If application logic requires rollback) after subsequent fsync() failure on the directory. Then, we have two options, abort or ignore. Because, I don't know how we can rollback rename() in this case.

@oranagra
Copy link
Member

I think we can't tell in which state we landed (and more so, considering the details here suggesting that a successful subsequent fsync isn't any good either, and the writes may be lost forever).

So I'd argues that for now we should not attempt anything fancy that tries to get out of the unknown state, just handle the error as if write failed, rather than panic, or ignore the error.
If some delay we'll have specific info on a certain case on a certain platform, we can reopen this.

@uvletter
Copy link
Contributor Author

@oranagra I completely agree as it's also my origin proposal :-)
According to the discusses I recover the code in replication.c, and make some other fix, if I'm not misunderstanding your comments.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redis/core-team please be aware of this change and comment if you have feedback.
i would like to include it in 7.0.3.

in theory this could cause some unexpected issues:

  • platforms or file systems that don't support directory fsync will start failing (we tried to resolve that inside fsyncFileDir)
  • cases where a hardware issue causing the fsync to fail, in which case we can either:
    1. ignore the error (best effort fsync)
    2. panic (since in some cases we can't be sure what's the result, i.e. if the rename succeeded or failed, and there may even be cases where a subsequent successful fsync would still not mean the data of the previous one is safe)
    3. or just treat it the same as we treat other write errors (which is what we did for now).

@oranagra oranagra added state:major-decision Requires core team consensus approval-needed Waiting for core team approval to be merged labels Jun 15, 2022
@oranagra oranagra changed the title Fsync aof directory while persisting manifest Fsync directory while persisting AOF manifest, RDB file, and config file Jun 15, 2022
@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Jun 15, 2022
@oranagra oranagra merged commit 99a425d into redis:unstable Jun 20, 2022
@oranagra
Copy link
Member

@uvletter thank you for your patience. maybe you'd like to handle cluster and ACL next?

@uvletter
Copy link
Contributor Author

@uvletter thank you for your patience. maybe you'd like to handle cluster and ACL next?

Yes! let me have a try

@oranagra oranagra mentioned this pull request Jul 11, 2022
oranagra pushed a commit that referenced this pull request Jul 20, 2022
As an outstanding part mentioned in #10737, we could just make the cluster config file and
ACL file saving done with a more safe and atomic pattern (write to temp file, fsync, rename, fsync dir).

The cluster config file uses an in-place overwrite and truncation (which was also used by the
main config file before #7824).
The ACL file is using the temp file and rename approach, but was missing an fsync.

Co-authored-by: 朱天 <[email protected]>
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…ile (redis#10737)

The current process to persist files is `write` the data, `fsync` and `rename` the file,
but a underlying problem is that the rename may be lost when a sudden crash like
power outage and the directory hasn't been persisted.

The article [Ensuring data reaches disk](https://lwn.net/Articles/457667/) mentions
a safe way to update file should be:

1. create a new temp file (on the same file system!)
2. write data to the temp file
3. fsync() the temp file
4. rename the temp file to the appropriate name
5. fsync() the containing directory

This commit handles CONFIG REWRITE, AOF manifest, and RDB file (both for persistence,
and the one the replica gets from the master).
It doesn't handle (yet), ACL SAVE and Cluster configs, since these don't yet follow this pattern.
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
As an outstanding part mentioned in redis#10737, we could just make the cluster config file and
ACL file saving done with a more safe and atomic pattern (write to temp file, fsync, rename, fsync dir).

The cluster config file uses an in-place overwrite and truncation (which was also used by the
main config file before redis#7824).
The ACL file is using the temp file and rename approach, but was missing an fsync.

Co-authored-by: 朱天 <[email protected]>
@michaelklishin michaelklishin mentioned this pull request Jan 5, 2026
@s1191994586-max
Copy link

It appears that these atomic operation changes were only merged into version V7, and not into version V6.

@oranagra
Copy link
Member

oranagra commented Jan 31, 2026

It appears that these atomic operation changes were only merged into version V7, and not into version V6.

@s1191994586-max multipart AOF didn't exist in 6.x, so some these are not applicable there.

Regardless, we don't normally backport non-critical fixes to old stable releases, even an innocent looking fix carries some risk, so if the bug is non-critical, we can't afford to backport it all the way back to very old releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants