Make cluster config file saving atomic and fsync acl by uvletter · Pull Request #10924 · redis/redis

uvletter · 2022-07-03T17:08:26Z

As an outstanding part mentioned in #10737, we could just make the cluster config file and ACL file saving done with a more safe and atomic pattern (write to temp file, fsync, rename, fsync dir).

The cluster config file uses an in-place overwrite and truncation (which was also used by the main config file before #7824).
The ACL file is using the temp file and rename approach, but was missing an fsync.

oranagra

@yossigo do you think such a change is safe for 7.0.x? or should we postpone this for 7.2?

src/acl.c

src/cluster.c

oranagra · 2022-07-04T08:21:52Z

src/cluster.c

-    if (write(fd,ci,sdslen(ci)) != (ssize_t)sdslen(ci)) goto err;
+
    if (do_fsync) {
        server.cluster->todo_before_sleep &= ~CLUSTER_TODO_FSYNC_CONFIG;


i'm not certain this (clearing CLUSTER_TODO_FSYNC_CONFIG) should still be here.
we're fsyncing a temp file.
maybe move it to after the rename and directory fsync?

on the other hand, i see we always used to clear CLUSTER_TODO_SAVE_CONFIG even before opening the file for writing, so maybe the very attempt to persist / fsync is enough to clear these flags.

I think server.cluster->todo_before_sleep &= ~CLUSTER_TODO_FSYNC_CONFIG is only relative to do_fsync, nothing to do with the order. do_fsync indicates server.cluster->todo_before_sleep may have CLUSTER_TODO_FSYNC_CONFIG set, to avoid repeated fsync next round, we should clear CLUSTER_TODO_FSYNC_CONFIG bit in server.cluster->todo_before_sleep when do_fsync is set.

oranagra · 2022-07-04T08:27:40Z

src/cluster.c

-    return 0;

-err:
+    if (do_fsync) {


i'm not to keen about the do_fsync argument and it's purpose, and whether or not it should control the directory fsync as well.

i see in some places we set both CLUSTER_TODO_SAVE_CONFIG and CLUSTER_TODO_FSYNC_CONFIG.
and in others we only set CLUSTER_TODO_SAVE_CONFIG.

so this looks right, but i'd like some confirmation.

IMO CLUSTER_TODO_SAVE_CONFIG with CLUSTER_TODO_FSYNC_CONFIG indicates it's a critical state update, e.g. a new config epoch, slots reassignment and so on, which's failure may result in cluster not work correctly. In contrast, a single CLUSTER_TODO_SAVE_CONFIG means it better save the config, but failure doesn't matter severely, while the state can recover automatically in future, for example, the flag of a node. If I'm wrong plz @madolson correct me.

My understanding is consistent with @uvletter, although I'll be transparent and say we don't use fsync at AWS based on our architecture, so it might have some caveats I'm not that familiar with. The CLUSTER_TODO_FSYNC_CONFIG just means that if we fail to fsync we should crash since we may no longer be in a recoverable state, so updating last ping is fine to skip but bumping epoch we need to fsync.

yossigo · 2022-07-19T10:35:30Z

@oranagra I think this behavior change should be part of 7.2.

oranagra · 2022-07-19T10:52:11Z

details about why this is a behavior change: #7824 (comment)

As an outstanding part mentioned in redis#10737, we could just make the cluster config file and ACL file saving done with a more safe and atomic pattern (write to temp file, fsync, rename, fsync dir). The cluster config file uses an in-place overwrite and truncation (which was also used by the main config file before redis#7824). The ACL file is using the temp file and rename approach, but was missing an fsync. Co-authored-by: 朱天 <[email protected]>

write file and rename

5ad92f8

oranagra reviewed Jul 4, 2022

View reviewed changes

fix

ea3d629

oranagra approved these changes Jul 19, 2022

View reviewed changes

oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Jul 19, 2022

oranagra merged commit cc28481 into redis:unstable Jul 20, 2022

sundb mentioned this pull request Jun 19, 2024

[BUG] node restart result in truncated cluster.nodes.conf file. #13353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make cluster config file saving atomic and fsync acl#10924

Make cluster config file saving atomic and fsync acl#10924
oranagra merged 2 commits intoredis:unstablefrom
uvletter:save-with-rename

uvletter commented Jul 3, 2022 •

edited by oranagra

Loading

Uh oh!

oranagra left a comment

Uh oh!

Uh oh!

Uh oh!

oranagra Jul 4, 2022

Uh oh!

uvletter Jul 4, 2022 •

edited

Loading

Uh oh!

oranagra Jul 4, 2022

Uh oh!

uvletter Jul 4, 2022

Uh oh!

madolson Jul 19, 2022 •

edited

Loading

Uh oh!

yossigo commented Jul 19, 2022

Uh oh!

oranagra commented Jul 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

uvletter commented Jul 3, 2022 • edited by oranagra Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

oranagra Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

uvletter Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oranagra Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

uvletter Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

madolson Jul 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yossigo commented Jul 19, 2022

Uh oh!

oranagra commented Jul 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

uvletter commented Jul 3, 2022 •

edited by oranagra

Loading

uvletter Jul 4, 2022 •

edited

Loading

madolson Jul 19, 2022 •

edited

Loading