Skip to content

fsync the old aof file when open a new INCR AOF#11004

Merged
oranagra merged 3 commits intoredis:unstablefrom
enjoy-binbin:fsync_aof
Jul 25, 2022
Merged

fsync the old aof file when open a new INCR AOF#11004
oranagra merged 3 commits intoredis:unstablefrom
enjoy-binbin:fsync_aof

Conversation

@enjoy-binbin
Copy link
Contributor

@enjoy-binbin enjoy-binbin commented Jul 18, 2022

In rewriteAppendOnlyFileBackground, after flushAppendOnlyFile(1),
and before openNewIncrAofForAppend, we should call redis_fsync
to fsync the aof file.

Because we may open a new INCR AOF in openNewIncrAofForAppend,
in the case of using everysec policy, the old AOF file may not
be fsynced in time (or even at all).

When using everysec, we don't want to pay the disk latency from
the main thread, so we will do a background fsync.

Adding a argument for bioCreateCloseJob, a need_fsync flag to
indicate that a fsync is required before the file is closed. So we will
fsync the old AOF file before we close it.

A cleanup, we make bio_job become a union, since the free_* args and
the fd / fsync args are never used together.

@oranagra
Copy link
Member

interesting... you're right that it'll not be fsynced in time (or even at all).
but i wonder if foreground fsync is right in this case.
i.e. in this configuration, we don't want to pay the disk latency from the main thread.
maybe we should call bioCreateFsyncJob to cause it to be fsynced in the background?
note that we also do bioCreateCloseJob from within openNewIncrAofForAppend but we can't rely on their order so we might need to create some job that does both fsync and close.

p.s. in theory we can use SYNC_FILE_RANGE_WRITE without waiting (and no need for a thread), but that's non-standard.

@enjoy-binbin
Copy link
Contributor Author

yes, bg fsync sounds like a better idea.

so we might need to create some job that does both fsync and close.

so a new bg job opcodes? like a new BIO_FSYNC_AND_CLOSE, then we call it in openNewIncrAofForAppend, replace the bioCreateCloseJob

@uvletter
Copy link
Contributor

Something irrelevant to the pr, but occasionally found when viewing the code, that bioCreateCloseJob here seems a misuse. bioCreateCloseJob is basically used in bg_unlink to forbid the blocking in unlink, here no file unlink, so maybe closing the file directly is just ok.

@chenyang8094
Copy link
Contributor

I prefer to add BIO_FSYNC_AND_CLOSE bio opcode, and use bioCreateFsyncCloseJob to replace bioCreateCloseJob .

@uvletter
Copy link
Contributor

Another finding is that flushAppendOnlyFile(1) may also block when using everysec policy

 * When the fsync policy is set to 'everysec' we may delay the flush if there
 * is still an fsync() going on in the background thread, since for instance
 * on Linux write(2) will be blocked by the background fsync anyway.
 * When this happens we remember that there is some aof buffer to be
 * flushed ASAP, and will try to do that in the serverCron() function.
 *  
 * However if force is set to 1 we'll write regardless of the background
 * fsync. *

If there's a background fsync, flushAppendOnlyFile(1) may block on write(and the fsync is skipped), the following fsync in background doesn't make many sense if it's for unblock. If there's no background fsync, flushAppendOnlyFile(1) will write aof buffer to os and submit a background fsync job, but an explicit fsync is still needed to guarantee the order between fsync and close. I hope my understanding to flushAppendOnlyFile is correct while it's kind of obscure.

@oranagra
Copy link
Member

@uvletter i'm having hard time following your post.
do you mean that regardless of the problem discussed in this PR, there was already an issue about a background fsync being used on a file descriptor that's already closed, and that in this case the fsync we intended to do would fail and be missing.

i.e. related to this comment:

            /* The fd may be closed by main thread and reused for another
             * socket, pipe, or file. We just ignore these errno because
             * aof fsync did not really fail. */

@enjoy-binbin
Copy link
Contributor Author

enjoy-binbin commented Jul 19, 2022

I try to explain uvletter's comment (hope i got it right), let's assume a modified version of this PR

  1. flushAppendOnlyFile(1)
  2. submit a bg fsync, let's assume it will be the new BIO_FSYNC_AND_CLOSE

and then here are some cases: when calling flushAppendOnlyFile, is there already bg fsync in progress (using everysec)

  1. if there is a long bg fsync in progress
    a. flushAppendOnlyFile(1) will block in write, since we need to wait the bg fsync to complete
    b. after the write, we will submit a bg fsync in flushAppendOnlyFile, the old logic
    c. after flushAppendOnlyFile, we will call BIO_FSYNC_AND_CLOSE do a bg fsync and close, a bit redundant compared to b

  2. if no bg fsync in progress
    a. flushAppendOnlyFile(1) will do a write
    b. then flushAppendOnlyFile submit a bg fsync, the old logic
    c. after flushAppendOnlyFile, we will call BIO_FSYNC_AND_CLOSE do a bg fsync and close, a bit redundant compared to b

  3. with or without bg fsync in progress, if we didn't reach the 1s (everysec)
    a. flushAppendOnlyFile(1) will do a write, but it will not submit the bg fsync
    b. so we need to call BIO_FSYNC_AND_CLOSE do the fsync

@uvletter
Copy link
Contributor

uvletter commented Jul 19, 2022

Sorry for my mess expression(or English)... I hope I make it clear this time.
It's actually two individual things:

  1. bioCreateCloseJob is originally for avoiding the blocking when closing an overwritten AOF, as the old AOF has been overwritten by rename, the oldfd is the last reference to the old overwritten AOF, so the close, which will release the resource of old AOF, may take a while for a big AOF, and should be put into background. A similar PR is Rename rdb when replica finish receiving rdb asynchronously #6114.

redis/src/aof.c

Lines 1938 to 1941 in e6f6709

/* Rename the temporary file. This will not unlink the target file if
* it exists, because we reference it with "oldfd". */
latencyStartMonitor(latency);
if (rename(tmpfile,server.aof_filename) == -1) {

redis/src/aof.c

Lines 1981 to 1982 in e6f6709

/* Asynchronously close the overwritten AOF. */
if (oldfd != -1) bioCreateCloseJob(oldfd);

But for the new INCR AOF implementation, the AOF file wouldn't be overwritten/rename directly, every AOF is a new file, so it seems bioCreateCloseJob don't make much sense.

  1. It's a little similar to what @enjoy-binbin analyze, also with some differences. I have no doubt about the necessity of a fsync after flushAppendOnlyFile(1), I just throw a new question that flushAppendOnlyFile(1) may also suffer from the disk latency, ruin the effort that put the following fsync into background. There're three cases in flushAppendOnlyFile(1), depending on where whether aofFsyncInProgress is true or false, and whether it's in or out a second since last fsync.

case 1: aofFsyncInProgress is true, whether it's in or out of one second since last fsync doesn't matter. The write to file may be clocked(suffer the disk latency), and the fsync in flushAppendOnlyFile won't be triggered for bg fsync job exists(L1231). This is the problem case I throw.

case 2: aofFsyncInProgress is false, in one second since last fsync. flushAppendOnlyFile will write to file but not fsync, the following fsync is required for the safety.

case 3: aofFsyncInProgress is false, out of one second since last fsync. flushAppendOnlyFile will write to file and submit a bg fsync job. But as the fd will be closed after a while in openNewIncrAofForAppend(), which may happen before the bg fsync job, so a explicit fsync before close is required. This case is what the following quoted statement express

If there's no background fsync, flushAppendOnlyFile(1) will write aof buffer to os and submit a background fsync job, but an explicit fsync is still needed to guarantee the order between fsync and close

Maybe I wanted to express, with case 1, bg fsync is a qualified but still not perfect solution. I have no good idea as well, so just throw some points/questions I found.

@chenyang8094
Copy link
Contributor

chenyang8094 commented Jul 20, 2022

I don't think the redundant fsync in case 1 and 2 is a problem, first of all, the probability of this case happening is not very high (the conditions are harsh, right). Secondly, this redundant fsync will also finish very quickly (because it has basically nothing to do), so bioCreateFsyncCloseJob is a safe and reliable practice.

One thing I think needs attention is to determine the order of execution of BIO_AOF_FSYNC and BIO_AOF_FSYNC_AND_CLOSE.

Since only AOF currently uses bioCreateFsyncJob to create bg fsync, we even cancel or kill fsync jobs before submitting bioCreateFsyncCloseJob (or wait for them to finish), which can solve the problem of fsync redundancy or execution order.

@oranagra WDYT?

@oranagra
Copy link
Member

i hope i'm not missing anything too important (quite a lot of details and cases, and i didn't allocate enough time to dive deep into it).

  • i think a redundant fsync is ok (we have that today), and we can just ignore it.
  • i think that getting blocked in flushAppendOnlyFile because the last fsync didn't complete in time is ok (it can happen anyway in beforeSleep).
  • i think that all we need to do is to make sure that fsync will happen before close (regardless of if that fsync is redundant or there is another one already scheduled).

please let me know if you think i'm missing and important detail.

In rewriteAppendOnlyFileBackground, after flushAppendOnlyFile(1),
and before openNewIncrAofForAppend, we should call redis_fsync
to fsync the aof file.

Because we may open a new INCR AOF in openNewIncrAofForAppend,
in the case of using everysec policy, the old AOF file may not
be flushed in time (or even at all).

When using everysec, we don't want to pay the disk latency from
the main thread, so we introduce a new BIO_FSYNC_AND_CLOSE bio
opcode, it will close the fd after the fsync, both in background.

We use bioCreateFsyncCloseJob to replace bioCreateCloseJob in
openNewIncrAofForAppend, to make sure we will fsync the aof file
before close it.
@enjoy-binbin enjoy-binbin changed the title fsync the aof in rewriteAppendOnlyFileBackground fsync the old aof file when open a new INCR AOF Jul 20, 2022
@uvletter
Copy link
Contributor

uvletter commented Jul 20, 2022

Since only AOF currently uses bioCreateFsyncJob to create bg fsync, we even cancel or kill fsync jobs before submitting bioCreateFsyncCloseJob (or wait for them to finish), which can solve the problem of fsync redundancy or execution order.

As the BIO_AOF_FSYNC has handled the case of closed fd, so I also think make sure that fsync will happen before close should be enough.

            if (redis_fsync(job->fd) == -1 &&
                errno != EBADF && errno != EINVAL)

For @oranagra 's three points I generally agree, about point 2 I wanna add(It's just my inference, not sure it's right), that the flushAppendOnlyFile(1) here is more likely getting blocked than the flushAppendOnlyFile(0) in beforeSleep, flushAppendOnlyFile(0) only try to write if it's postponed more than two seconds due to the bg fsync, but flushAppendOnlyFile(1) always write without conditions. So if nearly before the flushAppendOnlyFile(1) a flushAppendOnlyFile(0) in beforeSleep get executed and submit a bg fsync, the flushAppendOnlyFile(1) here would likely get blocked(though the condition is still harsh).

And I think BIO_AOF_FSYNC_AND_CLOSE is currently reasonable solution(except someone put forward a better one)

it's a waste to open another thread for this.
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the amount of changes seem reasonable to me. i think we can accept that.
any other comments before i merge this?

@enjoy-binbin
Copy link
Contributor Author

any other comments before i merge this?

i am good, i don't have it here

@oranagra oranagra merged commit 03fff10 into redis:unstable Jul 25, 2022
@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Jul 25, 2022
@enjoy-binbin enjoy-binbin deleted the fsync_aof branch July 25, 2022 06:17
@oranagra oranagra mentioned this pull request Sep 21, 2022
oranagra added a commit that referenced this pull request Sep 21, 2022
In rewriteAppendOnlyFileBackground, after flushAppendOnlyFile(1),
and before openNewIncrAofForAppend, we should call redis_fsync
to fsync the aof file.

Because we may open a new INCR AOF in openNewIncrAofForAppend,
in the case of using everysec policy, the old AOF file may not
be fsynced in time (or even at all).

When using everysec, we don't want to pay the disk latency from
the main thread, so we will do a background fsync.

Adding a argument for bioCreateCloseJob, a `need_fsync` flag to
indicate that a fsync is required before the file is closed. So we will
fsync the old AOF file before we close it.

A cleanup, we make union become a union, since the free_* args and
the fd / fsync args are never used together.

Co-authored-by: Oran Agra <[email protected]>
(cherry picked from commit 03fff10)
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
In rewriteAppendOnlyFileBackground, after flushAppendOnlyFile(1),
and before openNewIncrAofForAppend, we should call redis_fsync
to fsync the aof file.

Because we may open a new INCR AOF in openNewIncrAofForAppend,
in the case of using everysec policy, the old AOF file may not
be fsynced in time (or even at all).

When using everysec, we don't want to pay the disk latency from
the main thread, so we will do a background fsync.

Adding a argument for bioCreateCloseJob, a `need_fsync` flag to
indicate that a fsync is required before the file is closed. So we will
fsync the old AOF file before we close it.

A cleanup, we make union become a union, since the free_* args and
the fd / fsync args are never used together.

Co-authored-by: Oran Agra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants