Implement Multi Part AOF mechanism to avoid AOFRW overheads. by chenyang8094 · Pull Request #9788 · redis/redis

chenyang8094 · 2021-11-16T13:28:30Z

Implement Multi-Part AOF mechanism to avoid overheads during AOFRW.
Introducing a folder with multiple AOF files tracked by a manifest file.

This PR implements the Meta-File Based Multi-Part AOF mechanism described in 9539.

The main issues with the the original AOFRW mechanism are:

buffering of commands that are processed during rewrite (consuming a lot of RAM)
freezes of the main process when the AOFRW completes to drain the remaining part of the buffer and fsync it.
double disk IO for the data that arrives during AOFRW (had to be written to both the old and new AOF files)

The main modifications of this PR:

Remove the AOF rewrite buffer and related code.
Divide the AOF into multiple files, they are classified as two types, one is the the BASE type, it represents the full amount of data (Maybe AOF or RDB format) after each AOFRW, there is only one BASE file at most. The second is INCR type, may have more than one. They represent the incremental commands since the last AOFRW.
Use a AOF manifest file to record and manage these AOF files mentioned above.
The original configuration of appendfilename will be the base part of the new file name, for example: appendonly.aof.1.base.rdb and appendonly.aof.2.incr.aof
Add manifest-related TCL tests, and modified some existing tests that depend on the appendfilename
Remove the aof_rewrite_buffer_length field in info.
Add aof-disable-auto-gc configuration. By default we're automatically deleting HISTORY type AOFs. It also gives users the opportunity to preserve the history AOFs. just for testing use now.
Add AOFRW limiting measure. When the AOFRW failures reaches the threshold (3 times now), we will delay the execution
of the next AOFRW by 1 minute. If the next AOFRW also fails, it will be delayed by 2 minutes. The next is 4, 8, 16, the maximum delay is 60 minutes (1 hour). During the limit period, we can still use the 'bgrewriteaof' command to execute AOFRW immediately.
Support upgrade (load) data from old version redis.
Add appenddirname configuration, as the directory name of the append only files. All AOF files and manifest file will be placed in this directory.
Only the last AOF file (BASE or INCR) can be truncated. Otherwise redis will exit even if aof-load-truncated is enabled.

To Do:

On upgrades from old redis versions, we need to detect the old AOF file and rename it / create a meta file.
At present, if AOFRW fails, redis will automatically retry. If it continues to fail, we may get a lot of very small INCR files. So we need an AOFRW limiting measure.

eduardobr · 2021-11-16T14:29:47Z

@chenyang8094 Just out of curiosity, does this create a foundation to extend #8015 to AOF-only and not only RDB-only setups? Thank you

oranagra

@chenyang8094 Thank you.
I started reviewing the PR, and also edited the PR top comment to make it clearer (please check if it looks right).
So far i reviewed just parts of it (namely all the files excluding aof.c and the tests).
I.d like to post my comments for discussion before i proceed with the other parts, to see if we agree on things, or maybe i have a misunderstanding.

redis.conf

src/config.c

src/server.c

src/server.h

chenyang8094 · 2021-11-17T03:31:56Z

@chenyang8094 Just out of curiosity, does this create a foundation to extend #8015 to AOF-only and not only RDB-only setups? Thank you

Sure, you can see some followup PR plans mentioned by oranagra herer.

oranagra · 2021-11-17T06:23:11Z

I've just created a few new issues to track these ideas (don't like them sitting around in a comment in a closed PR)
I moved one note of that post to a todo bullet in the top comment of this PR.
The first one of the other issues should be done for 7.0 IMHO:

Additional AOFRW deprecation cleanups #9794

The other ones might not make it for this release, but we can at least start discussion and mark them for next release:

chenyang8094 · 2021-11-17T07:53:48Z

@oranagra About upgrades from old redis versions, I think we can't directly rename the old AOF to the new name, because we still can't solve the atomic problem of rename and update manifest.

We can write the old AOF name directly into the manifest, so that we can start loading normally. Even if the process crashes after writing the manifest, that is no problem, because we can find the AOF from the manifest. Once AOFRW is executed once, all files will have new file names. （current code is implemented according to this idea）

oranagra · 2021-11-17T08:02:30Z

ok, sounds good to me. so the AOF file (possibly with preamble content) will have it's old name, and will be listed in the manifest.
once a rewrite happens it'll be tagged as history and later deleted.
let's be sure to add a test for it (storing an old AOF preamble file in the assets folder)

oranagra

Done reviewing aof.c, didn't review the tests yet.

redis.conf

src/aof.c

src/server.h

src/aof.c

src/config.c

src/aof.c

Change aofRewriteLimited and Update TCL test

oranagra · 2022-01-03T17:24:15Z

🎉 more than 400 comments 2500 LOC, and some 3 months (counting form #9539)
@chenyang8094 thank you for all the time you put in.

chenyang8094 · 2022-01-03T23:46:04Z

🎉 more than 400 comments 2500 LOC, and some 3 months (counting form #9539) @chenyang8094 thank you for all the time you put in.

Thank you, now I am going to deal with the next PR.

oranagra · 2022-01-23T07:08:32Z

@chenyang8094 can you please look into this failure:
https://github.com/redis/redis/runs/4890129519?check_suite_focus=true

*** [err]: AOF will trigger limit when AOFRW fails many times in tests/integration/aof-multi-part.tcl
aof rewrite did trigger limit

another one:
https://github.com/redis/redis/runs/4875669104?check_suite_focus=true

enjoy-binbin · 2022-01-23T09:08:20Z

i also saw it in my daily. DB 9: 13640 keys
i guess r config set auto-aof-rewrite-min-size 1mb, in FreeBSD, sometimes it was so slow, and there are not enought key to tigger the rewrite.

so maybe change it to 1KB, or do a for loop and insert some keys (maybe at least 20000 keys)

oranagra · 2022-01-23T09:17:56Z

maybe..
maybe we just need a longer timeout for this check.

maybe instead of using start_write_load we rather just add just enough data manually, and only then start the wait_for_condition.
i suppose someone needs to add more prints and reproduce this so that whatever fix we add we can have some certainty it's gonna solve the problem.

enjoy-binbin · 2022-01-23T13:02:27Z

it is easy to reproduce if we delete all the start_write_load (in case they generate enough keys), and replace it with some for loop set k v, but not let it reach 20000 keys.

like in this commit (7d4f50f), if we change the 20000 to be smaller, it will fail because not able to touch 1MB. (i can trigger a CI in FreeBSD and see the result, but i think this is the right fix, the start_write_load seems a bit fragile). it also speed up the test (total: 32s -> 25s, in my machine. Centos7.

but the time it should be similar in FreeBSD, because we need to generate that many keys all the time (successful use case too))

ci: https://github.com/enjoy-binbin/redis/runs/4912825635?check_suite_focus=true#step:4:3028

maybe we just need a longer timeout for this check.

btw, i think it is not enought, the start_write_load will last 10s, in this case, i think in thoes case of failures none of them produce enough keys.

chenyang8094 · 2022-01-24T09:20:14Z

I think we can change 1MB to 1KB and Then manually construct a 1KB more (which is easy) data to get a definite test status. @oranagra @enjoy-binbin WDYT？

In redis#9788, now we stores all persistent append-only files in a dedicated directory. The name of the directory is determined by the appenddirname configuration parameter in redis.conf Update create-cluster clean to clean this default directory. Fixes redis#10222

In #9788, now we stores all persistent append-only files in a dedicated directory. The name of the directory is determined by the appenddirname configuration parameter in redis.conf. Now each node have a separate folder. Update create-cluster clean to clean this default directory.

yangbodong22011 · 2022-02-28T03:21:16Z

.gitignore

 misc/*
 src/release.h
-appendonly.aof
+appendonly.aof.*


should be appendonly.aof* instead of appendonly.aof.*
now, appendonly.aof will not be ignored.

Maybe you're right, but the current implementation doesn't create a file named appendonly.aof unless you manually copy an old-style AOF for upgrade testing. So I think we just need to ignore the file we're going to create. @oranagra WDYT?

i think it's a good idea to ignore both.
i.e. both the ones we create, and the ones we created in the past, that are still left in the folder.

Only developers pay attention to .gitignore, and when I updated the unstable brance code, appendonly.aof caught my attention, so I started this discussion.

@oranagra @yangbodong22011 Sounds reasonable, I'll make a PR to add it, thanks.

…YNC_ALWAYS This PR fix two bugs: 1. About the MP-AOF one, that code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with multipart-aof, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue SELECT, and i suppose there's no reason to update any of the fsync metrics. This should have been deleted with MP-AOF (introduced in redis#9788, 7.0). The aof_fsync_offset it updates will cause us to miss an fsync in flushAppendOnlyFile, if we stop write commands in AOF_FSYNC_EVERYSEC 2. About the AOF_FSYNC_ALWAYS one, it is follow redis#6053 (the details is in redis#5985), we also check AOF_FSYNC_ALWAYS to prevent users from switching from everysec to always. We detected these problems with WAITAOF test (redis 7.2), introduced in redis#11713.

…SYNC_ALWAYS (#11973) This PR fix several unrelated bugs that were discovered by the same set of tests (WAITAOF tests in #11713), could make the `WAITAOF` test hang. The change in `backgroundRewriteDoneHandler` is about MP-AOF. That leftover / old code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with MP-AOF, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue `SELECT`, and no reason to update any of the fsync variables in that flow. This should have been deleted with MP-AOF (introduced in #9788, 7.0). The damage is that the update to `aof_fsync_offset` will cause us to miss an fsync in `flushAppendOnlyFile`, that happens if we stop write commands in `AOF_FSYNC_EVERYSEC` while an AOFRW is in progress. This caused a new `WAITAOF` test to sometime hang forever. Also because of MP-AOF, we needed to change `aof_fsync_offset` to `aof_last_incr_fsync_offset` and match it to `aof_last_incr_size` in `flushAppendOnlyFile`. This is because in the past we compared `aof_fsync_offset` and `aof_current_size`, but with MP-AOF it could be the total AOF file will be smaller after AOFRW, and the (already existing) incr file still has data that needs to be fsynced. The change in `flushAppendOnlyFile`, about the `AOF_FSYNC_ALWAYS`, it is follow #6053 (the details is in #5985), we also check `AOF_FSYNC_ALWAYS` to handle a case where appendfsync is changed from everysec to always while there is data that's written but not yet fsynced.

…SYNC_ALWAYS (#11973) This PR fix several unrelated bugs that were discovered by the same set of tests (WAITAOF tests in #11713), could make the `WAITAOF` test hang. The change in `backgroundRewriteDoneHandler` is about MP-AOF. That leftover / old code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with MP-AOF, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue `SELECT`, and no reason to update any of the fsync variables in that flow. This should have been deleted with MP-AOF (introduced in #9788, 7.0). The damage is that the update to `aof_fsync_offset` will cause us to miss an fsync in `flushAppendOnlyFile`, that happens if we stop write commands in `AOF_FSYNC_EVERYSEC` while an AOFRW is in progress. This caused a new `WAITAOF` test to sometime hang forever. Also because of MP-AOF, we needed to change `aof_fsync_offset` to `aof_last_incr_fsync_offset` and match it to `aof_last_incr_size` in `flushAppendOnlyFile`. This is because in the past we compared `aof_fsync_offset` and `aof_current_size`, but with MP-AOF it could be the total AOF file will be smaller after AOFRW, and the (already existing) incr file still has data that needs to be fsynced. The change in `flushAppendOnlyFile`, about the `AOF_FSYNC_ALWAYS`, it is follow #6053 (the details is in #5985), we also check `AOF_FSYNC_ALWAYS` to handle a case where appendfsync is changed from everysec to always while there is data that's written but not yet fsynced. (cherry picked from commit cb17178)

oranagra added the 7.0-must-have label Nov 16, 2021

oranagra mentioned this pull request Nov 16, 2021

A new AOF persistence mechanism #9539

Closed

6 tasks

oranagra reviewed Nov 16, 2021

View reviewed changes

chenyang8094 requested a review from oranagra November 17, 2021 06:01

oranagra mentioned this pull request Nov 17, 2021

Make full use of aofrwblock's buf #8975

Merged

oranagra reviewed Nov 17, 2021

View reviewed changes

chenyang8094 commented Nov 20, 2021

View reviewed changes