Implementing the WAITAOF command (issue #10505)#11713
Implementing the WAITAOF command (issue #10505)#11713oranagra merged 21 commits intoredis:unstablefrom
Conversation
oranagra
left a comment
There was a problem hiding this comment.
depending on the type of interface we choose for WAITAOF maybe the race you presented is irrelevant?
i.e. if the caller explicitly asks to wait for AOF files on replicas, it should hang.
maybe another solution besides the one you presented in the commit comment could be to somehow bump pot_fsynced_reploff to the right offset (last one that existed) on the next fsync?
maybe we should discuss that on a call.
I don't think that would help if you already have clients waiting on a |
Added code to keep track of replication offsets that are confirmed to have been fsynced to disk. REPLCONF ACK command is extended to also report this offset- reported as REPLCONF FACK <offset>. This REPLCONF extension is backwards-compatible because the REPLCONF command handling terminates after encountering the ACK offset, so older Redises will simply ignore the unknown FACK that follows it.
This introduces infrastructure necessary to allow BIO workers to not have a 1-1 mapping of worker to job-type. This allows in the future to assign multiple job types to a single worker, either as a performance/resource optimization, or as a way of enforcing ordering between specific classes of jobs.
- Replace close+fsync job with new BIO_CLOSE_AOF job that will also update the field. - Handle all AOF jobs (BIO_CLOSE_AOF, BIO_AOF_FSYNC) in the same bio worker thread, to impose ordering on their execution. This solves a race condition where a job could set `pot_fsynced_reploff` to a higher value than another pending fsync job, resulting in indicating an offset for which parts of the data have not yet actually been fsynced. Imposing an ordering on the jobs guarantees that fsync jobs are executed in increasing order of replication offset.
This should prevent a write race between updates to `pot_fsynced_reploff` in the main thread (`flushAppendOnlyFile` when set to ALWAYS fsync), and those done in the bio thread.
When this is the case, the `server.master_repl_offset` we usually rely on is not being updated, so we need an alternative means of tracking fsyncs. In this case, we use `pot_fsynced_reploff` as a basic "fsync epoch" counter instead, incrementing it each time the AOF is fsynced. This is inspired by antirez's original approach to the problem, here: redis/redis@unstable...wait-aof This change introduces a possible race in potential implementation of a `WAIT AOF` command: 1. Client executes `WAIT AOF` to wait for an fsync of its own server. Replication is currently enabled. A client waiting for fsync of the current `master_repl_offset` is added to the `clients_waiting_acks` list. 2. Suppose `server.pot_fsynced_reploff == server.master_repl_offset - n`. 3. We now disable replication. 4. Now every AOF fsync will simply increment `server.pot_fsynced_reploff`. As a result, the blocked client will unnecessarily wait `n` fsyncs before it is released. To resolve this race, we might have to: * When disabling replication, iterate over `clients_waiting_acks` list and set relevant offsets to the value `pot_fsynced_reploff`.
This reverts the previous commit that tracked fsync "epochs", as it was non-ideal and suffered from some serious bugs. Instead I've modified the code to continue incrementing the replication offset as long as AOF is configured, even with no replication backlog. This will allow us to rely on a single field for tracking AOF fsync in all scenarios, with or without replication.
0cba22e to
66c87f5
Compare
* Revert refactor of if-else to switch to reduce diff size. * Add missing initalization to `startAppendOnly` to resolve bug.
Iterating over the worker's entire jobs list wasn't very efficient. This commit adds some minimal bookkeeping to maintain a running counter per job type. This allows `bioPendingJobsOfType` to become an O(1) operation.
Syntax: WAITAOF <num_local> <num_replicas> <timeout> Response: Array containing num_local, num_replicas Returns an error when called on replicas, or when called with non-zero num_local on master with AOF disabled, in all other cases the response just contains number of fsync copies. Changes * Make sure for the replica or master to update the fsynced offset at the end of the initial AOF rewrite (even if there are no additional writes that trigger a periodic fsync. * Drain the pending fsync when starting over a new AOF to avoid race conditions with the previous aof offsets overriding the new one. * implement the reciving side of `REPLCONF ACK <ofs> FACK <ofs>` which will be appended only if there's an AOF on the replica, and ignored on old masters * WAIT command respects CLIENT_DENY_BLOCKING (not just CLIENT_MULTI) Minor notes: * rename pot_fsynced_reploff to fsynced_reploff_pending * when an initial rewrite ends, update fsynced_reploff right away, don't wait for the next beforeSleep.
|
@redis/core-team this is ready for review, and the top comment is updated. |
|
The following steps may cause the
ackreplicas = replicationCountAOFAcksByOffset(c->woff);
acklocal = server.fsynced_reploff >= c->woff;
printf("server.fsynced_reploff: %lld, c->woff: %lld\n", server.fsynced_reploff, c->woff);
./src/redis-server --appendonly yes
./src/redis-server --port 6380
./src/redis-cli -h 127.0.0.1 -p 6380
slaveof 127.0.0.1 6379
while true; do echo "waitaof 1 0 0"; done | ./src/redis-cli -xmaster logs: 115202:C 08 Feb 2023 12:16:17.615 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
115202:C 08 Feb 2023 12:16:17.615 # Redis version=255.255.255, bits=64, commit=909f091f, modified=1, pid=115202, just started
115202:C 08 Feb 2023 12:16:17.615 # Configuration loaded
115202:M 08 Feb 2023 12:16:17.615 * monotonic clock: POSIX clock_gettime
115202:M 08 Feb 2023 12:16:17.616 * Running mode=standalone, port=6379.
115202:M 08 Feb 2023 12:16:17.616 # Server initialized
115202:M 08 Feb 2023 12:16:17.616 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
115202:M 08 Feb 2023 12:16:17.616 * Reading RDB base file on AOF loading...
115202:M 08 Feb 2023 12:16:17.616 * Loading RDB produced by version 255.255.255
115202:M 08 Feb 2023 12:16:17.616 * RDB age 1225 seconds
115202:M 08 Feb 2023 12:16:17.616 * RDB memory usage when created 0.82 Mb
115202:M 08 Feb 2023 12:16:17.616 * RDB is base AOF
115202:M 08 Feb 2023 12:16:17.617 * Done loading RDB, keys loaded: 0, keys expired: 0.
115202:M 08 Feb 2023 12:16:17.617 * DB loaded from base file appendonly.aof.1.base.rdb: 0.000 seconds
115202:M 08 Feb 2023 12:16:17.617 * DB loaded from append only file: 0.000 seconds
115202:M 08 Feb 2023 12:16:17.617 * Opening AOF incr file appendonly.aof.1.incr.aof on server start
115202:M 08 Feb 2023 12:16:17.617 * Ready to accept connections tcp
115202:M 08 Feb 2023 12:16:25.929 * Replica 127.0.0.1:6380 asks for synchronization
115202:M 08 Feb 2023 12:16:25.929 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'c6eb517826ab88f5c97a2dd31195c079650a18d8', my replication IDs are 'd6b625e0f3fc09a312e26c775a5681df6f0691df' and '0000000000000000000000000000000000000000')
115202:M 08 Feb 2023 12:16:25.929 * Replication backlog created, my new replication IDs are '114682cd6adfd9801b4c0d3e02413e85234e6a84' and '0000000000000000000000000000000000000000'
115202:M 08 Feb 2023 12:16:25.929 * Delay next BGSAVE for diskless SYNC
server.fsynced_reploff: 0, c->woff: 0
....repeated the line above
server.fsynced_reploff: 0, c->woff: 14
115202:M 08 Feb 2023 12:16:30.651 * Starting BGSAVE for SYNC with target: replicas sockets
115202:M 08 Feb 2023 12:16:30.652 * Background RDB transfer started by pid 115369
115369:C 08 Feb 2023 12:16:30.653 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
115202:M 08 Feb 2023 12:16:30.654 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
115202:M 08 Feb 2023 12:16:30.659 * Background RDB transfer terminated with success
115202:M 08 Feb 2023 12:16:30.659 * Streamed RDB transfer with replica 127.0.0.1:6380 succeeded (socket). Waiting for REPLCONF ACK from replica to enable streaming
115202:M 08 Feb 2023 12:16:30.659 * Synchronization with replica 127.0.0.1:6380 succeededThe main reason is that WAIT[AOF] waits for the repl offset of the previous command this client executed. |
|
I should have seen this coming. one clear solution for that is replication stream multiplexing (tracked in #8440), where PINGs will not increase the replication offset. looking for other advises. |
| /* Return the number of replicas that already acknowledged the specified | ||
| * replication offset being AOF fsynced. */ | ||
| int replicationCountAOFAcksByOffset(long long offset) { |
There was a problem hiding this comment.
Turns out it's possible to write a module and a Lua script that propagate to the AOF and doesn't propagate to the replication stream. see REDISMODULE_ARGV_NO_REPLICAS and luaRedisSetReplCommand.
That can result in two bad cases, scenario:
User executes command that only propagates to AOF, and then immediately issues a WAITAOF, and there's no further writes on the replication stream after that.
- if the the last thing that happened on the replication stream is a PING (which increased the replication offset but won't trigger an fsync on the replica), then the client would hang forever (will wait for an fack that the replica will never send sine it doesn't trigger any fsyncs).
- if the last thing that happened is a write command that got propagated properly, then WAITAOF will be released immediately, without waiting for an fsync (since the offset didn't change)
We need to decide if we want to handle that in some way, or ignore it as an abusive case (i'm not sure what's the justification for someone to explicitly propagate only to AOF).
On the other hand, if the client asked WAITAOF to wait for remote AOF files, and propagated a command only to the local AOF, maybe it deserves to get blocked forever (even if there are further writes), since his write didn't go to replicas and he waits for something that wouldn't happen..
So the other scenario we should look at is a case where there's no replication, only a master with AOF, and a command that's explicitly only propagating to AOF, and then user uses WAITAOF to wait for the local AOF file.
in that case, with the current code, we won't increment the replication offset and the client will get released before the fsync. that's probably easy to solve (increment the replication offset in that scenario).
|
@redis/core-team reminder to take a look / approve this one. |
|
I generally dislike wait, I've seen a lot of individuals use it with a 3 node configuration and just wait for one replica. While that technically increases the real world durability, they can still lose writes to the last node. When they set up wait with 2 nodes, they inevitably have a prolonged outage when one of the replicas dies. Now we're just taking more dependencies on disks, which makes me think more people are going to slam into these sharp edges. The answer to durability is not just to write data to more disks, but to write it intelligently in a way that can be recovered. With that said, I feel like that is kind of what Redis does, so practically OK with it. |
|
FULL CI (of the one test unit): https://github.com/redis/redis/actions/runs/4418135344 |
WAITAOF wad added in redis#11713, its return is an array. But forget to handle WAITAOF in last_offset and last_numreplicas, causing WAITAOF to return a WAIT like reply. Tests was added to validate that case (both WAIT and WAITAOF).
|
@oranagra Will do. |
WAITAOF wad added in #11713, its return is an array. But forget to handle WAITAOF in last_offset and last_numreplicas, causing WAITAOF to return a WAIT like reply. Tests was added to validate that case (both WAIT and WAITAOF). This PR also refactored processClientsWaitingReplicas a bit for better maintainability and readability.
There be a situation that satisfies WAIT, and then wrongly unblock WAITAOF because we mix-use last_offset and last_numreplicas. We update last_offset and last_numreplicas only when the condition matches. i.e. output of either replicationCountAOFAcksByOffset or replicationCountAcksByOffset is right. In this case, we need to have separate last_ variables for each of them. Added a last_aof_offset and last_aof_numreplicas for WAITAOF. WAITAOF was added in redis#11713. Found while coding redis#11917. A Test was added to validate that case.
There be a situation that satisfies WAIT, and then wrongly unblock WAITAOF because we mix-use last_offset and last_numreplicas. We update last_offset and last_numreplicas only when the condition matches. i.e. output of either replicationCountAOFAcksByOffset or replicationCountAcksByOffset is right. In this case, we need to have separate last_ variables for each of them. Added a last_aof_offset and last_aof_numreplicas for WAITAOF. WAITAOF was added in #11713. Found while coding #11917. A Test was added to validate that case.
…YNC_ALWAYS This PR fix two bugs: 1. About the MP-AOF one, that code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with multipart-aof, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue SELECT, and i suppose there's no reason to update any of the fsync metrics. This should have been deleted with MP-AOF (introduced in redis#9788, 7.0). The aof_fsync_offset it updates will cause us to miss an fsync in flushAppendOnlyFile, if we stop write commands in AOF_FSYNC_EVERYSEC 2. About the AOF_FSYNC_ALWAYS one, it is follow redis#6053 (the details is in redis#5985), we also check AOF_FSYNC_ALWAYS to prevent users from switching from everysec to always. We detected these problems with WAITAOF test (redis 7.2), introduced in redis#11713.
…SYNC_ALWAYS (#11973) This PR fix several unrelated bugs that were discovered by the same set of tests (WAITAOF tests in #11713), could make the `WAITAOF` test hang. The change in `backgroundRewriteDoneHandler` is about MP-AOF. That leftover / old code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with MP-AOF, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue `SELECT`, and no reason to update any of the fsync variables in that flow. This should have been deleted with MP-AOF (introduced in #9788, 7.0). The damage is that the update to `aof_fsync_offset` will cause us to miss an fsync in `flushAppendOnlyFile`, that happens if we stop write commands in `AOF_FSYNC_EVERYSEC` while an AOFRW is in progress. This caused a new `WAITAOF` test to sometime hang forever. Also because of MP-AOF, we needed to change `aof_fsync_offset` to `aof_last_incr_fsync_offset` and match it to `aof_last_incr_size` in `flushAppendOnlyFile`. This is because in the past we compared `aof_fsync_offset` and `aof_current_size`, but with MP-AOF it could be the total AOF file will be smaller after AOFRW, and the (already existing) incr file still has data that needs to be fsynced. The change in `flushAppendOnlyFile`, about the `AOF_FSYNC_ALWAYS`, it is follow #6053 (the details is in #5985), we also check `AOF_FSYNC_ALWAYS` to handle a case where appendfsync is changed from everysec to always while there is data that's written but not yet fsynced.
…SYNC_ALWAYS (#11973) This PR fix several unrelated bugs that were discovered by the same set of tests (WAITAOF tests in #11713), could make the `WAITAOF` test hang. The change in `backgroundRewriteDoneHandler` is about MP-AOF. That leftover / old code assumes that we started a new AOF file just now (when we have a new base into which we're gonna incrementally write), but the fact is that with MP-AOF, the fork done handler doesn't really affect the incremental file being maintained by the parent process, there's no reason to re-issue `SELECT`, and no reason to update any of the fsync variables in that flow. This should have been deleted with MP-AOF (introduced in #9788, 7.0). The damage is that the update to `aof_fsync_offset` will cause us to miss an fsync in `flushAppendOnlyFile`, that happens if we stop write commands in `AOF_FSYNC_EVERYSEC` while an AOFRW is in progress. This caused a new `WAITAOF` test to sometime hang forever. Also because of MP-AOF, we needed to change `aof_fsync_offset` to `aof_last_incr_fsync_offset` and match it to `aof_last_incr_size` in `flushAppendOnlyFile`. This is because in the past we compared `aof_fsync_offset` and `aof_current_size`, but with MP-AOF it could be the total AOF file will be smaller after AOFRW, and the (already existing) incr file still has data that needs to be fsynced. The change in `flushAppendOnlyFile`, about the `AOF_FSYNC_ALWAYS`, it is follow #6053 (the details is in #5985), we also check `AOF_FSYNC_ALWAYS` to handle a case where appendfsync is changed from everysec to always while there is data that's written but not yet fsynced. (cherry picked from commit cb17178)
Implementing the WAITAOF functionality requested in issue #10505, which would allow the user to block until a specified number of Redises have fsynced all previous write commands to the AOF.
Syntax:
WAITAOF <num_local> <num_replicas> <timeout>Response: Array containing two elements: num_local, num_replicas
num_local is always either 0 or 1 representing the local AOF on the master.
num_replicas is the number of replicas that acknowledged the a replication
offset of the last write being fsynced to the AOF.
Returns an error when called on replicas, or when called with non-zero
num_local on a master with AOF disabled, in all other cases the response
just contains number of fsync copies.
Main changes:
This way we can use this command and it's mechanisms even when replication is disabled.
REPLCONF ACK <ofs> FACK <ofs>, the FACK will be appended only if there's an AOF on the replica, and already ignored on old masters (thus backwards compatible)Unrelated changes:
Implementation details:
fsynced_reploff_pendingthat's updated (usually by the bio thread) and later copied to the mainfsynced_reploffvariable (only if the AOF base file exists).I.e. during the initial AOF rewrite it will not be used as the fsynced offset since the AOF base is still missing.
worker thread, to impose ordering on their execution. This solves a
race condition where a job could set
fsynced_reploff_pendingto a highervalue than another pending fsync job, resulting in indicating an offset
for which parts of the data have not yet actually been fsynced.
Imposing an ordering on the jobs guarantees that fsync jobs are executed
in increasing order of replication offset.
appendfsyncto "always"This should prevent a write race between updates to
fsynced_reploff_pendingin the main thread (
flushAppendOnlyFilewhen set to ALWAYS fsync), andthose done in the bio thread.
with the previous AOF offsets overriding the new one (e.g. after switching to replicate from a new master).
a must in case there are no additional writes that trigger a periodic fsync, specifically for a replica that does a full sync.
Limitations:
It is possible to write a module and a Lua script that propagate to the AOF and doesn't
propagate to the replication stream. see REDISMODULE_ARGV_NO_REPLICAS and luaRedisSetReplCommand.
These features are incompatible with the WAITAOF command, and can result in two bad cases.
The scenario is that the user executes command that only propagates to AOF, and then immediately
issues a WAITAOF, and there's no further writes on the replication stream after that.
Refactoring:
This introduces infrastructure necessary to allow BIO workers to
not have a 1-1 mapping of worker to job-type. This allows in the
future to assign multiple job types to a single worker, either as
a performance/resource optimization, or as a way of enforcing
ordering between specific classes of jobs.
Discussion:
[Note currently this PR implements the most explicit approach of the options presented blow]
In its most general form, the command can be used to wait for fsync by replicas, by the master (or only Redis if replication is not used), or some combination of the above. A few use-cases I can think of:
The simplest approach would be to extend the existing
WAITcommand with an optionalAOFkeyword:In which case we need to define whether
<num>includes the master or not. If it does, use-cases 1 and 3 are covered, but use-case 2—which would probably be a pretty common one—is impossible; If it does not, 2 is supported but 1 and 3 are not. This could be extended by giving "0" the special meaning of "wait for current Redis only" to enable use-case 1, but this might be somewhat surprising semantics.Another option that crossed my mind is to use positive/negative numbers for the different meanings. e.g.: "-5" means "wait for any 5 Redises" while "5" means "wait for 5 replicas." This is simple and supports all 3 use-cases above, but might not be intuitive or particularly easy to remember.
Yet another possibility is of course to split this functionality into a separate command entirely, in which case we need to define the arguments and semantics of that new command.
Todo: