Skip to content

Replica keep serving data during repl-diskless-load=swapdb for better availability#9323

Merged
oranagra merged 40 commits intoredis:unstablefrom
eduardobr:feature/draft-use-tempdb-swapdb
Nov 4, 2021
Merged

Replica keep serving data during repl-diskless-load=swapdb for better availability#9323
oranagra merged 40 commits intoredis:unstablefrom
eduardobr:feature/draft-use-tempdb-swapdb

Conversation

@eduardobr
Copy link
Contributor

@eduardobr eduardobr commented Aug 5, 2021

For diskless replication in swapdb mode, considering we already spend replica memory having a backup of current db to restore in case of failure, we can have the following benefits by instead swapping database only in case we succeeded in transferring db from master:

  • Avoid LOADING response during failed and successful synchronization for cases where the replica is already up and running with data.
  • Faster total time of diskless replication, because now we're moving from Transfer + Flush + Load time to Transfer + Load only. Flushing the tempDb is done asynchronously after swapping.
  • This could be implemented also for disk replication with similar benefits if consumers are willing to spend the extra memory usage.

General notes:

  • The concept of backupDb becomes tempDb for clarity.
  • Async loading mode will only kick in if the replica is syncing from a master that has the same repl-id the one it had before. i.e. the data it's getting belongs to a different time of the same timeline.
  • New property in INFO: async_loading to differentiate from the blocking loading
  • Slot to Key mapping is now a field of redisDb as it's more natural to access it from both server.db and the tempDb that is passed around.
  • Because this is affecting replicas only, we assume that if they are not readonly and write commands during replication, they are lost after SYNC same way as before, but we're still denying CONFIG SET here anyways to avoid complications.

Considerations for review:

  • We have many cases where server.loading flag is used and even though I tried my best, there may be cases where async_loading should be checked as well and cases where it shouldn't (would require very good understanding of whole code)
  • Several places that had different behavior depending on the loading flag where actually meant to just handle commands coming from the AOF client differently than ones coming from real clients, changed to check CLIENT_ID_AOF instead.

Additional for Release Notes

  • Bugfix - server.dirty was not incremented for any kind of diskless replication, as effect it wouldn't contribute on triggering next database SAVE
  • New flag for RM_GetContextFlags module API: REDISMODULE_CTX_FLAGS_ASYNC_LOADING
  • Deprecated RedisModuleEvent_ReplBackup. Starting from Redis 7.0, we don't fire this event.
    Instead, we have the new RedisModuleEvent_ReplAsyncLoad holding 3 sub-events: STARTED, ABORTED and COMPLETED.
  • New module flag REDISMODULE_OPTIONS_HANDLE_REPL_ASYNC_LOAD for RedisModule_SetModuleOptions to allow modules to declare they support the diskless replication with async loading (when absent, we fall back to disk-based loading).

@eduardobr eduardobr force-pushed the feature/draft-use-tempdb-swapdb branch from 9d16f89 to ecf3f43 Compare August 5, 2021 16:17
@eduardobr
Copy link
Contributor Author

eduardobr commented Aug 5, 2021

Relates to #4624, resolving it in some setups

@eduardobr eduardobr force-pushed the feature/draft-use-tempdb-swapdb branch 3 times, most recently from 47b7d2e to 409c72a Compare August 6, 2021 13:10
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduardobr thank you, i think this PR shows the potential of that concept quite well, and it's nice to see that the code was nearly prepared for this concept and you were able to apply it nicely without too many changes.

I did a quick review and gave a few comments, the main issue to be resolved is probably around the module API and complex modules with out of keyspace data.

Few other things i must note:

  1. During loading redis is not highly responsive, it only processes events for a short period once in some 2mb of RDB data. so the clients working with the replica may get degraded performance, even beyond the fact that the CPU is heavily invested in loading.
  2. I think we wanna have that mode enabled only when a replica was previously in sync with a master (not depending on whatever the database was empty or not though), when a new replica comes up, we'll want to avoid using swapdb, so that we will respond with -LOADING.

@oranagra oranagra added release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus labels Aug 12, 2021
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor fix.

what's left is to handle the modules issue.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduardobr thank you.. your changes seem great!
i gave a few minor comments mostly about renaming "repl async" to "repl async load".
and raised a few questions about ordering.
i think you can proceed with adjusting the tests.

@oranagra
Copy link
Member

oranagra commented Sep 9, 2021

@eduardobr are you gonna finish this? any major issues? IIRC it's only renaming / doc and some minor changes.
Let me know if you want me to pick it up instead.

@oranagra oranagra added the state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten label Sep 9, 2021
@eduardobr
Copy link
Contributor Author

@eduardobr are you gonna finish this? any major issues? IIRC it's only renaming / doc and some minor changes.
Let me know if you want me to pick it up instead.

@oranagra I can move on, was just waiting for some answer on the questions you raised, so I can finish in one seating.
I’ll apply the suggestions soon.
Thanks

@eduardobr eduardobr force-pushed the feature/draft-use-tempdb-swapdb branch from 3c6ad5a to 8ee0c56 Compare September 12, 2021 09:16
@eduardobr eduardobr force-pushed the feature/draft-use-tempdb-swapdb branch from 8ee0c56 to ca922aa Compare September 12, 2021 12:34
@eduardobr
Copy link
Contributor Author

eduardobr commented Sep 12, 2021

@oranagra Things got quite complicated due to many conflicting changes
Especially related to #9356 and #9398
So I decided to squash everything then rebase to reduce my cognitive load. The rebase was commited in 2 steps due to the huge amount of changes (5676f0c and ca922aa).

In relation to "Delay to discard cached master when full synchronization #9398", at replication.c readSyncBulkPayload(), I repositioned this new block and run it in a specific situation for SWAPDB mode, and another specific situation otherwise (please review). It could be wrapped in a method for reuse, but how to call a routine like this?:

       /* Replica starts to apply data from new master, we must discard the cached
        * master structure. */
        serverAssert(server.master == NULL);
        replicationDiscardCachedMaster();

        /* We want our slaves to resync with us as well, if we have any sub-slaves.
        * The master already transferred us an entirely different data set and we
        * have no way to incrementally feed our slaves after that. */
        disconnectSlaves(); /* Force our slaves to resync with us as well. */
        freeReplicationBacklog(); /* Don't allow our chained slaves to PSYNC. */

In relation to "Slot-to-keys using dict entry metadata #9356", it hit this PR in some points and even required moving this to server.h:

/* State for the Slot to Key API, for a single slot. The keys in the same slot
 * are linked together using dictEntry metadata. See also "Slot to Key API" in
 * cluster.c. */
struct clusterSlotToKeys {
    uint64_t count;             /* Number of keys in the slot. */
    dictEntry *head;            /* The first key-value entry in the slot. */
};
typedef struct clusterSlotToKeys clusterSlotsToKeysData[CLUSTER_SLOTS];

Additionally, raised question about a missing bit:
Naming for Pre and Post versions of some module events

@ShooterIT
Copy link
Member

For #9356, if you have uncertain things, we can ask @zuiderkwast for suggestions, he is a friendly man!

@zuiderkwast
Copy link
Contributor

In relation to "Slot-to-keys using dict entry metadata #9356", it hit this PR in some points and even required moving this to server.h
struct clusterSlotToKeys [...]

I'm here to help. 😄 I tried to move everything cluster-related to cluster.{c,h}.

Isn't it enough to add #include "cluster.h" in some files? I can't see exactly why you had to move them. Maybe create some new .h file (replication.h) which includes cluster.h and defines some new structs..?

@eduardobr
Copy link
Contributor Author

In relation to "Slot-to-keys using dict entry metadata #9356", it hit this PR in some points and even required moving this to server.h
struct clusterSlotToKeys [...]

I'm here to help. smile I tried to move everything cluster-related to cluster.{c,h}.

Isn't it enough to add #include "cluster.h" in some files? I can't see exactly why you had to move them. Maybe create some new .h file (replication.h) which includes cluster.h and defines some new structs..?

@zuiderkwast Thank you ;)
The reason it was moved is that tempDb struct needs clusterSlotsToKeysData (which is not a cluster specific concept)
tempDb is a broad concept and is used by

  • db.c (depends on cluster)
  • replication.c (depends on cluster)

If it lived in one of these 2, one of them wouldn't be able to access it.
So keeping in server.h was a better choice.

I'm afraid I'm still not familiar with the code base as a whole and C best practices to know if that was a decent solution though. I believe that can be also a separated commit after the storm of changes and rebases done here if necessary, what do you think?

But most importantly, I believe it would be great to have your review on the correctness of this change after I integrated your PR, for example:
I noticed your change includes this cleanup in db.c restoreDbBackup:

    /* Restore slots to keys map backup if enable cluster. */
    if (server.cluster_enabled) {
        serverAssert(server.cluster->slots_to_keys->numele == 0);
        raxFree(server.cluster->slots_to_keys);
        server.cluster->slots_to_keys = backup->slots_to_keys;
        memcpy(server.cluster->slots_keys_count, backup->slots_keys_count,
                sizeof(server.cluster->slots_keys_count));
    }

into

    /* Restore slots to keys map backup if enable cluster. */
    if (server.cluster_enabled) slotToKeyRestoreBackup(&backup->slots_to_keys);

Which inlining, is:
memcpy(server.cluster->slots_to_keys, backup, sizeof(server.cluster->slots_to_keys));

And that's what I follow here after swapping tempDb with main db:
ca922aa

Does it make sense, is the part removed now unnecessary?

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Re-reviewed the whole thing from scratch, here are my new comments:

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on documentation, didn't read the implementation details.

@oranagra oranagra changed the title Swap db only at end of diskless replication for better availability Replica keep serving data during repl-diskless-sync=swapdb for better availability Nov 4, 2021
@oranagra oranagra changed the title Replica keep serving data during repl-diskless-sync=swapdb for better availability Replica keep serving data during repl-diskless-load=swapdb for better availability Nov 4, 2021
@oranagra oranagra merged commit 91d0c75 into redis:unstable Nov 4, 2021
@oranagra
Copy link
Member

oranagra commented Nov 4, 2021

@eduardobr merged 🎉
Thank you for your patience.

@eduardobr
Copy link
Contributor Author

@eduardobr merged 🎉
Thank you for your patience.

Thank YOU for the patience 😁

@oranagra
Copy link
Member

oranagra commented Nov 7, 2021

@eduardobr i noticed some sporadic timeout errors in the valgrind runs
https://github.com/redis/redis/runs/4123007165?check_suite_focus=true

Diskless load swapdb (different replid): replica enter loading

and

Diskless load swapdb (async_loading): old database is exposed while async replication is in progress

they did not reproduce they day before / after so it might be nothing.
maybe you wanna look into that, or we can wait to gather more evidence.

oranagra pushed a commit that referenced this pull request Nov 9, 2021
During diskless replication, the check for broken EOF mark is misplaced
and should be earlier. Now we do not swap db, we do proper cleanup and
correctly raise module events on this kind of failure.

This issue existed prior to #9323, but before, the side effect was not restoring
backup and not raising the correct module events on this failure.
@oranagra
Copy link
Member

another timeout in the test under valgrind:
https://github.com/redis/redis/runs/4159338778?check_suite_focus=true

[TIMEOUT]: clients state report follows.
sock55c4ba1cc3c0 => (OK) Diskless load swapdb (async_loading): old database is exposed while async replication is in progress

@oranagra
Copy link
Member

i think the timeout is not a fault of the tests, hope it'll get solved by #9767

oranagra added a commit that referenced this pull request Nov 10, 2021
We saw some tests sporadically time out on valgrind (namely the ones
from #9323).

Increasing valgrind timeout from 20 mins to 40 mins in CI.
And fixing an outdated help message.
"module_fork_in_progress:%d\r\n"
"module_fork_last_cow_size:%zu\r\n",
(int)server.loading,
(int)(server.loading && !server.async_loading),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somehow i think it should not be added !server.async_loading here? (a bit confused?)
btw, i think async_loading should also require a document, maybe missing, i added it in redis/redis-doc#1686

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed a bit confusing that in Redis both flags are set in this mode, but in info, it's either one or the other.
But we concluded that for each one (users or developers) it makes sense to look at it this way.

enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Nov 24, 2021
In redis#9323, when `repl-diskless-load` is enabled and set to `swapdb`,
if the master replication ID hasn't changed, we can load data-set
asynchronously, and serving read commands during the full resync.

In `diskless loading short read` test, after a loading successfully,
we will wait for the loading to stop and continue the for loop.

After the introduction of `async_loading`, we also need to check it.
Otherwise the next loop will start too soon, may trigger a timing issue.
oranagra pushed a commit that referenced this pull request Nov 24, 2021
In #9323, when `repl-diskless-load` is enabled and set to `swapdb`,
if the master replication ID hasn't changed, we can load data-set
asynchronously, and serving read commands during the full resync.

In `diskless loading short read` test, after a loading successfully,
we will wait for the loading to stop and continue the for loop.

After the introduction of `async_loading`, we also need to check it.
Otherwise the next loop will start too soon, may trigger a timing issue.
hwware pushed a commit to hwware/redis that referenced this pull request Dec 20, 2021
In redis#9323, when `repl-diskless-load` is enabled and set to `swapdb`,
if the master replication ID hasn't changed, we can load data-set
asynchronously, and serving read commands during the full resync.

In `diskless loading short read` test, after a loading successfully,
we will wait for the loading to stop and continue the for loop.

After the introduction of `async_loading`, we also need to check it.
Otherwise the next loop will start too soon, may trigger a timing issue.
oranagra added a commit that referenced this pull request Dec 22, 2021
…ading (#9878)

## background
Till now CONFIG SET was blocked during loading.
(In the not so distant past, GET was disallowed too)

We recently (not released yet) added an async-loading mode, see #9323,
and during that time it'll serve CONFIG SET and any other command.
And now we realized (#9770) that some configs, and commands are dangerous
during async-loading.

## changes
* Allow most CONFIG SET during loading (both on async-loading and normal loading)
* Allow CONFIG REWRITE and CONFIG RESETSTAT during loading
* Block a few config during loading (`appendonly`, `repl-diskless-load`, and `dir`)
* Block a few commands during loading (list below)

## the blocked commands:
* SAVE - obviously we don't wanna start a foregreound save during loading 8-)
* BGSAVE - we don't mind to schedule one, but we don't wanna fork now
* BGREWRITEAOF - we don't mind to schedule one, but we don't wanna fork now
* MODULE - we obviously don't wanna unload a module during replication / rdb loading
  (MODULE HELP and MODULE LIST are not blocked)
* SYNC / PSYNC - we're in the middle of RDB loading from master, must not allow sync
  requests now.
* REPLICAOF / SLAVEOF - we're in the middle of replicating, maybe it makes sense to let
  the user abort it, but he couldn't do that so far, i don't wanna take any risk of bugs due to odd state.
* CLUSTER - only allow [HELP, SLOTS, NODES, INFO, MYID, LINKS, KEYSLOT, COUNTKEYSINSLOT,
  GETKEYSINSLOT, RESET, REPLICAS, COUNT_FAILURE_REPORTS], for others, preserve the status quo

## other fixes
* processEventsWhileBlocked had an issue when being nested, this could happen with a busy script
  during async loading (new), but also in a busy script during AOF loading (old). this lead to a crash in
  the scenario described in #6988
oranagra added a commit that referenced this pull request Feb 11, 2022
The bug is introduced by #9323. (released in 7.0 RC1)
The define of `REDISMODULE_OPTIONS_HANDLE_IO_ERRORS` and `REDISMODULE_OPTION_NO_IMPLICIT_SIGNAL_MODIFIED` have the same value.

This will result in skipping `signalModifiedKey()` after `RM_CloseKey()` if the module has set
`REDISMODULE_OPTIONS_HANDLE_REPL_ASYNC_LOAD` option.
The implication is missing WATCH and client side tracking invalidations.

Other changes:
- add `no-implicit-signal-modified` to the options in INFO modules

Co-authored-by: Oran Agra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

10 participants