Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reply LOADING on replica while flushing the db #13495

Merged
merged 11 commits into from
Sep 3, 2024
Merged

Conversation

tezc
Copy link
Collaborator

@tezc tezc commented Aug 27, 2024

On a full sync, replica starts discarding existing db. If the existing db is huge and
flush is happening synchronously, replica may become unresponsive.

Adding a change to yield back to event loop while flushing db on a replica.
Replica will reply -LOADING in this case. Note that while replica is loading the new rdb,
it may get an error and start flushing the partial db. This step may take a long time as well.
Similarly, replica will reply -LOADING in this case.

To call processEventsWhileBlocked() and reply -LOADING, we need to do a few things:

  • Set connSetReadHandler() null not to process further data from the master
  • Set server.loading flag
  • Call blockingOperationStarts()

rdbload() already does these steps and calls processEventsWhileBlocked() while loading the rdb.
Added a new call rdbLoadWithEmptyCb() which accepts callback to flush db before loading rdb or
when an error happens while loading.

For diskless replication, doing something similar and calling emptyData() after setting required flags.

Additional changes:

  • Allow appendonly config change during loading.
    Config can be changed while loading data on startup or on replication when slave is loading RDB.
    We allow config change command to update server.aof_enabled and then lazily apply config change after loading operation is completed.

  • Added a test for replica-lazy-flush config

* reply -LOADING. */
emptyData(-1, empty_db_flags, replicationEmptyDbCallback);
}
loadingFireEvent(RDBFLAGS_REPLICATION);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split startLoading() into two functions here. Before the PR, we fired module event after flushing the data. Not sure it matters but I tried to preserve same behavior.

loadingSetFlags() sets flags so emptyData() can reply -LOADING.
loadingFireEvent() fires event with an empty db same as before.

@tezc tezc requested a review from oranagra August 27, 2024 13:34
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd like @sundb to take a look too

@sundb
Copy link
Collaborator

sundb commented Aug 28, 2024

@oranagra is it the chance to make the config replica-lazy-flush to be yes by default.
and continue #12191

@oranagra
Copy link
Member

@oranagra is it the chance to make the config replica-lazy-flush to be yes by default. and continue #12191

i don't think this is related to lazy flush.
lazy flush means we proceed to rdb loading without waiting to release the old memory.
in this case it just means we start replying with LOADING message (or take CONFIG SET and INFO), instead of being unresponsive.

oranagra
oranagra previously approved these changes Sep 2, 2024
@oranagra
Copy link
Member

oranagra commented Sep 2, 2024

approved by accident

@tezc
Copy link
Collaborator Author

tezc commented Sep 2, 2024

redis/src/module.c

Line 13073 in 3fcddfb

if (server.aof_state != AOF_OFF) startAppendOnly();

Also, I just realized this should have been server.aof_enabled check. So, this function disables aof before loading and never restarts it again. (I added this API 😨). I think we can fix in a follow up (separate) PR.

Note: It needs retry logic as well. We can call startAppendOnlyWithRetry() here.

oranagra
oranagra previously approved these changes Sep 2, 2024
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Sep 2, 2024
@sundb
Copy link
Collaborator

sundb commented Sep 2, 2024

fully CI: https://github.com/redis/redis/actions/runs/10666481354

@tezc tezc merged commit a7afd1d into redis:unstable Sep 3, 2024
14 checks passed
@tezc tezc deleted the loading-reply branch September 3, 2024 06:49
tezc added a commit that referenced this pull request Sep 8, 2024
#13495 introduced a change to reply -LOADING while flushing existing db on a replica. Some of our tests are
 sensitive to this change and do no expect -LOADING reply.

Fixing a couple of tests that fail time to time.
@YaacovHazan YaacovHazan mentioned this pull request Sep 11, 2024
YaacovHazan added a commit that referenced this pull request Sep 12, 2024
### New Features in binary distributions

- 7 new data structures: JSON, Time series, Bloom filter, Cuckoo filter,
Count-min sketch, Top-k, t-digest
- Redis scalable query engine (including vector search)

### Potentially breaking changes

- #12272 `GETRANGE` returns an empty bulk when the negative end index is
out of range
- #12395 Optimize `SCAN` command when matching data type

### Bug fixes

- #13510 Fix `RM_RdbLoad` to enable AOF after RDB loading is completed
- #13489 `ACL CAT` - return module commands
- #13476 Fix a race condition in the `cache_memory` of `functionsLibCtx`
- #13473 Fix incorrect lag due to trimming stream via `XTRIM` command
- #13338 Fix incorrect lag field in `XINFO` when tombstone is after the
`last_id` of the consume group
- #13470 On `HDEL` of last field - update the global hash field
expiration data structure
- #13465 Cluster: Pass extensions to node if extension processing is
handled by it
- #13443 Cluster: Ensure validity of myself when loading cluster config
- #13422 Cluster: Fix `CLUSTER SHARDS` command returns empty array

### Modules API

- #13509 New API calls: `RM_DefragAllocRaw`, `RM_DefragFreeRaw`, and
`RM_RegisterDefragCallbacks` - defrag API to allocate and free raw
memory

### Performance and resource utilization improvements

- #13503 Avoid overhead of comparison function pointer calls in listpack
`lpFind`
- #13505 Optimize `STRING` datatype write commands
- #13499 Optimize `SMEMBERS` command
- #13494 Optimize `GEO*` commands reply
- #13490 Optimize `HELLO` command
- #13488 Optimize client query buffer
- #12395 Optimize `SCAN` command when matching data type
- #13529 Optimize `LREM`, `LPOS`, `LINSERT`, and `LINDEX` commands
- #13516 Optimize `LRANGE` and other commands that perform several
writes to client buffers per call
- #13431 Avoid `used_memory` contention when updating from multiple
threads

### Other general improvements

- #13495 Reply `-LOADING` on replica while flushing the db

### CLI tools

- #13411 redis-cli: Fix wrong `dbnum` showed after the client
reconnected

### Notes

- No backward compatibility for replication or persistence.
- Additional distributions, upgrade paths, features, and improvements
will be introduced in upcoming pre-releases.
- With the GA release of 8.0 we will deprecate Redis Stack.
tezc added a commit that referenced this pull request Dec 3, 2024
In #13495, we introduced a feature to
reply -LOADING while flushing a large db on a replica.
While `_dictClear()` is in progress, it calls a callback for every 65k
items and we yield back to eventloop to reply -LOADING.

This change has made some tests unstable as those tests don't expect new
-LOADING reply.
One observation, inside `_dictClear()`, we call the callback even if db
has a few keys. Most tests run with small amount of keys. So, each
replication and cluster test has to handle potential -LOADING reply now.

This PR changes this behavior, skips calling callback when `i=0` to
stabilize replication tests.
Callback will be called after the first 65k items. Most tests use less
than 65k keys and they won't get -LOADING reply.
YaacovHazan pushed a commit that referenced this pull request Jan 14, 2025
In #13495, we introduced a feature to
reply -LOADING while flushing a large db on a replica.
While `_dictClear()` is in progress, it calls a callback for every 65k
items and we yield back to eventloop to reply -LOADING.

This change has made some tests unstable as those tests don't expect new
-LOADING reply.
One observation, inside `_dictClear()`, we call the callback even if db
has a few keys. Most tests run with small amount of keys. So, each
replication and cluster test has to handle potential -LOADING reply now.

This PR changes this behavior, skips calling callback when `i=0` to
stabilize replication tests.
Callback will be called after the first 65k items. Most tests use less
than 65k keys and they won't get -LOADING reply.
tezc added a commit that referenced this pull request Mar 26, 2025
When the diskless load configuration is set to on-empty-db, we retain a
pointer to the function library context. When emptyData() is called, it
frees this function library context pointer, leading to a use-after-free
situation.

I refactored code to ensure that emptyData() is called first, followed
by retrieving the valid pointer to the function library context.

Refactored code should not introduce any runtime implications.

Bug introduced by #13495 (Redis 8.0)

Co-authored-by: Oran Agra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes indication that this issue needs to be mentioned in the release notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants