Skip memory prefetch during loading to avoid crash in dictEmpty callback#14848
Skip memory prefetch during loading to avoid crash in dictEmpty callback#14848sundb merged 11 commits intoredis:unstablefrom
Conversation
🤖 Augment PR SummarySummary: Prevents a crash in Changes:
🤖 Was this summary useful? React with 👍 or 👎 |
|
|
||
| /* Get the keys ptrs - we do it here after the key obj was prefetched. */ | ||
| for (size_t i = 0; i < batch->key_count; i++) { | ||
| if (unlikely(batch->keys[i] == NULL)) continue; |
There was a problem hiding this comment.
batch->keys[i] being NULL is now skipped here, but dictPrefetch()/initBatchInfo() later uses batch->keys[i] unconditionally (e.g., dictGetHash(..., batch->keys[i])), so a NULL key can still lead to a crash further down in the same call.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
|
@kairosci thx, did you find the argv to be NULL via debug? but anyway theoretically it shouldn't happen. |
|
|
Yes, it happens. |
d8d2b16 to
00296fc
Compare
|
@sundb Please, take another look, thank you! |
30d88aa to
a3e6e5c
Compare
bd2378a to
76d183f
Compare
|
@kairosci thx, but i ran the test without this fixing code, the test doesn't fail. am i missing something? |
|
The crash occurs under very specific conditions:
These conditions are rare and intermittent, which is why the difference seems idempotent; it is a preventive fix. |
I'm worried that these NULL judgments might hide an underlying bug. It's possible that c->argv was released somewhere originally. |
595c7e0 to
68b8cd0
Compare
- Add NULL checks for individual argv elements before redis_prefetch_read() - Add NULL checks before accessing ->encoding field - Add bounds checking for key_pos in addCommandToBatch() - Handle cases where argv is NULL or partially cleaned by IO threads - Protect against accessing invalid pointers during replication This prevents crashes when IO threads clean up partial argv arrays during RDB replication, particularly when modules send commands while the replica is loading the snapshot. Fixes: #14838 Test: Enhance replication test for NULL argv handling - Add comprehensive test for prefetchCommands() NULL argv handling - Simulate IO thread cleanup scenarios during RDB replication - Test with external module commands during snapshot loading - Verify data consistency in NULL argv scenarios - Test both normal and partial cleanup cases This test exercises the real code path where the crash could occur: prefetchIOThreadCommands() -> prefetchCommands() with IO threads active during RDB streaming to the replica. Related to: #14838
68b8cd0 to
636da5f
Compare
- Added local variable capture for pcmd->argv and pcmd->argv[j] to prevent TOCTOU crashes from concurrent I/O threads. - Added !pcmd->argv check in addCommandToBatch(). - Improved integration test to correctly trigger a full sync and verify stability.
| int key_pos = pcmd->keys_result.keys[i].pos; | ||
| robj **argv = pcmd->argv; | ||
| /* Skip if argv element is NULL (can happen if argv was partially cleaned by IO threads) */ | ||
| if (unlikely(!argv || key_pos >= pcmd->argc)) continue; |
There was a problem hiding this comment.
Missing negative bounds check on key position
Low Severity
The bounds check key_pos >= pcmd->argc validates only the upper bound of key_pos but not the lower bound. Since key_pos is a signed int (from keyReference.pos which is int), a negative value would pass this check and cause an out-of-bounds read on argv[key_pos]. The PR description explicitly mentions "Invalid key positions in pcmd->keys_result" as a scenario this fix addresses, so the bounds validation is incomplete.
|
possible reproduce steps:
@ShooterIT I'm still not sure how to handle it. Maybe disable prefetch when loading? However, there is still a risk. In the future, dictEmpty() might be used in non-loading scenarios. void dictEmpty(dict *d, void(callback)(dict*)) {
...
_dictClear(d,0,callback);
/* we might prefetch command in this callback */
_dictClear(d,1,callback);
...
} |
| test {prefetchCommands handles NULL argv and keys during RDB replication with IO threads} { | ||
| # Enable diskless sync to trigger RDB streaming during replication | ||
| $master config set repl-diskless-sync yes | ||
| $master config set repl-diskless-sync-delay 0 | ||
|
|
||
| # Verify IO threads are enabled on the slave | ||
| set io_threads [$slave config get io-threads] | ||
| if {[lindex $io_threads 1] > 1} { | ||
| # Create substantial data on the master | ||
| for {set i 0} {$i < 100} {incr i} { | ||
| $master set key:$i "value:$i" | ||
| } | ||
|
|
||
| # Start replication | ||
| $slave slaveof $master_host $master_port | ||
|
|
||
| # Wait for replication to complete | ||
| wait_for_condition 100 1000 { | ||
| [lindex [$slave role] 0] eq {slave} && | ||
| [string match {*master_link_status:up*} [$slave info replication]] | ||
| } else { | ||
| fail "Replica did not complete synchronization" | ||
| } | ||
|
|
||
| # Verify data was replicated | ||
| for {set i 0} {$i < 100} {incr i} { | ||
| assert_equal "value:$i" [$slave get key:$i] | ||
| } |
There was a problem hiding this comment.
@kairosci i think the test is not correct, my understanding of the reproduction steps would be:
- Insert well over 65535 keys into the replica.
- Then execute REPLICAOF to trigger emptyDB for replica.
There was a problem hiding this comment.
I Added 70000 keys.
There was a problem hiding this comment.
I modified your test to reproduce this crash. Please refer to it
The main point is that after call replicaof, the same client is used to continuously call the get command, giving emptydb the opportunity to handle these commands in the middle.
make SANITIZER=address
./runtest --config io-threads 4 --single integration/replication
start_server {tags {"repl external:skip"}} {
set master [srv 0 client]
set master_host [srv 0 host]
set master_port [srv 0 port]
start_server {overrides {io-threads 2}} {
set slave [srv 0 client]
test {prefetchCommands handles NULL argv and keys during RDB replication with IO threads} {
# Enable diskless sync to trigger RDB streaming during replication
$master config set repl-diskless-sync yes
$master config set repl-diskless-sync-delay 0
# Verify IO threads are enabled on the slave
set io_threads [$slave config get io-threads]
if {[lindex $io_threads 1] > 1} {
# Force a full resync by resetting the slave
$slave debug populate 700000
set rd [redis_deferring_client 0]
# $slave slaveof $master_host $master_port
$rd slaveof $master_host $master_port
set batch_size 1000
set total 10000000
set num_batches [expr {$total / $batch_size}]
for {set b 0} {$b < $num_batches} {incr b} {
set buf ""
for {set i 0} {$i < $batch_size} {incr i} {
append buf [format_command get key:1]
}
$rd write $buf
$rd flush
# for {set i 0} {$i < $batch_size} {incr i} { $rd read }
}
$rd close
} else {
skip "Test requires IO threads to be enabled"
}
}
}
}
maybe we can handle both.
if loading, we don't do prefetch, actually, it is useless to prefetch key/value from main db, we can just return in prefetchCommands
it seems we already had, we call
|
12329a5 to
9a119b6
Compare
Co-authored-by: Yuan Wang <[email protected]>
…ack (redis#14848) Fixes redis#14838 ## Summary Fix a crash in `prefetchCommands()` that occurs during replica full sync when the replica has existing data that needs to be emptied. ## Problem Description During `emptyData()` → `kvstoreEmpty()` → `dictEmpty()` → `_dictClear()`, the first hash table is cleared and `d->ht_table[0]` is set to NULL via `_dictReset`. Then while clearing the second hash table, every 65536 buckets it invokes `replicationEmptyDbCallback()` → `processEventsWhileBlocked()` → `readQueryFromClient()` → `prefetchCommands()`. At this point, `dictSize() > 0` still holds (because the second hash table isn't fully cleared yet), but `ht_table[0]` is already NULL. The prefetch code assumed `ht_table[0]` is always valid when `dictSize() > 0`, leading to a crash. ## Solution 1. **Skip prefetch during loading**: Added a `server.loading` check at the top of `prefetchCommands()` to return early. During RDB loading, the main dictionary is being rebuilt, so prefetching keys from it is useless anyway. 2. **Add defensive assertion**: Added `serverAssert(batch->current_dicts[i]->ht_table[0])` in `initBatchInfo()` to catch any future cases where `ht_table[0]` is NULL while `dictSize() > 0` (which should only happen mid-`dictEmpty` via `_dictReset`). --------- Co-authored-by: kairosci <[email protected]> Co-authored-by: debing.sun <[email protected]> Co-authored-by: Yuan Wang <[email protected]>
…ack (#14848) Fixes #14838 ## Summary Fix a crash in `prefetchCommands()` that occurs during replica full sync when the replica has existing data that needs to be emptied. ## Problem Description During `emptyData()` → `kvstoreEmpty()` → `dictEmpty()` → `_dictClear()`, the first hash table is cleared and `d->ht_table[0]` is set to NULL via `_dictReset`. Then while clearing the second hash table, every 65536 buckets it invokes `replicationEmptyDbCallback()` → `processEventsWhileBlocked()` → `readQueryFromClient()` → `prefetchCommands()`. At this point, `dictSize() > 0` still holds (because the second hash table isn't fully cleared yet), but `ht_table[0]` is already NULL. The prefetch code assumed `ht_table[0]` is always valid when `dictSize() > 0`, leading to a crash. ## Solution 1. **Skip prefetch during loading**: Added a `server.loading` check at the top of `prefetchCommands()` to return early. During RDB loading, the main dictionary is being rebuilt, so prefetching keys from it is useless anyway. 2. **Add defensive assertion**: Added `serverAssert(batch->current_dicts[i]->ht_table[0])` in `initBatchInfo()` to catch any future cases where `ht_table[0]` is NULL while `dictSize() > 0` (which should only happen mid-`dictEmpty` via `_dictReset`). --------- Co-authored-by: kairosci <[email protected]> Co-authored-by: debing.sun <[email protected]> Co-authored-by: Yuan Wang <[email protected]>


Fixes #14838
Summary
Fix a crash in
prefetchCommands()that occurs during replica full sync when the replica has existing data that needs to be emptied.Problem Description
During
emptyData()→kvstoreEmpty()→dictEmpty()→_dictClear(), the first hash table is cleared andd->ht_table[0]is set to NULL via_dictReset. Then while clearing the second hash table, every 65536 buckets it invokesreplicationEmptyDbCallback()→processEventsWhileBlocked()→readQueryFromClient()→prefetchCommands().At this point,
dictSize() > 0still holds (because the second hash table isn't fully cleared yet), butht_table[0]is already NULL. The prefetch code assumedht_table[0]is always valid whendictSize() > 0, leading to a crash.Solution
server.loadingcheck at the top ofprefetchCommands()to return early. During RDB loading, the main dictionary is being rebuilt, so prefetching keys from it is useless anyway.serverAssert(batch->current_dicts[i]->ht_table[0])ininitBatchInfo()to catch any future cases whereht_table[0]is NULL whiledictSize() > 0(which should only happen mid-dictEmptyvia_dictReset).Note
Medium Risk
Touches the command prefetch path during replication/RDB loading; while the behavior change is small (early return), it affects a performance-critical hot path and adds a new invariant assertion that could trigger if other edge cases exist.
Overview
Prevents a crash during replica full sync by disabling
prefetchCommands()whileserver.loadingis true, avoiding prefetching against dictionaries that may be mid-dictEmpty/reset.Adds a defensive
serverAssertininitBatchInfo()to ensureht_table[0]is non-NULL whendictSize() > 0, and introduces an integration test that reproduces the replication + IO threads scenario with heavy pipelined traffic during a full resync.Written by Cursor Bugbot for commit 98914ba. This will update automatically on new commits. Configure here.