Skip memory prefetch during loading to avoid crash in dictEmpty callback by kairosci · Pull Request #14848 · redis/redis

kairosci · 2026-03-04T22:40:53Z

Summary

Fix a crash in prefetchCommands() that occurs during replica full sync when the replica has existing data that needs to be emptied.

Problem Description

During emptyData() → kvstoreEmpty() → dictEmpty() → _dictClear(), the first hash table is cleared and d->ht_table[0] is set to NULL via _dictReset. Then while clearing the second hash table, every 65536 buckets it invokes replicationEmptyDbCallback() → processEventsWhileBlocked() → readQueryFromClient() → prefetchCommands().
At this point, dictSize() > 0 still holds (because the second hash table isn't fully cleared yet), but ht_table[0] is already NULL. The prefetch code assumed ht_table[0] is always valid when dictSize() > 0, leading to a crash.

Solution

Skip prefetch during loading: Added a server.loading check at the top of prefetchCommands() to return early. During RDB loading, the main dictionary is being rebuilt, so prefetching keys from it is useless anyway.
Add defensive assertion: Added serverAssert(batch->current_dicts[i]->ht_table[0]) in initBatchInfo() to catch any future cases where ht_table[0] is NULL while dictSize() > 0 (which should only happen mid-dictEmpty via _dictReset).

Note

Medium Risk
Touches the command prefetch path during replication/RDB loading; while the behavior change is small (early return), it affects a performance-critical hot path and adds a new invariant assertion that could trigger if other edge cases exist.

Overview
Prevents a crash during replica full sync by disabling prefetchCommands() while server.loading is true, avoiding prefetching against dictionaries that may be mid-dictEmpty/reset.

Adds a defensive serverAssert in initBatchInfo() to ensure ht_table[0] is non-NULL when dictSize() > 0, and introduces an integration test that reproduces the replication + IO threads scenario with heavy pipelined traffic during a full resync.

^{Written by Cursor Bugbot for commit 98914ba. This will update automatically on new commits. Configure here.}

augmentcode · 2026-03-04T22:43:12Z

🤖 Augment PR Summary

Summary: Prevents a crash in prefetchCommands() during replica full sync when module traffic produces partially-initialized pending commands.

Changes:

Adds defensive NULL checks for pcmd->argv before prefetching argument objects/ptrs
Guards against NULL entries in batch->keys when converting key objects to their underlying pointers

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-03-04T22:43:13Z


    /* Get the keys ptrs - we do it here after the key obj was prefetched. */
    for (size_t i = 0; i < batch->key_count; i++) {
+        if (unlikely(batch->keys[i] == NULL)) continue;


batch->keys[i] being NULL is now skipped here, but dictPrefetch()/initBatchInfo() later uses batch->keys[i] unconditionally (e.g., dictGetHash(..., batch->keys[i])), so a NULL key can still lead to a crash further down in the same call.

Severity: high

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

sundb · 2026-03-05T01:04:23Z

@kairosci thx, did you find the argv to be NULL via debug? but anyway theoretically it shouldn't happen.
can you add a corresponding test for this PR?

ShooterIT · 2026-03-05T01:58:27Z

argc > 1 but argv == NULL?
could it happen?

kairosci · 2026-03-05T08:16:07Z

argc > 1 but argv == NULL?
could it happen?

Yes, it happens.

kairosci · 2026-03-05T09:47:06Z

@sundb Please, take another look, thank you!

sundb · 2026-03-05T10:27:29Z

@kairosci thx, but i ran the test without this fixing code, the test doesn't fail.

./runtest --config io-threads 4 --single integration/replication --only "prefetchCommands handles NULL argv and keys during RDB replication with IO threads"

am i missing something?

kairosci · 2026-03-05T10:56:14Z

The crash occurs under very specific conditions:

IO threads clean up argv arrays during RDB replication.
External modules send commands while replication loads the snapshot.
Prefetch code attempts to access argv[j] elements that have been set to NULL. (I found it in this way).
The code also attempts to access ->encoding on potentially invalid pointers.

These conditions are rare and intermittent, which is why the difference seems idempotent; it is a preventive fix.

sundb · 2026-03-05T11:06:00Z

The crash occurs under very specific conditions:

IO threads clean up argv arrays during RDB replication.

External modules send commands while replication loads the snapshot.

Prefetch code attempts to access argv[j] elements that have been set to NULL. (I found it in this way).

The code also attempts to access ->encoding on potentially invalid pointers.

These conditions are rare and intermittent, which is why the difference seems idempotent; it is a preventive fix.

I'm worried that these NULL judgments might hide an underlying bug. It's possible that c->argv was released somewhere originally.
From the steps you provided, I suspect that it's most likely that c->argv was modified during the execution of FT.SEARCH. I'd rather we know how it touched it.
Do you have reproducible steps? I want to verify it locally as well. thank you

- Add NULL checks for individual argv elements before redis_prefetch_read() - Add NULL checks before accessing ->encoding field - Add bounds checking for key_pos in addCommandToBatch() - Handle cases where argv is NULL or partially cleaned by IO threads - Protect against accessing invalid pointers during replication This prevents crashes when IO threads clean up partial argv arrays during RDB replication, particularly when modules send commands while the replica is loading the snapshot. Fixes: #14838 Test: Enhance replication test for NULL argv handling - Add comprehensive test for prefetchCommands() NULL argv handling - Simulate IO thread cleanup scenarios during RDB replication - Test with external module commands during snapshot loading - Verify data consistency in NULL argv scenarios - Test both normal and partial cleanup cases This test exercises the real code path where the crash could occur: prefetchIOThreadCommands() -> prefetchCommands() with IO threads active during RDB streaming to the replica. Related to: #14838

- Added local variable capture for pcmd->argv and pcmd->argv[j] to prevent TOCTOU crashes from concurrent I/O threads. - Added !pcmd->argv check in addCommandToBatch(). - Improved integration test to correctly trigger a full sync and verify stability.

cursor · 2026-03-05T11:57:06Z

+            int key_pos = pcmd->keys_result.keys[i].pos;
+            robj **argv = pcmd->argv;
+            /* Skip if argv element is NULL (can happen if argv was partially cleaned by IO threads) */
+            if (unlikely(!argv || key_pos >= pcmd->argc)) continue;


Missing negative bounds check on key position

Low Severity

The bounds check key_pos >= pcmd->argc validates only the upper bound of key_pos but not the lower bound. Since key_pos is a signed int (from keyReference.pos which is int), a negative value would pass this check and cause an out-of-bounds read on argv[key_pos]. The PR description explicitly mentions "Invalid key positions in pcmd->keys_result" as a scenario this fix addresses, so the bounds validation is incomplete.

sundb · 2026-03-05T12:20:36Z

possible reproduce steps:

emptyData() → kvstoreEmpty() → dictEmpty() → _dictClear(d, 0, callback) clears the first hash table and then sets d->ht_table[0] = NULL (via _dictReset).
Then _dictClear(d, 1, callback) runs. Every 65536 buckets it calls replicationEmptyDbCallback → processEventsWhileBlocked() → readQueryFromClient() -> prefetchCommands().
However, during prefetch, we assumed that when dict size is not 0, dict[0] must not be NULL.
if (!batch->current_dicts[i] || dictSize(batch->current_dicts[i]) == 0)

@ShooterIT I'm still not sure how to handle it. Maybe disable prefetch when loading? However, there is still a risk. In the future, dictEmpty() might be used in non-loading scenarios.
CC @tezc

void dictEmpty(dict *d, void(callback)(dict*)) {
    ...

    _dictClear(d,0,callback);
    /* we might prefetch command in this callback  */
    _dictClear(d,1,callback);

    ...
}

sundb · 2026-03-05T13:17:47Z

+        test {prefetchCommands handles NULL argv and keys during RDB replication with IO threads} {
+            # Enable diskless sync to trigger RDB streaming during replication
+            $master config set repl-diskless-sync yes
+            $master config set repl-diskless-sync-delay 0
+
+            # Verify IO threads are enabled on the slave
+            set io_threads [$slave config get io-threads]
+            if {[lindex $io_threads 1] > 1} {
+                # Create substantial data on the master
+                for {set i 0} {$i < 100} {incr i} {
+                    $master set key:$i "value:$i"
+                }
+
+                # Start replication
+                $slave slaveof $master_host $master_port
+
+                # Wait for replication to complete
+                wait_for_condition 100 1000 {
+                    [lindex [$slave role] 0] eq {slave} &&
+                    [string match {*master_link_status:up*} [$slave info replication]]
+                } else {
+                    fail "Replica did not complete synchronization"
+                }
+
+                # Verify data was replicated
+                for {set i 0} {$i < 100} {incr i} {
+                    assert_equal "value:$i" [$slave get key:$i]
+                }


@kairosci i think the test is not correct, my understanding of the reproduction steps would be:

Insert well over 65535 keys into the replica.

Then execute REPLICAOF to trigger emptyDB for replica.

I Added 70000 keys.

I modified your test to reproduce this crash. Please refer to it
The main point is that after call replicaof, the same client is used to continuously call the get command, giving emptydb the opportunity to handle these commands in the middle.

make SANITIZER=address ./runtest --config io-threads 4 --single integration/replication

start_server {tags {"repl external:skip"}} { set master [srv 0 client] set master_host [srv 0 host] set master_port [srv 0 port] start_server {overrides {io-threads 2}} { set slave [srv 0 client] test {prefetchCommands handles NULL argv and keys during RDB replication with IO threads} { # Enable diskless sync to trigger RDB streaming during replication $master config set repl-diskless-sync yes $master config set repl-diskless-sync-delay 0 # Verify IO threads are enabled on the slave set io_threads [$slave config get io-threads] if {[lindex $io_threads 1] > 1} { # Force a full resync by resetting the slave $slave debug populate 700000 set rd [redis_deferring_client 0] # $slave slaveof $master_host $master_port $rd slaveof $master_host $master_port set batch_size 1000 set total 10000000 set num_batches [expr {$total / $batch_size}] for {set b 0} {$b < $num_batches} {incr b} { set buf "" for {set i 0} {$i < $batch_size} {incr i} { append buf [format_command get key:1] } $rd write $buf $rd flush # for {set i 0} {$i < $batch_size} {incr i} { $rd read } } $rd close } else { skip "Test requires IO threads to be enabled" } } } }

ShooterIT · 2026-03-06T02:48:50Z

@ShooterIT I'm still not sure how to handle it. Maybe disable prefetch when loading? However, there is still a risk. In the future, dictEmpty() might be used in non-loading scenarios.

maybe we can handle both.

Maybe disable prefetch when loading?

if loading, we don't do prefetch, actually, it is useless to prefetch key/value from main db, we can just return in prefetchCommands

dictEmpty() might be used in non-loading scenarios.

it seems we already had, we call emptyData first in debug reload command, but without a callback. to avoid risk, we can also skip the dict which is being empty in initBatchInfo, as @sundb said, normally the ht_table[0] must not be NULL except clearing, so we can check here.

However, during prefetch, we assumed that when dict size is not 0, dict[0] must not be NULL.
if (!batch->current_dicts[i] || dictSize(batch->current_dicts[i]) == 0)

sundb · 2026-03-12T02:39:50Z

@kairosci what's the status for this fix, please take a look at this comment.

This reverts commit cc633b0.

This reverts commit 385b665.

This reverts commit 636da5f.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

sundb · 2026-03-23T06:32:45Z

fully CI: https://github.com/sundb/redis/actions/runs/23424569172

Co-authored-by: Yuan Wang <[email protected]>

…ack (redis#14848) Fixes redis#14838 ## Summary Fix a crash in `prefetchCommands()` that occurs during replica full sync when the replica has existing data that needs to be emptied. ## Problem Description During `emptyData()` → `kvstoreEmpty()` → `dictEmpty()` → `_dictClear()`, the first hash table is cleared and `d->ht_table[0]` is set to NULL via `_dictReset`. Then while clearing the second hash table, every 65536 buckets it invokes `replicationEmptyDbCallback()` → `processEventsWhileBlocked()` → `readQueryFromClient()` → `prefetchCommands()`. At this point, `dictSize() > 0` still holds (because the second hash table isn't fully cleared yet), but `ht_table[0]` is already NULL. The prefetch code assumed `ht_table[0]` is always valid when `dictSize() > 0`, leading to a crash. ## Solution 1. **Skip prefetch during loading**: Added a `server.loading` check at the top of `prefetchCommands()` to return early. During RDB loading, the main dictionary is being rebuilt, so prefetching keys from it is useless anyway. 2. **Add defensive assertion**: Added `serverAssert(batch->current_dicts[i]->ht_table[0])` in `initBatchInfo()` to catch any future cases where `ht_table[0]` is NULL while `dictSize() > 0` (which should only happen mid-`dictEmpty` via `_dictReset`). --------- Co-authored-by: kairosci <[email protected]> Co-authored-by: debing.sun <[email protected]> Co-authored-by: Yuan Wang <[email protected]>

…ack (#14848) Fixes #14838 ## Summary Fix a crash in `prefetchCommands()` that occurs during replica full sync when the replica has existing data that needs to be emptied. ## Problem Description During `emptyData()` → `kvstoreEmpty()` → `dictEmpty()` → `_dictClear()`, the first hash table is cleared and `d->ht_table[0]` is set to NULL via `_dictReset`. Then while clearing the second hash table, every 65536 buckets it invokes `replicationEmptyDbCallback()` → `processEventsWhileBlocked()` → `readQueryFromClient()` → `prefetchCommands()`. At this point, `dictSize() > 0` still holds (because the second hash table isn't fully cleared yet), but `ht_table[0]` is already NULL. The prefetch code assumed `ht_table[0]` is always valid when `dictSize() > 0`, leading to a crash. ## Solution 1. **Skip prefetch during loading**: Added a `server.loading` check at the top of `prefetchCommands()` to return early. During RDB loading, the main dictionary is being rebuilt, so prefetching keys from it is useless anyway. 2. **Add defensive assertion**: Added `serverAssert(batch->current_dicts[i]->ht_table[0])` in `initBatchInfo()` to catch any future cases where `ht_table[0]` is NULL while `dictSize() > 0` (which should only happen mid-`dictEmpty` via `_dictReset`). --------- Co-authored-by: kairosci <[email protected]> Co-authored-by: debing.sun <[email protected]> Co-authored-by: Yuan Wang <[email protected]>

augmentcode Bot reviewed Mar 4, 2026

View reviewed changes

cursor Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread src/memory_prefetch.c Outdated

sundb added this to Redis 8.8 Mar 5, 2026

github-project-automation Bot moved this to Todo in Redis 8.8 Mar 5, 2026

sundb added the release-notes indication that this issue needs to be mentioned in the release notes label Mar 5, 2026