Skip to content

Skip memory prefetch during loading to avoid crash in dictEmpty callback#14848

Merged
sundb merged 11 commits intoredis:unstablefrom
kairosci:fix/14838-prefetchcommands-crash
Mar 24, 2026
Merged

Skip memory prefetch during loading to avoid crash in dictEmpty callback#14848
sundb merged 11 commits intoredis:unstablefrom
kairosci:fix/14838-prefetchcommands-crash

Conversation

@kairosci
Copy link
Copy Markdown
Contributor

@kairosci kairosci commented Mar 4, 2026

Fixes #14838

Summary

Fix a crash in prefetchCommands() that occurs during replica full sync when the replica has existing data that needs to be emptied.

Problem Description

During emptyData()kvstoreEmpty()dictEmpty()_dictClear(), the first hash table is cleared and d->ht_table[0] is set to NULL via _dictReset. Then while clearing the second hash table, every 65536 buckets it invokes replicationEmptyDbCallback()processEventsWhileBlocked()readQueryFromClient()prefetchCommands().
At this point, dictSize() > 0 still holds (because the second hash table isn't fully cleared yet), but ht_table[0] is already NULL. The prefetch code assumed ht_table[0] is always valid when dictSize() > 0, leading to a crash.

Solution

  1. Skip prefetch during loading: Added a server.loading check at the top of prefetchCommands() to return early. During RDB loading, the main dictionary is being rebuilt, so prefetching keys from it is useless anyway.
  2. Add defensive assertion: Added serverAssert(batch->current_dicts[i]->ht_table[0]) in initBatchInfo() to catch any future cases where ht_table[0] is NULL while dictSize() > 0 (which should only happen mid-dictEmpty via _dictReset).

Note

Medium Risk
Touches the command prefetch path during replication/RDB loading; while the behavior change is small (early return), it affects a performance-critical hot path and adds a new invariant assertion that could trigger if other edge cases exist.

Overview
Prevents a crash during replica full sync by disabling prefetchCommands() while server.loading is true, avoiding prefetching against dictionaries that may be mid-dictEmpty/reset.

Adds a defensive serverAssert in initBatchInfo() to ensure ht_table[0] is non-NULL when dictSize() > 0, and introduces an integration test that reproduces the replication + IO threads scenario with heavy pipelined traffic during a full resync.

Written by Cursor Bugbot for commit 98914ba. This will update automatically on new commits. Configure here.

@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented Mar 4, 2026

🤖 Augment PR Summary

Summary: Prevents a crash in prefetchCommands() during replica full sync when module traffic produces partially-initialized pending commands.

Changes:

  • Adds defensive NULL checks for pcmd->argv before prefetching argument objects/ptrs
  • Guards against NULL entries in batch->keys when converting key objects to their underlying pointers

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread src/memory_prefetch.c Outdated

/* Get the keys ptrs - we do it here after the key obj was prefetched. */
for (size_t i = 0; i < batch->key_count; i++) {
if (unlikely(batch->keys[i] == NULL)) continue;
Copy link
Copy Markdown

@augmentcode augmentcode Bot Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch->keys[i] being NULL is now skipped here, but dictPrefetch()/initBatchInfo() later uses batch->keys[i] unconditionally (e.g., dictGetHash(..., batch->keys[i])), so a NULL key can still lead to a crash further down in the same call.

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Comment thread src/memory_prefetch.c Outdated
@sundb
Copy link
Copy Markdown
Collaborator

sundb commented Mar 5, 2026

@kairosci thx, did you find the argv to be NULL via debug? but anyway theoretically it shouldn't happen.
can you add a corresponding test for this PR?

@sundb sundb added this to Redis 8.8 Mar 5, 2026
@github-project-automation github-project-automation Bot moved this to Todo in Redis 8.8 Mar 5, 2026
@sundb sundb added the release-notes indication that this issue needs to be mentioned in the release notes label Mar 5, 2026
@ShooterIT
Copy link
Copy Markdown
Member

argc > 1 but argv == NULL?
could it happen?

@kairosci
Copy link
Copy Markdown
Contributor Author

kairosci commented Mar 5, 2026

argc > 1 but argv == NULL?
could it happen?

Yes, it happens.

Comment thread tests/integration/replication.tcl Outdated
@kairosci kairosci force-pushed the fix/14838-prefetchcommands-crash branch from d8d2b16 to 00296fc Compare March 5, 2026 09:38
@kairosci
Copy link
Copy Markdown
Contributor Author

kairosci commented Mar 5, 2026

@sundb Please, take another look, thank you!

Comment thread tests/integration/replication.tcl
@kairosci kairosci force-pushed the fix/14838-prefetchcommands-crash branch 2 times, most recently from 30d88aa to a3e6e5c Compare March 5, 2026 09:57
Comment thread src/memory_prefetch.c Outdated
@kairosci kairosci force-pushed the fix/14838-prefetchcommands-crash branch from bd2378a to 76d183f Compare March 5, 2026 10:16
Comment thread src/memory_prefetch.c Outdated
@sundb
Copy link
Copy Markdown
Collaborator

sundb commented Mar 5, 2026

@kairosci thx, but i ran the test without this fixing code, the test doesn't fail.

./runtest --config io-threads 4 --single integration/replication --only "prefetchCommands handles NULL argv and keys during RDB replication with IO threads"

am i missing something?

@kairosci
Copy link
Copy Markdown
Contributor Author

kairosci commented Mar 5, 2026

The crash occurs under very specific conditions:

  • IO threads clean up argv arrays during RDB replication.
  • External modules send commands while replication loads the snapshot.
  • Prefetch code attempts to access argv[j] elements that have been set to NULL. (I found it in this way).
  • The code also attempts to access ->encoding on potentially invalid pointers.

These conditions are rare and intermittent, which is why the difference seems idempotent; it is a preventive fix.

@sundb
Copy link
Copy Markdown
Collaborator

sundb commented Mar 5, 2026

The crash occurs under very specific conditions:

  • IO threads clean up argv arrays during RDB replication.
  • External modules send commands while replication loads the snapshot.
  • Prefetch code attempts to access argv[j] elements that have been set to NULL. (I found it in this way).
  • The code also attempts to access ->encoding on potentially invalid pointers.

These conditions are rare and intermittent, which is why the difference seems idempotent; it is a preventive fix.

I'm worried that these NULL judgments might hide an underlying bug. It's possible that c->argv was released somewhere originally.
From the steps you provided, I suspect that it's most likely that c->argv was modified during the execution of FT.SEARCH. I'd rather we know how it touched it.
Do you have reproducible steps? I want to verify it locally as well. thank you

@kairosci kairosci force-pushed the fix/14838-prefetchcommands-crash branch from 595c7e0 to 68b8cd0 Compare March 5, 2026 11:06
- Add NULL checks for individual argv elements before redis_prefetch_read()
- Add NULL checks before accessing ->encoding field
- Add bounds checking for key_pos in addCommandToBatch()
- Handle cases where argv is NULL or partially cleaned by IO threads
- Protect against accessing invalid pointers during replication

This prevents crashes when IO threads clean up partial argv arrays during RDB replication, particularly when modules send commands while the replica is loading the snapshot.

Fixes: #14838

Test: Enhance replication test for NULL argv handling

- Add comprehensive test for prefetchCommands() NULL argv handling
- Simulate IO thread cleanup scenarios during RDB replication
- Test with external module commands during snapshot loading
- Verify data consistency in NULL argv scenarios
- Test both normal and partial cleanup cases

This test exercises the real code path where the crash could occur:
prefetchIOThreadCommands() -> prefetchCommands() with IO threads active
during RDB streaming to the replica.

Related to: #14838
@kairosci kairosci force-pushed the fix/14838-prefetchcommands-crash branch from 68b8cd0 to 636da5f Compare March 5, 2026 11:09
Comment thread tests/integration/replication.tcl Outdated
- Added local variable capture for pcmd->argv and pcmd->argv[j] to prevent TOCTOU crashes from concurrent I/O threads.
- Added !pcmd->argv check in addCommandToBatch().
- Improved integration test to correctly trigger a full sync and verify stability.
Comment thread src/memory_prefetch.c Outdated
int key_pos = pcmd->keys_result.keys[i].pos;
robj **argv = pcmd->argv;
/* Skip if argv element is NULL (can happen if argv was partially cleaned by IO threads) */
if (unlikely(!argv || key_pos >= pcmd->argc)) continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing negative bounds check on key position

Low Severity

The bounds check key_pos >= pcmd->argc validates only the upper bound of key_pos but not the lower bound. Since key_pos is a signed int (from keyReference.pos which is int), a negative value would pass this check and cause an out-of-bounds read on argv[key_pos]. The PR description explicitly mentions "Invalid key positions in pcmd->keys_result" as a scenario this fix addresses, so the bounds validation is incomplete.

Fix in Cursor Fix in Web

@sundb
Copy link
Copy Markdown
Collaborator

sundb commented Mar 5, 2026

possible reproduce steps:

  1. emptyData() → kvstoreEmpty() → dictEmpty() → _dictClear(d, 0, callback) clears the first hash table and then sets d->ht_table[0] = NULL (via _dictReset).
  2. Then _dictClear(d, 1, callback) runs. Every 65536 buckets it calls replicationEmptyDbCallback → processEventsWhileBlocked() → readQueryFromClient() -> prefetchCommands().
  3. However, during prefetch, we assumed that when dict size is not 0, dict[0] must not be NULL.
    if (!batch->current_dicts[i] || dictSize(batch->current_dicts[i]) == 0)

@ShooterIT I'm still not sure how to handle it. Maybe disable prefetch when loading? However, there is still a risk. In the future, dictEmpty() might be used in non-loading scenarios.
CC @tezc

void dictEmpty(dict *d, void(callback)(dict*)) {
    ...

    _dictClear(d,0,callback);
    /* we might prefetch command in this callback  */
    _dictClear(d,1,callback);

    ...
}

Comment thread tests/integration/replication.tcl Outdated
Comment on lines +1848 to +1875
test {prefetchCommands handles NULL argv and keys during RDB replication with IO threads} {
# Enable diskless sync to trigger RDB streaming during replication
$master config set repl-diskless-sync yes
$master config set repl-diskless-sync-delay 0

# Verify IO threads are enabled on the slave
set io_threads [$slave config get io-threads]
if {[lindex $io_threads 1] > 1} {
# Create substantial data on the master
for {set i 0} {$i < 100} {incr i} {
$master set key:$i "value:$i"
}

# Start replication
$slave slaveof $master_host $master_port

# Wait for replication to complete
wait_for_condition 100 1000 {
[lindex [$slave role] 0] eq {slave} &&
[string match {*master_link_status:up*} [$slave info replication]]
} else {
fail "Replica did not complete synchronization"
}

# Verify data was replicated
for {set i 0} {$i < 100} {incr i} {
assert_equal "value:$i" [$slave get key:$i]
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kairosci i think the test is not correct, my understanding of the reproduction steps would be:

  1. Insert well over 65535 keys into the replica.
  2. Then execute REPLICAOF to trigger emptyDB for replica.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Added 70000 keys.

Copy link
Copy Markdown
Collaborator

@sundb sundb Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified your test to reproduce this crash. Please refer to it
The main point is that after call replicaof, the same client is used to continuously call the get command, giving emptydb the opportunity to handle these commands in the middle.

make SANITIZER=address
./runtest --config io-threads 4 --single integration/replication
start_server {tags {"repl external:skip"}} {
    set master [srv 0 client]
    set master_host [srv 0 host]
    set master_port [srv 0 port]

    start_server {overrides {io-threads 2}} {
        set slave [srv 0 client]

        test {prefetchCommands handles NULL argv and keys during RDB replication with IO threads} {
            # Enable diskless sync to trigger RDB streaming during replication
            $master config set repl-diskless-sync yes
            $master config set repl-diskless-sync-delay 0

            # Verify IO threads are enabled on the slave
            set io_threads [$slave config get io-threads]
            if {[lindex $io_threads 1] > 1} {

                # Force a full resync by resetting the slave
                $slave debug populate 700000
                set rd [redis_deferring_client 0]
                # $slave slaveof $master_host $master_port
                $rd slaveof $master_host $master_port

                set batch_size 1000
                set total 10000000
                set num_batches [expr {$total / $batch_size}]
                for {set b 0} {$b < $num_batches} {incr b} {
                    set buf ""
                    for {set i 0} {$i < $batch_size} {incr i} {
                        append buf [format_command get key:1]
                    }
                    $rd write $buf
                    $rd flush
                    # for {set i 0} {$i < $batch_size} {incr i} { $rd read }
                }
                $rd close
            } else {
                skip "Test requires IO threads to be enabled"
            }
        }
    }
}

@ShooterIT
Copy link
Copy Markdown
Member

ShooterIT commented Mar 6, 2026

@ShooterIT I'm still not sure how to handle it. Maybe disable prefetch when loading? However, there is still a risk. In the future, dictEmpty() might be used in non-loading scenarios.

maybe we can handle both.

Maybe disable prefetch when loading?

if loading, we don't do prefetch, actually, it is useless to prefetch key/value from main db, we can just return in prefetchCommands

dictEmpty() might be used in non-loading scenarios.

it seems we already had, we call emptyData first in debug reload command, but without a callback. to avoid risk, we can also skip the dict which is being empty in initBatchInfo, as @sundb said, normally the ht_table[0] must not be NULL except clearing, so we can check here.

However, during prefetch, we assumed that when dict size is not 0, dict[0] must not be NULL.
if (!batch->current_dicts[i] || dictSize(batch->current_dicts[i]) == 0)

@sundb
Copy link
Copy Markdown
Collaborator

sundb commented Mar 12, 2026

@kairosci what's the status for this fix, please take a look at this comment.

Comment thread tests/integration/replication.tcl Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Comment thread src/memory_prefetch.c Outdated
@sundb sundb force-pushed the fix/14838-prefetchcommands-crash branch from 12329a5 to 9a119b6 Compare March 23, 2026 04:56
@sundb sundb changed the title Fix crash in prefetchCommands() during replication with module commands Skip memory prefetch during loading to avoid crash in dictEmpty callback Mar 23, 2026
@sundb
Copy link
Copy Markdown
Collaborator

sundb commented Mar 23, 2026

@sundb sundb requested a review from ShooterIT March 23, 2026 06:32
Comment thread tests/integration/replication.tcl Outdated
@sundb sundb merged commit 1abd489 into redis:unstable Mar 24, 2026
18 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in Redis 8.8 Mar 24, 2026
@YaacovHazan YaacovHazan moved this from Todo to In Progress in Redis 8.6 Backport Mar 24, 2026
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Mar 24, 2026
…ack (redis#14848)

Fixes redis#14838

## Summary

Fix a crash in `prefetchCommands()` that occurs during replica full sync
when the replica has existing data that needs to be emptied.

## Problem Description

During `emptyData()` → `kvstoreEmpty()` → `dictEmpty()` →
`_dictClear()`, the first hash table is cleared and `d->ht_table[0]` is
set to NULL via `_dictReset`. Then while clearing the second hash table,
every 65536 buckets it invokes `replicationEmptyDbCallback()` →
`processEventsWhileBlocked()` → `readQueryFromClient()` →
`prefetchCommands()`.
At this point, `dictSize() > 0` still holds (because the second hash
table isn't fully cleared yet), but `ht_table[0]` is already NULL. The
prefetch code assumed `ht_table[0]` is always valid when `dictSize() >
0`, leading to a crash.

## Solution
1. **Skip prefetch during loading**: Added a `server.loading` check at
the top of `prefetchCommands()` to return early. During RDB loading, the
main dictionary is being rebuilt, so prefetching keys from it is useless
anyway.
2. **Add defensive assertion**: Added
`serverAssert(batch->current_dicts[i]->ht_table[0])` in
`initBatchInfo()` to catch any future cases where `ht_table[0]` is NULL
while `dictSize() > 0` (which should only happen mid-`dictEmpty` via
`_dictReset`).

---------

Co-authored-by: kairosci <[email protected]>
Co-authored-by: debing.sun <[email protected]>
Co-authored-by: Yuan Wang <[email protected]>
@YaacovHazan YaacovHazan mentioned this pull request Mar 24, 2026
YaacovHazan pushed a commit that referenced this pull request Mar 24, 2026
…ack (#14848)

Fixes #14838

## Summary

Fix a crash in `prefetchCommands()` that occurs during replica full sync
when the replica has existing data that needs to be emptied.

## Problem Description

During `emptyData()` → `kvstoreEmpty()` → `dictEmpty()` →
`_dictClear()`, the first hash table is cleared and `d->ht_table[0]` is
set to NULL via `_dictReset`. Then while clearing the second hash table,
every 65536 buckets it invokes `replicationEmptyDbCallback()` →
`processEventsWhileBlocked()` → `readQueryFromClient()` →
`prefetchCommands()`.
At this point, `dictSize() > 0` still holds (because the second hash
table isn't fully cleared yet), but `ht_table[0]` is already NULL. The
prefetch code assumed `ht_table[0]` is always valid when `dictSize() >
0`, leading to a crash.

## Solution
1. **Skip prefetch during loading**: Added a `server.loading` check at
the top of `prefetchCommands()` to return early. During RDB loading, the
main dictionary is being rebuilt, so prefetching keys from it is useless
anyway.
2. **Add defensive assertion**: Added
`serverAssert(batch->current_dicts[i]->ht_table[0])` in
`initBatchInfo()` to catch any future cases where `ht_table[0]` is NULL
while `dictSize() > 0` (which should only happen mid-`dictEmpty` via
`_dictReset`).

---------

Co-authored-by: kairosci <[email protected]>
Co-authored-by: debing.sun <[email protected]>
Co-authored-by: Yuan Wang <[email protected]>
@YaacovHazan YaacovHazan moved this from In Progress to Done in Redis 8.6 Backport Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[BUG] Crash in prefetchCommands()

4 participants