Skip to content

[8.6] [MOD-14732] fix race condition on hybrid#9066

Merged
redisearch-backport-pull-request[bot] merged 1 commit into8.6from
backport-9029-to-8.6
Apr 15, 2026
Merged

[8.6] [MOD-14732] fix race condition on hybrid#9066
redisearch-backport-pull-request[bot] merged 1 commit into8.6from
backport-9029-to-8.6

Conversation

@redisearch-backport-pull-request
Copy link
Copy Markdown
Contributor

@redisearch-backport-pull-request redisearch-backport-pull-request Bot commented Apr 15, 2026

Description

Backport of #9029 to 8.6.

Describe the changes in the pull request

The race condition

ProcessHybridCursorMappings dispatches async callbacks to shards and waits for all responses. The coordinator reads numShards from the topology, allocates a shared processCursorMappingCallbackContext (ctx), and waits until responseCount == numShards before freeing ctx.

However, iterStartCb runs asynchronously on an IO thread and reads the live topology, which may have a different shard count after a migration. If the live topology has M > N shards, iterStartCb dispatches commands to all M shards, but the coordinator only waits for N responses. After receiving N responses, the coordinator frees ctx via cleanupCtx, while the remaining M - N callbacks still hold a pointer to the freed memory.

When those late callbacks fire, MRIteratorCallback_GetPrivateData(ctx) returns a dangling pointer, and the callback dereferences freed memory — reading cb_ctx->mutex, cb_ctx->searchMappings, etc.

The ASAN trace confirms this: processCursorMappingCallback (thread T4, search-uv-1) reads from the freed ctx at the very first dereference, while thread T23 (coord-5081) already freed it via cleanupCtxrm_free.

The fix

The existing privateDataInit hook in MR_IterateWithPrivateData is called from iterStartCb on the IO thread after the actual shard count is determined from the live topology but before any commands are dispatched. We leverage this hook with a new processCursorMappingInit callback that:

  1. Locks ctx->mutex
  2. Updates ctx->numShards to the actual number of shards from the live topology
  3. Signals the condition variable (in case the coordinator is already waiting)
  4. Unlocks the mutex

The coordinator's wait loop now reads ctx->numShards (under the mutex) instead of the local numShards variable, so it dynamically picks up the updated count and waits for all dispatched callbacks before freeing ctx.

Also re-enables test_add_shard_and_migrate_hybrid_BG, which was previously skipped due to a flaky hybrid cursor migration crash (SIGSEGV) caused by this race.


Note

Medium Risk
Touches async coordination between the coordinator and shard I/O threads; mistakes could cause deadlocks or incomplete cursor mapping collection under topology changes. Change is localized to hybrid cursor mapping initialization and wait logic.

Overview
Prevents rare coordinator crashes in distributed FT.HYBRID queries during shard migration/topology changes by making cursor-mapping collection wait on the actual shard count determined on the I/O thread.

ProcessHybridCursorMappings now initializes numShards via a new private-data init callback and waits until all dispatched shard replies arrive before freeing shared callback state, eliminating a use-after-free race.

Reviewed by Cursor Bugbot for commit 845231d. Bugbot is set up for automated code reviews on this repo. Configure here.

* fix race condition on hybrid

* remove unrelated file

* use Strong/WeakRef instead of refCount

* fix the real issue, count proper numShards

* fix by setting numShards from IO thread

* rephrase

* fix potential problem if shards can be 0

* use initialized flag

* update comment

* remove unused numShards

(cherry picked from commit 3bb5f03)
@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented Apr 15, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 845231d. Configure here.

// Signal so the coordinator can re-check the wait condition.
pthread_cond_signal(ctx->completionCond);
pthread_mutex_unlock(ctx->mutex);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing GitHub release note for this PR

Low Severity

Per project rules, a GitHub release note is needed. Suggested release note: "Fixed a race condition in hybrid queries that could cause a crash (use-after-free) when the cluster topology changes during a shard migration while a hybrid query is being processed."

Fix in Cursor Fix in Web

Triggered by project rule: Please assist in writing a GitHub release note for this PR, which is concise and focused on the user impact.

Reviewed by Cursor Bugbot for commit 845231d. Configure here.

@sonarqubecloud
Copy link
Copy Markdown

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.89%. Comparing base (2b7a93a) to head (845231d).
⚠️ Report is 1 commits behind head on 8.6.

Additional details and impacted files
@@            Coverage Diff             @@
##              8.6    #9066      +/-   ##
==========================================
+ Coverage   83.85%   83.89%   +0.03%     
==========================================
  Files         367      367              
  Lines       55818    55825       +7     
  Branches    14322    14322              
==========================================
+ Hits        46805    46833      +28     
+ Misses       8851     8830      -21     
  Partials      162      162              
Flag Coverage Δ
flow 84.90% <100.00%> (+0.07%) ⬆️
unit 50.69% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@redisearch-backport-pull-request redisearch-backport-pull-request Bot added this pull request to the merge queue Apr 15, 2026
Merged via the queue into 8.6 with commit 40a7055 Apr 15, 2026
51 of 53 checks passed
@redisearch-backport-pull-request redisearch-backport-pull-request Bot deleted the backport-9029-to-8.6 branch April 15, 2026 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants