[8.6] [MOD-14732] fix race condition on hybrid#9066
[8.6] [MOD-14732] fix race condition on hybrid#9066redisearch-backport-pull-request[bot] merged 1 commit into8.6from
Conversation
* fix race condition on hybrid * remove unrelated file * use Strong/WeakRef instead of refCount * fix the real issue, count proper numShards * fix by setting numShards from IO thread * rephrase * fix potential problem if shards can be 0 * use initialized flag * update comment * remove unused numShards (cherry picked from commit 3bb5f03)
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 845231d. Configure here.
| // Signal so the coordinator can re-check the wait condition. | ||
| pthread_cond_signal(ctx->completionCond); | ||
| pthread_mutex_unlock(ctx->mutex); | ||
| } |
There was a problem hiding this comment.
Missing GitHub release note for this PR
Low Severity
Per project rules, a GitHub release note is needed. Suggested release note: "Fixed a race condition in hybrid queries that could cause a crash (use-after-free) when the cluster topology changes during a shard migration while a hybrid query is being processed."
Triggered by project rule: Please assist in writing a GitHub release note for this PR, which is concise and focused on the user impact.
Reviewed by Cursor Bugbot for commit 845231d. Configure here.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## 8.6 #9066 +/- ##
==========================================
+ Coverage 83.85% 83.89% +0.03%
==========================================
Files 367 367
Lines 55818 55825 +7
Branches 14322 14322
==========================================
+ Hits 46805 46833 +28
+ Misses 8851 8830 -21
Partials 162 162
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|





Description
Backport of #9029 to
8.6.Describe the changes in the pull request
The race condition
ProcessHybridCursorMappingsdispatches async callbacks to shards and waits for all responses. The coordinator readsnumShardsfrom the topology, allocates a sharedprocessCursorMappingCallbackContext(ctx), and waits untilresponseCount == numShardsbefore freeingctx.However,
iterStartCbruns asynchronously on an IO thread and reads the live topology, which may have a different shard count after a migration. If the live topology hasM > Nshards,iterStartCbdispatches commands to allMshards, but the coordinator only waits forNresponses. After receivingNresponses, the coordinator freesctxviacleanupCtx, while the remainingM - Ncallbacks still hold a pointer to the freed memory.When those late callbacks fire,
MRIteratorCallback_GetPrivateData(ctx)returns a dangling pointer, and the callback dereferences freed memory — readingcb_ctx->mutex,cb_ctx->searchMappings, etc.The ASAN trace confirms this:
processCursorMappingCallback(thread T4,search-uv-1) reads from the freedctxat the very first dereference, while thread T23 (coord-5081) already freed it viacleanupCtx→rm_free.The fix
The existing
privateDataInithook inMR_IterateWithPrivateDatais called fromiterStartCbon the IO thread after the actual shard count is determined from the live topology but before any commands are dispatched. We leverage this hook with a newprocessCursorMappingInitcallback that:ctx->mutexctx->numShardsto the actual number of shards from the live topologyThe coordinator's wait loop now reads
ctx->numShards(under the mutex) instead of the localnumShardsvariable, so it dynamically picks up the updated count and waits for all dispatched callbacks before freeingctx.Also re-enables
test_add_shard_and_migrate_hybrid_BG, which was previously skipped due to a flaky hybrid cursor migration crash (SIGSEGV) caused by this race.Note
Medium Risk
Touches async coordination between the coordinator and shard I/O threads; mistakes could cause deadlocks or incomplete cursor mapping collection under topology changes. Change is localized to hybrid cursor mapping initialization and wait logic.
Overview
Prevents rare coordinator crashes in distributed
FT.HYBRIDqueries during shard migration/topology changes by making cursor-mapping collection wait on the actual shard count determined on the I/O thread.ProcessHybridCursorMappingsnow initializesnumShardsvia a new private-data init callback and waits until all dispatched shard replies arrive before freeing shared callback state, eliminating a use-after-free race.Reviewed by Cursor Bugbot for commit 845231d. Bugbot is set up for automated code reviews on this repo. Configure here.