[8.4] [MOD-14475] Fail queries with error when topology is incomplete (#8884)#9165
Merged
[8.4] [MOD-14475] Fail queries with error when topology is incomplete (#8884)#9165
Conversation
* Handle uvFanoutRequest * Change phrasing * HYBRID + AGGREGATE * tests * tests * Prevent iterator from being freed during command dispatch loop Add reference counting around the command sending loop to ensure the iterator remains valid throughout the entire operation, even if callbacks trigger early cleanup. * Cursor commetns * remove hallucination * Change function name and fix cursor leak * Revert Cursor Delete mechanism * Change FT.SEARCH to use a PRE-check * PRE-check for AGGREGATE/HYBRID * fix validate test * fix missing condition * fix double done * revert hybrid * remove redudant MRConn_IsConnected * Fix comment and remove ref increase * Remove is connected * Increase ref count * cache it len instead of adding a refernce * fix strlen * Fixes * Fix * Fix len & hybrid deadlock
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## 8.4 #9165 +/- ##
==========================================
- Coverage 85.60% 85.55% -0.05%
==========================================
Files 337 337
Lines 53531 53559 +28
Branches 11023 11023
==========================================
- Hits 45826 45824 -2
- Misses 7562 7592 +30
Partials 143 143
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…topology When iterStartCb detects a missing shard connection it returns early with a synthetic CLUSTER_QUERY_ERROR. Previously it did so before running ctx.privateDataInit, which left ShardResponseBarrier.numShards at 0. The Notify bounds check (shardIndex >= numShards) then short-circuited the error notification, and shardResponseBarrier_HandleTimeout replaced the real error with a misleading 'ShardResponseBarrier: Timeout...' message for FT.AGGREGATE ... WITHCOUNT. Run privateDataInit before dispatching the synthetic error so the barrier is initialized and the real topology error propagates to the client. Add a regression test case covering FT.AGGREGATE ... WITHCOUNT in both the all-shards-unreachable and one-shard-unreachable scenarios.
The FT.AGGREGATE WITHCOUNT assertion added in the previous commit is gated behind ENABLE_UNSTABLE_FEATURES on this branch. Enable unstable features in the topology-failure tests so the WITHCOUNT case is exercised instead of being rejected at argument parsing.
|
GuyAv46
approved these changes
Apr 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Description
Backport of #8884 to
8.4.Differences from the original PR
8.4predates the coordinator timeout / reducing-state / dispatch-time machinery that the original PR on master interleaves with the topology-validation change, so the backport keeps only the topology-validation pieces:src/coord/rmr/cluster.cMRCluster_FanoutCommand.dispatchTimeArgIndex/MRCommand_SetDispatchTimehunk — neither exists in 8.4.src/coord/rmr/rmr.cMRCtx_SetValidateConnections/MRCtx_GetValidateConnectionsfrom the PR's added helpers. DroppedMRCtx_GetCommandProtocol,MRCtx_SetBlockedClient,MRCtx_SetTimedOut,MRCtx_IsTimedOut,MRCtx_TryClaimReducing,MRCtx_SignalReducerComplete,MRCtx_WaitForReducerComplete— 8.4'sMRCtxhas no timeout/reducing fields.if (!r)branch infanoutCallback(noMRCtx_IsTimedOutgate).ret->validateConnections = false;inMR_CreateCtx— 8.4 usesrm_malloc, so the field needs an explicit zero-init.src/module.cDistSearchUnblockClient, kept the new informative comment above the existingRedisModule_ReplyWithError("Could not send query to cluster").SearchCmdCtx* sCmdCtx = rm_malloc(sizeof(*sCmdCtx));structure and dropped the master-only block (initQueryTimeout,rscParseRequeston the main thread,MR_CreateCtx+MRCtx_SetFreePrivDataCB(DistSearchMRCtxFreePrivData)+DistSearchBlockClientWithTimeout+MRCtx_SetBlockedClient+RedisModule_BlockClientSetPrivateData). None of those symbols exist in 8.4.MRCtxis created inFlatSearchCommandHandlerandDEBUG_FlatSearchCommandHandler(same as 8.6), so addedMRCtx_SetValidateConnections(mrctx, true)right afterMR_CreateCtx(0, bc, req, NumShards)in both handlers.All other files (
src/coord/rmr/cluster.h,src/coord/rmr/reply.c/.h,src/coord/rmr/rmr.h,src/coord/hybrid/hybrid_cursor_mappings.c,tests/pytests/test_coordinator.py) merged cleanly without adjustments.Original PR description
Summary
When shards become unreachable,
FT.SEARCHandFT.AGGREGATEsilently ignore the connection issue, whileFT.HYBRIDhangs indefinitely. This PR adds pre-fanout connection validation to fail explicitly with a clear error message instead.Changes
Core Fix
"Could not send query to cluster") via the callback instead of attempting to send commands that will fail or hang.Scope
iterStartCb,MRCluster_FanoutCommand)Mark if applicable
Release Notes
Note
Medium Risk
Changes fanout/iterator dispatch to pre-validate shard connections and synthesize errors, which affects distributed query behavior and error paths under cluster instability.
Overview
Release note: Distributed query commands now fail fast with
Could not send query to clusterwhen the coordinator detects missing/unreachable shard connections, instead of hanging (notablyFT.HYBRID) or proceeding inconsistently.This adds optional pre-fanout connection validation to coordinator dispatch (
MRCluster_FanoutCommandand iterator startup), introducesMRReply_CreateErrorto inject a synthetic error reply, updates the hybrid cursor-mapping wait condition to unblock on early failures, and enables validation forFT.SEARCHdispatch. New coordinator tests cover all/one-shard-unreachable scenarios to ensureFT.SEARCH,FT.AGGREGATE(includingWITHCURSOR/WITHCOUNT), andFT.HYBRIDreturn an error promptly.Reviewed by Cursor Bugbot for commit 533112c. Bugbot is set up for automated code reviews on this repo. Configure here.