[8.4] [MOD-14475] Fail queries with error when topology is incomplete (#8884) by lerman25 · Pull Request #9165 · RediSearch/RediSearch

lerman25 · 2026-04-17T09:46:59Z

Description

Backport of #8884 to 8.4.

Differences from the original PR

8.4 predates the coordinator timeout / reducing-state / dispatch-time machinery that the original PR on master interleaves with the topology-validation change, so the backport keeps only the topology-validation pieces:

`src/coord/rmr/cluster.c`

Kept the new pre-fanout connection-validation loop in MRCluster_FanoutCommand.
Dropped the dispatchTimeArgIndex / MRCommand_SetDispatchTime hunk — neither exists in 8.4.

`src/coord/rmr/rmr.c`

Kept only MRCtx_SetValidateConnections / MRCtx_GetValidateConnections from the PR's added helpers. Dropped MRCtx_GetCommandProtocol, MRCtx_SetBlockedClient, MRCtx_SetTimedOut, MRCtx_IsTimedOut, MRCtx_TryClaimReducing, MRCtx_SignalReducerComplete, MRCtx_WaitForReducerComplete — 8.4's MRCtx has no timeout/reducing fields.
Kept 8.4's simple if (!r) branch in fanoutCallback (no MRCtx_IsTimedOut gate).
Initialized ret->validateConnections = false; in MR_CreateCtx — 8.4 uses rm_malloc, so the field needs an explicit zero-init.

`src/module.c`

Inside DistSearchUnblockClient, kept the new informative comment above the existing RedisModule_ReplyWithError("Could not send query to cluster").
In the coordinator's FT.SEARCH dispatch path, kept 8.4's SearchCmdCtx* sCmdCtx = rm_malloc(sizeof(*sCmdCtx)); structure and dropped the master-only block (initQueryTimeout, rscParseRequest on the main thread, MR_CreateCtx + MRCtx_SetFreePrivDataCB(DistSearchMRCtxFreePrivData) + DistSearchBlockClientWithTimeout + MRCtx_SetBlockedClient + RedisModule_BlockClientSetPrivateData). None of those symbols exist in 8.4.
In 8.4 the FT.SEARCH MRCtx is created in FlatSearchCommandHandler and DEBUG_FlatSearchCommandHandler (same as 8.6), so added MRCtx_SetValidateConnections(mrctx, true) right after MR_CreateCtx(0, bc, req, NumShards) in both handlers.

All other files (src/coord/rmr/cluster.h, src/coord/rmr/reply.c/.h, src/coord/rmr/rmr.h, src/coord/hybrid/hybrid_cursor_mappings.c, tests/pytests/test_coordinator.py) merged cleanly without adjustments.

Original PR description

Summary

When shards become unreachable, FT.SEARCH and FT.AGGREGATE silently ignore the connection issue, while FT.HYBRID hangs indefinitely. This PR adds pre-fanout connection validation to fail explicitly with a clear error message instead.

Changes

Core Fix

Topology validation: Before sending commands to shards, validate that all connections are established. If any connection is invalid, return a synthetic error ("Could not send query to cluster") via the callback instead of attempting to send commands that will fail or hang.

Scope

Validates connections only on initial query fanout (iterStartCb, MRCluster_FanoutCommand)
Does not validate on subsequent cursor reads

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

Release Notes

This PR requires release notes
This PR does not require release notes

Note

Medium Risk
Changes fanout/iterator dispatch to pre-validate shard connections and synthesize errors, which affects distributed query behavior and error paths under cluster instability.

Overview
Release note: Distributed query commands now fail fast with Could not send query to cluster when the coordinator detects missing/unreachable shard connections, instead of hanging (notably FT.HYBRID) or proceeding inconsistently.

This adds optional pre-fanout connection validation to coordinator dispatch (MRCluster_FanoutCommand and iterator startup), introduces MRReply_CreateError to inject a synthetic error reply, updates the hybrid cursor-mapping wait condition to unblock on early failures, and enables validation for FT.SEARCH dispatch. New coordinator tests cover all/one-shard-unreachable scenarios to ensure FT.SEARCH, FT.AGGREGATE (including WITHCURSOR/WITHCOUNT), and FT.HYBRID return an error promptly.

^{Reviewed by Cursor Bugbot for commit 533112c. Bugbot is set up for automated code reviews on this repo. Configure here.}

* Handle uvFanoutRequest * Change phrasing * HYBRID + AGGREGATE * tests * tests * Prevent iterator from being freed during command dispatch loop Add reference counting around the command sending loop to ensure the iterator remains valid throughout the entire operation, even if callbacks trigger early cleanup. * Cursor commetns * remove hallucination * Change function name and fix cursor leak * Revert Cursor Delete mechanism * Change FT.SEARCH to use a PRE-check * PRE-check for AGGREGATE/HYBRID * fix validate test * fix missing condition * fix double done * revert hybrid * remove redudant MRConn_IsConnected * Fix comment and remove ref increase * Remove is connected * Increase ref count * cache it len instead of adding a refernce * fix strlen * Fixes * Fix * Fix len & hybrid deadlock

jit-ci · 2026-04-17T09:48:33Z

🛡️ Jit Security Scan Results

✅ No security findings were detected in this PR

^{Security scan by Jit}

codecov · 2026-04-17T10:40:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.55%. Comparing base (5dd82a0) to head (533112c).
⚠️ Report is 2 commits behind head on 8.4.

Additional details and impacted files

@@            Coverage Diff             @@
##              8.4    #9165      +/-   ##
==========================================
- Coverage   85.60%   85.55%   -0.05%     
==========================================
  Files         337      337              
  Lines       53531    53559      +28     
  Branches    11023    11023              
==========================================
- Hits        45826    45824       -2     
- Misses       7562     7592      +30     
  Partials      143      143

Flag	Coverage Δ
flow	`84.51% <100.00%> (-0.16%)`	⬇️
unit	`51.20% <0.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…topology When iterStartCb detects a missing shard connection it returns early with a synthetic CLUSTER_QUERY_ERROR. Previously it did so before running ctx.privateDataInit, which left ShardResponseBarrier.numShards at 0. The Notify bounds check (shardIndex >= numShards) then short-circuited the error notification, and shardResponseBarrier_HandleTimeout replaced the real error with a misleading 'ShardResponseBarrier: Timeout...' message for FT.AGGREGATE ... WITHCOUNT. Run privateDataInit before dispatching the synthetic error so the barrier is initialized and the real topology error propagates to the client. Add a regression test case covering FT.AGGREGATE ... WITHCOUNT in both the all-shards-unreachable and one-shard-unreachable scenarios.

The FT.AGGREGATE WITHCOUNT assertion added in the previous commit is gated behind ENABLE_UNSTABLE_FEATURES on this branch. Enable unstable features in the topology-failure tests so the WITHCOUNT case is exercised instead of being rejected at argument parsing.

sonarqubecloud · 2026-04-17T15:11:05Z

Quality Gate failed

Failed conditions
13.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

github-actions Bot added the size:M label Apr 17, 2026

lerman25 requested a review from GuyAv46 April 17, 2026 10:42

lerman25 added 2 commits April 17, 2026 14:18

GuyAv46 approved these changes Apr 19, 2026

View reviewed changes

lerman25 added this pull request to the merge queue Apr 20, 2026

Merged via the queue into 8.4 with commit 295f76d Apr 20, 2026
34 of 35 checks passed

lerman25 deleted the backport-8884-to-8.4 branch April 20, 2026 08:07

YaacovHazan mentioned this pull request May 5, 2026

Redis 8.4.3 redis/redis#15169

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.4] [MOD-14475] Fail queries with error when topology is incomplete (#8884)#9165

[8.4] [MOD-14475] Fail queries with error when topology is incomplete (#8884)#9165
lerman25 merged 3 commits into8.4from
backport-8884-to-8.4

lerman25 commented Apr 17, 2026 •

edited by cursor Bot

Loading

Uh oh!

jit-ci Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lerman25 commented Apr 17, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Differences from the original PR

src/coord/rmr/cluster.c

src/coord/rmr/rmr.c

src/module.c

Original PR description

Mark if applicable

Release Notes

Uh oh!

jit-ci Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛡️ Jit Security Scan Results

Uh oh!

codecov Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented Apr 17, 2026

Quality Gate failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lerman25 commented Apr 17, 2026 •

edited by cursor Bot

Loading

`src/coord/rmr/cluster.c`

`src/coord/rmr/rmr.c`

`src/module.c`

jit-ci Bot commented Apr 17, 2026 •

edited

Loading

codecov Bot commented Apr 17, 2026 •

edited

Loading