Skip to content

[8.4] [MOD-14475] Fail queries with error when topology is incomplete (#8884)#9165

Merged
lerman25 merged 3 commits into8.4from
backport-8884-to-8.4
Apr 20, 2026
Merged

[8.4] [MOD-14475] Fail queries with error when topology is incomplete (#8884)#9165
lerman25 merged 3 commits into8.4from
backport-8884-to-8.4

Conversation

@lerman25
Copy link
Copy Markdown
Collaborator

@lerman25 lerman25 commented Apr 17, 2026

Description

Backport of #8884 to 8.4.

Differences from the original PR

8.4 predates the coordinator timeout / reducing-state / dispatch-time machinery that the original PR on master interleaves with the topology-validation change, so the backport keeps only the topology-validation pieces:

src/coord/rmr/cluster.c

  • Kept the new pre-fanout connection-validation loop in MRCluster_FanoutCommand.
  • Dropped the dispatchTimeArgIndex / MRCommand_SetDispatchTime hunk — neither exists in 8.4.

src/coord/rmr/rmr.c

  • Kept only MRCtx_SetValidateConnections / MRCtx_GetValidateConnections from the PR's added helpers. Dropped MRCtx_GetCommandProtocol, MRCtx_SetBlockedClient, MRCtx_SetTimedOut, MRCtx_IsTimedOut, MRCtx_TryClaimReducing, MRCtx_SignalReducerComplete, MRCtx_WaitForReducerComplete — 8.4's MRCtx has no timeout/reducing fields.
  • Kept 8.4's simple if (!r) branch in fanoutCallback (no MRCtx_IsTimedOut gate).
  • Initialized ret->validateConnections = false; in MR_CreateCtx — 8.4 uses rm_malloc, so the field needs an explicit zero-init.

src/module.c

  • Inside DistSearchUnblockClient, kept the new informative comment above the existing RedisModule_ReplyWithError("Could not send query to cluster").
  • In the coordinator's FT.SEARCH dispatch path, kept 8.4's SearchCmdCtx* sCmdCtx = rm_malloc(sizeof(*sCmdCtx)); structure and dropped the master-only block (initQueryTimeout, rscParseRequest on the main thread, MR_CreateCtx + MRCtx_SetFreePrivDataCB(DistSearchMRCtxFreePrivData) + DistSearchBlockClientWithTimeout + MRCtx_SetBlockedClient + RedisModule_BlockClientSetPrivateData). None of those symbols exist in 8.4.
  • In 8.4 the FT.SEARCH MRCtx is created in FlatSearchCommandHandler and DEBUG_FlatSearchCommandHandler (same as 8.6), so added MRCtx_SetValidateConnections(mrctx, true) right after MR_CreateCtx(0, bc, req, NumShards) in both handlers.

All other files (src/coord/rmr/cluster.h, src/coord/rmr/reply.c/.h, src/coord/rmr/rmr.h, src/coord/hybrid/hybrid_cursor_mappings.c, tests/pytests/test_coordinator.py) merged cleanly without adjustments.


Original PR description

Summary

When shards become unreachable, FT.SEARCH and FT.AGGREGATE silently ignore the connection issue, while FT.HYBRID hangs indefinitely. This PR adds pre-fanout connection validation to fail explicitly with a clear error message instead.

Changes

Core Fix

  • Topology validation: Before sending commands to shards, validate that all connections are established. If any connection is invalid, return a synthetic error ("Could not send query to cluster") via the callback instead of attempting to send commands that will fail or hang.

Scope

  • Validates connections only on initial query fanout (iterStartCb, MRCluster_FanoutCommand)
  • Does not validate on subsequent cursor reads

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Release Notes

  • This PR requires release notes
  • This PR does not require release notes

Note

Medium Risk
Changes fanout/iterator dispatch to pre-validate shard connections and synthesize errors, which affects distributed query behavior and error paths under cluster instability.

Overview
Release note: Distributed query commands now fail fast with Could not send query to cluster when the coordinator detects missing/unreachable shard connections, instead of hanging (notably FT.HYBRID) or proceeding inconsistently.

This adds optional pre-fanout connection validation to coordinator dispatch (MRCluster_FanoutCommand and iterator startup), introduces MRReply_CreateError to inject a synthetic error reply, updates the hybrid cursor-mapping wait condition to unblock on early failures, and enables validation for FT.SEARCH dispatch. New coordinator tests cover all/one-shard-unreachable scenarios to ensure FT.SEARCH, FT.AGGREGATE (including WITHCURSOR/WITHCOUNT), and FT.HYBRID return an error promptly.

Reviewed by Cursor Bugbot for commit 533112c. Bugbot is set up for automated code reviews on this repo. Configure here.

* Handle uvFanoutRequest

* Change phrasing

* HYBRID + AGGREGATE

* tests

* tests

* Prevent iterator from being freed during command dispatch loop

Add reference counting around the command sending loop to ensure the iterator remains valid throughout the entire operation, even if callbacks trigger early cleanup.

* Cursor commetns

* remove hallucination

* Change function name and fix cursor leak

* Revert Cursor Delete mechanism

* Change FT.SEARCH to use a PRE-check

* PRE-check for AGGREGATE/HYBRID

* fix validate test

* fix missing condition

* fix double done

* revert hybrid

* remove redudant MRConn_IsConnected

* Fix comment and remove ref increase

* Remove is connected

* Increase ref count

* cache it len instead of adding a refernce

* fix strlen

* Fixes

* Fix

* Fix len & hybrid deadlock
@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented Apr 17, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.55%. Comparing base (5dd82a0) to head (533112c).
⚠️ Report is 2 commits behind head on 8.4.

Additional details and impacted files
@@            Coverage Diff             @@
##              8.4    #9165      +/-   ##
==========================================
- Coverage   85.60%   85.55%   -0.05%     
==========================================
  Files         337      337              
  Lines       53531    53559      +28     
  Branches    11023    11023              
==========================================
- Hits        45826    45824       -2     
- Misses       7562     7592      +30     
  Partials      143      143              
Flag Coverage Δ
flow 84.51% <100.00%> (-0.16%) ⬇️
unit 51.20% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lerman25 lerman25 requested a review from GuyAv46 April 17, 2026 10:42
…topology

When iterStartCb detects a missing shard connection it returns early with a
synthetic CLUSTER_QUERY_ERROR. Previously it did so before running
ctx.privateDataInit, which left ShardResponseBarrier.numShards at 0. The
Notify bounds check (shardIndex >= numShards) then short-circuited the
error notification, and shardResponseBarrier_HandleTimeout replaced the real
error with a misleading 'ShardResponseBarrier: Timeout...' message for
FT.AGGREGATE ... WITHCOUNT.

Run privateDataInit before dispatching the synthetic error so the barrier
is initialized and the real topology error propagates to the client.

Add a regression test case covering FT.AGGREGATE ... WITHCOUNT in both the
all-shards-unreachable and one-shard-unreachable scenarios.
The FT.AGGREGATE WITHCOUNT assertion added in the previous commit is
gated behind ENABLE_UNSTABLE_FEATURES on this branch. Enable unstable
features in the topology-failure tests so the WITHCOUNT case is exercised
instead of being rejected at argument parsing.
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
13.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@lerman25 lerman25 added this pull request to the merge queue Apr 20, 2026
Merged via the queue into 8.4 with commit 295f76d Apr 20, 2026
34 of 35 checks passed
@lerman25 lerman25 deleted the backport-8884-to-8.4 branch April 20, 2026 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants