Skip to content

[MOD-12419] Track Timeout errors and warnings in info#7414

Merged
lerman25 merged 22 commits intomasterfrom
omerl-track-timeout-metric
Nov 24, 2025
Merged

[MOD-12419] Track Timeout errors and warnings in info#7414
lerman25 merged 22 commits intomasterfrom
omerl-track-timeout-metric

Conversation

@lerman25
Copy link
Collaborator

@lerman25 lerman25 commented Nov 19, 2025

Continuing the work of #7386
This PR add tracking timeout warnings and errors in the global stats.

Note -

  • For FT.HYBRID there is debug timeout mechanism in cluster
  • FT.HYBRID SA has a bug in timeout warnings

Note

Track timeout errors and warnings across query paths, aggregate them in global stats, and expose them via INFO (with tests).

  • Global Stats / API:
    • Add timeout counters to QueryErrorsGlobalStats and new QueryWarningGlobalStats with QueryWarningsGlobalStats_UpdateWarning.
    • Treat standalone as coordinator for stats; include timeout in QueryErrorsGlobalStats_UpdateError.
    • Export stats in INFO MODULES with new fields for shard/coord timeout errors/warnings.
  • Query Execution:
    • Aggregate: update error/timeout handling to increment global error/warning stats; mark timeouts in profile context.
    • Hybrid: increment stats on errors/timeouts; track timeout warnings in handleAndReplyWarning; adjust cleanup error accounting.
    • Coordinator (module reducer): on shard errors/timeouts, increment coord error counters; parse warning strings and increment warning counters.
  • Rust/FFI:
    • Introduce QueryWarningCode and QueryWarningCode_GetCodeFromMessage in FFI and headers.
  • Tests/Utils:
    • Extend tests to validate timeout error/warning metrics in standalone and cluster (RESP2/RESP3).
    • Add helpers to change timeout policy on shards and minor test data adjustments.

Written by Cursor Bugbot for commit be69839. This will update automatically on new commits. Configure here.

enricozb
enricozb previously approved these changes Nov 19, 2025
Comment on lines 76 to 77
[export]
include = ["QueryWarningCode"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you do a pub use .. QueryWarningCode in src/redisearch_rs/c_entrypoint/query_error_ffi/src/lib.rs, i don't think this line is needed.

Base automatically changed from omer-query-error-metrics-base to master November 19, 2025 13:39
@lerman25 lerman25 dismissed enricozb’s stale review November 19, 2025 13:39

The base branch was changed.

@lerman25 lerman25 marked this pull request as draft November 19, 2025 14:44
@lerman25 lerman25 marked this pull request as ready for review November 19, 2025 14:44
@lerman25 lerman25 force-pushed the omerl-track-timeout-metric branch from 38967ea to 7f330c3 Compare November 19, 2025 14:46
@codecov
Copy link

codecov bot commented Nov 19, 2025

Codecov Report

❌ Patch coverage is 59.42029% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.94%. Comparing base (e56bf5b) to head (be69839).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
...isearch_rs/c_entrypoint/query_error_ffi/src/lib.rs 0.00% 14 Missing ⚠️
src/redisearch_rs/query_error/src/lib.rs 0.00% 11 Missing ⚠️
src/module.c 83.33% 2 Missing ⚠️
src/hybrid/hybrid_exec.c 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7414      +/-   ##
==========================================
+ Coverage   84.92%   84.94%   +0.02%     
==========================================
  Files         348      348              
  Lines       53773    53836      +63     
  Branches    14353    14378      +25     
==========================================
+ Hits        45668    45733      +65     
+ Misses       7909     7907       -2     
  Partials      196      196              
Flag Coverage Δ
flow 84.98% <93.18%> (+0.04%) ⬆️
unit 52.36% <0.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


if (ShouldReplyWithError(QueryError_GetCode(rp->parent->err), req->reqConfig.timeoutPolicy, IsProfile(req))) {
// Track errors in global statistics
QueryErrorsGlobalStats_UpdateError(QueryError_GetCode(rp->parent->err), 1, !IsInternal(req));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Inconsistent error tracking between RESP2 and RESP3

In sendChunk_Resp3, the error tracking uses QueryError_GetCode(rp->parent->err) while the error reply uses QueryError_GetUserError(qctx->err). This differs from sendChunk_Resp2 which consistently uses qctx->err for both tracking and replying. If rp->parent->err and qctx->err contain different error codes, the tracked error code won't match the error actually returned to the user, leading to incorrect statistics.

Fix in Cursor Fix in Web

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I should fix misalignment between resp2 and resp3 in this PR.
Using rp->parent->err is aligned with the line above which checks errors in rp->parent->err

Copy link
Collaborator

@meiravgri meiravgri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean
I agree
with the SA definiton change


if (ShouldReplyWithError(QueryError_GetCode(rp->parent->err), req->reqConfig.timeoutPolicy, IsProfile(req))) {
// Track errors in global statistics
QueryErrorsGlobalStats_UpdateError(QueryError_GetCode(rp->parent->err), 1, !IsInternal(req));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

// Update global query errors statistics before freeing the request.
// Both SA and internal are considered shards.
QueryErrorsGlobalStats_UpdateError(QueryError_GetCode(&status), 1, SHARD_ERR_WARN);
// Update global query errors statistics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder why you remove " before freeing the request."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the request is no longer relevant if I use numShards

// If an error occurred, or a timeout in strict mode - return a simple error
if (ShouldReplyWithError(QueryError_GetCode(rp->parent->err), req->reqConfig.timeoutPolicy, IsProfile(req))) {
// Track errors in global statistics
QueryErrorsGlobalStats_UpdateError(QueryError_GetCode(qctx->err), 1, !IsInternal(req));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Inconsistent error source in RESP2 error tracking

The error tracking uses qctx->err while the error check on line 401 uses rp->parent->err. This inconsistency means the tracked error code may differ from the error that triggered the tracking. The RESP3 version at line 532 correctly uses rp->parent->err for both operations, confirming this is a bug in the RESP2 path.

Fix in Cursor Fix in Web

Comment on lines 1074 to 1077
for shardId in range(1, self.env.shardsCount + 1):
info_dict = info_modules_to_dict(self.env.getConnection(shardId))
self.env.assertEqual(info_dict[WARN_ERR_SECTION][TIMEOUT_WARNING_SHARD_METRIC], str(base_warn_shards[shardId] + 3),
message=f"Shard {shardId} AGG coordinator timeout warning should be +3 total")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one can be flaky
the coordinator might return before the shards reach to the point where they update the timeout wraning

meiravgri
meiravgri previously approved these changes Nov 23, 2025
@lerman25 lerman25 added this pull request to the merge queue Nov 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 23, 2025
@lerman25 lerman25 enabled auto-merge November 24, 2025 13:23
@lerman25 lerman25 added this pull request to the merge queue Nov 24, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 24, 2025
@lerman25 lerman25 added this pull request to the merge queue Nov 24, 2025
Merged via the queue into master with commit 6eee260 Nov 24, 2025
25 of 26 checks passed
@lerman25 lerman25 deleted the omerl-track-timeout-metric branch November 24, 2025 21:41
lerman25 added a commit that referenced this pull request Nov 25, 2025
* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

(cherry picked from commit 6eee260)
lerman25 added a commit that referenced this pull request Nov 28, 2025
* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

(cherry picked from commit 6eee260)
github-merge-queue bot pushed a commit that referenced this pull request Nov 30, 2025
* [MOD-12419] Track Timeout errors and warnings in info (#7414)

* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

(cherry picked from commit 6eee260)

* Align to 8.4
lerman25 added a commit that referenced this pull request Dec 2, 2025
* [MOD-12419] Track Timeout errors and warnings in info ([#7414](#7414))

* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

(cherry picked from commit 6eee260)

* Align to 8.4

(cherry picked from commit ac92116)
github-merge-queue bot pushed a commit that referenced this pull request Dec 3, 2025
* [MOD-12419] Track Timeout errors and warnings in info ([#7507](#7507))

* [MOD-12419] Track Timeout errors and warnings in info ([#7414](#7414))

* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

(cherry picked from commit 6eee260)

* Align to 8.4

(cherry picked from commit ac92116)

* Fixes

* Backport more logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants