Skip to content

[8.4] [MOD-12419] Track Timeout errors and warnings in info#7507

Merged
lerman25 merged 2 commits into8.4from
backport-7414-to-8.4
Nov 30, 2025
Merged

[8.4] [MOD-12419] Track Timeout errors and warnings in info#7507
lerman25 merged 2 commits into8.4from
backport-7414-to-8.4

Conversation

@lerman25
Copy link
Collaborator

@lerman25 lerman25 commented Nov 25, 2025

backport of #7414 to 8.4


Note

Adds global tracking of query timeout errors and warnings (coord/shard), wires counting into SEARCH/AGGREGATE/HYBRID paths, and surfaces new metrics in INFO; includes warning code utilities and tests.

  • Stats/Accounting:
    • Track timeout errors via QueryErrorsGlobalStats_UpdateError(QUERY_ETIMEDOUT, ...) and timeout warnings via new QueryWarningsGlobalStats_UpdateWarning(...).
    • Treat standalone as coordinator for stats where applicable.
    • Add warning code enum/mapping in query_error.{h,c}.
  • Execution paths:
    • Instrument aggregate_exec.c, hybrid_exec.c, and coordinator in module.c to update global stats on errors/timeouts and to count warnings emitted in replies (RESP2/RESP3).
    • Record timeout warnings on non-fatal timeouts in replies.
  • INFO output:
    • Extend TotalGlobalStats_GetQueryStats and global_stats.{h,c} with shard/coord timeout counters for errors and warnings.
    • Expose new fields in info_redis.c: shard_total_query_errors_timeout, coord_total_query_errors_timeout, shard_total_query_warnings_timeout, coord_total_query_warnings_timeout.
  • Tests/Utilities:
    • Add/expand tests in tests/pytests/test_info_modules.py to validate new metrics in standalone/cluster (RESP2/RESP3).
    • Add helpers in tests/pytests/common.py to change timeout policy across shards.

Written by Cursor Bugbot for commit 3026209. This will update automatically on new commits. Configure here.

@lerman25 lerman25 changed the base branch from master to 8.4 November 25, 2025 14:30
Copy link

@jit-ci jit-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Jit has detected 8 important findings in this PR that you should review.


Jit encountered an internal error and cannot comment on each finding.

You can ask a Jit admin to comment #jit_ignore_all on this PR to ignore the findings.

Here are the findings in this PR:

  • Security Control: Static Code Analysis Semgrep Pro
    • Type: Yaml.Github-Actions.Security.Run-Shell-Injection.Run-Shell-Injection
    • Description: Using variable interpolation ${{...}} with github context data in a run: step could allow an attacker to inject their own code into the runner. This would allow them to steal secrets and code. github context data can have arbitrary user input and should be treated as untrusted. Instead, use an intermediate environment variable with env: to store the data and use the environment variable in the run: script. Be sure to use double-quotes the environment variable, like this: "$ENVVAR".
    • Severity: HIGH
    • Learn More: Link
    • Filename: .github/workflows/task-test.yml
    • Lines: 396-401
  • Security Control: Static Code Analysis Semgrep Pro
    • Type: Yaml.Github-Actions.Security.Run-Shell-Injection.Run-Shell-Injection
    • Description: Using variable interpolation ${{...}} with github context data in a run: step could allow an attacker to inject their own code into the runner. This would allow them to steal secrets and code. github context data can have arbitrary user input and should be treated as untrusted. Instead, use an intermediate environment variable with env: to store the data and use the environment variable in the run: script. Be sure to use double-quotes the environment variable, like this: "$ENVVAR".
    • Severity: HIGH
    • Learn More: Link
    • Filename: .github/workflows/generate-basic-matrix.yml
    • Lines: 39-74
  • Security Control: Static Code Analysis Semgrep Pro
    • Type: Yaml.Github-Actions.Security.Run-Shell-Injection.Run-Shell-Injection
    • Description: Using variable interpolation ${{...}} with github context data in a run: step could allow an attacker to inject their own code into the runner. This would allow them to steal secrets and code. github context data can have arbitrary user input and should be treated as untrusted. Instead, use an intermediate environment variable with env: to store the data and use the environment variable in the run: script. Be sure to use double-quotes the environment variable, like this: "$ENVVAR".
    • Severity: HIGH
    • Learn More: Link
    • Filename: .github/workflows/flow-build-artifacts.yml
    • Lines: 69-90
  • Security Control: Static Code Analysis Semgrep Pro
    • Type: Yaml.Github-Actions.Security.Run-Shell-Injection.Run-Shell-Injection
    • Description: Using variable interpolation ${{...}} with github context data in a run: step could allow an attacker to inject their own code into the runner. This would allow them to steal secrets and code. github context data can have arbitrary user input and should be treated as untrusted. Instead, use an intermediate environment variable with env: to store the data and use the environment variable in the run: script. Be sure to use double-quotes the environment variable, like this: "$ENVVAR".
    • Severity: HIGH
    • Learn More: Link
    • Filename: .github/workflows/task-get-config.yml
    • Lines: 47-271
  • Security Control: Static Code Analysis Semgrep Pro
    • Type: Yaml.Github-Actions.Security.Run-Shell-Injection.Run-Shell-Injection
    • Description: Using variable interpolation ${{...}} with github context data in a run: step could allow an attacker to inject their own code into the runner. This would allow them to steal secrets and code. github context data can have arbitrary user input and should be treated as untrusted. Instead, use an intermediate environment variable with env: to store the data and use the environment variable in the run: script. Be sure to use double-quotes the environment variable, like this: "$ENVVAR".
    • Severity: HIGH
    • Learn More: Link
    • Filename: .github/workflows/task-build-artifacts.yml
    • Lines: 312-326
  • Security Control: Static Code Analysis Semgrep Pro
    • Type: Yaml.Github-Actions.Security.Run-Shell-Injection.Run-Shell-Injection
    • Description: Using variable interpolation ${{...}} with github context data in a run: step could allow an attacker to inject their own code into the runner. This would allow them to steal secrets and code. github context data can have arbitrary user input and should be treated as untrusted. Instead, use an intermediate environment variable with env: to store the data and use the environment variable in the run: script. Be sure to use double-quotes the environment variable, like this: "$ENVVAR".
    • Severity: HIGH
    • Learn More: Link
    • Filename: .github/workflows/task-test.yml
    • Lines: 387-390
  • Security Control: Static Code Analysis Semgrep Pro
    • Type: Yaml.Github-Actions.Security.Run-Shell-Injection.Run-Shell-Injection
    • Description: Using variable interpolation ${{...}} with github context data in a run: step could allow an attacker to inject their own code into the runner. This would allow them to steal secrets and code. github context data can have arbitrary user input and should be treated as untrusted. Instead, use an intermediate environment variable with env: to store the data and use the environment variable in the run: script. Be sure to use double-quotes the environment variable, like this: "$ENVVAR".
    • Severity: HIGH
    • Learn More: Link
    • Filename: .github/workflows/generate-matrix.yml
    • Lines: 43-112
  • Security Control: Static Code Analysis Semgrep Pro
    • Type: Yaml.Github-Actions.Security.Run-Shell-Injection.Run-Shell-Injection
    • Description: Using variable interpolation ${{...}} with github context data in a run: step could allow an attacker to inject their own code into the runner. This would allow them to steal secrets and code. github context data can have arbitrary user input and should be treated as untrusted. Instead, use an intermediate environment variable with env: to store the data and use the environment variable in the run: script. Be sure to use double-quotes the environment variable, like this: "$ENVVAR".
    • Severity: HIGH
    • Learn More: Link
    • Filename: .github/workflows/task-test.yml
    • Lines: 80-101

@lerman25 lerman25 marked this pull request as draft November 25, 2025 14:36
@lerman25 lerman25 marked this pull request as ready for review November 25, 2025 14:36
@lerman25 lerman25 requested a review from meiravgri November 25, 2025 14:36

if (ShouldReplyWithError(QueryError_GetCode(rp->parent->err), req->reqConfig.timeoutPolicy, IsProfile(req))) {
// Track errors in global statistics
QueryErrorsGlobalStats_UpdateError(QueryError_GetCode(rp->parent->err), 1, !IsInternal(req));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Inconsistent error source in statistics tracking

The error tracking on line 532 uses rp->parent->err to get the error code, but line 533 immediately uses qctx->err to reply with the error message. This inconsistency means the wrong error code may be tracked in global statistics. Line 532 should use QueryError_GetCode(qctx->err) instead of QueryError_GetCode(rp->parent->err) to match the pattern used in line 403 (RESP2 version) and to be consistent with the error being replied on line 533.

Fix in Cursor Fix in Web

MRReply *currentWarning = MRReply_ArrayElement(rCtx->warning, i);
const char *warning_str = MRReply_String(currentWarning, NULL);
QueryWarningCode warningCode = QueryWarningCode_GetCodeFromMessage(warning_str);
QueryWarningsGlobalStats_UpdateWarning(warningCode, 1, COORD_ERR_WARN);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Shard warnings tracked as coordinator warnings

Warnings received from shards are incorrectly tracked as coordinator warnings. The code iterates over rCtx->warning which contains warnings from shard replies, but tracks them using COORD_ERR_WARN (true). Since these warnings originated from shards, they should be tracked as shard warnings using SHARD_ERR_WARN (false) instead. This causes incorrect statistics where shard-level warnings are counted as coordinator-level warnings.

Fix in Cursor Fix in Web

@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

❌ Patch coverage is 88.52459% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.90%. Comparing base (81137f0) to head (3026209).
⚠️ Report is 10 commits behind head on 8.4.

Files with missing lines Patch % Lines
src/query_error.c 76.47% 4 Missing ⚠️
src/module.c 83.33% 2 Missing ⚠️
src/hybrid/hybrid_exec.c 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              8.4    #7507      +/-   ##
==========================================
- Coverage   85.92%   85.90%   -0.03%     
==========================================
  Files         330      330              
  Lines       52558    52613      +55     
  Branches    12001    12001              
==========================================
+ Hits        45161    45196      +35     
- Misses       7230     7250      +20     
  Partials      167      167              
Flag Coverage Δ
flow 84.47% <88.52%> (-0.14%) ⬇️
unit 52.43% <0.00%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

meiravgri
meiravgri previously approved these changes Nov 26, 2025
@lerman25 lerman25 added this pull request to the merge queue Nov 26, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 26, 2025
// If an error occurred, or a timeout in strict mode - return a simple error
if (ShouldReplyWithError(QueryError_GetCode(rp->parent->err), req->reqConfig.timeoutPolicy, IsProfile(req))) {
// Track errors in global statistics
QueryErrorsGlobalStats_UpdateError(QueryError_GetCode(qctx->err), 1, !IsInternal(req));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Inconsistent error object used for tracking

The condition on line 401 checks QueryError_GetCode(rp->parent->err) to determine if an error should be replied, but line 403 tracks QueryError_GetCode(qctx->err) instead. This inconsistency means the tracked error code may differ from the one that triggered the error reply, potentially causing incorrect statistics. Line 532 correctly uses rp->parent->err for both checking and tracking, showing the intended pattern.

Fix in Cursor Fix in Web

meiravgri
meiravgri previously approved these changes Nov 27, 2025
lerman25 and others added 2 commits November 28, 2025 18:10
* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

(cherry picked from commit 6eee260)
// Reply warning
MR_ReplyWithMRReply(reply, currentWarning);
}
RedisModule_Reply_ArrayEnd(reply);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing array wrapper for OOM warning reply

When req->queryOOM is true, the code replies with a simple string directly without wrapping it in an array. The warning field expects an array value (as seen in the rCtx->warning branch that starts an array at line 2696 and the else branch that replies with an empty array at line 2714). This causes a protocol mismatch where a string is sent instead of an array containing the warning string.

Fix in Cursor Fix in Web

@lerman25 lerman25 marked this pull request as draft November 29, 2025 13:09
@lerman25 lerman25 marked this pull request as ready for review November 29, 2025 13:09
RedisModule_Reply_SimpleString(reply, QUERY_WOOM_COORD);
}
if (rc == RS_RESULT_TIMEDOUT) {
// Track warnings in global statistics
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: OOM warnings not tracked in global statistics

When QueryError_HasQueryOOMWarning returns true or req->queryOOM is set, the code replies with QUERY_WOOM_COORD warning but doesn't track it in global statistics using QueryWarningsGlobalStats_UpdateWarning. This is inconsistent with timeout warnings which are tracked at line 614 in aggregate_exec.c and line 62 in hybrid_exec.c. The same tracking should be added for OOM warnings to maintain consistency and accurate statistics.

Additional Locations (2)

Fix in Cursor Fix in Web

@lerman25 lerman25 added this pull request to the merge queue Nov 30, 2025
Merged via the queue into 8.4 with commit ac92116 Nov 30, 2025
39 of 41 checks passed
@lerman25 lerman25 deleted the backport-7414-to-8.4 branch November 30, 2025 11:17
@lerman25
Copy link
Collaborator Author

lerman25 commented Dec 2, 2025

/backport

@redisearch-backport-pull-request
Copy link
Contributor

Backport failed for 8.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 8.2
git worktree add -d .worktree/backport-7507-to-8.2 origin/8.2
cd .worktree/backport-7507-to-8.2
git switch --create backport-7507-to-8.2
git cherry-pick -x ac921169c63c4bac7176ee3501ff36e1baed2a17

lerman25 added a commit that referenced this pull request Dec 2, 2025
* [MOD-12419] Track Timeout errors and warnings in info ([#7414](#7414))

* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

(cherry picked from commit 6eee260)

* Align to 8.4

(cherry picked from commit ac92116)
github-merge-queue bot pushed a commit that referenced this pull request Dec 3, 2025
* [MOD-12419] Track Timeout errors and warnings in info ([#7507](#7507))

* [MOD-12419] Track Timeout errors and warnings in info ([#7414](#7414))

* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

(cherry picked from commit 6eee260)

* Align to 8.4

(cherry picked from commit ac92116)

* Fixes

* Backport more logic
alonre24 added a commit to alonre24/redis that referenced this pull request Jan 26, 2026
**Bug Fixes:**

* [redis#7385](RediSearch/RediSearch#7385) Fix high temporary memory consumption when loading multiple search indexes from RDB
* [redis#7430](RediSearch/RediSearch#7430) Fix a potential deadlock in `FT.HYBRID` in cluster mode during updates.
* [redis#7454](RediSearch/RediSearch#7454) Fix a garbage collection performence regression
* [redis#7460](RediSearch/RediSearch#7460) Fix potential double-free in Fork GC error paths
* [redis#7455](RediSearch/RediSearch#7455) Fix internal cursors not being deleted promptly in cluster mode
* [redis#7667](RediSearch/RediSearch#7667) Fix a cursor logical leak upon dropping the index
* [redis#7796](RediSearch/RediSearch#7796) Fix a potential use-after-free when removing connections
* [redis#7792](RediSearch/RediSearch#7792) Fix string comparison for binary data with embedded NULLs in TOLIST reducer in FT.AGGREGATE
* [redis#7823](RediSearch/RediSearch#7823) Update `FT.HYBRID` to accept vector blobs only via parameters
* [redis#7903](RediSearch/RediSearch#7903) Fix a memory leak in Hybrid ASM
* [redis#8052](RediSearch/RediSearch#8052) Fix `FT.HYBRID` behavior when used with `LOAD *`
* [redis#8082](RediSearch/RediSearch#8082) Fix incorrect FULLTEXT field metric counts
* [redis#8089](RediSearch/RediSearch#8089) Fix an edge case in `CLUSTERSET` handling
* [redis#8152](RediSearch/RediSearch#8152) Fix configuration registration issues

**Improvements:**

* [redis#7427](RediSearch/RediSearch#7427) Enhance `FT.PROFILE` with vector search execution details
* [redis#7431](RediSearch/RediSearch#7431) Ensure full `FT.PROFILE` output is returned on timeout with RETURN policy
* [redis#7507](RediSearch/RediSearch#7507) Track timeout warnings and errors in INFO
* [redis#7576](RediSearch/RediSearch#7576) Track OOM warnings and errors in INFO
* [redis#7612](RediSearch/RediSearch#7612) Track `maxprefixexpansions` warnings and errors in INFO
* [redis#7960](RediSearch/RediSearch#7960) Persist query warnings across cursor reads
* [redis#7551](RediSearch/RediSearch#7551), [redis#7616](RediSearch/RediSearch#7616), [redis#7622](RediSearch/RediSearch#7622), [redis#7625](RediSearch/RediSearch#7625) Add runtime thread and pending-jobs metrics
* [redis#7589](RediSearch/RediSearch#7589) Support multiple slot ranges in `search.CLUSTERSET`
* [redis#7707](RediSearch/RediSearch#7707) Add `WITHCOUNT` support to `FT.AGGREGATE`
* [redis#7862](RediSearch/RediSearch#7862) Add support for subquery `COUNT` in `FT.HYBRID`
* [redis#8087](RediSearch/RediSearch#8087) Add warnings when cursor results may be affected by ASM and expose ASM warnings in `FT.PROFILE`
* [redis#8049](RediSearch/RediSearch#8049) Add logging for index-related commands
* [redis#8150](RediSearch/RediSearch#8150) Fix shard total profile time reporting in `FT.PROFILE`
YaacovHazan pushed a commit to redis/redis that referenced this pull request Jan 26, 2026
**Bug Fixes:**

* [#7385](RediSearch/RediSearch#7385) Fix high
temporary memory consumption when loading multiple search indexes from
RDB
* [#7430](RediSearch/RediSearch#7430) Fix a
potential deadlock in `FT.HYBRID` in cluster mode during updates.
* [#7454](RediSearch/RediSearch#7454) Fix a
garbage collection performence regression
* [#7460](RediSearch/RediSearch#7460) Fix
potential double-free in Fork GC error paths
* [#7455](RediSearch/RediSearch#7455) Fix
internal cursors not being deleted promptly in cluster mode
* [#7667](RediSearch/RediSearch#7667) Fix a
cursor logical leak upon dropping the index
* [#7796](RediSearch/RediSearch#7796) Fix a
potential use-after-free when removing connections
* [#7792](RediSearch/RediSearch#7792) Fix string
comparison for binary data with embedded NULLs in TOLIST reducer in
FT.AGGREGATE
* [#7704](RediSearch/RediSearch#7704) Use
asynchronous jobs for GC in SVS to accelerate execution
* [#7823](RediSearch/RediSearch#7823) Update
`FT.HYBRID` to accept vector blobs only via parameters
* [#7903](RediSearch/RediSearch#7903) Fix a
memory leak in Hybrid ASM
* [#8052](RediSearch/RediSearch#8052) Fix
`FT.HYBRID` behavior when used with `LOAD *`
* [#8082](RediSearch/RediSearch#8082) Fix
incorrect FULLTEXT field metric counts
* [#8089](RediSearch/RediSearch#8089) Fix an
edge case in `CLUSTERSET` handling
* [#8152](RediSearch/RediSearch#8152) Fix
configuration registration issues

**Improvements:**

* [#7427](RediSearch/RediSearch#7427) Enhance
`FT.PROFILE` with vector search execution details
* [#7431](RediSearch/RediSearch#7431) Ensure
full `FT.PROFILE` output is returned on timeout with RETURN policy
* [#7507](RediSearch/RediSearch#7507) Track
timeout warnings and errors in INFO
* [#7576](RediSearch/RediSearch#7576) Track OOM
warnings and errors in INFO
* [#7612](RediSearch/RediSearch#7612) Track
`maxprefixexpansions` warnings and errors in INFO
* [#7960](RediSearch/RediSearch#7960) Persist
query warnings across cursor reads
* [#7551](RediSearch/RediSearch#7551),
[#7616](RediSearch/RediSearch#7616),
[#7622](RediSearch/RediSearch#7622),
[#7625](RediSearch/RediSearch#7625) Add runtime
thread and pending-jobs metrics
* [#7589](RediSearch/RediSearch#7589) Support
multiple slot ranges in `search.CLUSTERSET`
* [#7707](RediSearch/RediSearch#7707) Add
`WITHCOUNT` support to `FT.AGGREGATE`
* [#7862](RediSearch/RediSearch#7862) Add
support for subquery `COUNT` in `FT.HYBRID`
* [#8087](RediSearch/RediSearch#8087) Add
warnings when cursor results may be affected by ASM and expose ASM
warnings in `FT.PROFILE`
* [#8049](RediSearch/RediSearch#8049) Add
logging for index-related commands
* [#8150](RediSearch/RediSearch#8150) Fix shard
total profile time reporting in `FT.PROFILE`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants