Skip to content

[8.4] [MOD-12418] Track OOM errors and warnings in info (#7452)#7576

Merged
lerman25 merged 2 commits into8.4from
backport-7452-to-8.4
Dec 1, 2025
Merged

[8.4] [MOD-12418] Track OOM errors and warnings in info (#7452)#7576
lerman25 merged 2 commits into8.4from
backport-7452-to-8.4

Conversation

@lerman25
Copy link
Collaborator

@lerman25 lerman25 commented Nov 30, 2025

backport #7452 to 8.4


Note

Track and expose OOM query errors and warnings across coordinator and shards, updating execution paths and INFO output, with tests covering standalone/cluster and RESP2/RESP3.

  • Metrics/Stats:
    • Add OOM counters to QueryErrorsGlobalStats/QueryWarningGlobalStats and include them in TotalGlobalStats_GetQueryStats and INFO MODULES output (both shard and coordinator sections).
    • Extend QueryErrorsGlobalStats_UpdateError/QueryWarningsGlobalStats_UpdateWarning to handle OOM codes (QUERY_EOOM, QUERY_WARNING_CODE_OUT_OF_MEMORY_{COORD,SHARD}).
  • Query Execution/Replies:
    • Aggregate/Search/Hybrid paths update global OOM warning/error counters on guardrail OOM, query OOM warnings, and empty-reply bailouts (aggregate_exec.c, reply_empty.c, hybrid_exec.c, module.c).
    • RESP2/RESP3 replies now record OOM warnings in warning arrays where applicable.
    • Hybrid cursor mapping now preserves original error codes in aggregation of shard errors.
  • Tests:
    • Add/extend tests to validate OOM errors/warnings counting in standalone and cluster (RESP2/RESP3), and to ensure unrelated metrics remain unchanged.

Written by Cursor Bugbot for commit fdd7a03. This will update automatically on new commits. Configure here.

* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Track OOM

(cherry picked from commit de1a285aac27c73d4feca50abe3c2328f6959ce2)

* fix warnings double counting

* fix missing skip and logic

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

* comments

* Refactor warning tracking loop for clarity

* Add test for warnings metric count with timeout

* fix flaky

(cherry picked from commit 9ccdf3e)
meiravgri
meiravgri previously approved these changes Nov 30, 2025

if (req->queryOOM) {
QueryWarningsGlobalStats_UpdateWarning(QUERY_WARNING_CODE_OUT_OF_MEMORY_COORD, 1, COORD_ERR_WARN);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: OOM warning double counted for RESP3 responses

The newly added OOM warning tracking at lines 2806-2808 runs for both RESP2 and RESP3 responses, but RESP3 already tracks OOM warnings at line 2711 within its else if (req->queryOOM) branch. When req->queryOOM is true and using RESP3, QueryWarningsGlobalStats_UpdateWarning is called twice, inflating the OOM warning counter. The tracking at lines 2806-2808 appears intended only for RESP2 (which has no other OOM tracking) but is positioned outside the if/else block.

Additional Locations (1)

Fix in Cursor Fix in Web

@lerman25 lerman25 enabled auto-merge November 30, 2025 12:32
@lerman25 lerman25 added this pull request to the merge queue Nov 30, 2025
@codecov
Copy link

codecov bot commented Nov 30, 2025

Codecov Report

❌ Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.89%. Comparing base (ac92116) to head (fdd7a03).
⚠️ Report is 2 commits behind head on 8.4.

Files with missing lines Patch % Lines
src/aggregate/aggregate_exec.c 71.42% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              8.4    #7576      +/-   ##
==========================================
- Coverage   85.95%   85.89%   -0.06%     
==========================================
  Files         331      331              
  Lines       52667    52701      +34     
  Branches    12004    12004              
==========================================
- Hits        45272    45270       -2     
- Misses       7228     7264      +36     
  Partials      167      167              
Flag Coverage Δ
flow 84.61% <94.28%> (-0.05%) ⬇️
unit 52.37% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 30, 2025
@lerman25 lerman25 added this pull request to the merge queue Nov 30, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 30, 2025
@lerman25 lerman25 added this pull request to the merge queue Nov 30, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 30, 2025
@lerman25 lerman25 added this pull request to the merge queue Dec 1, 2025
Merged via the queue into 8.4 with commit fbccafd Dec 1, 2025
26 checks passed
@lerman25 lerman25 deleted the backport-7452-to-8.4 branch December 1, 2025 10:14
alonre24 added a commit to alonre24/redis that referenced this pull request Jan 26, 2026
**Bug Fixes:**

* [redis#7385](RediSearch/RediSearch#7385) Fix high temporary memory consumption when loading multiple search indexes from RDB
* [redis#7430](RediSearch/RediSearch#7430) Fix a potential deadlock in `FT.HYBRID` in cluster mode during updates.
* [redis#7454](RediSearch/RediSearch#7454) Fix a garbage collection performence regression
* [redis#7460](RediSearch/RediSearch#7460) Fix potential double-free in Fork GC error paths
* [redis#7455](RediSearch/RediSearch#7455) Fix internal cursors not being deleted promptly in cluster mode
* [redis#7667](RediSearch/RediSearch#7667) Fix a cursor logical leak upon dropping the index
* [redis#7796](RediSearch/RediSearch#7796) Fix a potential use-after-free when removing connections
* [redis#7792](RediSearch/RediSearch#7792) Fix string comparison for binary data with embedded NULLs in TOLIST reducer in FT.AGGREGATE
* [redis#7823](RediSearch/RediSearch#7823) Update `FT.HYBRID` to accept vector blobs only via parameters
* [redis#7903](RediSearch/RediSearch#7903) Fix a memory leak in Hybrid ASM
* [redis#8052](RediSearch/RediSearch#8052) Fix `FT.HYBRID` behavior when used with `LOAD *`
* [redis#8082](RediSearch/RediSearch#8082) Fix incorrect FULLTEXT field metric counts
* [redis#8089](RediSearch/RediSearch#8089) Fix an edge case in `CLUSTERSET` handling
* [redis#8152](RediSearch/RediSearch#8152) Fix configuration registration issues

**Improvements:**

* [redis#7427](RediSearch/RediSearch#7427) Enhance `FT.PROFILE` with vector search execution details
* [redis#7431](RediSearch/RediSearch#7431) Ensure full `FT.PROFILE` output is returned on timeout with RETURN policy
* [redis#7507](RediSearch/RediSearch#7507) Track timeout warnings and errors in INFO
* [redis#7576](RediSearch/RediSearch#7576) Track OOM warnings and errors in INFO
* [redis#7612](RediSearch/RediSearch#7612) Track `maxprefixexpansions` warnings and errors in INFO
* [redis#7960](RediSearch/RediSearch#7960) Persist query warnings across cursor reads
* [redis#7551](RediSearch/RediSearch#7551), [redis#7616](RediSearch/RediSearch#7616), [redis#7622](RediSearch/RediSearch#7622), [redis#7625](RediSearch/RediSearch#7625) Add runtime thread and pending-jobs metrics
* [redis#7589](RediSearch/RediSearch#7589) Support multiple slot ranges in `search.CLUSTERSET`
* [redis#7707](RediSearch/RediSearch#7707) Add `WITHCOUNT` support to `FT.AGGREGATE`
* [redis#7862](RediSearch/RediSearch#7862) Add support for subquery `COUNT` in `FT.HYBRID`
* [redis#8087](RediSearch/RediSearch#8087) Add warnings when cursor results may be affected by ASM and expose ASM warnings in `FT.PROFILE`
* [redis#8049](RediSearch/RediSearch#8049) Add logging for index-related commands
* [redis#8150](RediSearch/RediSearch#8150) Fix shard total profile time reporting in `FT.PROFILE`
YaacovHazan pushed a commit to redis/redis that referenced this pull request Jan 26, 2026
**Bug Fixes:**

* [#7385](RediSearch/RediSearch#7385) Fix high
temporary memory consumption when loading multiple search indexes from
RDB
* [#7430](RediSearch/RediSearch#7430) Fix a
potential deadlock in `FT.HYBRID` in cluster mode during updates.
* [#7454](RediSearch/RediSearch#7454) Fix a
garbage collection performence regression
* [#7460](RediSearch/RediSearch#7460) Fix
potential double-free in Fork GC error paths
* [#7455](RediSearch/RediSearch#7455) Fix
internal cursors not being deleted promptly in cluster mode
* [#7667](RediSearch/RediSearch#7667) Fix a
cursor logical leak upon dropping the index
* [#7796](RediSearch/RediSearch#7796) Fix a
potential use-after-free when removing connections
* [#7792](RediSearch/RediSearch#7792) Fix string
comparison for binary data with embedded NULLs in TOLIST reducer in
FT.AGGREGATE
* [#7704](RediSearch/RediSearch#7704) Use
asynchronous jobs for GC in SVS to accelerate execution
* [#7823](RediSearch/RediSearch#7823) Update
`FT.HYBRID` to accept vector blobs only via parameters
* [#7903](RediSearch/RediSearch#7903) Fix a
memory leak in Hybrid ASM
* [#8052](RediSearch/RediSearch#8052) Fix
`FT.HYBRID` behavior when used with `LOAD *`
* [#8082](RediSearch/RediSearch#8082) Fix
incorrect FULLTEXT field metric counts
* [#8089](RediSearch/RediSearch#8089) Fix an
edge case in `CLUSTERSET` handling
* [#8152](RediSearch/RediSearch#8152) Fix
configuration registration issues

**Improvements:**

* [#7427](RediSearch/RediSearch#7427) Enhance
`FT.PROFILE` with vector search execution details
* [#7431](RediSearch/RediSearch#7431) Ensure
full `FT.PROFILE` output is returned on timeout with RETURN policy
* [#7507](RediSearch/RediSearch#7507) Track
timeout warnings and errors in INFO
* [#7576](RediSearch/RediSearch#7576) Track OOM
warnings and errors in INFO
* [#7612](RediSearch/RediSearch#7612) Track
`maxprefixexpansions` warnings and errors in INFO
* [#7960](RediSearch/RediSearch#7960) Persist
query warnings across cursor reads
* [#7551](RediSearch/RediSearch#7551),
[#7616](RediSearch/RediSearch#7616),
[#7622](RediSearch/RediSearch#7622),
[#7625](RediSearch/RediSearch#7625) Add runtime
thread and pending-jobs metrics
* [#7589](RediSearch/RediSearch#7589) Support
multiple slot ranges in `search.CLUSTERSET`
* [#7707](RediSearch/RediSearch#7707) Add
`WITHCOUNT` support to `FT.AGGREGATE`
* [#7862](RediSearch/RediSearch#7862) Add
support for subquery `COUNT` in `FT.HYBRID`
* [#8087](RediSearch/RediSearch#8087) Add
warnings when cursor results may be affected by ASM and expose ASM
warnings in `FT.PROFILE`
* [#8049](RediSearch/RediSearch#8049) Add
logging for index-related commands
* [#8150](RediSearch/RediSearch#8150) Fix shard
total profile time reporting in `FT.PROFILE`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants