Skip to content

[MOD-12418] Track OOM errors and warnings in info#7452

Merged
lerman25 merged 32 commits intomasterfrom
omerl-track-oom-metric
Nov 27, 2025
Merged

[MOD-12418] Track OOM errors and warnings in info#7452
lerman25 merged 32 commits intomasterfrom
omerl-track-oom-metric

Conversation

@lerman25
Copy link
Collaborator

@lerman25 lerman25 commented Nov 20, 2025

Continuing the work of #7386
This PR add tracking OOM warnings and errors in the global stats.

Notes -
FT.AGGREGATE doesn't track warnings coming from empty reply - so currently we can't track OOM warnings from shard in the coord.


Note

Adds OOM error/warning metrics to global stats and INFO, wiring updates across SEARCH/AGGREGATE/HYBRID paths (coord vs shard) and covering with tests.

  • Metrics/Stats:
    • Add OOM counters to QueryErrorsGlobalStats and QueryWarningGlobalStats; expose via TotalGlobalStats_GetQueryStats.
    • Increment OOM error/warning stats in memory guardrails and reply paths across SEARCH, AGGREGATE, HYBRID, cursors, and empty-reply helpers (coord vs shard-aware).
  • INFO output:
    • Export new fields: shard_total_query_errors_oom, coord_total_query_errors_oom, shard_total_query_warnings_oom, coord_total_query_warnings_oom in INFO MODULES.
  • Execution paths updated:
    • aggregate_exec.c, hybrid_exec.c, reply_empty.c, and module.c: track/query OOM warnings (QUERY_WOOM_*) and errors on guardrail failures and when composing RESP2/RESP3 replies.
    • coord/hybrid/hybrid_cursor_mappings.c: propagate specific error code instead of generic on shard failures.
  • Tests:
    • Extend warnings/errors tests (standalone, cluster, RESP3) to validate OOM error/warning metrics and coordinator/shard accounting.

Written by Cursor Bugbot for commit 2f3f64d. This will update automatically on new commits. Configure here.

@lerman25 lerman25 requested a review from meiravgri November 20, 2025 16:52
@lerman25 lerman25 removed the request for review from meiravgri November 20, 2025 16:52
@lerman25 lerman25 added the enforce:test Run the test suite even if no code has changed. label Nov 20, 2025
@lerman25 lerman25 marked this pull request as draft November 20, 2025 16:53
@lerman25 lerman25 marked this pull request as ready for review November 20, 2025 16:53
// warning
RedisModule_ReplyKV_Array(reply, "warning"); // >warnings
if (QueryError_HasQueryOOMWarning(AREQ_QueryProcessingCtx(req)->err)) {
QueryWarningsGlobalStats_UpdateWarning(QUERY_WARNING_CODE_OUT_OF_MEMORY_SHARD, 1, !IsInternal(req));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Inconsistent OOM warning code and counter tracking

The warning code QUERY_WARNING_CODE_OUT_OF_MEMORY_SHARD is used with !IsInternal(req) as the third parameter. When !IsInternal(req) evaluates to true (coordinator/standalone), this increments the coordinator counter, not the shard counter. The warning code name suggests shard tracking but the actual behavior tracks coordinator warnings for non-internal requests. This creates inconsistent metrics where the warning code enum doesn't match which counter gets incremented.

Fix in Cursor Fix in Web

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image


if (req->queryOOM) {
QueryWarningsGlobalStats_UpdateWarning(QUERY_WARNING_CODE_OUT_OF_MEMORY_COORD, 1, COORD_ERR_WARN);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Double counting OOM warnings in RESP3 mode

The OOM warning is tracked twice when req->queryOOM is true in RESP3 mode. First at line 2714 inside the RESP3 branch, and again at line 2810 after both RESP2 and RESP3 branches complete. This causes the coordinator OOM warning counter to be incremented twice for the same warning in RESP3 protocol, leading to inflated metrics.

Fix in Cursor Fix in Web

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resp2 & resp3 can't be both true

@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

❌ Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.82%. Comparing base (8dc6f23) to head (2f3f64d).
⚠️ Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
src/aggregate/aggregate_exec.c 71.42% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7452   +/-   ##
=======================================
  Coverage   84.82%   84.82%           
=======================================
  Files         349      349           
  Lines       53899    53933   +34     
  Branches    14505    14505           
=======================================
+ Hits        45720    45751   +31     
- Misses       7986     7989    +3     
  Partials      193      193           
Flag Coverage Δ
flow 85.06% <94.28%> (-0.12%) ⬇️
unit 52.18% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lerman25 lerman25 marked this pull request as draft November 22, 2025 13:02
@lerman25 lerman25 marked this pull request as ready for review November 22, 2025 13:02
Base automatically changed from omerl-track-timeout-metric to master November 24, 2025 21:41
@lerman25 lerman25 dismissed meiravgri’s stale review November 24, 2025 21:41

The base branch was changed.

meiravgri
meiravgri previously approved these changes Nov 25, 2025
meiravgri
meiravgri previously approved these changes Nov 25, 2025
@lerman25 lerman25 added this pull request to the merge queue Nov 25, 2025
github-merge-queue bot pushed a commit that referenced this pull request Nov 25, 2025
* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Track OOM

(cherry picked from commit de1a285aac27c73d4feca50abe3c2328f6959ce2)

* fix warnings double counting

* fix missing skip and logic

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

* comments

* Refactor warning tracking loop for clarity

* Add test for warnings metric count with timeout
@lerman25 lerman25 removed this pull request from the merge queue due to a manual request Nov 25, 2025
@lerman25 lerman25 added this pull request to the merge queue Nov 25, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 25, 2025
meiravgri
meiravgri previously approved these changes Nov 26, 2025
@lerman25 lerman25 added this pull request to the merge queue Nov 26, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 26, 2025
@lerman25 lerman25 enabled auto-merge November 27, 2025 11:53
@lerman25 lerman25 added this pull request to the merge queue Nov 27, 2025
Merged via the queue into master with commit 9ccdf3e Nov 27, 2025
26 checks passed
@lerman25 lerman25 deleted the omerl-track-oom-metric branch November 27, 2025 17:32
lerman25 added a commit that referenced this pull request Nov 30, 2025
* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Track OOM

(cherry picked from commit de1a285aac27c73d4feca50abe3c2328f6959ce2)

* fix warnings double counting

* fix missing skip and logic

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

* comments

* Refactor warning tracking loop for clarity

* Add test for warnings metric count with timeout

* fix flaky

(cherry picked from commit 9ccdf3e)
github-merge-queue bot pushed a commit that referenced this pull request Dec 1, 2025
* [MOD-12418]  Track OOM errors and warnings in info (#7452)

* Track timeout in sendchunk resp2

* Track timeout warning in sendSearchResults

* Track timeout error in searchResultReducer

(cherry picked from commit fafe1dc)

* Track timeout in sendChunk_hybrid

(cherry picked from commit 8529cb9)

* test timeout metrics

(cherry picked from commit c56a0d9)

* fix isCoord check

* Add query warning code and add function and fields needed to track

(cherry picked from commit a414641)

* Track timeout in sendchunk resp3

(cherry picked from commit 6853bb9)

* readd skip

* Update syntax and args error to new SA as cluster

* format and enrico comment

* Track OOM

(cherry picked from commit de1a285aac27c73d4feca50abe3c2328f6959ce2)

* fix warnings double counting

* fix missing skip and logic

* Change test to N=0 with Internal only (not working so revert afterwards)

* Revert "Change test to N=0 with Internal only (not working so revert afterwards)"

This reverts commit 829ac53.

* meirav comments

* Stablize tests

* Add resp3 test

* _disable_ hybrid sa timeout

* Make test robust

* fixup! Make test robust

* remove limits

* comments

* Refactor warning tracking loop for clarity

* Add test for warnings metric count with timeout

* fix flaky

(cherry picked from commit 9ccdf3e)

* remove flaky part
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enforce:test Run the test suite even if no code has changed. size:M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants