[MOD-12418] Track OOM errors and warnings in info#7452
Conversation
| // warning | ||
| RedisModule_ReplyKV_Array(reply, "warning"); // >warnings | ||
| if (QueryError_HasQueryOOMWarning(AREQ_QueryProcessingCtx(req)->err)) { | ||
| QueryWarningsGlobalStats_UpdateWarning(QUERY_WARNING_CODE_OUT_OF_MEMORY_SHARD, 1, !IsInternal(req)); |
There was a problem hiding this comment.
Bug: Inconsistent OOM warning code and counter tracking
The warning code QUERY_WARNING_CODE_OUT_OF_MEMORY_SHARD is used with !IsInternal(req) as the third parameter. When !IsInternal(req) evaluates to true (coordinator/standalone), this increments the coordinator counter, not the shard counter. The warning code name suggests shard tracking but the actual behavior tracks coordinator warnings for non-internal requests. This creates inconsistent metrics where the warning code enum doesn't match which counter gets incremented.
|
|
||
| if (req->queryOOM) { | ||
| QueryWarningsGlobalStats_UpdateWarning(QUERY_WARNING_CODE_OUT_OF_MEMORY_COORD, 1, COORD_ERR_WARN); | ||
| } |
There was a problem hiding this comment.
Bug: Double counting OOM warnings in RESP3 mode
The OOM warning is tracked twice when req->queryOOM is true in RESP3 mode. First at line 2714 inside the RESP3 branch, and again at line 2810 after both RESP2 and RESP3 branches complete. This causes the coordinator OOM warning counter to be incremented twice for the same warning in RESP3 protocol, leading to inflated metrics.
There was a problem hiding this comment.
resp2 & resp3 can't be both true
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7452 +/- ##
=======================================
Coverage 84.82% 84.82%
=======================================
Files 349 349
Lines 53899 53933 +34
Branches 14505 14505
=======================================
+ Hits 45720 45751 +31
- Misses 7986 7989 +3
Partials 193 193
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…afterwards)" This reverts commit 829ac53.
3659e24 to
8a6480d
Compare
* Track timeout in sendchunk resp2 * Track timeout warning in sendSearchResults * Track timeout error in searchResultReducer (cherry picked from commit fafe1dc) * Track timeout in sendChunk_hybrid (cherry picked from commit 8529cb9) * test timeout metrics (cherry picked from commit c56a0d9) * fix isCoord check * Add query warning code and add function and fields needed to track (cherry picked from commit a414641) * Track timeout in sendchunk resp3 (cherry picked from commit 6853bb9) * readd skip * Update syntax and args error to new SA as cluster * format and enrico comment * Track OOM (cherry picked from commit de1a285aac27c73d4feca50abe3c2328f6959ce2) * fix warnings double counting * fix missing skip and logic * Change test to N=0 with Internal only (not working so revert afterwards) * Revert "Change test to N=0 with Internal only (not working so revert afterwards)" This reverts commit 829ac53. * meirav comments * Stablize tests * Add resp3 test * _disable_ hybrid sa timeout * Make test robust * fixup! Make test robust * remove limits * comments * Refactor warning tracking loop for clarity * Add test for warnings metric count with timeout
* Track timeout in sendchunk resp2 * Track timeout warning in sendSearchResults * Track timeout error in searchResultReducer (cherry picked from commit fafe1dc) * Track timeout in sendChunk_hybrid (cherry picked from commit 8529cb9) * test timeout metrics (cherry picked from commit c56a0d9) * fix isCoord check * Add query warning code and add function and fields needed to track (cherry picked from commit a414641) * Track timeout in sendchunk resp3 (cherry picked from commit 6853bb9) * readd skip * Update syntax and args error to new SA as cluster * format and enrico comment * Track OOM (cherry picked from commit de1a285aac27c73d4feca50abe3c2328f6959ce2) * fix warnings double counting * fix missing skip and logic * Change test to N=0 with Internal only (not working so revert afterwards) * Revert "Change test to N=0 with Internal only (not working so revert afterwards)" This reverts commit 829ac53. * meirav comments * Stablize tests * Add resp3 test * _disable_ hybrid sa timeout * Make test robust * fixup! Make test robust * remove limits * comments * Refactor warning tracking loop for clarity * Add test for warnings metric count with timeout * fix flaky (cherry picked from commit 9ccdf3e)
* [MOD-12418] Track OOM errors and warnings in info (#7452) * Track timeout in sendchunk resp2 * Track timeout warning in sendSearchResults * Track timeout error in searchResultReducer (cherry picked from commit fafe1dc) * Track timeout in sendChunk_hybrid (cherry picked from commit 8529cb9) * test timeout metrics (cherry picked from commit c56a0d9) * fix isCoord check * Add query warning code and add function and fields needed to track (cherry picked from commit a414641) * Track timeout in sendchunk resp3 (cherry picked from commit 6853bb9) * readd skip * Update syntax and args error to new SA as cluster * format and enrico comment * Track OOM (cherry picked from commit de1a285aac27c73d4feca50abe3c2328f6959ce2) * fix warnings double counting * fix missing skip and logic * Change test to N=0 with Internal only (not working so revert afterwards) * Revert "Change test to N=0 with Internal only (not working so revert afterwards)" This reverts commit 829ac53. * meirav comments * Stablize tests * Add resp3 test * _disable_ hybrid sa timeout * Make test robust * fixup! Make test robust * remove limits * comments * Refactor warning tracking loop for clarity * Add test for warnings metric count with timeout * fix flaky (cherry picked from commit 9ccdf3e) * remove flaky part

Continuing the work of #7386
This PR add tracking OOM warnings and errors in the global stats.
Notes -
FT.AGGREGATEdoesn't track warnings coming from empty reply - so currently we can't track OOM warnings from shard in the coord.Note
Adds OOM error/warning metrics to global stats and INFO, wiring updates across SEARCH/AGGREGATE/HYBRID paths (coord vs shard) and covering with tests.
QueryErrorsGlobalStatsandQueryWarningGlobalStats; expose viaTotalGlobalStats_GetQueryStats.SEARCH,AGGREGATE,HYBRID, cursors, and empty-reply helpers (coord vs shard-aware).shard_total_query_errors_oom,coord_total_query_errors_oom,shard_total_query_warnings_oom,coord_total_query_warnings_oominINFO MODULES.aggregate_exec.c,hybrid_exec.c,reply_empty.c, andmodule.c: track/query OOM warnings (QUERY_WOOM_*) and errors on guardrail failures and when composing RESP2/RESP3 replies.coord/hybrid/hybrid_cursor_mappings.c: propagate specific error code instead of generic on shard failures.Written by Cursor Bugbot for commit 2f3f64d. This will update automatically on new commits. Configure here.