[MOD-12419] Track Timeout errors and warnings in info#7414
Conversation
| [export] | ||
| include = ["QueryWarningCode"] |
There was a problem hiding this comment.
Since you do a pub use .. QueryWarningCode in src/redisearch_rs/c_entrypoint/query_error_ffi/src/lib.rs, i don't think this line is needed.
38967ea to
7f330c3
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #7414 +/- ##
==========================================
+ Coverage 84.92% 84.94% +0.02%
==========================================
Files 348 348
Lines 53773 53836 +63
Branches 14353 14378 +25
==========================================
+ Hits 45668 45733 +65
+ Misses 7909 7907 -2
Partials 196 196
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| if (ShouldReplyWithError(QueryError_GetCode(rp->parent->err), req->reqConfig.timeoutPolicy, IsProfile(req))) { | ||
| // Track errors in global statistics | ||
| QueryErrorsGlobalStats_UpdateError(QueryError_GetCode(rp->parent->err), 1, !IsInternal(req)); |
There was a problem hiding this comment.
Bug: Inconsistent error tracking between RESP2 and RESP3
In sendChunk_Resp3, the error tracking uses QueryError_GetCode(rp->parent->err) while the error reply uses QueryError_GetUserError(qctx->err). This differs from sendChunk_Resp2 which consistently uses qctx->err for both tracking and replying. If rp->parent->err and qctx->err contain different error codes, the tracked error code won't match the error actually returned to the user, leading to incorrect statistics.
There was a problem hiding this comment.
I don't think I should fix misalignment between resp2 and resp3 in this PR.
Using rp->parent->err is aligned with the line above which checks errors in rp->parent->err
meiravgri
left a comment
There was a problem hiding this comment.
Nice and clean
I agree
with the SA definiton change
|
|
||
| if (ShouldReplyWithError(QueryError_GetCode(rp->parent->err), req->reqConfig.timeoutPolicy, IsProfile(req))) { | ||
| // Track errors in global statistics | ||
| QueryErrorsGlobalStats_UpdateError(QueryError_GetCode(rp->parent->err), 1, !IsInternal(req)); |
| // Update global query errors statistics before freeing the request. | ||
| // Both SA and internal are considered shards. | ||
| QueryErrorsGlobalStats_UpdateError(QueryError_GetCode(&status), 1, SHARD_ERR_WARN); | ||
| // Update global query errors statistics |
There was a problem hiding this comment.
wonder why you remove " before freeing the request."
There was a problem hiding this comment.
Because the request is no longer relevant if I use numShards
| // If an error occurred, or a timeout in strict mode - return a simple error | ||
| if (ShouldReplyWithError(QueryError_GetCode(rp->parent->err), req->reqConfig.timeoutPolicy, IsProfile(req))) { | ||
| // Track errors in global statistics | ||
| QueryErrorsGlobalStats_UpdateError(QueryError_GetCode(qctx->err), 1, !IsInternal(req)); |
There was a problem hiding this comment.
Bug: Inconsistent error source in RESP2 error tracking
The error tracking uses qctx->err while the error check on line 401 uses rp->parent->err. This inconsistency means the tracked error code may differ from the error that triggered the tracking. The RESP3 version at line 532 correctly uses rp->parent->err for both operations, confirming this is a bug in the RESP2 path.
tests/pytests/test_info_modules.py
Outdated
| for shardId in range(1, self.env.shardsCount + 1): | ||
| info_dict = info_modules_to_dict(self.env.getConnection(shardId)) | ||
| self.env.assertEqual(info_dict[WARN_ERR_SECTION][TIMEOUT_WARNING_SHARD_METRIC], str(base_warn_shards[shardId] + 3), | ||
| message=f"Shard {shardId} AGG coordinator timeout warning should be +3 total") |
There was a problem hiding this comment.
This one can be flaky
the coordinator might return before the shards reach to the point where they update the timeout wraning
* Track timeout in sendchunk resp2 * Track timeout warning in sendSearchResults * Track timeout error in searchResultReducer (cherry picked from commit fafe1dc) * Track timeout in sendChunk_hybrid (cherry picked from commit 8529cb9) * test timeout metrics (cherry picked from commit c56a0d9) * fix isCoord check * Add query warning code and add function and fields needed to track (cherry picked from commit a414641) * Track timeout in sendchunk resp3 (cherry picked from commit 6853bb9) * readd skip * Update syntax and args error to new SA as cluster * format and enrico comment * Change test to N=0 with Internal only (not working so revert afterwards) * Revert "Change test to N=0 with Internal only (not working so revert afterwards)" This reverts commit 829ac53. * meirav comments * Stablize tests * Add resp3 test * _disable_ hybrid sa timeout * Make test robust * fixup! Make test robust * remove limits (cherry picked from commit 6eee260)
* Track timeout in sendchunk resp2 * Track timeout warning in sendSearchResults * Track timeout error in searchResultReducer (cherry picked from commit fafe1dc) * Track timeout in sendChunk_hybrid (cherry picked from commit 8529cb9) * test timeout metrics (cherry picked from commit c56a0d9) * fix isCoord check * Add query warning code and add function and fields needed to track (cherry picked from commit a414641) * Track timeout in sendchunk resp3 (cherry picked from commit 6853bb9) * readd skip * Update syntax and args error to new SA as cluster * format and enrico comment * Change test to N=0 with Internal only (not working so revert afterwards) * Revert "Change test to N=0 with Internal only (not working so revert afterwards)" This reverts commit 829ac53. * meirav comments * Stablize tests * Add resp3 test * _disable_ hybrid sa timeout * Make test robust * fixup! Make test robust * remove limits (cherry picked from commit 6eee260)
* [MOD-12419] Track Timeout errors and warnings in info (#7414) * Track timeout in sendchunk resp2 * Track timeout warning in sendSearchResults * Track timeout error in searchResultReducer (cherry picked from commit fafe1dc) * Track timeout in sendChunk_hybrid (cherry picked from commit 8529cb9) * test timeout metrics (cherry picked from commit c56a0d9) * fix isCoord check * Add query warning code and add function and fields needed to track (cherry picked from commit a414641) * Track timeout in sendchunk resp3 (cherry picked from commit 6853bb9) * readd skip * Update syntax and args error to new SA as cluster * format and enrico comment * Change test to N=0 with Internal only (not working so revert afterwards) * Revert "Change test to N=0 with Internal only (not working so revert afterwards)" This reverts commit 829ac53. * meirav comments * Stablize tests * Add resp3 test * _disable_ hybrid sa timeout * Make test robust * fixup! Make test robust * remove limits (cherry picked from commit 6eee260) * Align to 8.4
* [MOD-12419] Track Timeout errors and warnings in info ([#7414](#7414)) * Track timeout in sendchunk resp2 * Track timeout warning in sendSearchResults * Track timeout error in searchResultReducer (cherry picked from commit fafe1dc) * Track timeout in sendChunk_hybrid (cherry picked from commit 8529cb9) * test timeout metrics (cherry picked from commit c56a0d9) * fix isCoord check * Add query warning code and add function and fields needed to track (cherry picked from commit a414641) * Track timeout in sendchunk resp3 (cherry picked from commit 6853bb9) * readd skip * Update syntax and args error to new SA as cluster * format and enrico comment * Change test to N=0 with Internal only (not working so revert afterwards) * Revert "Change test to N=0 with Internal only (not working so revert afterwards)" This reverts commit 829ac53. * meirav comments * Stablize tests * Add resp3 test * _disable_ hybrid sa timeout * Make test robust * fixup! Make test robust * remove limits (cherry picked from commit 6eee260) * Align to 8.4 (cherry picked from commit ac92116)
* [MOD-12419] Track Timeout errors and warnings in info ([#7507](#7507)) * [MOD-12419] Track Timeout errors and warnings in info ([#7414](#7414)) * Track timeout in sendchunk resp2 * Track timeout warning in sendSearchResults * Track timeout error in searchResultReducer (cherry picked from commit fafe1dc) * Track timeout in sendChunk_hybrid (cherry picked from commit 8529cb9) * test timeout metrics (cherry picked from commit c56a0d9) * fix isCoord check * Add query warning code and add function and fields needed to track (cherry picked from commit a414641) * Track timeout in sendchunk resp3 (cherry picked from commit 6853bb9) * readd skip * Update syntax and args error to new SA as cluster * format and enrico comment * Change test to N=0 with Internal only (not working so revert afterwards) * Revert "Change test to N=0 with Internal only (not working so revert afterwards)" This reverts commit 829ac53. * meirav comments * Stablize tests * Add resp3 test * _disable_ hybrid sa timeout * Make test robust * fixup! Make test robust * remove limits (cherry picked from commit 6eee260) * Align to 8.4 (cherry picked from commit ac92116) * Fixes * Backport more logic
Continuing the work of #7386
This PR add tracking timeout warnings and errors in the global stats.
Note -
FT.HYBRIDthere is debug timeout mechanism in clusterFT.HYBRIDSA has a bug in timeout warningsNote
Track timeout errors and warnings across query paths, aggregate them in global stats, and expose them via INFO (with tests).
QueryErrorsGlobalStatsand newQueryWarningGlobalStatswithQueryWarningsGlobalStats_UpdateWarning.QueryErrorsGlobalStats_UpdateError.INFO MODULESwith new fields for shard/coord timeout errors/warnings.handleAndReplyWarning; adjust cleanup error accounting.QueryWarningCodeandQueryWarningCode_GetCodeFromMessagein FFI and headers.Written by Cursor Bugbot for commit be69839. This will update automatically on new commits. Configure here.