[MOD-12899] Add warning when Cursor may give inaccurate results because of ASM#7869
[MOD-12899] Add warning when Cursor may give inaccurate results because of ASM#7869
Conversation
bcd0a44 to
3d4b4e7
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7869 +/- ##
==========================================
- Coverage 84.26% 84.23% -0.03%
==========================================
Files 360 360
Lines 56295 56334 +39
Branches 15717 15719 +2
==========================================
+ Hits 47437 47455 +18
- Misses 8642 8663 +21
Partials 216 216
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| CursorList *cl = getCursorList(true); | ||
| CursorList_Lock(cl); | ||
| Cursors_ForEach(cl, cursorMarkASMInaccuracyCb, NULL); | ||
| CursorList_Unlock(cl); | ||
| cl = getCursorList(false); | ||
| CursorList_Lock(cl); | ||
| Cursors_ForEach(cl, cursorMarkASMInaccuracyCb, NULL); | ||
| CursorList_Unlock(cl); |
There was a problem hiding this comment.
Did we test the case where we set the warning in internal cursors?
There was a problem hiding this comment.
I think is very hard to test that, and is very hard we would have that. But I think for completion is good to mark everything we have there.
There was a problem hiding this comment.
There is a pause mechanism introduced via FT.DEBUG command that may allow testing it more easily, consider using it (nota a must)
There was a problem hiding this comment.
feels like an overkill
| stats.shard_errors.unavailableSlots = READ(RSGlobalStats.totalStats.queries.shard_errors.unavailableSlots); | ||
| stats.coord_errors.unavailableSlots = READ(RSGlobalStats.totalStats.queries.coord_errors.unavailableSlots); |
There was a problem hiding this comment.
Automatically mapping from the error name to the Query error code.
I do not have a strong opinion, but this should anyway be handled in another PR. I think delaying too much can be more harmful to the normal operations of the engine than giving some inaccurate results on some set of queries |
| Self::UnavailableSlots => c"Query requires unavailable slots", | ||
| Self::AsmInaccurateResults => { | ||
| c"Query execution exceeded maximum delay for RediSearch to delay key trimming. Results may be incomplete or have duplications due to Atomic Slot Migration." | ||
| } |
There was a problem hiding this comment.
Warning message mismatch between Rust and C definitions
The AsmInaccurateResults warning message in Rust includes "or have duplications" but the C header definitions don't. The C messages say "Results may be incomplete due to Atomic Slot Migration" while Rust says "Results may be incomplete or have duplications due to Atomic Slot Migration." This inconsistency will cause tests to fail and creates confusion about what conditions are actually possible.
Additional Locations (2)
d281918 to
1603143
Compare
| break; | ||
| case QUERY_WARNING_CODE_ASM_INACCURATE_RESULTS: | ||
| INCR_BY(queries_warnings->asm_inaccuracy, toAdd); | ||
| break; |
There was a problem hiding this comment.
Missing handler for warning code in switch statement
The QueryWarningCode enum was extended with QUERY_WARNING_CODE_UNAVAILABLE_SLOTS in the header files, but QueryWarningsGlobalStats_UpdateWarning() has no case to handle this new value. If this warning code is ever passed to the function, it will silently do nothing instead of updating the statistics. The enum value should have a corresponding case that updates the appropriate stat counter, or it should be removed from the enum if not actually used.
@alonre24 I think it's too high for a default. This delay causes both memory spikes and blocks additional migrations. Because it also affects cursors, and their default idle time is 5 minutes, in a user-cursor use-cases we are likely to wait the full delay time. |
You mean give more time if the user does have user (coord) cursors? I think we can go with than compromise. @JoanFM WDYT? Should we add that in this PR? |
I think this should go into another PR. I also would vote for simplicity and keep as is now and potentially improve in the future. What would be the case if they have user cursors? wait 60s in that case? |
|
The other way around - if there are no user cursors, extend the max delay. |
Then I do not think complexity is justified at this moment |
|
/backport |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 8.4
git worktree add -d .worktree/backport-7869-to-8.4 origin/8.4
cd .worktree/backport-7869-to-8.4
git switch --create backport-7869-to-8.4
git cherry-pick -x a47408423ca5e2c9e221c5f847c8a2f0eb1e82cc |
…because of ASM and add asm warnings in profile (#7869)(#7899)(#7913) (#8087) * manual backport of ASM warnings * remove some warning * test: change timeout test to not collide with RLTest timeout * asm warnings profile * add new case * test: add condition for source shard to pass migration * add back time sleep to test
Describe the changes in the pull request
A clear and concise description of what the PR is solving, including:
Which additional issues this PR fixes
Main objects this PR modified
Mark if applicable
Note
Introduces end-to-end handling for ASM-induced result inaccuracy during slot migration, including reply warnings, cursor state propagation, and global observability.
QEXEC_S_ASM_TRIMMING_DELAY_TIMEOUTtoQEStateFlags; sets it when ASM max delay elapses (viaCursorList_MarkASMInaccuracy()fromnotifications.c) and when shards return the new warning (rpnet.c)QUERY_ASM_INACCURATE_RESULTStowarningand updates global stats inaggregate_exec.cQUERY_WARNING_CODE_ASM_INACCURATE_RESULTSandQUERY_ERROR_CODE_UNAVAILABLE_SLOTS; exposes inINFO MODULES(coord/shard counters) viaglobal_stats.[ch]andinfo_redis.cquery_errorheaders and libCursorList_MarkASMInaccuracy) and propagates the flag in replies; initializescursor_reads/stateflagsrpnet.c)tests/pytests/test_asm.py) to validate RESP2/RESP3 cursor behavior, warnings text, and INFO countersWritten by Cursor Bugbot for commit a67aa8d. This will update automatically on new commits. Configure here.