Skip to content

[MOD-12899] Add warning when Cursor may give inaccurate results because of ASM#7869

Merged
JoanFM merged 12 commits intomasterfrom
joan-asm-warnings
Dec 24, 2025
Merged

[MOD-12899] Add warning when Cursor may give inaccurate results because of ASM#7869
JoanFM merged 12 commits intomasterfrom
joan-asm-warnings

Conversation

@JoanFM
Copy link
Collaborator

@JoanFM JoanFM commented Dec 23, 2025

Describe the changes in the pull request

A clear and concise description of what the PR is solving, including:

  1. Current: The current state briefly
  2. Change: What is the change
  3. Outcome: Adding the outcome

Which additional issues this PR fixes

  1. MOD-...
  2. #...

Main objects this PR modified

  1. ...

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Note

Introduces end-to-end handling for ASM-induced result inaccuracy during slot migration, including reply warnings, cursor state propagation, and global observability.

  • Adds QEXEC_S_ASM_TRIMMING_DELAY_TIMEOUT to QEStateFlags; sets it when ASM max delay elapses (via CursorList_MarkASMInaccuracy() from notifications.c) and when shards return the new warning (rpnet.c)
  • Emits warnings in replies: RESP2 increments stats; RESP3 appends QUERY_ASM_INACCURATE_RESULTS to warning and updates global stats in aggregate_exec.c
  • Tracks new stats: QUERY_WARNING_CODE_ASM_INACCURATE_RESULTS and QUERY_ERROR_CODE_UNAVAILABLE_SLOTS; exposes in INFO MODULES (coord/shard counters) via global_stats.[ch] and info_redis.c
  • Extends Rust/C FFI with new warning codes and strings in query_error headers and lib
  • Adds cursor API to mark all active cursors as inaccurate (CursorList_MarkASMInaccuracy) and propagates the flag in replies; initializes cursor_reads/stateflags
  • Updates coordinator pipeline to parse/propagate ASM warnings from shards (rpnet.c)
  • Expands tests (tests/pytests/test_asm.py) to validate RESP2/RESP3 cursor behavior, warnings text, and INFO counters

Written by Cursor Bugbot for commit a67aa8d. This will update automatically on new commits. Configure here.

@JoanFM JoanFM changed the title ASM: Add warning when Cursor may give inaccurate results because of ASM [MOD-12899] Add warning when Cursor may give inaccurate results because of ASM Dec 23, 2025
@JoanFM JoanFM marked this pull request as ready for review December 23, 2025 16:12
@JoanFM JoanFM requested review from GuyAv46 and alonre24 December 23, 2025 16:12
@JoanFM JoanFM self-assigned this Dec 23, 2025
@codecov
Copy link

codecov bot commented Dec 23, 2025

Codecov Report

❌ Patch coverage is 87.50000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.23%. Comparing base (701f719) to head (a67aa8d).
⚠️ Report is 73 commits behind head on master.

Files with missing lines Patch % Lines
src/info/global_stats.c 70.00% 3 Missing ⚠️
src/redisearch_rs/query_error/src/lib.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7869      +/-   ##
==========================================
- Coverage   84.26%   84.23%   -0.03%     
==========================================
  Files         360      360              
  Lines       56295    56334      +39     
  Branches    15717    15719       +2     
==========================================
+ Hits        47437    47455      +18     
- Misses       8642     8663      +21     
  Partials      216      216              
Flag Coverage Δ
flow 85.23% <92.10%> (-0.09%) ⬇️
unit 52.05% <5.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@alonre24 alonre24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Few questions
Also, what do you think increase the max trimming default from 5 seconds to ~60 seconds, to be on the safe side with long run queries? @GuyAv46 @JoanFM

Comment on lines +214 to +221
CursorList *cl = getCursorList(true);
CursorList_Lock(cl);
Cursors_ForEach(cl, cursorMarkASMInaccuracyCb, NULL);
CursorList_Unlock(cl);
cl = getCursorList(false);
CursorList_Lock(cl);
Cursors_ForEach(cl, cursorMarkASMInaccuracyCb, NULL);
CursorList_Unlock(cl);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we test the case where we set the warning in internal cursors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is very hard to test that, and is very hard we would have that. But I think for completion is good to mark everything we have there.

Copy link
Collaborator

@alonre24 alonre24 Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a pause mechanism introduced via FT.DEBUG command that may allow testing it more easily, consider using it (nota a must)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like an overkill

Comment on lines +105 to +106
stats.shard_errors.unavailableSlots = READ(RSGlobalStats.totalStats.queries.shard_errors.unavailableSlots);
stats.coord_errors.unavailableSlots = READ(RSGlobalStats.totalStats.queries.coord_errors.unavailableSlots);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is that set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatically mapping from the error name to the Query error code.

@JoanFM
Copy link
Collaborator Author

JoanFM commented Dec 23, 2025

Nice! Few questions Also, what do you think increase the max trimming default from 5 seconds to ~60 seconds, to be on the safe side with long run queries? @GuyAv46 @JoanFM

I do not have a strong opinion, but this should anyway be handled in another PR.

I think delaying too much can be more harmful to the normal operations of the engine than giving some inaccurate results on some set of queries

@JoanFM JoanFM requested a review from alonre24 December 23, 2025 18:07
Self::UnavailableSlots => c"Query requires unavailable slots",
Self::AsmInaccurateResults => {
c"Query execution exceeded maximum delay for RediSearch to delay key trimming. Results may be incomplete or have duplications due to Atomic Slot Migration."
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning message mismatch between Rust and C definitions

The AsmInaccurateResults warning message in Rust includes "or have duplications" but the C header definitions don't. The C messages say "Results may be incomplete due to Atomic Slot Migration" while Rust says "Results may be incomplete or have duplications due to Atomic Slot Migration." This inconsistency will cause tests to fail and creates confusion about what conditions are actually possible.

Additional Locations (2)

Fix in Cursor Fix in Web

break;
case QUERY_WARNING_CODE_ASM_INACCURATE_RESULTS:
INCR_BY(queries_warnings->asm_inaccuracy, toAdd);
break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing handler for warning code in switch statement

The QueryWarningCode enum was extended with QUERY_WARNING_CODE_UNAVAILABLE_SLOTS in the header files, but QueryWarningsGlobalStats_UpdateWarning() has no case to handle this new value. If this warning code is ever passed to the function, it will silently do nothing instead of updating the statistics. The enum value should have a corresponding case that updates the appropriate stat counter, or it should be removed from the enum if not actually used.

Fix in Cursor Fix in Web

@JoanFM JoanFM marked this pull request as draft December 24, 2025 07:18
@JoanFM JoanFM marked this pull request as ready for review December 24, 2025 08:20
@GuyAv46
Copy link
Collaborator

GuyAv46 commented Dec 24, 2025

What do you think about increasing the max trimming default from 5 seconds to ~60 seconds, to be on the safe side with long run queries?

@alonre24 I think it's too high for a default. This delay causes both memory spikes and blocks additional migrations. Because it also affects cursors, and their default idle time is 5 minutes, in a user-cursor use-cases we are likely to wait the full delay time.
Maybe we can detect the use case and give more time if the user doesn't have user (coord) cursors?

@alonre24
Copy link
Collaborator

What do you think about increasing the max trimming default from 5 seconds to ~60 seconds, to be on the safe side with long run queries?

@alonre24 I think it's too high for a default. This delay causes both memory spikes and blocks additional migrations. Because it also affects cursors, and their default idle time is 5 minutes, in a user-cursor use-cases we are likely to wait the full delay time. Maybe we can detect the use case and give more time if the user doesn't have user (coord) cursors?

You mean give more time if the user does have user (coord) cursors? I think we can go with than compromise. @JoanFM WDYT? Should we add that in this PR?

@JoanFM
Copy link
Collaborator Author

JoanFM commented Dec 24, 2025

What do you think about increasing the max trimming default from 5 seconds to ~60 seconds, to be on the safe side with long run queries?

@alonre24 I think it's too high for a default. This delay causes both memory spikes and blocks additional migrations. Because it also affects cursors, and their default idle time is 5 minutes, in a user-cursor use-cases we are likely to wait the full delay time. Maybe we can detect the use case and give more time if the user doesn't have user (coord) cursors?

You mean give more time if the user does have user (coord) cursors? I think we can go with than compromise. @JoanFM WDYT? Should we add that in this PR?

I think this should go into another PR. I also would vote for simplicity and keep as is now and potentially improve in the future. What would be the case if they have user cursors? wait 60s in that case?

@GuyAv46
Copy link
Collaborator

GuyAv46 commented Dec 24, 2025

The other way around - if there are no user cursors, extend the max delay.
But I’m not sure it’s a simple check, we need to know if a local internal cursor is for a user cursor on another shard. We need to add a marker to the internal command or something.
let’s think about it and do it in another PR

@JoanFM
Copy link
Collaborator Author

JoanFM commented Dec 24, 2025

The other way around - if there are no user cursors, extend the max delay. But I’m not sure it’s a simple check, we need to know if a local internal cursor is for a user cursor on another shard. We need to add a marker to the internal command or something. let’s think about it and do it in another PR

Then I do not think complexity is justified at this moment

@JoanFM JoanFM added this pull request to the merge queue Dec 24, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 24, 2025
@JoanFM JoanFM added this pull request to the merge queue Dec 24, 2025
Merged via the queue into master with commit a474084 Dec 24, 2025
55 of 64 checks passed
@JoanFM JoanFM deleted the joan-asm-warnings branch December 24, 2025 17:13
@JoanFM
Copy link
Collaborator Author

JoanFM commented Jan 19, 2026

/backport

@redisearch-backport-pull-request
Copy link
Contributor

Backport failed for 8.4, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 8.4
git worktree add -d .worktree/backport-7869-to-8.4 origin/8.4
cd .worktree/backport-7869-to-8.4
git switch --create backport-7869-to-8.4
git cherry-pick -x a47408423ca5e2c9e221c5f847c8a2f0eb1e82cc

github-merge-queue bot pushed a commit that referenced this pull request Jan 20, 2026
…because of ASM (#7869) (#8087)

* manual backport of ASM warnings

* remove some warning
github-merge-queue bot pushed a commit that referenced this pull request Jan 20, 2026
…because of ASM and add asm warnings in profile (#7869)(#7899)(#7913) (#8087)

* manual backport of ASM warnings

* remove some warning

* test: change timeout test to not collide with RLTest timeout

* asm warnings profile

* add new case
github-merge-queue bot pushed a commit that referenced this pull request Jan 20, 2026
…because of ASM and add asm warnings in profile (#7869)(#7899)(#7913) (#8087)

* manual backport of ASM warnings

* remove some warning

* test: change timeout test to not collide with RLTest timeout

* asm warnings profile

* add new case

* test: add condition for source shard to pass migration

* add back time sleep to test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants