Skip to content

[MOD-12171] [MOD-12169] Implement ASM-aware search flow#7343

Merged
JoanFM merged 282 commits intomasterfrom
joan-mod-17271
Dec 17, 2025
Merged

[MOD-12171] [MOD-12169] Implement ASM-aware search flow#7343
JoanFM merged 282 commits intomasterfrom
joan-mod-17271

Conversation

@JoanFM
Copy link
Collaborator

@JoanFM JoanFM commented Nov 12, 2025

Describe the changes in the pull request

Put all he pieces together to make sure that queries respect ASM slot information at query time.

Depending on redis/redis#14567

Main ideas:

  • When parsing the SLOTS INFO from the requests, store the KEYSPACE version returned by the slots tracker or reject the query.

  • In result processor potentially filter out some results if the query keyspace version does not match the keyspace version of the server at that moment (Some migration happens that means that some of the results should not be owned by the shard and therefore discarded).

  • Once the migration is COMPLETE; make sure to delay trimming until one of these 2 conditions is met:
    - 5 seconds have passed since the completion of migration
    - 2 seconds have passed and all the queries run with the keyspace version before trimming have finished *

  • How do we keep track of the keyspace versions that are being run.

  • ASM keeps track with a hashmap of a counter of queries per keyspace version. These counter are increased and decreased in a thread safe manner:

    • Increase when parsing happens (In Hybrid twice (one per subquery))
    • Decrease when the free callback of the blocked client is called in Background queries
    • Decrease when AREQ_Free (when AREQ is no longer needed release the active query resource in a RAII manner)

Note

Implements ASM-aware distributed search by tracking keyspace versions, validating requested slots, filtering results on version mismatch, and adding configurable trimming delays with lifecycle integration.

  • Core query/exec flow:

    • Replace slotsVersion with keySpaceVersion across AREQ, parse contexts, pipeline params, and result processor.
    • Parse _SLOTS_INFO, validate via slots_tracker_check_availability, set/increment keySpaceVersion; reject with UnavailableSlots on failure.
    • Result filtering: when querySlots present and server key_space_version differs from request keySpaceVersion, discard docs not in querySlots.
    • Add DOCUMENT_OPEN_KEY_ACCESS_TRIMMED to query flags.
  • ASM state machine & tracking:

    • New asm_state_machine with global key_space_version, slot state transitions, and query-count tracking per version (thread-safe; sanitizer hooks).
    • Public init/end and helpers; integrate at module init/shutdown and topology updates.
  • Topology & notifications:

    • Use ASM_StateMachine_SetLocalSlots on topology refresh; reschedule updater on ASM events.
    • Implement trim enablement delays with timers: min/max delay and periodic checks; enable/disable trimming accordingly.
  • Config:

    • New tunables: _MIN_TRIM_DELAY_MS, _MAX_TRIM_DELAY_MS, _TRIMMING_STATE_CHECK_DELAY_MS with validation, GET/SET, defaults.
  • APIs & errors:

    • Add QUERY_ERROR_CODE_UNAVAILABLE_SLOTS (C/Rust) and propagate in coordinator.
    • Slots API: remove cached local slots helpers; add SlotRangeArray_ContainsSlot.
  • Hybrid parsing:

    • Honor _SLOTS_INFO with validation; propagate keySpaceVersion to subqueries and track query counts.
  • Tests:

    • Update/expand C/C++/Python tests for ASM flow, slots validation, trimming delays, topology, and cursor behavior; adjust slot range helpers and ACL tests.

Written by Cursor Bugbot for commit d782e73. This will update automatically on new commits. Configure here.

JoanFM added 30 commits November 5, 2025 17:15
…hreads may be waiting for the Shared Lock in the loop
add_shard_and_migrate_test(env, 'FT.AGGREGATE.WITHCURSOR')

#TODO: Enable once MOD-13110 is fixed
@skip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work? I was under the assumption that we must set the parentheses (@skip())

static bool ASM_CanStartTrimming(void) {
uint32_t current_version = __atomic_load_n(&key_space_version, __ATOMIC_RELAXED);
return ASM_KeySpaceVersionTracker_GetQueryCount(current_version) == 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Trimming readiness checks wrong version

ASM_CanStartTrimming checks ASM_KeySpaceVersionTracker_GetQueryCount(current_version), but the surrounding logic/comments refer to waiting for queries on the “old” version to finish. After a version bump, this can enable trimming while old-version queries are still running (or block trimming based on unrelated new-version queries).

Fix in Cursor Fix in Web

@JoanFM JoanFM requested a review from GuyAv46 December 17, 2025 13:23
GuyAv46
GuyAv46 previously approved these changes Dec 17, 2025
Copy link
Collaborator

@GuyAv46 GuyAv46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏆 🍾 🥂

self.fully_available = SlotSet::new();
self.partially_available = SlotSet::new();
self.increment_version();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Local slots update clears availability tracking

SlotsTracker::set_local_slots now resets fully_available and partially_available to empty instead of only removing overlaps with the new local ranges. This can drop in-progress migration/trim state (and its “unstable” behavior), making slots_tracker_check_availability report stability/coverage incorrectly and potentially skipping required filtering during topology transitions.

Fix in Cursor Fix in Web

@JoanFM JoanFM requested a review from GuyAv46 December 17, 2025 16:00
self.fully_available = SlotSet::new();
self.partially_available = SlotSet::new();
self.increment_version();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Topology refresh wipes ASM migration state

SlotsTracker::set_local_slots now resets fully_available and partially_available to empty instead of removing only overlapping ranges. When ASM_StateMachine_SetLocalSlots runs on periodic topology updates, any ongoing migration/trim state for other slots is lost, so check_availability can incorrectly accept queries or skip required filtering/rejection.

Fix in Cursor Fix in Web

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Slot filtering uses outdated slot ranges

When key_space_version changes mid-query, rpQueryItNext filters results using self->querySlots (the request’s original slot set). If the shard’s owned slots changed (migration/topology update), this check won’t drop documents from slots the shard no longer owns, so internal shard queries can return stale/misowned results after migration.

src/result_processor.c#L149-L158

}
// querySlots presence would indicate that is internal command, if querySlots is NULL, we don't need to filter as we would be in standalone.
if (self->querySlots && (__atomic_load_n(&key_space_version, __ATOMIC_RELAXED) != self->keySpaceVersion)) {
RS_ASSERT(self->querySlots != NULL);
int slot = RedisModule_ClusterKeySlotC(dmd->keyPtr, sdslen(dmd->keyPtr));
if (!SlotRangeArray_ContainsSlot(self->querySlots, slot)) {
DMD_Return(dmd);
continue;
}
}

Fix in Cursor Fix in Web


@JoanFM JoanFM added this pull request to the merge queue Dec 17, 2025
@JoanFM JoanFM removed this pull request from the merge queue due to a manual request Dec 17, 2025
@JoanFM JoanFM added this pull request to the merge queue Dec 17, 2025
Merged via the queue into master with commit a3f5468 Dec 17, 2025
26 checks passed
@JoanFM JoanFM deleted the joan-mod-17271 branch December 17, 2025 23:00
@redisearch-backport-pull-request
Copy link
Contributor

Backport failed for 8.4, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 8.4
git worktree add -d .worktree/backport-7343-to-8.4 origin/8.4
cd .worktree/backport-7343-to-8.4
git switch --create backport-7343-to-8.4
git cherry-pick -x a3f5468b514cf8d4a57cf43ba7617f0e027ab5f3

JoanFM added a commit that referenced this pull request Dec 18, 2025
* fix: fix counts in all variables

* small refactor

* some refactoring of Shared Exclusive Lock

* simplify

* protect GILOwned simple bool

* clarify some comments

* set GILAlternativeLockHeld to true properly

* fix spelling

* add assertion

* fix: fix import RS_LOG_ASSERT

* test: add more conditions to testing

* add some more logic

* improve testing to proper signal main thread can finish while other threads may be waiting for the Shared Lock in the loop

* test: handl test properly

* add another pattern of tests

* fix comments from cursor

* fix concurrency bug

* fix: fix potential race condition at release lock time

* fix: add condition

* test: add more testing

* force testing further to capture more potential errors

* parametrize tests

* test: make tests a little faster

* test: add micro benchmark

* fix compile microbenchmarks

* fix: avoid potential reentrant deadlocks

* fix: avoid potential reentrant deadlocks

* test: avoid leak in test

* test: avoid leak in test

* fix: fix assertion

* fix: fix assertion

* Simplify Shared Lock internals (#7267)

* simplify shared lock

* small improvement to set_timeout

* fix comment

* fix nd improve comments

* condition fix

* remove lock type from release API

* Add lock type back to the release API

* remove Unlocked from enum and handle clock init for macOS

* adapt to use new API

* chnange according to comments

* handle PR comments

* handle PR comments

* fix: take shared lock in other cases

* change as PR comments

* test: simplify test, do not allow query errors

* fix tests as per comments

* fix: handle number of high priority jobs running

* fix: fix test comment

* add ASM to help slot tracking in notifications

* test: add some testing idea

* change as per PR comments

* compile and link test fix

* change draining method to drain high priority

* add ctests for ASM State Machine

* test: add ASM tests

* test: complete tests

* remove SharedSlotRangeArray from code

* implement basic logic

* change as per PR comments

* fix: remove latest SlotRange reference

* set local slots on UpdateTopology

* change as per PR comments

* fix: fix query error

* some code changes needed

* handle tests

* fix compilation in Linux

* fix loading func in module

* fix tests hybrid internal standalone

* fix test standalone acl

* fix test standalone asm

* test: handle tests internal to different

* handle more tests slowly

* improve testing

* improve testing

* fix sanitizer errors in tests

* remove ASM comment

* small improvement commenting

* schedule refresh topology in timer

* remove println

* test: fix test ASM

* test: improve ASM test

* further improve

* improve asm tests

* fix formatting issue

* try to handle hybrid test

* test: change test to compare scores

* improve testing a little

* add micro benchmarks with jobs in threads

* fix: fail on error from coordinator

* small changes in test asm

* remove changes not wanted

* checkout redis feature branch in task test

* small change

* remove draining

* move atomic to new header

* remove _internal naming

* fix formatting

* add structure of timers

* add initial version of KeySpaceVersionTracker

* implementation and unit testing of KeySpaceVersionTracker

* Plugin keyspace version tracker counters into the flow

* small code comment

* align with trimming redismodule.h

* install feature redis in CI workflow

* plug new RedisModule API to delay trimming

* remove sleep from test

* fix values and enable logging

* make query version map extern to avoid different instances per compilation unit

* rename vars

* small cleanup and rename

* use debug print

* query all shards

* fix linting issue

* align with latest commit from Redis dependency

* fix on RedisModulesSDK

* do not rely on TimerID for initialization state

* handle ASM Decrease Query Count

* fix access to keyspace version

* fix: remove ASM count from AREQ_Free

* more progress towards keyspace query counting

* handle some small PR comments

* avoid cplusplus code duplication

* align test with no cplusplus deduplication

* add helper to identify leaks of query counts on sanitizer

* add keyspaceversion to BockQueryNode

* control leaks of cursor query count when queries fail

* check if is hybrid request from AREQ itself

* explain better the should_filter_slots

* refactor how to handle topology timer reset

* remove decrease of count on cursor Read

* align with dedicated API for cluster enable/disable trimming

* make query cursor tracker map thread safe

* remove some changes

* handle Decrease of Query Count for each AREQ individually in a RAII manner

* small cleanup of PR

* fix leak: decrease without previous increase

* add log to debug

* change in github workflow

* fix a test

* Apply suggestions from code review

* all ranges are owned in StandAlone cluster

* remove txt from pytests

* all slots dynamic allocation

* fix compilation warning

* register new API to c_symbols.rs

* proper mutex lifetime management

* degine mutex as extern to avoid duplication in translation units

* reschedule with slots

* fix: fix potential leak and correct log message condition

* reset slots when local slots are updated

* fix: accept not normlized slot ranges as input

* Revert "fix: accept not normlized slot ranges as input"

This reverts commit 1babfed.

* add configuration for max min delay trimming ms

* Apply suggestions from code review

Co-authored-by: GuyAv46 <[email protected]>

* apply some suggestion changes

* fix test acl

* proper verbose level

* remove inlining and create struct in notifications

* make config underscore

* add comments in code and fix typo

* add configuration for delay check ms

* avoid trim delay if no Indices found (no module usage)

* Update tests/pytests/test_hybrid_internal.py

Co-authored-by: GuyAv46 <[email protected]>

* refactor asm state machine sanitizer code

* ASM_StartTrim properly

* refactor test_hybrid_internal

* small refactor test asm

* add tests for FT AGGREGATE WITHCURSORS

* fix naming to be private of config

* test: handle test where cursor returns empty result after trimming starts

* add proper trimmed tests

* adapt test asm

* fix use after free connection

* use arr.h to avoid reallocating in ASM Sanitizer

* Apply suggestions from code review

Co-authored-by: GuyAv46 <[email protected]>

* test against redis unstable

* update RedisModulesSDK

* fix how to pass hybrid query vectors in query

* fix .gitmodules back

* update RedisModulesSDK

* proper error handling

* clarify invariants with an ASSERT

* skip while no fix is found

* do not assert, stop timers if they are scheduled

* reply to cursor comments

* small refactoring notifications

* improve logging

* add test for empty indices

* test: improve testing

* small fix

* small test change

* include timeout message

* test no indexes

* adapt changes

* make test longer to cover timers

* fix test

* fix: avoid memory leak of ever growing dict

* Update src/asm_state_machine.h

Co-authored-by: GuyAv46 <[email protected]>

---------

Co-authored-by: GuyAv46 <[email protected]>
(cherry picked from commit a3f5468)
@JoanFM
Copy link
Collaborator Author

JoanFM commented Jan 19, 2026

/backport

@redisearch-backport-pull-request
Copy link
Contributor

Backport failed for 8.4, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 8.4
git worktree add -d .worktree/backport-7343-to-8.4 origin/8.4
cd .worktree/backport-7343-to-8.4
git switch --create backport-7343-to-8.4
git cherry-pick -x a3f5468b514cf8d4a57cf43ba7617f0e027ab5f3

github-merge-queue bot pushed a commit that referenced this pull request Jan 19, 2026
…7829)

* [MOD-12171] [MOD-12169] Implement ASM-aware search flow (#7343)

* fix: fix counts in all variables

* small refactor

* some refactoring of Shared Exclusive Lock

* simplify

* protect GILOwned simple bool

* clarify some comments

* set GILAlternativeLockHeld to true properly

* fix spelling

* add assertion

* fix: fix import RS_LOG_ASSERT

* test: add more conditions to testing

* add some more logic

* improve testing to proper signal main thread can finish while other threads may be waiting for the Shared Lock in the loop

* test: handl test properly

* add another pattern of tests

* fix comments from cursor

* fix concurrency bug

* fix: fix potential race condition at release lock time

* fix: add condition

* test: add more testing

* force testing further to capture more potential errors

* parametrize tests

* test: make tests a little faster

* test: add micro benchmark

* fix compile microbenchmarks

* fix: avoid potential reentrant deadlocks

* fix: avoid potential reentrant deadlocks

* test: avoid leak in test

* test: avoid leak in test

* fix: fix assertion

* fix: fix assertion

* Simplify Shared Lock internals (#7267)

* simplify shared lock

* small improvement to set_timeout

* fix comment

* fix nd improve comments

* condition fix

* remove lock type from release API

* Add lock type back to the release API

* remove Unlocked from enum and handle clock init for macOS

* adapt to use new API

* chnange according to comments

* handle PR comments

* handle PR comments

* fix: take shared lock in other cases

* change as PR comments

* test: simplify test, do not allow query errors

* fix tests as per comments

* fix: handle number of high priority jobs running

* fix: fix test comment

* add ASM to help slot tracking in notifications

* test: add some testing idea

* change as per PR comments

* compile and link test fix

* change draining method to drain high priority

* add ctests for ASM State Machine

* test: add ASM tests

* test: complete tests

* remove SharedSlotRangeArray from code

* implement basic logic

* change as per PR comments

* fix: remove latest SlotRange reference

* set local slots on UpdateTopology

* change as per PR comments

* fix: fix query error

* some code changes needed

* handle tests

* fix compilation in Linux

* fix loading func in module

* fix tests hybrid internal standalone

* fix test standalone acl

* fix test standalone asm

* test: handle tests internal to different

* handle more tests slowly

* improve testing

* improve testing

* fix sanitizer errors in tests

* remove ASM comment

* small improvement commenting

* schedule refresh topology in timer

* remove println

* test: fix test ASM

* test: improve ASM test

* further improve

* improve asm tests

* fix formatting issue

* try to handle hybrid test

* test: change test to compare scores

* improve testing a little

* add micro benchmarks with jobs in threads

* fix: fail on error from coordinator

* small changes in test asm

* remove changes not wanted

* checkout redis feature branch in task test

* small change

* remove draining

* move atomic to new header

* remove _internal naming

* fix formatting

* add structure of timers

* add initial version of KeySpaceVersionTracker

* implementation and unit testing of KeySpaceVersionTracker

* Plugin keyspace version tracker counters into the flow

* small code comment

* align with trimming redismodule.h

* install feature redis in CI workflow

* plug new RedisModule API to delay trimming

* remove sleep from test

* fix values and enable logging

* make query version map extern to avoid different instances per compilation unit

* rename vars

* small cleanup and rename

* use debug print

* query all shards

* fix linting issue

* align with latest commit from Redis dependency

* fix on RedisModulesSDK

* do not rely on TimerID for initialization state

* handle ASM Decrease Query Count

* fix access to keyspace version

* fix: remove ASM count from AREQ_Free

* more progress towards keyspace query counting

* handle some small PR comments

* avoid cplusplus code duplication

* align test with no cplusplus deduplication

* add helper to identify leaks of query counts on sanitizer

* add keyspaceversion to BockQueryNode

* control leaks of cursor query count when queries fail

* check if is hybrid request from AREQ itself

* explain better the should_filter_slots

* refactor how to handle topology timer reset

* remove decrease of count on cursor Read

* align with dedicated API for cluster enable/disable trimming

* make query cursor tracker map thread safe

* remove some changes

* handle Decrease of Query Count for each AREQ individually in a RAII manner

* small cleanup of PR

* fix leak: decrease without previous increase

* add log to debug

* change in github workflow

* fix a test

* Apply suggestions from code review

* all ranges are owned in StandAlone cluster

* remove txt from pytests

* all slots dynamic allocation

* fix compilation warning

* register new API to c_symbols.rs

* proper mutex lifetime management

* degine mutex as extern to avoid duplication in translation units

* reschedule with slots

* fix: fix potential leak and correct log message condition

* reset slots when local slots are updated

* fix: accept not normlized slot ranges as input

* Revert "fix: accept not normlized slot ranges as input"

This reverts commit 1babfed.

* add configuration for max min delay trimming ms

* Apply suggestions from code review

Co-authored-by: GuyAv46 <[email protected]>

* apply some suggestion changes

* fix test acl

* proper verbose level

* remove inlining and create struct in notifications

* make config underscore

* add comments in code and fix typo

* add configuration for delay check ms

* avoid trim delay if no Indices found (no module usage)

* Update tests/pytests/test_hybrid_internal.py

Co-authored-by: GuyAv46 <[email protected]>

* refactor asm state machine sanitizer code

* ASM_StartTrim properly

* refactor test_hybrid_internal

* small refactor test asm

* add tests for FT AGGREGATE WITHCURSORS

* fix naming to be private of config

* test: handle test where cursor returns empty result after trimming starts

* add proper trimmed tests

* adapt test asm

* fix use after free connection

* use arr.h to avoid reallocating in ASM Sanitizer

* Apply suggestions from code review

Co-authored-by: GuyAv46 <[email protected]>

* test against redis unstable

* update RedisModulesSDK

* fix how to pass hybrid query vectors in query

* fix .gitmodules back

* update RedisModulesSDK

* proper error handling

* clarify invariants with an ASSERT

* skip while no fix is found

* do not assert, stop timers if they are scheduled

* reply to cursor comments

* small refactoring notifications

* improve logging

* add test for empty indices

* test: improve testing

* small fix

* small test change

* include timeout message

* test no indexes

* adapt changes

* make test longer to cover timers

* fix test

* fix: avoid memory leak of ever growing dict

* Update src/asm_state_machine.h

Co-authored-by: GuyAv46 <[email protected]>

---------

Co-authored-by: GuyAv46 <[email protected]>
(cherry picked from commit a3f5468)

* align

* fix compilation

* ci: fix

* update RedisModulesSDK

* remove unnecessary imports

* adapt to get 8.4 in checkout

* test: add test that ft hybrid does not accept slots info multiple times

* add dummy step in ci

* clean TODO

* Fix C++ compilation error by removing unused <stdatomic.h> include

---------

Co-authored-by: GuyAv46 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 8.4 enforce:coverage Run coverage flow even on draft pull request enforce:sanitize Run sanitizer flow even on draft pull request enforce:test Run the test suite even if no code has changed. size:L size:M size:XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants