[MOD-12171] [MOD-12169] Implement ASM-aware search flow#7343
[MOD-12171] [MOD-12169] Implement ASM-aware search flow#7343
Conversation
…nto joan-drain-high-priority
…nto joan-drain-high-priority
…nto joan-drain-high-priority
…hreads may be waiting for the Shared Lock in the loop
…nto joan-drain-high-priority
…nto joan-drain-high-priority
| add_shard_and_migrate_test(env, 'FT.AGGREGATE.WITHCURSOR') | ||
|
|
||
| #TODO: Enable once MOD-13110 is fixed | ||
| @skip |
There was a problem hiding this comment.
Does this work? I was under the assumption that we must set the parentheses (@skip())
a3d21ff to
b474008
Compare
| static bool ASM_CanStartTrimming(void) { | ||
| uint32_t current_version = __atomic_load_n(&key_space_version, __ATOMIC_RELAXED); | ||
| return ASM_KeySpaceVersionTracker_GetQueryCount(current_version) == 0; | ||
| } |
There was a problem hiding this comment.
Bug: Trimming readiness checks wrong version
ASM_CanStartTrimming checks ASM_KeySpaceVersionTracker_GetQueryCount(current_version), but the surrounding logic/comments refer to waiting for queries on the “old” version to finish. After a version bump, this can enable trimming while old-version queries are still running (or block trimming based on unrelated new-version queries).
| self.fully_available = SlotSet::new(); | ||
| self.partially_available = SlotSet::new(); | ||
| self.increment_version(); | ||
| } |
There was a problem hiding this comment.
Bug: Local slots update clears availability tracking
SlotsTracker::set_local_slots now resets fully_available and partially_available to empty instead of only removing overlaps with the new local ranges. This can drop in-progress migration/trim state (and its “unstable” behavior), making slots_tracker_check_availability report stability/coverage incorrectly and potentially skipping required filtering during topology transitions.
Co-authored-by: GuyAv46 <[email protected]>
| self.fully_available = SlotSet::new(); | ||
| self.partially_available = SlotSet::new(); | ||
| self.increment_version(); | ||
| } |
There was a problem hiding this comment.
Bug: Topology refresh wipes ASM migration state
SlotsTracker::set_local_slots now resets fully_available and partially_available to empty instead of removing only overlapping ranges. When ASM_StateMachine_SetLocalSlots runs on periodic topology updates, any ongoing migration/trim state for other slots is lost, so check_availability can incorrectly accept queries or skip required filtering/rejection.
There was a problem hiding this comment.
Bug: Slot filtering uses outdated slot ranges
When key_space_version changes mid-query, rpQueryItNext filters results using self->querySlots (the request’s original slot set). If the shard’s owned slots changed (migration/topology update), this check won’t drop documents from slots the shard no longer owns, so internal shard queries can return stale/misowned results after migration.
src/result_processor.c#L149-L158
RediSearch/src/result_processor.c
Lines 149 to 158 in d782e73
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 8.4
git worktree add -d .worktree/backport-7343-to-8.4 origin/8.4
cd .worktree/backport-7343-to-8.4
git switch --create backport-7343-to-8.4
git cherry-pick -x a3f5468b514cf8d4a57cf43ba7617f0e027ab5f3 |
* fix: fix counts in all variables * small refactor * some refactoring of Shared Exclusive Lock * simplify * protect GILOwned simple bool * clarify some comments * set GILAlternativeLockHeld to true properly * fix spelling * add assertion * fix: fix import RS_LOG_ASSERT * test: add more conditions to testing * add some more logic * improve testing to proper signal main thread can finish while other threads may be waiting for the Shared Lock in the loop * test: handl test properly * add another pattern of tests * fix comments from cursor * fix concurrency bug * fix: fix potential race condition at release lock time * fix: add condition * test: add more testing * force testing further to capture more potential errors * parametrize tests * test: make tests a little faster * test: add micro benchmark * fix compile microbenchmarks * fix: avoid potential reentrant deadlocks * fix: avoid potential reentrant deadlocks * test: avoid leak in test * test: avoid leak in test * fix: fix assertion * fix: fix assertion * Simplify Shared Lock internals (#7267) * simplify shared lock * small improvement to set_timeout * fix comment * fix nd improve comments * condition fix * remove lock type from release API * Add lock type back to the release API * remove Unlocked from enum and handle clock init for macOS * adapt to use new API * chnange according to comments * handle PR comments * handle PR comments * fix: take shared lock in other cases * change as PR comments * test: simplify test, do not allow query errors * fix tests as per comments * fix: handle number of high priority jobs running * fix: fix test comment * add ASM to help slot tracking in notifications * test: add some testing idea * change as per PR comments * compile and link test fix * change draining method to drain high priority * add ctests for ASM State Machine * test: add ASM tests * test: complete tests * remove SharedSlotRangeArray from code * implement basic logic * change as per PR comments * fix: remove latest SlotRange reference * set local slots on UpdateTopology * change as per PR comments * fix: fix query error * some code changes needed * handle tests * fix compilation in Linux * fix loading func in module * fix tests hybrid internal standalone * fix test standalone acl * fix test standalone asm * test: handle tests internal to different * handle more tests slowly * improve testing * improve testing * fix sanitizer errors in tests * remove ASM comment * small improvement commenting * schedule refresh topology in timer * remove println * test: fix test ASM * test: improve ASM test * further improve * improve asm tests * fix formatting issue * try to handle hybrid test * test: change test to compare scores * improve testing a little * add micro benchmarks with jobs in threads * fix: fail on error from coordinator * small changes in test asm * remove changes not wanted * checkout redis feature branch in task test * small change * remove draining * move atomic to new header * remove _internal naming * fix formatting * add structure of timers * add initial version of KeySpaceVersionTracker * implementation and unit testing of KeySpaceVersionTracker * Plugin keyspace version tracker counters into the flow * small code comment * align with trimming redismodule.h * install feature redis in CI workflow * plug new RedisModule API to delay trimming * remove sleep from test * fix values and enable logging * make query version map extern to avoid different instances per compilation unit * rename vars * small cleanup and rename * use debug print * query all shards * fix linting issue * align with latest commit from Redis dependency * fix on RedisModulesSDK * do not rely on TimerID for initialization state * handle ASM Decrease Query Count * fix access to keyspace version * fix: remove ASM count from AREQ_Free * more progress towards keyspace query counting * handle some small PR comments * avoid cplusplus code duplication * align test with no cplusplus deduplication * add helper to identify leaks of query counts on sanitizer * add keyspaceversion to BockQueryNode * control leaks of cursor query count when queries fail * check if is hybrid request from AREQ itself * explain better the should_filter_slots * refactor how to handle topology timer reset * remove decrease of count on cursor Read * align with dedicated API for cluster enable/disable trimming * make query cursor tracker map thread safe * remove some changes * handle Decrease of Query Count for each AREQ individually in a RAII manner * small cleanup of PR * fix leak: decrease without previous increase * add log to debug * change in github workflow * fix a test * Apply suggestions from code review * all ranges are owned in StandAlone cluster * remove txt from pytests * all slots dynamic allocation * fix compilation warning * register new API to c_symbols.rs * proper mutex lifetime management * degine mutex as extern to avoid duplication in translation units * reschedule with slots * fix: fix potential leak and correct log message condition * reset slots when local slots are updated * fix: accept not normlized slot ranges as input * Revert "fix: accept not normlized slot ranges as input" This reverts commit 1babfed. * add configuration for max min delay trimming ms * Apply suggestions from code review Co-authored-by: GuyAv46 <[email protected]> * apply some suggestion changes * fix test acl * proper verbose level * remove inlining and create struct in notifications * make config underscore * add comments in code and fix typo * add configuration for delay check ms * avoid trim delay if no Indices found (no module usage) * Update tests/pytests/test_hybrid_internal.py Co-authored-by: GuyAv46 <[email protected]> * refactor asm state machine sanitizer code * ASM_StartTrim properly * refactor test_hybrid_internal * small refactor test asm * add tests for FT AGGREGATE WITHCURSORS * fix naming to be private of config * test: handle test where cursor returns empty result after trimming starts * add proper trimmed tests * adapt test asm * fix use after free connection * use arr.h to avoid reallocating in ASM Sanitizer * Apply suggestions from code review Co-authored-by: GuyAv46 <[email protected]> * test against redis unstable * update RedisModulesSDK * fix how to pass hybrid query vectors in query * fix .gitmodules back * update RedisModulesSDK * proper error handling * clarify invariants with an ASSERT * skip while no fix is found * do not assert, stop timers if they are scheduled * reply to cursor comments * small refactoring notifications * improve logging * add test for empty indices * test: improve testing * small fix * small test change * include timeout message * test no indexes * adapt changes * make test longer to cover timers * fix test * fix: avoid memory leak of ever growing dict * Update src/asm_state_machine.h Co-authored-by: GuyAv46 <[email protected]> --------- Co-authored-by: GuyAv46 <[email protected]> (cherry picked from commit a3f5468)
|
/backport |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 8.4
git worktree add -d .worktree/backport-7343-to-8.4 origin/8.4
cd .worktree/backport-7343-to-8.4
git switch --create backport-7343-to-8.4
git cherry-pick -x a3f5468b514cf8d4a57cf43ba7617f0e027ab5f3 |
…7829) * [MOD-12171] [MOD-12169] Implement ASM-aware search flow (#7343) * fix: fix counts in all variables * small refactor * some refactoring of Shared Exclusive Lock * simplify * protect GILOwned simple bool * clarify some comments * set GILAlternativeLockHeld to true properly * fix spelling * add assertion * fix: fix import RS_LOG_ASSERT * test: add more conditions to testing * add some more logic * improve testing to proper signal main thread can finish while other threads may be waiting for the Shared Lock in the loop * test: handl test properly * add another pattern of tests * fix comments from cursor * fix concurrency bug * fix: fix potential race condition at release lock time * fix: add condition * test: add more testing * force testing further to capture more potential errors * parametrize tests * test: make tests a little faster * test: add micro benchmark * fix compile microbenchmarks * fix: avoid potential reentrant deadlocks * fix: avoid potential reentrant deadlocks * test: avoid leak in test * test: avoid leak in test * fix: fix assertion * fix: fix assertion * Simplify Shared Lock internals (#7267) * simplify shared lock * small improvement to set_timeout * fix comment * fix nd improve comments * condition fix * remove lock type from release API * Add lock type back to the release API * remove Unlocked from enum and handle clock init for macOS * adapt to use new API * chnange according to comments * handle PR comments * handle PR comments * fix: take shared lock in other cases * change as PR comments * test: simplify test, do not allow query errors * fix tests as per comments * fix: handle number of high priority jobs running * fix: fix test comment * add ASM to help slot tracking in notifications * test: add some testing idea * change as per PR comments * compile and link test fix * change draining method to drain high priority * add ctests for ASM State Machine * test: add ASM tests * test: complete tests * remove SharedSlotRangeArray from code * implement basic logic * change as per PR comments * fix: remove latest SlotRange reference * set local slots on UpdateTopology * change as per PR comments * fix: fix query error * some code changes needed * handle tests * fix compilation in Linux * fix loading func in module * fix tests hybrid internal standalone * fix test standalone acl * fix test standalone asm * test: handle tests internal to different * handle more tests slowly * improve testing * improve testing * fix sanitizer errors in tests * remove ASM comment * small improvement commenting * schedule refresh topology in timer * remove println * test: fix test ASM * test: improve ASM test * further improve * improve asm tests * fix formatting issue * try to handle hybrid test * test: change test to compare scores * improve testing a little * add micro benchmarks with jobs in threads * fix: fail on error from coordinator * small changes in test asm * remove changes not wanted * checkout redis feature branch in task test * small change * remove draining * move atomic to new header * remove _internal naming * fix formatting * add structure of timers * add initial version of KeySpaceVersionTracker * implementation and unit testing of KeySpaceVersionTracker * Plugin keyspace version tracker counters into the flow * small code comment * align with trimming redismodule.h * install feature redis in CI workflow * plug new RedisModule API to delay trimming * remove sleep from test * fix values and enable logging * make query version map extern to avoid different instances per compilation unit * rename vars * small cleanup and rename * use debug print * query all shards * fix linting issue * align with latest commit from Redis dependency * fix on RedisModulesSDK * do not rely on TimerID for initialization state * handle ASM Decrease Query Count * fix access to keyspace version * fix: remove ASM count from AREQ_Free * more progress towards keyspace query counting * handle some small PR comments * avoid cplusplus code duplication * align test with no cplusplus deduplication * add helper to identify leaks of query counts on sanitizer * add keyspaceversion to BockQueryNode * control leaks of cursor query count when queries fail * check if is hybrid request from AREQ itself * explain better the should_filter_slots * refactor how to handle topology timer reset * remove decrease of count on cursor Read * align with dedicated API for cluster enable/disable trimming * make query cursor tracker map thread safe * remove some changes * handle Decrease of Query Count for each AREQ individually in a RAII manner * small cleanup of PR * fix leak: decrease without previous increase * add log to debug * change in github workflow * fix a test * Apply suggestions from code review * all ranges are owned in StandAlone cluster * remove txt from pytests * all slots dynamic allocation * fix compilation warning * register new API to c_symbols.rs * proper mutex lifetime management * degine mutex as extern to avoid duplication in translation units * reschedule with slots * fix: fix potential leak and correct log message condition * reset slots when local slots are updated * fix: accept not normlized slot ranges as input * Revert "fix: accept not normlized slot ranges as input" This reverts commit 1babfed. * add configuration for max min delay trimming ms * Apply suggestions from code review Co-authored-by: GuyAv46 <[email protected]> * apply some suggestion changes * fix test acl * proper verbose level * remove inlining and create struct in notifications * make config underscore * add comments in code and fix typo * add configuration for delay check ms * avoid trim delay if no Indices found (no module usage) * Update tests/pytests/test_hybrid_internal.py Co-authored-by: GuyAv46 <[email protected]> * refactor asm state machine sanitizer code * ASM_StartTrim properly * refactor test_hybrid_internal * small refactor test asm * add tests for FT AGGREGATE WITHCURSORS * fix naming to be private of config * test: handle test where cursor returns empty result after trimming starts * add proper trimmed tests * adapt test asm * fix use after free connection * use arr.h to avoid reallocating in ASM Sanitizer * Apply suggestions from code review Co-authored-by: GuyAv46 <[email protected]> * test against redis unstable * update RedisModulesSDK * fix how to pass hybrid query vectors in query * fix .gitmodules back * update RedisModulesSDK * proper error handling * clarify invariants with an ASSERT * skip while no fix is found * do not assert, stop timers if they are scheduled * reply to cursor comments * small refactoring notifications * improve logging * add test for empty indices * test: improve testing * small fix * small test change * include timeout message * test no indexes * adapt changes * make test longer to cover timers * fix test * fix: avoid memory leak of ever growing dict * Update src/asm_state_machine.h Co-authored-by: GuyAv46 <[email protected]> --------- Co-authored-by: GuyAv46 <[email protected]> (cherry picked from commit a3f5468) * align * fix compilation * ci: fix * update RedisModulesSDK * remove unnecessary imports * adapt to get 8.4 in checkout * test: add test that ft hybrid does not accept slots info multiple times * add dummy step in ci * clean TODO * Fix C++ compilation error by removing unused <stdatomic.h> include --------- Co-authored-by: GuyAv46 <[email protected]>
Describe the changes in the pull request
Put all he pieces together to make sure that queries respect ASM slot information at query time.
Depending on redis/redis#14567
Main ideas:
When parsing the SLOTS INFO from the requests, store the KEYSPACE version returned by the slots tracker or reject the query.
In result processor potentially filter out some results if the query keyspace version does not match the keyspace version of the server at that moment (Some migration happens that means that some of the results should not be owned by the shard and therefore discarded).
Once the migration is COMPLETE; make sure to delay trimming until one of these 2 conditions is met:
- 5 seconds have passed since the completion of migration
- 2 seconds have passed and all the queries run with the keyspace version before trimming have finished *
How do we keep track of the keyspace versions that are being run.
ASM keeps track with a hashmap of a counter of queries per keyspace version. These counter are increased and decreased in a thread safe manner:
Note
Implements ASM-aware distributed search by tracking keyspace versions, validating requested slots, filtering results on version mismatch, and adding configurable trimming delays with lifecycle integration.
Core query/exec flow:
slotsVersionwithkeySpaceVersionacrossAREQ, parse contexts, pipeline params, and result processor._SLOTS_INFO, validate viaslots_tracker_check_availability, set/incrementkeySpaceVersion; reject withUnavailableSlotson failure.querySlotspresent and serverkey_space_versiondiffers from requestkeySpaceVersion, discard docs not inquerySlots.DOCUMENT_OPEN_KEY_ACCESS_TRIMMEDto query flags.ASM state machine & tracking:
asm_state_machinewith globalkey_space_version, slot state transitions, and query-count tracking per version (thread-safe; sanitizer hooks).Topology & notifications:
ASM_StateMachine_SetLocalSlotson topology refresh; reschedule updater on ASM events.Config:
_MIN_TRIM_DELAY_MS,_MAX_TRIM_DELAY_MS,_TRIMMING_STATE_CHECK_DELAY_MSwith validation, GET/SET, defaults.APIs & errors:
QUERY_ERROR_CODE_UNAVAILABLE_SLOTS(C/Rust) and propagate in coordinator.SlotRangeArray_ContainsSlot.Hybrid parsing:
_SLOTS_INFOwith validation; propagatekeySpaceVersionto subqueries and track query counts.Tests:
Written by Cursor Bugbot for commit d782e73. This will update automatically on new commits. Configure here.