[MOD-8605] Introducing query debug mechanism 🎉#5554
Conversation
add test
fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each
API: _FT.DEBUG <query> <DEBUG_TYPE> <DEBUG_TYPE_ARGS> ... DEBUG_PARAMS_COUNT 2 query can be either FT.AGGREGATE ... or FT.SEARCH ... (with all supported args) DEBUG_TYPE Currently supports TIMEOUT_AFTER_N, followed by the number of results to return before timeout is returned, including 0. can be called both in SA and cluster: _FT.DEBUG FT.AGGREGATE _FT.DEBUG FT.SEARCH Can be called only if num shards is 1: _FT.DEBUG _FT.AGGREGATE _FT.DEBUG _FT.SEARCH Changes taking place in production functions: module.c *if in `DistSearchCommand` (ft.search) and `DistAggregateCommand` (ft.aggregate) to check if it's a debug command * if debug and 1 shard -> call DEBUG_RSSearchCommand or DEBUG_RSAggregateCommand for search and aggregate, respectively. *if debug and cluster: call DEBUG_DistSearchCommandHandler or DEBUG_RSExecDistAggregate for search and aggregate, respectively. * split FlatSearchCommandHandler into: * CreateReq * prepareCommand * MR_CreateCtx * MRCtx_SetReduceFunction * MR_Fanout * DEBUG_FlatSearchCommandHandler flow; * CreateReq * prepareCommand * add _FT.DEBUG as first arg * add debug params at the end * expose DistAggregateCommandand DistSearchCommand so it can be called from debug_commands.c aggregate_exec.c: split execCommandCommon into * prepareRequest * buildPipelineAndExecute check if AREQ->flags & QEXEC_F_DEBUG in prepareExecutionPlan change arg withProfile to execOptions change profile options to bit flags remove NO_PROFILE and replace with a flag indicating this is a profile command EXEC_WITH_PROFILE than if its profile, checking if EXEC_WITH_PROFILE_LIMITED bit is set. parseProfile also gets execOptions *DEBUG_* command are defined in debug_commands.h New: RPTimeoutAfterCount_New a result processor sumulates timeout after n results. AREQ_Debug_New create AREQ_Debug holding a request by value. hence freed toghther. new AREQ flag : QEXEC_F_DEBUG
before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result.
use in in debug command instead of havong an if in the production function
that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount
1. sorter 2. pager 3. resp3 change rc in RS_RESULT_TIMEDOUT 4. coord hangign tests: revert rest cursors with _FT.DEBUG QUERY remove new test_timeout
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5554 +/- ##
==========================================
+ Coverage 88.05% 88.09% +0.04%
==========================================
Files 197 198 +1
Lines 35722 35999 +277
==========================================
+ Hits 31454 31714 +260
- Misses 4268 4285 +17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
isClusterCoord to check if we are the coordinator in cluster mode
* reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * disable testTimeoutPartialWithEmptyResults test * enforce timeout when calling INTERNAL_ONLY with TIMEOUT_AFTER_N where N==0 to avoid coordinator infinite loop fix missing return in debug commands * fix tests: explictly settimeout policy to on_timeout return * fix use after free in rmr.c:MRIteratorCallback_ProcessDone the problem was that we were trying to decrese &ctx->it->ctx.inProcess when the shard decided to release the iterator (MRIterator_Release released it) The solution is to change MRIterator_Release signature. Now it returns weather it freed the iterator or not. if it was released, return * return false if free flag is set to false * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * fix leak * simulate timeout during build - TO BE REVERTED as currently doesn't simulate well the scenario: i.e -the shard returns timeout error string, butthe coordinator clock is not set to be timeout. Trying to only return an error from the shards without any special changes in the coordinator: RPNetNext returns error (and exits the while loop) only if the reply contains timeout AND the policy is to fail. otherwise, we enter another while loop iteration without processing the reply, overriding it with the next loop. This is a leak. Another approach i took was: in the callback passed to the resultprocessor I tried to first call getnextreply before rpnetNext checks the timeout, but it didn't work since for some reason i coudnlt figure (the shards didn't shut down. mught be a bug in RLTest) Returning a generic query code soen't simulate the scenario required. * REVERT tieout on buld * enable test_cursors:testTimeoutPartialWithEmptyResults skip testEmptyResult in coord * fix comment revmove sortable * stable test_error_with_partial_results by using query debug * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * signal one reader simplify testCursorDepletionNonStrictTimeoutPolicySortby test * fix typo * revert force failure
alonre24
left a comment
There was a problem hiding this comment.
Very good and important job! 💪
See my questions and suggestions in the comments
src/result_processor.c
Outdated
| cur = cur->upstream; | ||
| } | ||
|
|
||
| if (cur) { |
There was a problem hiding this comment.
In which cases do we expect cur to be NULL? Should we debug assert it? Also, consider storing the relevant RPIndexIteratorpointer as part of the RPTimeoutAfterCount struct if it simplifies things
There was a problem hiding this comment.
There is no RP_INDEX in the coordinator pipeline.
I don't think it's complicated to search for it on runtime as in most cases (actully all of them, except coordinator) this is the next RP
There was a problem hiding this comment.
just comment that if it is coord we are not expected to get here
tests/pytests/test_debug_commands.py
Outdated
| skipTest(cluster=True) | ||
| env = self.env | ||
| results_count = 200 | ||
| timeout_res_count = results_count - 1 |
There was a problem hiding this comment.
To get timeout and not EOF
tests/pytests/test_debug_commands.py
Outdated
| # if the sum of results from all shards exceeds timeout_res_count but *each* shard individually returns fewer: | ||
| # no timeout warning is issued, the cursor is depleted. |
There was a problem hiding this comment.
In this case, are we stopping whenever we reach an accumulated number of results >= timeout_res_count (even if we did not collect results from some shards)?
There was a problem hiding this comment.
didn't understand the question
There was a problem hiding this comment.
try to refine the comment to be exact
There was a problem hiding this comment.
Refining and moving this comment to the aggregate debug header as it is relevant to the general behaviour of FT.AGGREGATE and not specifically to cursors.
raz-mon
left a comment
There was a problem hiding this comment.
Nice
Approving since the only comments are on the debug command, which does not have to be perfect. I think it would be better off without the dummy debug argument and with order enforcement on the arguments, but it's not a blocker for the PR AFAIC.
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 2.8
git worktree add -d .worktree/backport-5554-to-2.8 origin/2.8
cd .worktree/backport-5554-to-2.8
git switch --create backport-5554-to-2.8
git cherry-pick -x d45701ecbb8bfad221355aa0bba7e843acdde1b0 |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 2.6
git worktree add -d .worktree/backport-5554-to-2.6 origin/2.6
cd .worktree/backport-5554-to-2.6
git switch --create backport-5554-to-2.6
git cherry-pick -x d45701ecbb8bfad221355aa0bba7e843acdde1b0 |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 2.10
git worktree add -d .worktree/backport-5554-to-2.10 origin/2.10
cd .worktree/backport-5554-to-2.10
git switch --create backport-5554-to-2.10
git cherry-pick -x d45701ecbb8bfad221355aa0bba7e843acdde1b0 |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 8.0
git worktree add -d .worktree/backport-5554-to-8.0 origin/8.0
cd .worktree/backport-5554-to-8.0
git switch --create backport-5554-to-8.0
git cherry-pick -x d45701ecbb8bfad221355aa0bba7e843acdde1b0 |
* reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * Add query debug commands: API: _FT.DEBUG <query> <DEBUG_TYPE> <DEBUG_TYPE_ARGS> ... DEBUG_PARAMS_COUNT 2 query can be either FT.AGGREGATE ... or FT.SEARCH ... (with all supported args) DEBUG_TYPE Currently supports TIMEOUT_AFTER_N, followed by the number of results to return before timeout is returned, including 0. can be called both in SA and cluster: _FT.DEBUG FT.AGGREGATE _FT.DEBUG FT.SEARCH Can be called only if num shards is 1: _FT.DEBUG _FT.AGGREGATE _FT.DEBUG _FT.SEARCH Changes taking place in production functions: module.c *if in `DistSearchCommand` (ft.search) and `DistAggregateCommand` (ft.aggregate) to check if it's a debug command * if debug and 1 shard -> call DEBUG_RSSearchCommand or DEBUG_RSAggregateCommand for search and aggregate, respectively. *if debug and cluster: call DEBUG_DistSearchCommandHandler or DEBUG_RSExecDistAggregate for search and aggregate, respectively. * split FlatSearchCommandHandler into: * CreateReq * prepareCommand * MR_CreateCtx * MRCtx_SetReduceFunction * MR_Fanout * DEBUG_FlatSearchCommandHandler flow; * CreateReq * prepareCommand * add _FT.DEBUG as first arg * add debug params at the end * expose DistAggregateCommandand DistSearchCommand so it can be called from debug_commands.c aggregate_exec.c: split execCommandCommon into * prepareRequest * buildPipelineAndExecute check if AREQ->flags & QEXEC_F_DEBUG in prepareExecutionPlan change arg withProfile to execOptions change profile options to bit flags remove NO_PROFILE and replace with a flag indicating this is a profile command EXEC_WITH_PROFILE than if its profile, checking if EXEC_WITH_PROFILE_LIMITED bit is set. parseProfile also gets execOptions *DEBUG_* command are defined in debug_commands.h New: RPTimeoutAfterCount_New a result processor sumulates timeout after n results. AREQ_Debug_New create AREQ_Debug holding a request by value. hence freed toghther. new AREQ flag : QEXEC_F_DEBUG * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * finalize test debug comman * add GetNumShards_UnSafe to read number of shards use in in debug command instead of havong an if in the production function * add sanity test * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * remove bug fixes 1. sorter 2. pager 3. resp3 change rc in RS_RESULT_TIMEDOUT 4. coord hangign tests: revert rest cursors with _FT.DEBUG QUERY remove new test_timeout * add timeout build mechanism * set non strict mode to debug tests isClusterCoord to check if we are the coordinator in cluster mode * REVERT timeout on build * Bug fixes related to Timeout with return partial results policy (#5556) * reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * disable testTimeoutPartialWithEmptyResults test * enforce timeout when calling INTERNAL_ONLY with TIMEOUT_AFTER_N where N==0 to avoid coordinator infinite loop fix missing return in debug commands * fix tests: explictly settimeout policy to on_timeout return * fix use after free in rmr.c:MRIteratorCallback_ProcessDone the problem was that we were trying to decrese &ctx->it->ctx.inProcess when the shard decided to release the iterator (MRIterator_Release released it) The solution is to change MRIterator_Release signature. Now it returns weather it freed the iterator or not. if it was released, return * return false if free flag is set to false * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * fix leak * simulate timeout during build - TO BE REVERTED as currently doesn't simulate well the scenario: i.e -the shard returns timeout error string, butthe coordinator clock is not set to be timeout. Trying to only return an error from the shards without any special changes in the coordinator: RPNetNext returns error (and exits the while loop) only if the reply contains timeout AND the policy is to fail. otherwise, we enter another while loop iteration without processing the reply, overriding it with the next loop. This is a leak. Another approach i took was: in the callback passed to the resultprocessor I tried to first call getnextreply before rpnetNext checks the timeout, but it didn't work since for some reason i coudnlt figure (the shards didn't shut down. mught be a bug in RLTest) Returning a generic query code soen't simulate the scenario required. * REVERT tieout on buld * enable test_cursors:testTimeoutPartialWithEmptyResults skip testEmptyResult in coord * fix comment revmove sortable * stable test_error_with_partial_results by using query debug * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * signal one reader simplify testCursorDepletionNonStrictTimeoutPolicySortby test * fix typo * revert force failure * fix spelling errors * review fixes: remove load from AggregateDebug test introduce DistAggregateCleanups to handle `go to err` in DEBUG_RSExecDistAggregate and RSExecDistAggregate comments in PipelineAddTimeoutAfterCount * move debug things to a header * alaborate comment * more details in aggregate_debug.h refrence cursor test to the header modify and add some comments * fix comment in rp use cursor to parse dewbug args fix * disable internal only when used with 0 results and no user cursor * remove DUMMY_DEBUG_OPTION * fix (cherry picked from commit d45701e)
* Introducing query debug mechanism 🎉 (#5554) * reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * Add query debug commands: API: _FT.DEBUG <query> <DEBUG_TYPE> <DEBUG_TYPE_ARGS> ... DEBUG_PARAMS_COUNT 2 query can be either FT.AGGREGATE ... or FT.SEARCH ... (with all supported args) DEBUG_TYPE Currently supports TIMEOUT_AFTER_N, followed by the number of results to return before timeout is returned, including 0. can be called both in SA and cluster: _FT.DEBUG FT.AGGREGATE _FT.DEBUG FT.SEARCH Can be called only if num shards is 1: _FT.DEBUG _FT.AGGREGATE _FT.DEBUG _FT.SEARCH Changes taking place in production functions: module.c *if in `DistSearchCommand` (ft.search) and `DistAggregateCommand` (ft.aggregate) to check if it's a debug command * if debug and 1 shard -> call DEBUG_RSSearchCommand or DEBUG_RSAggregateCommand for search and aggregate, respectively. *if debug and cluster: call DEBUG_DistSearchCommandHandler or DEBUG_RSExecDistAggregate for search and aggregate, respectively. * split FlatSearchCommandHandler into: * CreateReq * prepareCommand * MR_CreateCtx * MRCtx_SetReduceFunction * MR_Fanout * DEBUG_FlatSearchCommandHandler flow; * CreateReq * prepareCommand * add _FT.DEBUG as first arg * add debug params at the end * expose DistAggregateCommandand DistSearchCommand so it can be called from debug_commands.c aggregate_exec.c: split execCommandCommon into * prepareRequest * buildPipelineAndExecute check if AREQ->flags & QEXEC_F_DEBUG in prepareExecutionPlan change arg withProfile to execOptions change profile options to bit flags remove NO_PROFILE and replace with a flag indicating this is a profile command EXEC_WITH_PROFILE than if its profile, checking if EXEC_WITH_PROFILE_LIMITED bit is set. parseProfile also gets execOptions *DEBUG_* command are defined in debug_commands.h New: RPTimeoutAfterCount_New a result processor sumulates timeout after n results. AREQ_Debug_New create AREQ_Debug holding a request by value. hence freed toghther. new AREQ flag : QEXEC_F_DEBUG * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * finalize test debug comman * add GetNumShards_UnSafe to read number of shards use in in debug command instead of havong an if in the production function * add sanity test * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * remove bug fixes 1. sorter 2. pager 3. resp3 change rc in RS_RESULT_TIMEDOUT 4. coord hangign tests: revert rest cursors with _FT.DEBUG QUERY remove new test_timeout * add timeout build mechanism * set non strict mode to debug tests isClusterCoord to check if we are the coordinator in cluster mode * REVERT timeout on build * Bug fixes related to Timeout with return partial results policy (#5556) * reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * disable testTimeoutPartialWithEmptyResults test * enforce timeout when calling INTERNAL_ONLY with TIMEOUT_AFTER_N where N==0 to avoid coordinator infinite loop fix missing return in debug commands * fix tests: explictly settimeout policy to on_timeout return * fix use after free in rmr.c:MRIteratorCallback_ProcessDone the problem was that we were trying to decrese &ctx->it->ctx.inProcess when the shard decided to release the iterator (MRIterator_Release released it) The solution is to change MRIterator_Release signature. Now it returns weather it freed the iterator or not. if it was released, return * return false if free flag is set to false * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * fix leak * simulate timeout during build - TO BE REVERTED as currently doesn't simulate well the scenario: i.e -the shard returns timeout error string, butthe coordinator clock is not set to be timeout. Trying to only return an error from the shards without any special changes in the coordinator: RPNetNext returns error (and exits the while loop) only if the reply contains timeout AND the policy is to fail. otherwise, we enter another while loop iteration without processing the reply, overriding it with the next loop. This is a leak. Another approach i took was: in the callback passed to the resultprocessor I tried to first call getnextreply before rpnetNext checks the timeout, but it didn't work since for some reason i coudnlt figure (the shards didn't shut down. mught be a bug in RLTest) Returning a generic query code soen't simulate the scenario required. * REVERT tieout on buld * enable test_cursors:testTimeoutPartialWithEmptyResults skip testEmptyResult in coord * fix comment revmove sortable * stable test_error_with_partial_results by using query debug * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * signal one reader simplify testCursorDepletionNonStrictTimeoutPolicySortby test * fix typo * revert force failure * fix spelling errors * review fixes: remove load from AggregateDebug test introduce DistAggregateCleanups to handle `go to err` in DEBUG_RSExecDistAggregate and RSExecDistAggregate comments in PipelineAddTimeoutAfterCount * move debug things to a header * alaborate comment * more details in aggregate_debug.h refrence cursor test to the header modify and add some comments * fix comment in rp use cursor to parse dewbug args fix * disable internal only when used with 0 results and no user cursor * remove DUMMY_DEBUG_OPTION * fix (cherry picked from commit d45701e) * fix
This PR introduces debugging capabilities to the query execution process by introducing a new command to the debug commands set:
New _FT.DEBUG command
Syntax
Parameters
<QUERY>:Any valid
FT.SEARCHorFT.AGGREGATEcommand. It supports both SA (standalone) and cluster mode.<DEBUG_QUERY_ARGS>:Currently, the following argument is supported:
TIMEOUT_AFTER_N <N> [INTERNAL_ONLY]:simulates a timeout after processing
<N>results.Internally, adds a result processor as the downstream result processor of the final execution step (e.g.,
RP_INDEXin SA orRP_NETWORKin coordinator).INTERNAL_ONLY(optional):In cluster mode, it specifies that the timeout only applies to the internal query command sent to the shards, without affecting the coordinator pipeline.
<DEBUG_PARAMS_COUNT>should follow<DEBUG_QUERY_ARGS>. specifies the expected number of arguments in<DEBUG_QUERY_ARGS>.For example, to simulate a timeout after 100 results, you can use the following debug query:
Changes in query execution code
Functions refactor
To facilitate the injection of debug logic without modifying core execution paths, several functions in the query command flow were broken down into smaller callbacks. This approach minimizes the impact of debug code on production performance.
execCommandCommonwas split into:prepareRequestbuildPipelineAndExecuteChanges in
execCommandCommonarguments:withProfileargument (previously constrained to a set of macros and a single value), was replaced withexecOptions.execOptionsis a set of flags defined in the newly introducedExecOptionsenum, allowing multiple flags to be defined.RSExecDistAggregateintoprepareForExecutionexecutePlanAdditions to the production code
As mentioned, the goal was to introduce minimal changes to the production code so that debug mode wouldn't affect performance for regular users. However, some changes were inevitable:
QEXEC_F_DEBUGinprepareExecutionPlan(in the production code)QEXEC_F_DEBUGwas added toQEFlagsenum.Debug command implmentation
DEBUG_execCommandCommonsimulatesexecCommandCommon, by creating and initializingAREQ_Debugstructure instead ofAREQ.DEBUG_RSAggregateCommandandDEBUG_RSSearchCommand:FT.AGGREGATEandFT.SEARCHcommands, respectively.DEBUG_execCommandCommonwith the respective command type flags (COMMAND_AGGREGATEorCOMMAND_SEARCH).Entry point:
defined in debug_command.c.
_FT.DEBUGfollowed byFT.AGGREGATEorFT.SEARCHinvokesDEBUG_RSAggregateCommandorDEBUG_RSSearchCommandrespectively.DEBUG_execCommandCommonwith the appropriate command type hint.FT.AGGREGATEorFT.SEARCHand callsDistAggregateCommandorDistSearchCommand._FT.DEBUGfollowed by_FT.AGGREGATEor_FT.SEARCHcommand to all shards