[MOD-8605] Introducing query debug mechanism :tada: by meiravgri · Pull Request #5554 · RediSearch/RediSearch

meiravgri · 2025-01-23T12:23:39Z

This PR introduces debugging capabilities to the query execution process by introducing a new command to the debug commands set:

New _FT.DEBUG command

Syntax

_FT.DEBUG <QUERY> <DEBUG_QUERY_ARGS> DEBUG_PARAMS_COUNT <COUNT>

Parameters

<QUERY>:
Any valid FT.SEARCH or FT.AGGREGATE command. It supports both SA (standalone) and cluster mode.
<DEBUG_QUERY_ARGS>:
Currently, the following argument is supported:
- TIMEOUT_AFTER_N <N> [INTERNAL_ONLY]:
  simulates a timeout after processing <N> results.
  Internally, adds a result processor as the downstream result processor of the final execution step (e.g., RP_INDEX in SA or RP_NETWORK in coordinator).
- INTERNAL_ONLY (optional):
  In cluster mode, it specifies that the timeout only applies to the internal query command sent to the shards, without affecting the coordinator pipeline.
<DEBUG_PARAMS_COUNT> should follow <DEBUG_QUERY_ARGS>. specifies the expected number of arguments in <DEBUG_QUERY_ARGS>.

For example, to simulate a timeout after 100 results, you can use the following debug query:

_FT.DEBUG FT.SEARCH  TIMEOUT_AFTER_N 100 DEBUG_PARAMS_COUNT 2

Changes in query execution code

Functions refactor

To facilitate the injection of debug logic without modifying core execution paths, several functions in the query command flow were broken down into smaller callbacks. This approach minimizes the impact of debug code on production performance.

execCommandCommon was split into:
- prepareRequest
- buildPipelineAndExecute
Changes in execCommandCommon arguments:

The withProfile argument (previously constrained to a set of macros and a single value), was replaced with execOptions.
execOptions is a set of flags defined in the newly introduced ExecOptions enum, allowing multiple flags to be defined.

split RSExecDistAggregate into

prepareForExecution
executePlan

Additions to the production code

As mentioned, the goal was to introduce minimal changes to the production code so that debug mode wouldn't affect performance for regular users. However, some changes were inevitable:

check the newly introduced QEXEC_F_DEBUG in prepareExecutionPlan (in the production code)

A new flag QEXEC_F_DEBUG was added to QEFlags enum.
If enabled, a callback parses debug parameters and adjusts the query execution plan accordingly.

Debug command implmentation

DEBUG_execCommandCommon simulates execCommandCommon, by creating and initializing AREQ_Debug structure instead of AREQ.
DEBUG_RSAggregateCommand and DEBUG_RSSearchCommand:

Simulate the FT.AGGREGATE and FT.SEARCH commands, respectively.
Internally call DEBUG_execCommandCommon with the respective command type flags (COMMAND_AGGREGATE or COMMAND_SEARCH).

Entry point:

defined in debug_command.c.

Standalone (SA) Mode:
- _FT.DEBUG followed by FT.AGGREGATE or FT.SEARCH invokes DEBUG_RSAggregateCommand or DEBUG_RSSearchCommand respectively.
- These commands are delegated to DEBUG_execCommandCommon with the appropriate command type hint.
Cluster Mode:
- The coordinator receives FT.AGGREGATE or FT.SEARCH and calls DistAggregateCommand or DistSearchCommand.
- Internally, the coordinator sends _FT.DEBUG followed by _FT.AGGREGATE or _FT.SEARCH command to all shards

add test

fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each

API: _FT.DEBUG <query> <DEBUG_TYPE> <DEBUG_TYPE_ARGS> ... DEBUG_PARAMS_COUNT 2 query can be either FT.AGGREGATE ... or FT.SEARCH ... (with all supported args) DEBUG_TYPE Currently supports TIMEOUT_AFTER_N, followed by the number of results to return before timeout is returned, including 0. can be called both in SA and cluster: _FT.DEBUG FT.AGGREGATE _FT.DEBUG FT.SEARCH Can be called only if num shards is 1: _FT.DEBUG _FT.AGGREGATE _FT.DEBUG _FT.SEARCH Changes taking place in production functions: module.c *if in `DistSearchCommand` (ft.search) and `DistAggregateCommand` (ft.aggregate) to check if it's a debug command * if debug and 1 shard -> call DEBUG_RSSearchCommand or DEBUG_RSAggregateCommand for search and aggregate, respectively. *if debug and cluster: call DEBUG_DistSearchCommandHandler or DEBUG_RSExecDistAggregate for search and aggregate, respectively. * split FlatSearchCommandHandler into: * CreateReq * prepareCommand * MR_CreateCtx * MRCtx_SetReduceFunction * MR_Fanout * DEBUG_FlatSearchCommandHandler flow; * CreateReq * prepareCommand * add _FT.DEBUG as first arg * add debug params at the end * expose DistAggregateCommandand DistSearchCommand so it can be called from debug_commands.c aggregate_exec.c: split execCommandCommon into * prepareRequest * buildPipelineAndExecute check if AREQ->flags & QEXEC_F_DEBUG in prepareExecutionPlan change arg withProfile to execOptions change profile options to bit flags remove NO_PROFILE and replace with a flag indicating this is a profile command EXEC_WITH_PROFILE than if its profile, checking if EXEC_WITH_PROFILE_LIMITED bit is set. parseProfile also gets execOptions *DEBUG_* command are defined in debug_commands.h New: RPTimeoutAfterCount_New a result processor sumulates timeout after n results. AREQ_Debug_New create AREQ_Debug holding a request by value. hence freed toghther. new AREQ flag : QEXEC_F_DEBUG

…r_timeout

before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result.

use in in debug command instead of havong an if in the production function

that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount

…r_timeout

1. sorter 2. pager 3. resp3 change rc in RS_RESULT_TIMEDOUT 4. coord hangign tests: revert rest cursors with _FT.DEBUG QUERY remove new test_timeout

codecov · 2025-01-23T12:58:36Z

Codecov Report

Attention: Patch coverage is 95.00000% with 19 lines in your changes missing coverage. Please review.

Project coverage is 88.09%. Comparing base (fe9ec7a) to head (fbbdab8).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
src/coord/dist_aggregate.c	92.04%	7 Missing ⚠️
src/module.c	93.10%	6 Missing ⚠️
src/debug_commands.c	83.33%	4 Missing ⚠️
src/aggregate/aggregate_debug.c	98.36%	1 Missing ⚠️
src/aggregate/aggregate_exec.c	98.21%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5554      +/-   ##
==========================================
+ Coverage   88.05%   88.09%   +0.04%     
==========================================
  Files         197      198       +1     
  Lines       35722    35999     +277     
==========================================
+ Hits        31454    31714     +260     
- Misses       4268     4285      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

isClusterCoord to check if we are the coordinator in cluster mode

* reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * disable testTimeoutPartialWithEmptyResults test * enforce timeout when calling INTERNAL_ONLY with TIMEOUT_AFTER_N where N==0 to avoid coordinator infinite loop fix missing return in debug commands * fix tests: explictly settimeout policy to on_timeout return * fix use after free in rmr.c:MRIteratorCallback_ProcessDone the problem was that we were trying to decrese &ctx->it->ctx.inProcess when the shard decided to release the iterator (MRIterator_Release released it) The solution is to change MRIterator_Release signature. Now it returns weather it freed the iterator or not. if it was released, return * return false if free flag is set to false * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * fix leak * simulate timeout during build - TO BE REVERTED as currently doesn't simulate well the scenario: i.e -the shard returns timeout error string, butthe coordinator clock is not set to be timeout. Trying to only return an error from the shards without any special changes in the coordinator: RPNetNext returns error (and exits the while loop) only if the reply contains timeout AND the policy is to fail. otherwise, we enter another while loop iteration without processing the reply, overriding it with the next loop. This is a leak. Another approach i took was: in the callback passed to the resultprocessor I tried to first call getnextreply before rpnetNext checks the timeout, but it didn't work since for some reason i coudnlt figure (the shards didn't shut down. mught be a bug in RLTest) Returning a generic query code soen't simulate the scenario required. * REVERT tieout on buld * enable test_cursors:testTimeoutPartialWithEmptyResults skip testEmptyResult in coord * fix comment revmove sortable * stable test_error_with_partial_results by using query debug * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * signal one reader simplify testCursorDepletionNonStrictTimeoutPolicySortby test * fix typo * revert force failure

alonre24

Very good and important job! 💪
See my questions and suggestions in the comments

src/coord/dist_aggregate.c

src/aggregate/aggregate_exec.c

src/result_processor.c

alonre24 · 2025-02-18T08:17:26Z

src/result_processor.c

+        cur = cur->upstream;
+    }
+
+    if (cur) {


In which cases do we expect cur to be NULL? Should we debug assert it? Also, consider storing the relevant RPIndexIteratorpointer as part of the RPTimeoutAfterCount struct if it simplifies things

There is no RP_INDEX in the coordinator pipeline.
I don't think it's complicated to search for it on runtime as in most cases (actully all of them, except coordinator) this is the next RP

just comment that if it is coord we are not expected to get here

src/module.c

tests/pytests/test_debug_commands.py

alonre24 · 2025-02-19T11:09:59Z

tests/pytests/test_debug_commands.py

+        skipTest(cluster=True)
+        env = self.env
+        results_count = 200
+        timeout_res_count = results_count - 1


why minus 1?

To get timeout and not EOF

maybe comment it :)

tests/pytests/test_debug_commands.py

alonre24 · 2025-02-19T11:20:42Z

tests/pytests/test_debug_commands.py

+            # if the sum of results from all shards exceeds timeout_res_count but *each* shard individually returns fewer:
+            # no timeout warning is issued, the cursor is depleted.


In this case, are we stopping whenever we reach an accumulated number of results >= timeout_res_count (even if we did not collect results from some shards)?

didn't understand the question

try to refine the comment to be exact

Refining and moving this comment to the aggregate debug header as it is relevant to the general behaviour of FT.AGGREGATE and not specifically to cursors.

tests/pytests/test_timeout.py

meiravgri

// commented in the wrong PR

raz-mon

Nice
Approving since the only comments are on the debug command, which does not have to be perfect. I think it would be better off without the dummy debug argument and with order enforcement on the arguments, but it's not a blocker for the PR AFAIC.

src/aggregate/aggregate_debug.c

tests/pytests/test_debug_commands.py

…_command

raz-mon

🔥

redisearch-backport-pull-request · 2025-03-06T09:45:21Z

Backport failed for 2.8, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.8
git worktree add -d .worktree/backport-5554-to-2.8 origin/2.8
cd .worktree/backport-5554-to-2.8
git switch --create backport-5554-to-2.8
git cherry-pick -x d45701ecbb8bfad221355aa0bba7e843acdde1b0

redisearch-backport-pull-request · 2025-03-06T09:45:22Z

Backport failed for 2.6, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.6
git worktree add -d .worktree/backport-5554-to-2.6 origin/2.6
cd .worktree/backport-5554-to-2.6
git switch --create backport-5554-to-2.6
git cherry-pick -x d45701ecbb8bfad221355aa0bba7e843acdde1b0

redisearch-backport-pull-request · 2025-03-06T09:45:24Z

Backport failed for 2.10, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.10
git worktree add -d .worktree/backport-5554-to-2.10 origin/2.10
cd .worktree/backport-5554-to-2.10
git switch --create backport-5554-to-2.10
git cherry-pick -x d45701ecbb8bfad221355aa0bba7e843acdde1b0

redisearch-backport-pull-request · 2025-03-06T09:45:25Z

Backport failed for 8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 8.0
git worktree add -d .worktree/backport-5554-to-8.0 origin/8.0
cd .worktree/backport-5554-to-8.0
git switch --create backport-5554-to-8.0
git cherry-pick -x d45701ecbb8bfad221355aa0bba7e843acdde1b0

* reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * Add query debug commands: API: _FT.DEBUG <query> <DEBUG_TYPE> <DEBUG_TYPE_ARGS> ... DEBUG_PARAMS_COUNT 2 query can be either FT.AGGREGATE ... or FT.SEARCH ... (with all supported args) DEBUG_TYPE Currently supports TIMEOUT_AFTER_N, followed by the number of results to return before timeout is returned, including 0. can be called both in SA and cluster: _FT.DEBUG FT.AGGREGATE _FT.DEBUG FT.SEARCH Can be called only if num shards is 1: _FT.DEBUG _FT.AGGREGATE _FT.DEBUG _FT.SEARCH Changes taking place in production functions: module.c *if in `DistSearchCommand` (ft.search) and `DistAggregateCommand` (ft.aggregate) to check if it's a debug command * if debug and 1 shard -> call DEBUG_RSSearchCommand or DEBUG_RSAggregateCommand for search and aggregate, respectively. *if debug and cluster: call DEBUG_DistSearchCommandHandler or DEBUG_RSExecDistAggregate for search and aggregate, respectively. * split FlatSearchCommandHandler into: * CreateReq * prepareCommand * MR_CreateCtx * MRCtx_SetReduceFunction * MR_Fanout * DEBUG_FlatSearchCommandHandler flow; * CreateReq * prepareCommand * add _FT.DEBUG as first arg * add debug params at the end * expose DistAggregateCommandand DistSearchCommand so it can be called from debug_commands.c aggregate_exec.c: split execCommandCommon into * prepareRequest * buildPipelineAndExecute check if AREQ->flags & QEXEC_F_DEBUG in prepareExecutionPlan change arg withProfile to execOptions change profile options to bit flags remove NO_PROFILE and replace with a flag indicating this is a profile command EXEC_WITH_PROFILE than if its profile, checking if EXEC_WITH_PROFILE_LIMITED bit is set. parseProfile also gets execOptions *DEBUG_* command are defined in debug_commands.h New: RPTimeoutAfterCount_New a result processor sumulates timeout after n results. AREQ_Debug_New create AREQ_Debug holding a request by value. hence freed toghther. new AREQ flag : QEXEC_F_DEBUG * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * finalize test debug comman * add GetNumShards_UnSafe to read number of shards use in in debug command instead of havong an if in the production function * add sanity test * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * remove bug fixes 1. sorter 2. pager 3. resp3 change rc in RS_RESULT_TIMEDOUT 4. coord hangign tests: revert rest cursors with _FT.DEBUG QUERY remove new test_timeout * add timeout build mechanism * set non strict mode to debug tests isClusterCoord to check if we are the coordinator in cluster mode * REVERT timeout on build * Bug fixes related to Timeout with return partial results policy (#5556) * reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * disable testTimeoutPartialWithEmptyResults test * enforce timeout when calling INTERNAL_ONLY with TIMEOUT_AFTER_N where N==0 to avoid coordinator infinite loop fix missing return in debug commands * fix tests: explictly settimeout policy to on_timeout return * fix use after free in rmr.c:MRIteratorCallback_ProcessDone the problem was that we were trying to decrese &ctx->it->ctx.inProcess when the shard decided to release the iterator (MRIterator_Release released it) The solution is to change MRIterator_Release signature. Now it returns weather it freed the iterator or not. if it was released, return * return false if free flag is set to false * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * fix leak * simulate timeout during build - TO BE REVERTED as currently doesn't simulate well the scenario: i.e -the shard returns timeout error string, butthe coordinator clock is not set to be timeout. Trying to only return an error from the shards without any special changes in the coordinator: RPNetNext returns error (and exits the while loop) only if the reply contains timeout AND the policy is to fail. otherwise, we enter another while loop iteration without processing the reply, overriding it with the next loop. This is a leak. Another approach i took was: in the callback passed to the resultprocessor I tried to first call getnextreply before rpnetNext checks the timeout, but it didn't work since for some reason i coudnlt figure (the shards didn't shut down. mught be a bug in RLTest) Returning a generic query code soen't simulate the scenario required. * REVERT tieout on buld * enable test_cursors:testTimeoutPartialWithEmptyResults skip testEmptyResult in coord * fix comment revmove sortable * stable test_error_with_partial_results by using query debug * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * signal one reader simplify testCursorDepletionNonStrictTimeoutPolicySortby test * fix typo * revert force failure * fix spelling errors * review fixes: remove load from AggregateDebug test introduce DistAggregateCleanups to handle `go to err` in DEBUG_RSExecDistAggregate and RSExecDistAggregate comments in PipelineAddTimeoutAfterCount * move debug things to a header * alaborate comment * more details in aggregate_debug.h refrence cursor test to the header modify and add some comments * fix comment in rp use cursor to parse dewbug args fix * disable internal only when used with 0 results and no user cursor * remove DUMMY_DEBUG_OPTION * fix (cherry picked from commit d45701e)

* Introducing query debug mechanism 🎉 (#5554) * reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * Add query debug commands: API: _FT.DEBUG <query> <DEBUG_TYPE> <DEBUG_TYPE_ARGS> ... DEBUG_PARAMS_COUNT 2 query can be either FT.AGGREGATE ... or FT.SEARCH ... (with all supported args) DEBUG_TYPE Currently supports TIMEOUT_AFTER_N, followed by the number of results to return before timeout is returned, including 0. can be called both in SA and cluster: _FT.DEBUG FT.AGGREGATE _FT.DEBUG FT.SEARCH Can be called only if num shards is 1: _FT.DEBUG _FT.AGGREGATE _FT.DEBUG _FT.SEARCH Changes taking place in production functions: module.c *if in `DistSearchCommand` (ft.search) and `DistAggregateCommand` (ft.aggregate) to check if it's a debug command * if debug and 1 shard -> call DEBUG_RSSearchCommand or DEBUG_RSAggregateCommand for search and aggregate, respectively. *if debug and cluster: call DEBUG_DistSearchCommandHandler or DEBUG_RSExecDistAggregate for search and aggregate, respectively. * split FlatSearchCommandHandler into: * CreateReq * prepareCommand * MR_CreateCtx * MRCtx_SetReduceFunction * MR_Fanout * DEBUG_FlatSearchCommandHandler flow; * CreateReq * prepareCommand * add _FT.DEBUG as first arg * add debug params at the end * expose DistAggregateCommandand DistSearchCommand so it can be called from debug_commands.c aggregate_exec.c: split execCommandCommon into * prepareRequest * buildPipelineAndExecute check if AREQ->flags & QEXEC_F_DEBUG in prepareExecutionPlan change arg withProfile to execOptions change profile options to bit flags remove NO_PROFILE and replace with a flag indicating this is a profile command EXEC_WITH_PROFILE than if its profile, checking if EXEC_WITH_PROFILE_LIMITED bit is set. parseProfile also gets execOptions *DEBUG_* command are defined in debug_commands.h New: RPTimeoutAfterCount_New a result processor sumulates timeout after n results. AREQ_Debug_New create AREQ_Debug holding a request by value. hence freed toghther. new AREQ flag : QEXEC_F_DEBUG * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * finalize test debug comman * add GetNumShards_UnSafe to read number of shards use in in debug command instead of havong an if in the production function * add sanity test * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * remove bug fixes 1. sorter 2. pager 3. resp3 change rc in RS_RESULT_TIMEDOUT 4. coord hangign tests: revert rest cursors with _FT.DEBUG QUERY remove new test_timeout * add timeout build mechanism * set non strict mode to debug tests isClusterCoord to check if we are the coordinator in cluster mode * REVERT timeout on build * Bug fixes related to Timeout with return partial results policy (#5556) * reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * disable testTimeoutPartialWithEmptyResults test * enforce timeout when calling INTERNAL_ONLY with TIMEOUT_AFTER_N where N==0 to avoid coordinator infinite loop fix missing return in debug commands * fix tests: explictly settimeout policy to on_timeout return * fix use after free in rmr.c:MRIteratorCallback_ProcessDone the problem was that we were trying to decrese &ctx->it->ctx.inProcess when the shard decided to release the iterator (MRIterator_Release released it) The solution is to change MRIterator_Release signature. Now it returns weather it freed the iterator or not. if it was released, return * return false if free flag is set to false * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * fix leak * simulate timeout during build - TO BE REVERTED as currently doesn't simulate well the scenario: i.e -the shard returns timeout error string, butthe coordinator clock is not set to be timeout. Trying to only return an error from the shards without any special changes in the coordinator: RPNetNext returns error (and exits the while loop) only if the reply contains timeout AND the policy is to fail. otherwise, we enter another while loop iteration without processing the reply, overriding it with the next loop. This is a leak. Another approach i took was: in the callback passed to the resultprocessor I tried to first call getnextreply before rpnetNext checks the timeout, but it didn't work since for some reason i coudnlt figure (the shards didn't shut down. mught be a bug in RLTest) Returning a generic query code soen't simulate the scenario required. * REVERT tieout on buld * enable test_cursors:testTimeoutPartialWithEmptyResults skip testEmptyResult in coord * fix comment revmove sortable * stable test_error_with_partial_results by using query debug * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * signal one reader simplify testCursorDepletionNonStrictTimeoutPolicySortby test * fix typo * revert force failure * fix spelling errors * review fixes: remove load from AggregateDebug test introduce DistAggregateCleanups to handle `go to err` in DEBUG_RSExecDistAggregate and RSExecDistAggregate comments in PipelineAddTimeoutAfterCount * move debug things to a header * alaborate comment * more details in aggregate_debug.h refrence cursor test to the header modify and add some comments * fix comment in rp use cursor to parse dewbug args fix * disable internal only when used with 0 results and no user cursor * remove DUMMY_DEBUG_OPTION * fix (cherry picked from commit d45701e) * fix

meiravgri added 17 commits January 6, 2025 15:45

reset self->timedOut

d659d51

add test

revert otiginal test

b188941

suuport empty results (to bereverted)

80c042c

revert result len

a7c7637

revert resp2 reply length

1317360

Merge remote-tracking branch 'origin/master' into meiravg_reset_srote…

71025aa

…r_timeout

finalize test debug comman

09e9aaa

add GetNumShards_UnSafe to read number of shards

a0e3571

use in in debug command instead of havong an if in the production function

add sanity test

b40ce04

add optional paramter: [INTERNAL_ONLY]

fa23e9f

that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount

finalize tests

6cbac96

fix comment

e9138be

Merge remote-tracking branch 'origin/master' into meiravg_reset_srote…

08a619e

…r_timeout

remove bug fixes

c7b2076

1. sorter 2. pager 3. resp3 change rc in RS_RESULT_TIMEDOUT 4. coord hangign tests: revert rest cursors with _FT.DEBUG QUERY remove new test_timeout

meiravgri and others added 7 commits February 6, 2025 13:18

add timeout build mechanism

5c04d61

set non strict mode to debug tests

2722156

isClusterCoord to check if we are the coordinator in cluster mode

Merge branch 'master' into meiravg_debug_query_command

de70658

REVERT timeout on build

bdc4355

Merge branch 'master' into meiravg_debug_query_command

f173dd3

Merge branch 'master' into meiravg_debug_query_command

c2f8422

meiravgri requested a review from alonre24 February 16, 2025 15:53

alonre24 changed the title ~~Meiravg_debug_query_command~~ Introducing query debug mechanism 🎉 Feb 16, 2025

fix spelling errors

6b6091e

meiravgri requested a review from GuyAv46 February 16, 2025 17:16

alonre24 reviewed Feb 19, 2025

View reviewed changes

meiravgri requested a review from raz-mon March 3, 2025 16:18

meiravgri commented Mar 3, 2025

View reviewed changes

meiravgri added backport 2.8 backport 2.6 backport 2.10 backport 8.0 labels Mar 4, 2025

raz-mon previously approved these changes Mar 5, 2025

View reviewed changes

src/aggregate/aggregate_debug.c Outdated Show resolved Hide resolved

src/aggregate/aggregate_debug.c Show resolved Hide resolved

tests/pytests/test_debug_commands.py Show resolved Hide resolved

remove DUMMY_DEBUG_OPTION

7e2066d

meiravgri dismissed raz-mon’s stale review via 7e2066d March 6, 2025 06:03

meiravgri added 2 commits March 6, 2025 06:08

Merge remote-tracking branch 'origin/master' into meiravg_debug_query…

d1b6d5c

…_command

fix

fbbdab8

meiravgri requested a review from raz-mon March 6, 2025 08:20

raz-mon approved these changes Mar 6, 2025

View reviewed changes

meiravgri enabled auto-merge March 6, 2025 08:39

meiravgri added this pull request to the merge queue Mar 6, 2025

Merged via the queue into master with commit d45701e Mar 6, 2025
10 checks passed

meiravgri deleted the meiravg_debug_query_command branch March 6, 2025 09:45

meiravgri mentioned this pull request Mar 6, 2025

[8.0] Introducing query debug mechanism 🎉 (#5554) #5732

Merged

This was referenced Mar 16, 2025

[2.10] CP TIMEOUT related bugs & Query Debug Mechanism #5734

Merged

[2.8] CP TIMEOUT related bugs & Query Debug Mechanism #5788

Merged

Fix timeout with return partial results policy bug, implement debug timeout mock #5460

Closed

meiravgri changed the title ~~Introducing query debug mechanism 🎉~~ [MOD-8605] Introducing query debug mechanism 🎉 Mar 20, 2025

meiravgri mentioned this pull request Apr 20, 2025

[2.6] CP TIMEOUT related bugs & Query Debug Mechanism #5962

Open

		# if the sum of results from all shards exceeds timeout_res_count but each shard individually returns fewer:
		# no timeout warning is issued, the cursor is depleted.

Conversation

meiravgri commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New _FT.DEBUG command

Syntax

Changes in query execution code

Functions refactor

Additions to the production code

Debug command implmentation

Entry point:

Uh oh!

codecov bot commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alonre24 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

meiravgri left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raz-mon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

raz-mon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

redisearch-backport-pull-request bot commented Mar 6, 2025

Uh oh!

redisearch-backport-pull-request bot commented Mar 6, 2025

Uh oh!

redisearch-backport-pull-request bot commented Mar 6, 2025

Uh oh!

redisearch-backport-pull-request bot commented Mar 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

meiravgri commented Jan 23, 2025 •

edited

Loading

codecov bot commented Jan 23, 2025 •

edited

Loading

meiravgri left a comment •

edited

Loading