Skip to content

[8.0] Introducing query debug mechanism 🎉 (#5554)#5732

Merged
meiravgri merged 2 commits into8.0from
backport-5554-to-8.0
Mar 9, 2025
Merged

[8.0] Introducing query debug mechanism 🎉 (#5554)#5732
meiravgri merged 2 commits into8.0from
backport-5554-to-8.0

Conversation

@meiravgri
Copy link
Collaborator

@meiravgri meiravgri commented Mar 6, 2025

backport #5554 t0 8.0

fixing:
MOD-8483, MOD-8482, MOD-8515, MOD-8606, MOD-8794, MOD-8792

debug query mechanism mod:
MOD-8605

meiravgri and others added 2 commits March 6, 2025 16:27
* reset self->timedOut
add test

* revert otiginal test

* suuport empty results (to bereverted)

* revert getting empty result

fix bug 1:
aggregate_exec:sendChunk_Resp3
Don't change rc upon timeout, to avoid returning empty results.

bug fix2:
reset sorter timeout so we retrun rof after depleting the heap.

bug fix3:
if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted.
in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return.

to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout.

ehancement:
rpnetNext
cache results len instead of getting it each

* revert result len

* revert resp2 reply length

* Add query debug commands:
API:
_FT.DEBUG <query>  <DEBUG_TYPE> <DEBUG_TYPE_ARGS> ... DEBUG_PARAMS_COUNT 2

query can be either FT.AGGREGATE ... or FT.SEARCH ... (with all supported args)

DEBUG_TYPE Currently supports TIMEOUT_AFTER_N, followed by the number of results to return before timeout is returned, including 0.

can be called both in SA and cluster:
_FT.DEBUG FT.AGGREGATE
_FT.DEBUG FT.SEARCH

Can be called only if num shards is 1:
_FT.DEBUG _FT.AGGREGATE
_FT.DEBUG _FT.SEARCH

Changes taking place in production functions:
module.c
*if in `DistSearchCommand` (ft.search) and `DistAggregateCommand` (ft.aggregate) to check if it's a debug command
* if debug and 1 shard -> call DEBUG_RSSearchCommand or DEBUG_RSAggregateCommand for search and aggregate, respectively.
*if debug and cluster: call DEBUG_DistSearchCommandHandler or DEBUG_RSExecDistAggregate for search and aggregate, respectively.

* split FlatSearchCommandHandler into:
  * CreateReq
  * prepareCommand
  * MR_CreateCtx
  * MRCtx_SetReduceFunction
  * MR_Fanout

* DEBUG_FlatSearchCommandHandler flow;
  * CreateReq
  * prepareCommand
  * add _FT.DEBUG as first arg
  * add debug params at the end

* expose DistAggregateCommandand DistSearchCommand so it can be called from debug_commands.c

aggregate_exec.c:
split execCommandCommon into
  * prepareRequest
  * buildPipelineAndExecute

check if AREQ->flags & QEXEC_F_DEBUG in prepareExecutionPlan
change arg withProfile to execOptions
change profile options to bit flags
remove NO_PROFILE and replace with a flag indicating this is a profile command EXEC_WITH_PROFILE
than if its profile, checking if EXEC_WITH_PROFILE_LIMITED bit is set.
parseProfile also gets execOptions

*DEBUG_* command are defined in debug_commands.h

New:
RPTimeoutAfterCount_New
a result processor sumulates timeout after n results.

AREQ_Debug_New create AREQ_Debug holding a request by value. hence freed toghther.
new AREQ flag : QEXEC_F_DEBUG

* fix rppagerNext_Limit counter update

before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned
but the ->Next call could have returned no results, so the counter was wrongly updated for it.
This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result.

* finalize test debug comman

* add GetNumShards_UnSafe to read number of shards
use in in debug command instead of havong an if in the production function

* add sanity test

* add optional paramter: [INTERNAL_ONLY]
that will add the timeout Rp only if the query runs in a shard

add tst_timeout
add to debug tests:
TeststrictPolicy
testInternalOnly

add test_cursors:testTimeoutPartialWithEmptyResults

rename parseDebugParams->parseDebugParamsCount

* finalize tests

* fix comment

* remove bug fixes
1. sorter
2. pager
3. resp3 change rc in RS_RESULT_TIMEDOUT
4. coord hangign

tests:
revert rest cursors with _FT.DEBUG QUERY
remove new test_timeout

* add timeout build mechanism

* set non strict mode to debug tests
isClusterCoord to check if we are the coordinator in cluster mode

* REVERT timeout on build

* Bug fixes related to Timeout with return partial results policy (#5556)

* reset self->timedOut
add test

* revert otiginal test

* suuport empty results (to bereverted)

* revert getting empty result

fix bug 1:
aggregate_exec:sendChunk_Resp3
Don't change rc upon timeout, to avoid returning empty results.

bug fix2:
reset sorter timeout so we retrun rof after depleting the heap.

bug fix3:
if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted.
in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return.

to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout.

ehancement:
rpnetNext
cache results len instead of getting it each

* revert result len

* revert resp2 reply length

* fix rppagerNext_Limit counter update

before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned
but the ->Next call could have returned no results, so the counter was wrongly updated for it.
This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result.

* add optional paramter: [INTERNAL_ONLY]
that will add the timeout Rp only if the query runs in a shard

add tst_timeout
add to debug tests:
TeststrictPolicy
testInternalOnly

add test_cursors:testTimeoutPartialWithEmptyResults

rename parseDebugParams->parseDebugParamsCount

* finalize tests

* fix comment

* disable testTimeoutPartialWithEmptyResults test

* enforce timeout when calling INTERNAL_ONLY with TIMEOUT_AFTER_N where N==0 to avoid coordinator infinite loop

fix missing return in debug commands

* fix tests: explictly settimeout policy to on_timeout return

* fix use after free in rmr.c:MRIteratorCallback_ProcessDone

the problem was that we were trying to decrese &ctx->it->ctx.inProcess when the shard decided to release the iterator (MRIterator_Release released it)
The solution is to change MRIterator_Release signature. Now it returns weather it freed the iterator or not. if it was released, return

* return false if free flag is set to false

* move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards.

ensure the sequence of returning  from MRChannel_Popand re - blocking the chanel
is atomic under the chanel guard

remove MRChannel_Block

* fix leak

* simulate timeout during build - TO BE REVERTED as currently doesn't simulate well the scenario:
i.e -the shard returns timeout error string, butthe coordinator clock is not set to be timeout.
Trying to only return an error from the shards without any special changes in the coordinator: RPNetNext returns error (and exits the while loop) only if the reply contains timeout AND the policy is to fail. otherwise, we enter another while loop iteration without processing the reply, overriding it with the next loop. This is a leak.
Another approach i took was: in the callback passed to the resultprocessor I tried to first call getnextreply before rpnetNext checks the timeout, but it didn't work since for some reason i coudnlt figure (the shards didn't shut down. mught be a bug in RLTest)

Returning a generic query code soen't simulate the scenario required.

* REVERT tieout on buld

* enable test_cursors:testTimeoutPartialWithEmptyResults

skip testEmptyResult in coord

* fix comment

revmove sortable

* stable test_error_with_partial_results by using query debug

* move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards.

ensure the sequence of returning  from MRChannel_Popand re - blocking the chanel
is atomic under the chanel guard

remove MRChannel_Block

* signal one reader

simplify testCursorDepletionNonStrictTimeoutPolicySortby test

* fix typo

* revert force failure

* fix spelling errors

* review fixes:

remove load from AggregateDebug test

introduce DistAggregateCleanups to handle `go to err` in DEBUG_RSExecDistAggregate and RSExecDistAggregate

comments in PipelineAddTimeoutAfterCount

* move debug things to a header

* alaborate comment

* more details in aggregate_debug.h
refrence cursor test to the header

modify and add some comments

* fix comment in  rp
use cursor to parse dewbug args

fix

* disable internal only when used with 0 results and no user cursor

* remove DUMMY_DEBUG_OPTION

* fix

(cherry picked from commit d45701e)
@meiravgri meiravgri enabled auto-merge March 6, 2025 16:38
@codecov
Copy link

codecov bot commented Mar 6, 2025

Codecov Report

Attention: Patch coverage is 95.00000% with 19 lines in your changes missing coverage. Please review.

Project coverage is 88.11%. Comparing base (54dddef) to head (3d02579).
Report is 1 commits behind head on 8.0.

Files with missing lines Patch % Lines
src/coord/dist_aggregate.c 92.04% 7 Missing ⚠️
src/module.c 93.10% 6 Missing ⚠️
src/debug_commands.c 83.33% 4 Missing ⚠️
src/aggregate/aggregate_debug.c 98.36% 1 Missing ⚠️
src/aggregate/aggregate_exec.c 98.21% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              8.0    #5732      +/-   ##
==========================================
+ Coverage   88.03%   88.11%   +0.07%     
==========================================
  Files         197      198       +1     
  Lines       35689    35966     +277     
==========================================
+ Hits        31420    31692     +272     
- Misses       4269     4274       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@meiravgri meiravgri requested a review from raz-mon March 7, 2025 06:19
@meiravgri meiravgri added this pull request to the merge queue Mar 9, 2025
Merged via the queue into 8.0 with commit 5b795bf Mar 9, 2025
9 checks passed
@meiravgri meiravgri deleted the backport-5554-to-8.0 branch March 9, 2025 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants