[8.0] Introducing query debug mechanism :tada: (#5554) by meiravgri · Pull Request #5732 · RediSearch/RediSearch

meiravgri · 2025-03-06T16:29:16Z

backport #5554 t0 8.0

fixing:
MOD-8483, MOD-8482, MOD-8515, MOD-8606, MOD-8794, MOD-8792

debug query mechanism mod:
MOD-8605

* reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * Add query debug commands: API: _FT.DEBUG <query> <DEBUG_TYPE> <DEBUG_TYPE_ARGS> ... DEBUG_PARAMS_COUNT 2 query can be either FT.AGGREGATE ... or FT.SEARCH ... (with all supported args) DEBUG_TYPE Currently supports TIMEOUT_AFTER_N, followed by the number of results to return before timeout is returned, including 0. can be called both in SA and cluster: _FT.DEBUG FT.AGGREGATE _FT.DEBUG FT.SEARCH Can be called only if num shards is 1: _FT.DEBUG _FT.AGGREGATE _FT.DEBUG _FT.SEARCH Changes taking place in production functions: module.c *if in `DistSearchCommand` (ft.search) and `DistAggregateCommand` (ft.aggregate) to check if it's a debug command * if debug and 1 shard -> call DEBUG_RSSearchCommand or DEBUG_RSAggregateCommand for search and aggregate, respectively. *if debug and cluster: call DEBUG_DistSearchCommandHandler or DEBUG_RSExecDistAggregate for search and aggregate, respectively. * split FlatSearchCommandHandler into: * CreateReq * prepareCommand * MR_CreateCtx * MRCtx_SetReduceFunction * MR_Fanout * DEBUG_FlatSearchCommandHandler flow; * CreateReq * prepareCommand * add _FT.DEBUG as first arg * add debug params at the end * expose DistAggregateCommandand DistSearchCommand so it can be called from debug_commands.c aggregate_exec.c: split execCommandCommon into * prepareRequest * buildPipelineAndExecute check if AREQ->flags & QEXEC_F_DEBUG in prepareExecutionPlan change arg withProfile to execOptions change profile options to bit flags remove NO_PROFILE and replace with a flag indicating this is a profile command EXEC_WITH_PROFILE than if its profile, checking if EXEC_WITH_PROFILE_LIMITED bit is set. parseProfile also gets execOptions *DEBUG_* command are defined in debug_commands.h New: RPTimeoutAfterCount_New a result processor sumulates timeout after n results. AREQ_Debug_New create AREQ_Debug holding a request by value. hence freed toghther. new AREQ flag : QEXEC_F_DEBUG * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * finalize test debug comman * add GetNumShards_UnSafe to read number of shards use in in debug command instead of havong an if in the production function * add sanity test * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * remove bug fixes 1. sorter 2. pager 3. resp3 change rc in RS_RESULT_TIMEDOUT 4. coord hangign tests: revert rest cursors with _FT.DEBUG QUERY remove new test_timeout * add timeout build mechanism * set non strict mode to debug tests isClusterCoord to check if we are the coordinator in cluster mode * REVERT timeout on build * Bug fixes related to Timeout with return partial results policy (#5556) * reset self->timedOut add test * revert otiginal test * suuport empty results (to bereverted) * revert getting empty result fix bug 1: aggregate_exec:sendChunk_Resp3 Don't change rc upon timeout, to avoid returning empty results. bug fix2: reset sorter timeout so we retrun rof after depleting the heap. bug fix3: if shards didn't send any reply (curerntly assuming only possible due to timeout), but they did finish collecting the results, than inprocess would set to 0, but pending is still larger than 0, and the cursor is not depleted. in this case, we unblock the chanel, signle it to proceed. we rely on the coordinator to check the timeout. If it's timeout it will return. to allow that, get next reply returns pending (number of shard that potentially have results) instead of 0. in this case rpnet won't return EOF, but timeout. ehancement: rpnetNext cache results len instead of getting it each * revert result len * revert resp2 reply length * fix rppagerNext_Limit counter update before this fix self->remaining was decreased and than base->upstream->Next(base->upstream, r) was returned but the ->Next call could have returned no results, so the counter was wrongly updated for it. This caused wrong number of accumelated results retuned by cursor, when timeout was reached in one or more of the cursor's read command. In each timedout read we lost one result. * add optional paramter: [INTERNAL_ONLY] that will add the timeout Rp only if the query runs in a shard add tst_timeout add to debug tests: TeststrictPolicy testInternalOnly add test_cursors:testTimeoutPartialWithEmptyResults rename parseDebugParams->parseDebugParamsCount * finalize tests * fix comment * disable testTimeoutPartialWithEmptyResults test * enforce timeout when calling INTERNAL_ONLY with TIMEOUT_AFTER_N where N==0 to avoid coordinator infinite loop fix missing return in debug commands * fix tests: explictly settimeout policy to on_timeout return * fix use after free in rmr.c:MRIteratorCallback_ProcessDone the problem was that we were trying to decrese &ctx->it->ctx.inProcess when the shard decided to release the iterator (MRIterator_Release released it) The solution is to change MRIterator_Release signature. Now it returns weather it freed the iterator or not. if it was released, return * return false if free flag is set to false * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * fix leak * simulate timeout during build - TO BE REVERTED as currently doesn't simulate well the scenario: i.e -the shard returns timeout error string, butthe coordinator clock is not set to be timeout. Trying to only return an error from the shards without any special changes in the coordinator: RPNetNext returns error (and exits the while loop) only if the reply contains timeout AND the policy is to fail. otherwise, we enter another while loop iteration without processing the reply, overriding it with the next loop. This is a leak. Another approach i took was: in the callback passed to the resultprocessor I tried to first call getnextreply before rpnetNext checks the timeout, but it didn't work since for some reason i coudnlt figure (the shards didn't shut down. mught be a bug in RLTest) Returning a generic query code soen't simulate the scenario required. * REVERT tieout on buld * enable test_cursors:testTimeoutPartialWithEmptyResults skip testEmptyResult in coord * fix comment revmove sortable * stable test_error_with_partial_results by using query debug * move reader reset of the freeFlag to the uv thread to avoid race conditions with the shards. ensure the sequence of returning from MRChannel_Popand re - blocking the chanel is atomic under the chanel guard remove MRChannel_Block * signal one reader simplify testCursorDepletionNonStrictTimeoutPolicySortby test * fix typo * revert force failure * fix spelling errors * review fixes: remove load from AggregateDebug test introduce DistAggregateCleanups to handle `go to err` in DEBUG_RSExecDistAggregate and RSExecDistAggregate comments in PipelineAddTimeoutAfterCount * move debug things to a header * alaborate comment * more details in aggregate_debug.h refrence cursor test to the header modify and add some comments * fix comment in rp use cursor to parse dewbug args fix * disable internal only when used with 0 results and no user cursor * remove DUMMY_DEBUG_OPTION * fix (cherry picked from commit d45701e)

codecov · 2025-03-06T17:04:28Z

Codecov Report

Attention: Patch coverage is 95.00000% with 19 lines in your changes missing coverage. Please review.

Project coverage is 88.11%. Comparing base (54dddef) to head (3d02579).
Report is 1 commits behind head on 8.0.

Files with missing lines	Patch %	Lines
src/coord/dist_aggregate.c	92.04%	7 Missing ⚠️
src/module.c	93.10%	6 Missing ⚠️
src/debug_commands.c	83.33%	4 Missing ⚠️
src/aggregate/aggregate_debug.c	98.36%	1 Missing ⚠️
src/aggregate/aggregate_exec.c	98.21%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              8.0    #5732      +/-   ##
==========================================
+ Coverage   88.03%   88.11%   +0.07%     
==========================================
  Files         197      198       +1     
  Lines       35689    35966     +277     
==========================================
+ Hits        31420    31692     +272     
- Misses       4269     4274       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

meiravgri and others added 2 commits March 6, 2025 16:27

fix

3d02579

meiravgri enabled auto-merge March 6, 2025 16:38

meiravgri requested a review from raz-mon March 7, 2025 06:19

raz-mon approved these changes Mar 9, 2025

View reviewed changes

meiravgri added this pull request to the merge queue Mar 9, 2025

Merged via the queue into 8.0 with commit 5b795bf Mar 9, 2025
9 checks passed

meiravgri deleted the backport-5554-to-8.0 branch March 9, 2025 08:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.0] Introducing query debug mechanism 🎉 (#5554)#5732

[8.0] Introducing query debug mechanism 🎉 (#5554)#5732
meiravgri merged 2 commits into8.0from
backport-5554-to-8.0

meiravgri commented Mar 6, 2025 •

edited by atlassian bot

Loading

Uh oh!

codecov bot commented Mar 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

meiravgri commented Mar 6, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

meiravgri commented Mar 6, 2025 •

edited by atlassian bot

Loading

codecov bot commented Mar 6, 2025 •

edited

Loading