Skip to content

[8.2] [MOD-12252] [MOD-11658] reduce number of worker threads asynchronously#7534

Merged
JoanFM merged 3 commits into8.2from
backport-7410-to-8.2
Nov 26, 2025
Merged

[8.2] [MOD-12252] [MOD-11658] reduce number of worker threads asynchronously#7534
JoanFM merged 3 commits into8.2from
backport-7410-to-8.2

Conversation

@JoanFM
Copy link
Collaborator

@JoanFM JoanFM commented Nov 26, 2025

Backport #7410 to 8.2


Note

Introduce asynchronous thread reduction in the thread pool with robust initialization signaling, update worker management to use it, and add tests to prevent deadlocks when reducing WORKERS under load.

  • Threadpool (deps/thpool):
    • Add redisearch_thpool_schedule_config_reduce_threads_job(...) to reduce threads asynchronously via ADMIN jobs and barriers; set pool UNINITIALIZED on terminate-when-empty.
    • Refactor init and enqueue paths: split redisearch_thpool_push_chain(...), use per-thread started flags (via thread_do_args) to reliably wait for new threads; thread_do signature changed.
    • Strengthen checks/logging (RS_ASSERT, improved messages); remove synchronous APIs from header (redisearch_thpool_remove_threads, redisearch_thpool_terminate_when_empty).
  • Workers (src/util/workers.c):
    • Switch thread count changes to new async reduction API; keep adds synchronous with start-wait; add notices and assertions; improve drain logging.
  • Config/Debug:
    • Keep WORKERS changes triggering pool reconfig; add warning note for FT.DEBUG WORKERS DRAIN; minor assert macro swap.
  • Tests:
    • Update C++ thpool tests to use async reduction and new init semantics.
    • Add/expand Python tests to validate worker scaling (increase/decrease, zero↔nonzero), gradual reductions, and MOD-11658: avoid deadlock when setting WORKERS to 0 under load.
    • Minor test robustness tweaks (thread joins, timing).

Written by Cursor Bugbot for commit bf344b2. This will update automatically on new commits. Configure here.

#7410)

* change number of threads asynchronously

* do some changes to add thread

* improve code

* cancel old config change if happens

* more improvements

* destroy when sure noone needsit

* simplify implementation drastically

* improve async implementation

* revert barrier changes

* test: do not skip on standalone

* test: add new test

* fix tests

* remove extra tests

* remove unneeded code

* delay ensure init after

* fix: remove weird unlock

* test: handle cpp tests

* small improvement on the test

* improve the checking of thread starting

* remove some unneeded functions from thpool

* remove unused code from tests

* small cleanup of code

* update deps/thpool/thpool.c

* add checks to thpool stats in test

* fix multithread test

* fix memory leak and remove unreachable code

* assert some invariants

* fix: respect adding workers is a lazy operation even after start

* small changes

* remove double assertion

* add another assertion

* small change to avoid dangling pointer in stack

* changes according to comments to the code

* change tests of place and simplify some

* small change

* try to improve flakiness and test time using TimeLimit

* make static and reuse code

* Update deps/thpool/thpool.c

Co-authored-by: GuyAv46 <[email protected]>

* add volatile to started array

* remove extra assertion

* add warning to ft debug command code comment

* change assertion method

* remove flakiness from test

* change test query while flush test

* Update deps/thpool/thpool.c

Co-authored-by: GuyAv46 <[email protected]>

* Update deps/thpool/thpool.c

Co-authored-by: GuyAv46 <[email protected]>

---------

Co-authored-by: GuyAv46 <[email protected]>
(cherry picked from commit 874ce68)
@JoanFM JoanFM requested review from GuyAv46 and alonre24 and removed request for GuyAv46 November 26, 2025 14:03
@JoanFM JoanFM changed the title [8.2] [MOD-12252] [MOD-11658] reduce number of worker threads asynchronousl… [8.2] [MOD-12252] [MOD-11658] reduce number of worker threads asynchronously Nov 26, 2025
@JoanFM JoanFM requested a review from GuyAv46 November 26, 2025 14:05
}
// I do not need to verify init since we are actually putting in priority queue, and I do not want to wait on another barrier
// As per the input, I know i am Initialized, I do not need to verify it (and avoid waiting)
redisearch_thpool_add_n_work_not_verify_init(thpool_p, jobs, n_threads_to_remove, THPOOL_PRIORITY_ADMIN);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Ignored return value causes inconsistent state on allocation failure

In redisearch_thpool_schedule_config_reduce_threads_job, n_threads is decremented on line 1071 before checking if job addition succeeds. The return value of redisearch_thpool_add_n_work_not_verify_init on line 1087 is ignored. If create_jobs_chain fails due to memory allocation failure, the barrier and signal_job_arg are leaked, no admin jobs are queued, and no threads will actually terminate. This leaves the pool in an inconsistent state where n_threads no longer matches the actual alive threads, causing incorrect delta calculations in subsequent workersThreadPool_SetNumWorkers calls.

Fix in Cursor Fix in Web

Copy link

@jit-ci jit-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ The following Jit checks failed to run:

  • secret-detection-trufflehog

#jit_bypass_commit in this PR to bypass, Jit Admin privileges required.

More info in the Jit platform.

GuyAv46
GuyAv46 previously approved these changes Nov 26, 2025
GuyAv46
GuyAv46 previously approved these changes Nov 26, 2025
@JoanFM JoanFM enabled auto-merge November 26, 2025 14:51
@JoanFM JoanFM self-assigned this Nov 26, 2025
@JoanFM JoanFM requested a review from GuyAv46 November 26, 2025 15:35
@JoanFM JoanFM closed this Nov 26, 2025
auto-merge was automatically disabled November 26, 2025 15:35

Pull request was closed

@JoanFM JoanFM reopened this Nov 26, 2025
@JoanFM JoanFM enabled auto-merge November 26, 2025 15:48
@codecov
Copy link

codecov bot commented Nov 26, 2025

Codecov Report

❌ Patch coverage is 97.08738% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.42%. Comparing base (72cb880) to head (bf344b2).
⚠️ Report is 2 commits behind head on 8.2.

Files with missing lines Patch % Lines
deps/thpool/thpool.c 96.80% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              8.2    #7534      +/-   ##
==========================================
- Coverage   89.53%   89.42%   -0.12%     
==========================================
  Files         253      253              
  Lines       40914    40971      +57     
  Branches     3725     3725              
==========================================
+ Hits        36633    36637       +4     
- Misses       4232     4285      +53     
  Partials       49       49              
Flag Coverage Δ
flow 82.05% <95.14%> (-0.78%) ⬇️
unit 47.51% <85.43%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JoanFM JoanFM added this pull request to the merge queue Nov 26, 2025
Merged via the queue into 8.2 with commit f2f765e Nov 26, 2025
36 of 40 checks passed
@JoanFM JoanFM deleted the backport-7410-to-8.2 branch November 26, 2025 17:31
alonre24 added a commit to alonre24/redis that referenced this pull request Jan 26, 2026
**Bug Fixes:**

* [redis#7219](RediSearch/RediSearch#7219) Fix a concurrency issue on Reducer in `FT.AGGREGATE`
* [redis#7255](RediSearch/RediSearch#7255) Fix `BM25STD` underflow wraparound
* [redis#7275](RediSearch/RediSearch#7275) Report used memory as `unsigned long long` to avoid underflows
* [redis#7264](RediSearch/RediSearch#7264) Fix `totalDocsLen` updates
* [redis#6995](RediSearch/RediSearch#6995) Do not fanout `FT.INFO` to replicas
* [redis#7350](RediSearch/RediSearch#7350) Fix `FT.CREATE` failure with LeanVec parameters on non-Intel architectures
* [redis#7384](RediSearch/RediSearch#7384) Fix index load from RDB temporary memory overhead
* [redis#7459](RediSearch/RediSearch#7459) Fix Fork GC potential double-free on error path
* [redis#7458](RediSearch/RediSearch#7458) Fix a GC performence regression
* [redis#7470](RediSearch/RediSearch#7470) Avoid draining worker thread pool from FLUSH callback to avoid deadlocks
* [redis#7554](RediSearch/RediSearch#7554) Handle the case where `SCORE` is sent alone without extra fields (coordinator)
* [redis#7685](RediSearch/RediSearch#7685) Fix cursor logical leak
* [redis#7794](RediSearch/RediSearch#7794) Fix `cmp_strings()` to correctly handle binary data with embedded NULLs in TOLIST reducer in FT.AGGREGATE
* [redis#7873](RediSearch/RediSearch#7873) Handle warnings in empty `FT.AGGREGATE` replies (cluster)
* [redis#7886](RediSearch/RediSearch#7886) Remove non-TEXT fields from the spec keys dictionary
* [redis#7904](RediSearch/RediSearch#7904) Refactor keys dictionary handling
* [redis#7901](RediSearch/RediSearch#7901) Support multiple warnings in reply
* [redis#8083](RediSearch/RediSearch#8083) Fix incorrect FULLTEXT field metric counts
* [redis#8153](RediSearch/RediSearch#8153) Fix configuration registration issues

**Improvements:**

* [redis#7154](RediSearch/RediSearch#7154) `FT.AGGREGATE` can return Background Indexing OOM warnings
* [redis#7083](RediSearch/RediSearch#7083) Add the default text scorer as a configuration option
* [redis#7341](RediSearch/RediSearch#7341) Rename `FT.PROFILE` counter fields
* [redis#7436](RediSearch/RediSearch#7436) Enhance `FT.PROFILE` with vector search execution details
* [redis#7435](RediSearch/RediSearch#7435) Ensure full `FT.PROFILE` output on timeout with RETURN policy
* [redis#7534](RediSearch/RediSearch#7534) Reduce the number of worker threads asynchronously to avoid deadlocks during queries
* [redis#7614](RediSearch/RediSearch#7614) Track timeout warnings and errors in INFO
* [redis#7646](RediSearch/RediSearch#7646) Track `maxprefixexpansions` warnings and errors in INFO
* [redis#7577](RediSearch/RediSearch#7577) Track query syntax/argument errors (basis for query error tracking)
* [redis#7737](RediSearch/RediSearch#7737) Add `Internal cursor reads` metric to cluster `FT.PROFILE` output
* [redis#7759](RediSearch/RediSearch#7759) Extend indexing metrics
* [redis#7710](RediSearch/RediSearch#7710) Support `WITHCOUNT` keyword in `FT.AGGREGATE`
* [redis#7957](RediSearch/RediSearch#7957) Persist query warnings across cursor reads
* [redis#8054](RediSearch/RediSearch#8054) Add logging for index-related commands
* [redis#8151](RediSearch/RediSearch#8151) Fix shard total profile time reporting in `FT.PROFILE`
* [redis#8103](RediSearch/RediSearch#8103) Output current thread IndexSpec information on crash
YaacovHazan pushed a commit to redis/redis that referenced this pull request Jan 26, 2026
**Bug Fixes:**

* [#7219](RediSearch/RediSearch#7219) Fix a
concurrency issue on Reducer in `FT.AGGREGATE`
* [#7255](RediSearch/RediSearch#7255) Fix
`BM25STD` underflow wraparound
* [#7275](RediSearch/RediSearch#7275) Report
used memory as `unsigned long long` to avoid underflows
* [#7264](RediSearch/RediSearch#7264) Fix
`totalDocsLen` updates
* [#6995](RediSearch/RediSearch#6995) Do not
fanout `FT.INFO` to replicas
* [#7350](RediSearch/RediSearch#7350) Fix
`FT.CREATE` failure with LeanVec parameters on non-Intel architectures
* [#7694](RediSearch/RediSearch#7694) Use
asynchronous jobs for GC in SVS to accelerate execution
* [#7384](RediSearch/RediSearch#7384) Fix index
load from RDB temporary memory overhead
* [#7459](RediSearch/RediSearch#7459) Fix Fork
GC potential double-free on error path
* [#7458](RediSearch/RediSearch#7458) Fix a GC
performence regression
* [#7470](RediSearch/RediSearch#7470) Avoid
draining worker thread pool from FLUSH callback to avoid deadlocks
* [#7554](RediSearch/RediSearch#7554) Handle the
case where `SCORE` is sent alone without extra fields (coordinator)
* [#7685](RediSearch/RediSearch#7685) Fix cursor
logical leak
* [#7794](RediSearch/RediSearch#7794) Fix
`cmp_strings()` to correctly handle binary data with embedded NULLs in
TOLIST reducer in FT.AGGREGATE
* [#7873](RediSearch/RediSearch#7873) Handle
warnings in empty `FT.AGGREGATE` replies (cluster)
* [#7886](RediSearch/RediSearch#7886) Remove
non-TEXT fields from the spec keys dictionary
* [#7904](RediSearch/RediSearch#7904) Refactor
keys dictionary handling
* [#7901](RediSearch/RediSearch#7901) Support
multiple warnings in reply
* [#8083](RediSearch/RediSearch#8083) Fix
incorrect FULLTEXT field metric counts
* [#8153](RediSearch/RediSearch#8153) Fix
configuration registration issues

**Improvements:**

* [#7154](RediSearch/RediSearch#7154)
`FT.AGGREGATE` can return Background Indexing OOM warnings
* [#7083](RediSearch/RediSearch#7083) Add the
default text scorer as a configuration option
* [#7341](RediSearch/RediSearch#7341) Rename
`FT.PROFILE` counter fields
* [#7436](RediSearch/RediSearch#7436) Enhance
`FT.PROFILE` with vector search execution details
* [#7435](RediSearch/RediSearch#7435) Ensure
full `FT.PROFILE` output on timeout with RETURN policy
* [#7534](RediSearch/RediSearch#7534) Reduce the
number of worker threads asynchronously to avoid deadlocks during
queries
* [#7614](RediSearch/RediSearch#7614) Track
timeout warnings and errors in INFO
* [#7646](RediSearch/RediSearch#7646) Track
`maxprefixexpansions` warnings and errors in INFO
* [#7577](RediSearch/RediSearch#7577) Track
query syntax/argument errors (basis for query error tracking)
* [#7737](RediSearch/RediSearch#7737) Add
`Internal cursor reads` metric to cluster `FT.PROFILE` output
* [#7759](RediSearch/RediSearch#7759) Extend
indexing metrics
* [#7710](RediSearch/RediSearch#7710) Support
`WITHCOUNT` keyword in `FT.AGGREGATE`
* [#7957](RediSearch/RediSearch#7957) Persist
query warnings across cursor reads
* [#8054](RediSearch/RediSearch#8054) Add
logging for index-related commands
* [#8151](RediSearch/RediSearch#8151) Fix shard
total profile time reporting in `FT.PROFILE`
* [#8103](RediSearch/RediSearch#8103) Output
current thread IndexSpec information on crash
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants