Dynamic cpu pool#8790
Merged
Merged
Conversation
- OpenAPI / telemetry: user-facing cpu_cores_used description (2s window, when null). - process_cpu_usage: backoff after procfs errors; serialize Linux unit tests on CACHE. - Docs: decouple runtime thread comments from hardcoded 4× multiplier; name search_runtime in test. - consensus test: replace stale runtime comment. Made-with: Cursor
Replace hand-edited cpu_cores_used description with output from schema_generator + merge pipeline so openapi_consistency_check passes. Made-with: Cursor
dancixx
approved these changes
Apr 27, 2026
Merged
timvisee
pushed a commit
that referenced
this pull request
May 8, 2026
* [AI] inptoduce CPU process measurement * use parking_lot + 4 seconds refresh rate * [AI] AdaptiveSearchHandle * fmt * openapi schema * keep Runtime field * fix test * [AI] instead of async semaphore, use 2 runtimes * Adjust usage window to 2 seconds * Address CodeRabbit review comments for dynamic CPU pool - OpenAPI / telemetry: user-facing cpu_cores_used description (2s window, when null). - process_cpu_usage: backoff after procfs errors; serialize Linux unit tests on CACHE. - Docs: decouple runtime thread comments from hardcoded 4× multiplier; name search_runtime in test. - consensus test: replace stale runtime comment. Made-with: Cursor * chore(openapi): regenerate master spec via generate_openapi_models.sh Replace hand-edited cpu_cores_used description with output from schema_generator + merge pipeline so openapi_consistency_check passes. Made-with: Cursor --------- Co-authored-by: Cursor Agent <[email protected]>
Merged
|
This change is really helpful. We ran a benchmark using TurboQuant with rescoring from disk, and the RPS increased from 326 to 480, an improvement of nearly 47%, which closely matches your benchmark results. |
Member
Glad to hear. Thanks for sharing! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
We need to monitor CPU usage of the current qdrant process,
to determine if we want to decrease / increase search thread pool at the current moment.
Thread pool used to have size equal to number of CPUs, but it doesn't work well if we have high IO.
In this case we want more threads (up to 4x of CPU count). But this affects the full in-RAM usa case (because of CPU contention).
So we want to dynamically adjust thread pool size based on the current CPU usage of the process.
Implementation
Step 1: Get CPU usage of the current process
We need a function, which would return CPU usage of qdrant process in last N (configured constant) seconds.
This function should work on linux and other platforms are optional.
Request to this function should be cheap, so we need to create a "TTL cache" functionality,
so it would actually read CPU usage from OS only once per N seconds.
Function semantic:
If 2 CPU cores were used 100% in last N seconds, we should return value 2.0
It is prefered to use procfs to read process CPU usage.
Function should be available globally, and implemented in common crate.
Step 2: telemetry
Additionally, we want to have CPU usage value in telemetry, so it would be easy to monitor and debug it.
If function is not supported on platform, it should return None.
Step 3: Auto-adjust thread pool
(old version, for historical reasons)
Details
Currently, we use tokio runtime as a thread pool, and it have fixed size of max_blocking_threads, which can't be dynamically changed. So we need a second layer of thread pool control.I propose this:
search_runtimeis stored in ShardReplicaSet, it is currently propagated to all search operations, and it is used to spawn search tasks.What if we create a wrapper around
Handletype (starting from the main.rs), which would add a dynamic check before each spawn_blocking operation,that would check current CPU usage, and if it is close to 100% across all available cores (read with num_cpus function),
then we would lower number of available threads by 1 until (with N seconds cooldown) it is equal to number of CPU cores,
and if it is low (less than 50%), we would increase number of threads by 1 up to 4x of CPUs.
Initially, we should have 2x of CPU cores, and then adjust it based on CPU usage.
Step 3: thread pool selection
After initial experiments, it turns out that having semaphore in spawn-blocking creates overhead:
We approach gets read of it by introducing 2 runtimes instead of 1 - high_cpu and high_io.
Choise between runtimes is same - based on cpu usage thresholds. But check is much cheaper - just single atomic.
With this approach we have practically no overhead for thread selection.
The downside is that we loose granularity in threads count, but that seems fine.
Note, that having more runtimes doesn't actually introduce more threads, as tokio blocking threads are started on-demand. There might be only over-use of threads during switch from one runtime into another, but that doesn't seem like a big problem
Benchmarks
BQ + rescoring from disk
3-node cluster with 20M 1024D vectors, which doesn't fit RAM.
Binary quantization enabled with rescroing.
The result: dynamic threadpool can utlize disk resources better
Full in-ram
1 million 128d, 2 cpu limit