Thread pool: move thread creation out of lock#68694
Thread pool: move thread creation out of lock#68694serxa merged 26 commits intoClickHouse:masterfrom
Conversation
|
This is an automated comment for commit 48e4092 with description of existing statuses. It's updated for the latest CI running ✅ Click here to open a full report in a separate page Successful checks
|
There was a problem hiding this comment.
There are a few concerns with this approach:
- The thread limit is not respected in a corner case (see comment in code).
- We have two critical sections due to unlock/lock introduced + more sync using
std::future. This could hurt the performance in normal situations when threads are created fast.
I think a better approach would be to delegate the task of inserting a thread into the list to the thread itself. But it should be done with some care: we need to count the number of all threads including "initializing" threads using atomics.
- Add
std::atomic<UInt64> ThreadPoolImpl::threads_count; - When we start a thread in
scheduleImpl()we only try to acquire a "slot" for the new thread and respect the limits without locking themutex(see code below). It guarantees that multiple threads might be created simultaneously. - I think thread creation can be moved to the instant after critical section (after
++scheduled_jobs;). The main point is to make only one small critical section and avoid futures. - When the worker starts and locks the
mutexfor the first time it should first insert itself intothreadslist and get iterator it needs, then start job processing as usual. - Take care of decrementing
threads_countwhen thread removes itself fromthreads - Wait for
threads_count == 0(busy wait, maybe with sleeps) in destructor.
UInt64 threads_count_value = threads_count.load();
while (true)
{
if (threads_count_value < std::min(max_threads, scheduled_jobs))
{
if (threads_count.compare_exchange_strong(threads_count_value, threads_count_value + 1))
{
// Thread creation code-path
break;
}
// else retry (threads count has changed)
}
else
{
// Limit exceeded code-path (do not create new thread)
break;
}
}
That may lead to some other side effects, - it breaks some of the invariants existing currently. Namely currenly there is a strong guarantee that at the moment when job is accepted at least one thread is there. I.e. consider the scenario when we push the job but later will fail to create a thread, then that job will stay there and will never be executed... Actually i was trying a lot of variants - also with 2 mutexes, 'outsoursing' thread creation, also trying to push the thread by the worker... but they either just fails with some deadlocks (like if you create a thread before a critical section in schedule - worker can take the lock earlier and will wait for a condvar forever), or have side effects, either the complexity grows too much. @ilejn also did attempt to createing the lock-free thread pool - but again is quite hard to guaranee all the existing invarants (again no hard guaranees of having the desired number of threads). And hard to prove that everything will still work ok if they will be bit not as strict... Atomics are also quite expensive here due to huge contention... Scenario of the global thread pool expansion is not a hot path for sure, so I don't think that those 2 locks will hurt too much. But maybe indeed it's not the best for local pools... Anyway, thanks for review, will try to address the issues you mentioned. |
9c95e64 to
d6e662a
Compare
f7c497e to
debb501
Compare
There was a problem hiding this comment.
Looks good overall. I still do not like extra sync costs. Single std::mutex is better than std::future and its cost is ~2 atomic ops, so it might be good enough. We have separate mutex in every thread, which is good, less contention. And we now have shorter critical section on the main mutex which is also better. So it is now hard to tell what implementation is better. I want to do a few synthetic tests for ThreadPool just to make a comparison with some figures.
|
@serxa i was able to get rid of extra locks using atomics. |
|
seems related... The thread tries to access a properties of the pool which is destructed. Will check it... UPD: fixed. |
|
Synthetic tests results (12 cores laptop, ubuntu 22) Before After About 10% highter QPS, Results from another test on a bigger instance (128 cores, redhat 8) - query from number_mt with a high max_threads and another load in backgroud: Sounds like a clear win, visible on the high concurrency scenarios. |
serxa
left a comment
There was a problem hiding this comment.
Looks good. Requires a little bit of polishing.
f490b29 to
a03822e
Compare
|
|
RabbitMQ issue - #45160 (details in integration_run_parallel4_0.tar.zst test_storage_rabbitmq/_instances-0-gw1 ) looks like container just starts too slow and 1 minute timeout is not enough: At 2024-10-02 23:22:16 container was started, 1 minute later (at 2024-10-02 23:23:18) test reported "can't connect to RabbitMQ", and 2024-10-02 23:23:29 it said "Cannot wait RabbitMQ container". rabbit logs show that it was started at 2024-10-02 23:23:21 (3 seconds after 'can't connect') |
Encapsulate thread management in the new subclass ThreadFromThreadPool
This reverts commit 0d8dce9.
a03822e to
48e4092
Compare
…_creation_out_of_lock Thread pool: move thread creation out of lock
…read_creation_out_of_lock 24.8.8 Backport PR ClickHouse#68694 Thread pool: move thread creation out of lock
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Optimized thread creation in the ThreadPool to minimize lock contention. Thread creation is now performed outside of the critical section to avoid delays in job scheduling and thread management under high load conditions. This leads to a much more responsive ClickHouse under heavy concurrent load.
Documentation entry for user-facing changes
The thread creation process within the ThreadPool has been modified to address potential delays caused by holding a lock during thread creation. Previously, threads were created within a locked section, which could lead to significant contention and delays, especially when thread creation is slow. To mitigate this, the thread creation has been moved outside of the critical section.
Same test https://gist.github.com/filimonov/7e7adde17421d4a9f83c6fea2be8f802 - before and after the change - results are in comment below.
It looks like a clear win giving about 10% better QPS, much better response times and better threadpool and CPU usage, despite that thread creation become even slower - it's because now several thread can be created simultaneously and kernel side contetion is bigger. But now it does not block the thread pool work anymore.
P.S. I have no idea how to test it in CI/CD, but there is a change that performance tests will show something.