-
Notifications
You must be signed in to change notification settings - Fork 3.8k
unix,win: limit concurrent DNS calls to nthreads/2 #1845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Oh, also: Not sure how to write a race-condition-free test for this. |
|
LGTM |
bnoordhuis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments but I'm okay with the general idea.
include/uv/threadpool.h
Outdated
| void (*done)(struct uv__work *w, int status); | ||
| struct uv_loop_s* loop; | ||
| void* wq[2]; | ||
| enum uv__work_kind kind; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an ABI change and therefore semver-major. You probably don't want to wait for that to show up in a release because libuv 2.0 is still some time out.
You should be able to drop this field if you move the slow_io_pooled_work >= (nthreads + 1) / 2 logic to worker(), have that function check both queues, and have post() simply put it in the right queue based on kind (and move the enum the src/uv-common.h, please.)
src/threadpool.c
Outdated
| static void post(QUEUE* q, enum uv__work_kind kind) { | ||
| uv_mutex_lock(&mutex); | ||
| if (kind == UV__WORK_SLOW_IO) { | ||
| if (slow_io_pooled_work >= (nthreads + 1) / 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this could result in starvation (perennial preemption by fast work reqs) when nthreads == 1.
|
I would like to test the case of nodejs/node#8436 with this change, but looks difficult - as the current Node.js uses v1.x and this change is on top of master, and the gap seems to be wide. Any advice please, or else I will attempt to hand-pick the changes into the libuv present in the latest Node. thanks. |
|
never mind the previous comment, I manually applied this change onto the latest node and tested the reproduce in nodejs/node#8436, and the result is wonderful! without the change with the change: The first number indicates time gap between file I/O, and the second indicates the number of concurrent dns I/O that is being processed. The result clearly shows that the file I/O is immune to the dns look up. |
If `nthreads / 2` (rounded up) DNS calls are outstanding, queue more work of that kind instead of letting it take over more positions in the thread pool, blocking other work such as the (usually much faster) file system I/O or user-scheduled work. Fixes: nodejs/node#8436
930213f to
0724df2
Compare
42e0989 to
16a9fe6
Compare
16a9fe6 to
aa5506a
Compare
|
@gireeshpunathil @bnoordhuis Updated the PR for 1.x with no public ABI changes, along the lines of Ben’s suggestion. Could you take another look? I’ve tried to avoid adding even more complexity here; As a result, this does have a slightly different scheduling behaviour (multiple “slow” tasks always start executing a full queue length apart from each other). I think that’s okay, though. |
|
LGTM |
bnoordhuis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static void worker(void* arg) { | ||
| struct uv__work* w; | ||
| QUEUE* q; | ||
| int is_slow_work; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only proper name for this variable is is_slooow_work.
I'll get my coat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too tempting :)
| QUEUE_REMOVE(q); | ||
| QUEUE_INIT(q); /* Signal uv_cancel() that the work req is executing. */ | ||
|
|
||
| if (q == &run_slow_work_message) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For legibility, I'd is_slow_work = 0; before the if and drop the else. It's below the fold, so to speak; easy to miss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
src/threadpool.c
Outdated
| static QUEUE run_slow_work_message; | ||
| static QUEUE slow_io_pending_wq; | ||
|
|
||
| #define SLOW_WORK_THREAD_THRESHOLD ((nthreads + 1) / 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this a function. The fact that it's a macro hides that it's not a constant but depends on a global.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
bnoordhuis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM but this should probably get sign-off from at least one more maintainer since it's a pretty subtle change.
|
@libuv/collaborators Ping. Can at least one additional maintainer give this a quick lookover? |
santigimeno
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a couple of questions
|
|
||
| enum uv__work_kind { | ||
| UV__WORK_CPU, | ||
| UV__WORK_FAST_IO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between UV__WORK_CPU and UV__WORK_FAST_IO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing, at this point
| QUEUE_INSERT_TAIL(&wq, &run_slow_work_message); | ||
| if (idle_threads > 0) | ||
| uv_cond_signal(&cond); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean if we have a scenario with works scheduled in this order:
slow_work_1
slow_work_2
fast work_1
fast_work_2
fast_work_3
the slow_work_2 would be scheduled after the fast works even though the threshold for slow works was 2? If so, was it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was born out of necessity – an earlier version of this patch treated things differently, but as far as I can tell it’s hard to do this another way without changing the uv_work_t layout to include a uv__work_kind field
|
Hm, gentle ping? CI results are inaccessible by now, so I guess this needs a new one … :) |
|
requesting this to be taken forward as there is no outstanding concerns. |
|
Sorry for the delay, I was waiting for someone else to comment and forgot about it. One comment: |
I think so, yes. It’s not important, though. |
If `nthreads / 2` (rounded up) DNS calls are outstanding, queue more work of that kind instead of letting it take over more positions in the thread pool, blocking other work such as the (usually much faster) file system I/O or user-scheduled work. Fixes: nodejs/node#8436 PR-URL: #1845 Reviewed-By: Ben Noordhuis <[email protected]> Reviewed-By: Santiago Gimeno <[email protected]>
|
Landed in 90891b4. Thanks! |
This commit implements the "pluggable threadpool" design. This commit reverts libuv#1845 because it encodes a policy decision. This decision is up to the embedders now. The implementation is Linux-only at the moment. Still have to change: - win/fs.c - win/getaddrinfo.c - win/getnameinfo.c
This commit implements the "pluggable threadpool" design. This commit reverts libuv#1845 because it encodes a policy decision. This decision is up to the embedders now. The implementation is Linux-only at the moment. Still have to change: - win/fs.c - win/getaddrinfo.c - win/getnameinfo.c
90891b4 introduced a race condition when accessing `slow_io_work_running` – it is being increased and later decreased as part of the worker thread loop, but was accessed with different mutexes during these operations. This fixes the race condition by making sure both accesses are protected through the global `mutex` of `threadpool.c`. This fixes a number of flaky Node.js tests. Refs: libuv#1845 Refs: nodejs/reliability#18 Refs: nodejs/node#23089 Refs: nodejs/node#23067 Refs: nodejs/node#23066 Refs: nodejs/node#23219
90891b4232e91dbd7a2e2077e4d23d16a374b41d introduced a race condition when accessing `slow_io_work_running` – it is being increased and later decreased as part of the worker thread loop, but was accessed with different mutexes during these operations. This fixes the race condition by making sure both accesses are protected through the global `mutex` of `threadpool.c`. This fixes a number of flaky Node.js tests. Refs: libuv/libuv#1845 Refs: nodejs/reliability#18 Refs: nodejs#23089 Refs: nodejs#23067 Refs: nodejs#23066 Refs: nodejs#23219
90891b4 introduced a race condition when accessing `slow_io_work_running` – it is being increased and later decreased as part of the worker thread loop, but was accessed with different mutexes during these operations. This fixes the race condition by making sure both accesses are protected through the global `mutex` of `threadpool.c`. This fixes a number of flaky Node.js tests. Refs: #1845 Refs: nodejs/reliability#18 Refs: nodejs/node#23089 Refs: nodejs/node#23067 Refs: nodejs/node#23066 Refs: nodejs/node#23219 PR-URL: #2021 Reviewed-By: Santiago Gimeno <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Ben Noordhuis <[email protected]>
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: nodejs#8436 PR-URL: nodejs#23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: #8436 PR-URL: #23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: nodejs#8436 PR-URL: nodejs#23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
This commit implements the "pluggable threadpool" design. This commit reverts libuv#1845 because it encodes a policy decision. This decision is up to the embedders now. The implementation is Linux-only at the moment. Still have to change: - win/fs.c - win/getaddrinfo.c - win/getnameinfo.c
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: #8436 PR-URL: #23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: #8436 PR-URL: #23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: #8436 PR-URL: #23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: #8436 PR-URL: #23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
Not sure about semverness here, so I went with PRing againstmasterfirst. Happy to retarget tov1.xif that’s deemed okay as well. Also happy to throw this PR in the garbage bin if this approach doesn’t work for people.If
nthreads / 2(rounded up) DNS calls are outstanding,queue more work of that kind instead of letting it take over
more positions in the thread pool, blocking other work
such as the (usually much faster) file system I/O or
user-scheduled work.
Fixes: nodejs/node#8436