linux: cap number of in-flight io_uring SQEs#4601
linux: cap number of in-flight io_uring SQEs#4601bnoordhuis wants to merge 1 commit intolibuv:v1.xfrom
Conversation
Limit the number of in-flight submission entries to the size of the completion queue. The kernel's sqpoll thread hoovers them up really fast and we may end up posting way more submissions than fit in the completion queue. That's not fatal as such, the kernel stores them in an overflow queue that we flush eventually, but that queue can grow pretty big if the program submits lots of file operations. Salient questions are: 1. Is this change a bug fix? If yes, why? If not, why not? 2. Is this change a (de)optimization? Is the kernel's overflow queue more or less efficient than libuv's thread pool? This commit removes the IORING_SQ_CQ_OVERFLOW handling that was added in commit 30fc896 ("unix: handle CQ overflow in iou ring") from May 2023 because it is no longer needed. Fixes: libuv#4598
| if (iou->in_flight >= iou->max_in_flight) | ||
| return NULL; |
There was a problem hiding this comment.
What about if we reach the condition we enter the ring to retrieve whatever completions are already there and pushed those to a pending_queue to make sure no request is completed synchronously and decrease iou->in_flight accordingly? Only if iou->in_flight reaches the limit with no completions left, would make us go through the threadpool.
There was a problem hiding this comment.
It's probably because it's a sleepy Saturday night but I'm not quite sure what you're asking. To call io_uring_enter when we hit the limit?
I'm guessing you maybe intended to write s/What about/How about/?
There was a problem hiding this comment.
Yes, sorry about not being clear, even incorrect. My thinking is, by the time the limit is hit it's very likely completions are already available, grab them and keep on submitting to the ring taking advantage of how fast sometimes those ops are performed. I said "enter" and that's not correct as we're not actually using io_uring_enter() to grab the completions (I conflated it with the io_uring_enter(IORING_ENTER_GETEVENTS) we were using to make more completions available when the OVERFLOW state was reached)
There was a problem hiding this comment.
Right, I believe I'm following now. I don't think we can call uv__poll_io_uring() from here because that runs user callbacks.
Well, I guess we could if we retrieved the uv_fs_t pointer from the CQE and queued it somewhere, but... that's yet another queue :-)
There was a problem hiding this comment.
Well, I guess we could if we retrieved the
uv_fs_tpointer from the CQE and queued it somewhere, but... that's yet another queue :-)
Yes, that was what I was thinking when I mentioned about pushing completions to a pending_queue...
So, do you think it would make a difference? Yes, it's another queue, but we would be sending more requests to the ring than to the threadpool which should be have better performance?
There was a problem hiding this comment.
Well... maybe. The counterargument is that it's default-disabled code so it's not going to get used or tested much. An easier fix is to make the SQ and CQ bigger; they're quite modestly sized now. I.e.:
diff --git a/src/unix/linux.c b/src/unix/linux.c
index 279ca2c8..c63484d6 100644
--- a/src/unix/linux.c
+++ b/src/unix/linux.c
@@ -773,7 +773,7 @@ static struct uv__io_uring_sqe* uv__iou_get_sqe(struct uv__iou* iou,
return NULL;
}
- uv__iou_init(loop->backend_fd, iou, 64, UV__IORING_SETUP_SQPOLL);
+ uv__iou_init(loop->backend_fd, iou, 1024, UV__IORING_SETUP_SQPOLL);
if (iou->ringfd == -2)
iou->ringfd = -1; /* "failed" */
}There was a problem hiding this comment.
So I've benchmarked this last change. And performance drops significantly in my laptop. I'm running the fs_stat benchmark with these changes. Could you check I'm not doing anything obviously wrong?
diff --git a/src/unix/linux.c b/src/unix/linux.c
index 857a4ef8a..293cc1ce3 100644
--- a/src/unix/linux.c
+++ b/src/unix/linux.c
@@ -773,7 +773,7 @@ static struct uv__io_uring_sqe* uv__iou_get_sqe(struct uv__iou* iou,
return NULL;
}
- uv__iou_init(loop->backend_fd, iou, 64, UV__IORING_SETUP_SQPOLL);
+ uv__iou_init(loop->backend_fd, iou, 1024, UV__IORING_SETUP_SQPOLL);
if (iou->ringfd == -2)
iou->ringfd = -1; /* "failed" */
}
diff --git a/test/benchmark-fs-stat.c b/test/benchmark-fs-stat.c
index c41062241..c9d1da07d 100644
--- a/test/benchmark-fs-stat.c
+++ b/test/benchmark-fs-stat.c
@@ -27,7 +27,8 @@
#define NUM_SYNC_REQS (10 * 1e5)
#define NUM_ASYNC_REQS (1 * (int) 1e5)
-#define MAX_CONCURRENT_REQS 32
+/* To be sure it overflows when ring size of 64 */
+#define MAX_CONCURRENT_REQS 128
#define sync_stat(req, path) \
do { \
@@ -129,6 +130,7 @@ static void async_bench(const char* path) {
* out of the equation.
*/
BENCHMARK_IMPL(fs_stat) {
+ uv_loop_configure(uv_default_loop(), UV_LOOP_USE_IO_URING_SQPOLL);
const char path[] = ".";
warmup(path);
sync_bench(path);There was a problem hiding this comment.
I ran fs_stat with 32 and 128 (in steps of 8) concurrent requests in the following configurations on x86_64 Linux 6.5.0-18-generic:
- no io_uring, i.e., thread pool
- io_uring the way it is in v1.x
- io_uring + this pull request
- io_uring + this pull request + 1024 sqes, 2048 cqes
For reqs=32, the results are uninteresting: the thread pool is around 50% slower, the other three are indistinguishable from each other.
For reqs=128, things get more interesting. At low concurrency rates (let's say < 32) results are much the same as for reqs=32 but at high concurrency rates (> 64):
- (1) is still slower but closing the gap (15%)
- (4) is 35% slower than (2) and (3) - I think that's what you are seeing to?
fs_stat stress-tests system call overhead, not disk I/O (only hits the dentry cache), so it seems we're falling off some io_uring performance cliff in (4).
I'm not sure it's a bug fix. Probably it's equally difficult for the kernel to run out of memory than the libuv run application.
From the documentation, it seems to be a deoptimization. Whether it's better or worse than going through the threadpool... that's a different story. We may want to run some benchmarks. |
|
@santigimeno Santi, what's the conclusion here? Merge, not merge? I'm guessing the latter? |
I tend to agree. I was planning to make a PoC of the queue solution but I don't know when I'll have the time. |
|
Okay, thanks - I'll go ahead and close this and #4598. Feel free to reopen if necessary. |
Limit the number of in-flight submission entries to the size of the completion queue. The kernel's sqpoll thread hoovers them up really fast and we may end up posting way more submissions than fit in the completion queue.
That's not fatal as such, the kernel stores them in an overflow queue that we flush eventually, but that queue can grow pretty big if the program submits lots of file operations.
Salient questions are:
Is this change a bug fix? If yes, why? If not, why not?
Is this change a (de)optimization? Is the kernel's overflow queue more or less efficient than libuv's thread pool?
This commit removes the IORING_SQ_CQ_OVERFLOW handling that was added in commit 30fc896 ("unix: handle CQ overflow in iou ring") from May 2023 because it is no longer needed.
Fixes: #4598