Skip to content

linux: cap number of in-flight io_uring SQEs#4601

Closed
bnoordhuis wants to merge 1 commit intolibuv:v1.xfrom
bnoordhuis:fix4598
Closed

linux: cap number of in-flight io_uring SQEs#4601
bnoordhuis wants to merge 1 commit intolibuv:v1.xfrom
bnoordhuis:fix4598

Conversation

@bnoordhuis
Copy link
Copy Markdown
Member

Limit the number of in-flight submission entries to the size of the completion queue. The kernel's sqpoll thread hoovers them up really fast and we may end up posting way more submissions than fit in the completion queue.

That's not fatal as such, the kernel stores them in an overflow queue that we flush eventually, but that queue can grow pretty big if the program submits lots of file operations.

Salient questions are:

  1. Is this change a bug fix? If yes, why? If not, why not?

  2. Is this change a (de)optimization? Is the kernel's overflow queue more or less efficient than libuv's thread pool?

This commit removes the IORING_SQ_CQ_OVERFLOW handling that was added in commit 30fc896 ("unix: handle CQ overflow in iou ring") from May 2023 because it is no longer needed.

Fixes: #4598

Limit the number of in-flight submission entries to the size of the
completion queue. The kernel's sqpoll thread hoovers them up really
fast and we may end up posting way more submissions than fit in the
completion queue.

That's not fatal as such, the kernel stores them in an overflow queue
that we flush eventually, but that queue can grow pretty big if the
program submits lots of file operations.

Salient questions are:

1. Is this change a bug fix? If yes, why? If not, why not?

2. Is this change a (de)optimization? Is the kernel's overflow queue
   more or less efficient than libuv's thread pool?

This commit removes the IORING_SQ_CQ_OVERFLOW handling that was added in
commit 30fc896 ("unix: handle CQ overflow in iou ring") from May 2023
because it is no longer needed.

Fixes: libuv#4598
Comment on lines +784 to +785
if (iou->in_flight >= iou->max_in_flight)
return NULL;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about if we reach the condition we enter the ring to retrieve whatever completions are already there and pushed those to a pending_queue to make sure no request is completed synchronously and decrease iou->in_flight accordingly? Only if iou->in_flight reaches the limit with no completions left, would make us go through the threadpool.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably because it's a sleepy Saturday night but I'm not quite sure what you're asking. To call io_uring_enter when we hit the limit?

I'm guessing you maybe intended to write s/What about/How about/?

Copy link
Copy Markdown
Member

@santigimeno santigimeno Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry about not being clear, even incorrect. My thinking is, by the time the limit is hit it's very likely completions are already available, grab them and keep on submitting to the ring taking advantage of how fast sometimes those ops are performed. I said "enter" and that's not correct as we're not actually using io_uring_enter() to grab the completions (I conflated it with the io_uring_enter(IORING_ENTER_GETEVENTS) we were using to make more completions available when the OVERFLOW state was reached)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I believe I'm following now. I don't think we can call uv__poll_io_uring() from here because that runs user callbacks.

Well, I guess we could if we retrieved the uv_fs_t pointer from the CQE and queued it somewhere, but... that's yet another queue :-)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess we could if we retrieved the uv_fs_t pointer from the CQE and queued it somewhere, but... that's yet another queue :-)

Yes, that was what I was thinking when I mentioned about pushing completions to a pending_queue...
So, do you think it would make a difference? Yes, it's another queue, but we would be sending more requests to the ring than to the threadpool which should be have better performance?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... maybe. The counterargument is that it's default-disabled code so it's not going to get used or tested much. An easier fix is to make the SQ and CQ bigger; they're quite modestly sized now. I.e.:

diff --git a/src/unix/linux.c b/src/unix/linux.c
index 279ca2c8..c63484d6 100644
--- a/src/unix/linux.c
+++ b/src/unix/linux.c
@@ -773,7 +773,7 @@ static struct uv__io_uring_sqe* uv__iou_get_sqe(struct uv__iou* iou,
       return NULL;
     }
 
-    uv__iou_init(loop->backend_fd, iou, 64, UV__IORING_SETUP_SQPOLL);
+    uv__iou_init(loop->backend_fd, iou, 1024, UV__IORING_SETUP_SQPOLL);
     if (iou->ringfd == -2)
       iou->ringfd = -1;  /* "failed" */
   }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've benchmarked this last change. And performance drops significantly in my laptop. I'm running the fs_stat benchmark with these changes. Could you check I'm not doing anything obviously wrong?

diff --git a/src/unix/linux.c b/src/unix/linux.c
index 857a4ef8a..293cc1ce3 100644
--- a/src/unix/linux.c
+++ b/src/unix/linux.c
@@ -773,7 +773,7 @@ static struct uv__io_uring_sqe* uv__iou_get_sqe(struct uv__iou* iou,
       return NULL;
     }
 
-    uv__iou_init(loop->backend_fd, iou, 64, UV__IORING_SETUP_SQPOLL);
+    uv__iou_init(loop->backend_fd, iou, 1024, UV__IORING_SETUP_SQPOLL);
     if (iou->ringfd == -2)
       iou->ringfd = -1;  /* "failed" */
   }
diff --git a/test/benchmark-fs-stat.c b/test/benchmark-fs-stat.c
index c41062241..c9d1da07d 100644
--- a/test/benchmark-fs-stat.c
+++ b/test/benchmark-fs-stat.c
@@ -27,7 +27,8 @@
 
 #define NUM_SYNC_REQS         (10 * 1e5)
 #define NUM_ASYNC_REQS        (1 * (int) 1e5)
-#define MAX_CONCURRENT_REQS   32
+/* To be sure it overflows when ring size of 64 */
+#define MAX_CONCURRENT_REQS   128
 
 #define sync_stat(req, path)                                                  \
   do {                                                                        \
@@ -129,6 +130,7 @@ static void async_bench(const char* path) {
  * out of the equation.
  */
 BENCHMARK_IMPL(fs_stat) {
+  uv_loop_configure(uv_default_loop(), UV_LOOP_USE_IO_URING_SQPOLL);
   const char path[] = ".";
   warmup(path);
   sync_bench(path);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran fs_stat with 32 and 128 (in steps of 8) concurrent requests in the following configurations on x86_64 Linux 6.5.0-18-generic:

  1. no io_uring, i.e., thread pool
  2. io_uring the way it is in v1.x
  3. io_uring + this pull request
  4. io_uring + this pull request + 1024 sqes, 2048 cqes

For reqs=32, the results are uninteresting: the thread pool is around 50% slower, the other three are indistinguishable from each other.

For reqs=128, things get more interesting. At low concurrency rates (let's say < 32) results are much the same as for reqs=32 but at high concurrency rates (> 64):

  • (1) is still slower but closing the gap (15%)
  • (4) is 35% slower than (2) and (3) - I think that's what you are seeing to?

fs_stat stress-tests system call overhead, not disk I/O (only hits the dentry cache), so it seems we're falling off some io_uring performance cliff in (4).

@santigimeno
Copy link
Copy Markdown
Member

santigimeno commented Nov 9, 2024

  1. Is this change a bug fix? If yes, why? If not, why not?

I'm not sure it's a bug fix. Probably it's equally difficult for the kernel to run out of memory than the libuv run application.

  1. Is this change a (de)optimization? Is the kernel's overflow queue more or less efficient than libuv's thread pool?

From the documentation, it seems to be a deoptimization. Whether it's better or worse than going through the threadpool... that's a different story. We may want to run some benchmarks.

@bnoordhuis
Copy link
Copy Markdown
Member Author

@santigimeno Santi, what's the conclusion here? Merge, not merge? I'm guessing the latter?

@santigimeno
Copy link
Copy Markdown
Member

@santigimeno Santi, what's the conclusion here? Merge, not merge? I'm guessing the latter?

I tend to agree. I was planning to make a PoC of the queue solution but I don't know when I'll have the time.

@bnoordhuis
Copy link
Copy Markdown
Member Author

Okay, thanks - I'll go ahead and close this and #4598. Feel free to reopen if necessary.

@bnoordhuis bnoordhuis closed this Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

io_uring: possible issue handling CQ overflow

2 participants