Skip to content

IoUring: Reduce unnecessary io_uring_enter syscalls on non-blocking path#16259

Merged
normanmaurer merged 6 commits into
netty:4.2from
dreamlike-ocean:reduce_io_uring
Mar 2, 2026
Merged

IoUring: Reduce unnecessary io_uring_enter syscalls on non-blocking path#16259
normanmaurer merged 6 commits into
netty:4.2from
dreamlike-ocean:reduce_io_uring

Conversation

@dreamlike-ocean
Copy link
Copy Markdown
Contributor

@dreamlike-ocean dreamlike-ocean commented Feb 11, 2026

Motivation:

In IoUringIoHandler.run(), the non-blocking path unconditionally calls io_uring_enter via submitAndClearNow() even when there are no pending SQEs and no deferred task work to flush.

Modification:

  • Enable IORING_SETUP_TASKRUN_FLAG when IORING_SETUP_DEFER_TASKRUN is set,
    so the kernel signals IORING_SQ_TASKRUN when deferred completions are pending.

Result:

Fixes #16247.

No regression under high concurrency load. Under IO-idle non-blocking
path scenario, io_uring_enter calls reduced from 486 to 65 (-86.6%).

@normanmaurer
Copy link
Copy Markdown
Member

@dreamlike-ocean let me know once you did run some benchmarks

@franz1981
Copy link
Copy Markdown
Contributor

franz1981 commented Feb 15, 2026

Would be great while benchmarking, to collect a statistics of the different cases eg total in memory hit/sec, syscalls enter due to pending cqes or sqes to submit and eventually flame-graphs to make sure we have enough concurrent and small I/O

@franz1981
Copy link
Copy Markdown
Contributor

Any news @dreamlike-ocean ?

@dreamlike-ocean
Copy link
Copy Markdown
Contributor Author

Any news @dreamlike-ocean ?

Sorry, I’m currently on vacation. I’ll move this forward once I’m back.

@normanmaurer
Copy link
Copy Markdown
Member

@dreamlike-ocean enjoy your time off!

@dreamlike-ocean
Copy link
Copy Markdown
Contributor Author

Benchmark Results

Test Setup

  • Simple HTTP/1.1 server returning a 512-byte response body
  • Using buffer ring + zero-copy write
  • Load testing tool: oha
  1. High Concurrency Load Test (oha -c 500 -z 30s)
Metric OLD PATCH Change
RPS 117,147 117,649 +0.43%
P50 Latency 4.13ms 4.12ms -0.24%
P95 Latency 4.87ms 4.76ms -2.26%

strace statistics for io_uring_enter:

OLD PATCH Change
Calls 3,084,144 3,076,794 -0.24%

Conclusion: No performance regression under high concurrency, but the optimization shows minimal effect. This is expected — under saturated IO, needSubmit() almost always returns true because the SQ continuously has pending SQEs (count > 0), or IORING_SQ_TASKRUN is persistently set due to IORING_SETUP_DEFER_TASKRUN keeping deferred task work pending in the kernel.

  1. Non-blocking Path Validation (IO idle + busy task queue)

To isolate and validate the effect of needSubmit() on the non-blocking path, I designed a scenario that forces the event loop to always take the non-blocking path (submitAndClearNow) with zero IO traffic. This is achieved by submitting an infinitely recursive task to each worker EventLoop, keeping hasTasks() permanently true and canBlock() permanently false:

 worker.forEach(Main::executorInfiniteTask);

private static void executorInfiniteTask(EventExecutor executor) {
    executor.execute(() -> executor.execute(() -> executorInfiniteTask(executor)));
}

strace statistics for io_uring_enter (30 seconds):

OLD PATCH Change
io_uring_enter calls 486 65 -86.6%
Total time 1.69s 1.21s -28.3%
Avg time/call 3,482 μs 18,672 μs

The remaining 65 calls in PATCH come from the acceptor thread (which does not have the infinite task) performing normal blocking waits. The OLD version's low3.4ms per call confirms these are empty submits (submit=0, minComplete=0) returning quickly from the kernel — essentially wasted syscalls. The PATCH version's 18.6ms per call confirms these are exclusively legitimate blocking waits from the acceptor.

Conclusion: On the non-blocking path with no IO activity, needSubmit() effectively eliminates unnecessary io_uring_enter syscalls, achieving an86.6% reduction.

raw benchmark result
benchmark.zip

Directory Scenario Purpose
500/ oha -c 500 -z 30s, 512B HTTP/1.1 response Verify no performance regression under high concurrency
non-blocking/ No IO traffic; each worker EventLoop has a perpetually pending task (executorInfiniteTask), forcing the non-blocking path Validate that needSubmit() effectively skips empty io_uring_enter on the non-blocking path
File Description
old_result.txt / patch_result.txt oha load test output (RPS, latency distribution, etc.) combined with strace -f -e trace=io_uring_enter -c syscall statistics
flame_*.html CPU flame graphs generated by async-profiler4.x (collected via -agentpath), viewable interactively in a browser
strace/ Summary output of strace -f -e trace=io_uring_enter -c, comparing OLD vs PATCH io_uring_enter call counts

@franz1981

@dreamlike-ocean dreamlike-ocean changed the title Enable IORING_SETUP_TASKRUN_FLAG and use IORING_SQ_TASKRUN to decide … IoUring: Reduce unnecessary io_uring_enter syscalls on non-blocking path Feb 24, 2026
@dreamlike-ocean dreamlike-ocean marked this pull request as ready for review February 24, 2026 03:04
@franz1981
Copy link
Copy Markdown
Contributor

Looks great 😃 today will take a look more deeply

@chrisvest chrisvest added the needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. label Feb 26, 2026
@chrisvest chrisvest added this to the 4.2.11.Final milestone Feb 26, 2026
@normanmaurer normanmaurer merged commit bedd0ac into netty:4.2 Mar 2, 2026
35 of 37 checks passed
@normanmaurer
Copy link
Copy Markdown
Member

@dreamlike-ocean thanks a lot!

netty-project-bot pushed a commit that referenced this pull request Mar 2, 2026
…ath (#16259)

Motivation:

In `IoUringIoHandler.run()`, the non-blocking path unconditionally calls
`io_uring_enter` via `submitAndClearNow()` even when there are no
pending SQEs and no deferred task work to flush.

Modification:

- Enable `IORING_SETUP_TASKRUN_FLAG` when `IORING_SETUP_DEFER_TASKRUN`
is set,
so the kernel signals `IORING_SQ_TASKRUN` when deferred completions are
pending.

Result:

Fixes #16247.

No regression under high concurrency load. Under IO-idle non-blocking
path scenario, `io_uring_enter` calls reduced from 486 to 65 (-86.6%).

---------

Co-authored-by: Norman Maurer <[email protected]>
(cherry picked from commit bedd0ac)
@netty-project-bot
Copy link
Copy Markdown
Contributor

Auto-port PR for 5.0: #16397

chrisvest pushed a commit that referenced this pull request Mar 3, 2026
… non-blocking path (#16397)

Auto-port of #16259 to 5.0
Cherry-picked commit: bedd0ac

---

Motivation:

In `IoUringIoHandler.run()`, the non-blocking path unconditionally calls
`io_uring_enter` via `submitAndClearNow()` even when there are no
pending SQEs and no deferred task work to flush.

Modification:

- Enable `IORING_SETUP_TASKRUN_FLAG` when `IORING_SETUP_DEFER_TASKRUN`
is set,
so the kernel signals `IORING_SQ_TASKRUN` when deferred completions are
pending.

Result:

Fixes #16247.

No regression under high concurrency load. Under IO-idle non-blocking
path scenario, `io_uring_enter` calls reduced from 486 to 65 (-86.6%).

Co-authored-by: Mengyang Li <[email protected]>
Co-authored-by: Norman Maurer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IoUringIoHandler run() performs unnecessary io_uring_enter syscalls in non-blocking path

5 participants