Add generic FileRegion support in io_uring stream channel#16571
Conversation
|
@LuciferYang did you sign our ICLA ? https://netty.io/s/icla |
done |
| * Maximum bytes per chunk when converting a generic {@link FileRegion} | ||
| * to {@link ByteBuf} for the io_uring async send path. | ||
| */ | ||
| private static final int FILE_REGION_MAX_CHUNK_SIZE = 64 * 1024; |
There was a problem hiding this comment.
nit: Maybe add a system property ?
| } | ||
| long address = IoUring.memoryAddress(buf) + buf.readerIndex(); | ||
| int length = buf.readableBytes(); | ||
| ops = IoUringIoOps.newSend(fd, (byte) 0, 0, address, length, nextOpsId()); |
There was a problem hiding this comment.
can we factor this out into another method and call the methods from here ?
| */ | ||
| private static final int FILE_REGION_MAX_CHUNK_SIZE = 64 * 1024; | ||
| private static final int FILE_REGION_MAX_CHUNK_SIZE = | ||
| SystemPropertyUtil.getInt("io.netty.iouring.fileRegionChunkSize", 64 * 1024); |
There was a problem hiding this comment.
Is this property name OK?
|
I have a few thoughts related to this PR for your consideration.
If the fallback path performs better under certain workloads, would it make sense to allow selecting it explicitly via a command-line option, instead of relying purely on automatic fallback?
|
|
Thanks for the thoughtful suggestions, @dreamlike-ocean!
This PR focuses on correctness — making any |
|
@chrisvest PTAL |
|
What's the progress on this pr ... |
| try { | ||
| ByteBufWritableByteChannel ch = new ByteBufWritableByteChannel(buf); | ||
| int written = 0; | ||
| while (written < chunkSize) { |
There was a problem hiding this comment.
We can use buf.writableBytes() > 0 as a condition and remove written
| } | ||
| } else { | ||
| // Don't release the chunk buffer before checking the error — | ||
| // for retryable errors (EAGAIN) we need to re-send the same data. |
There was a problem hiding this comment.
Where are we re-sending the same data? The FileRegion.transferred() got incremented in the scheduleWriteFileRegion call.
| } catch (Throwable cause) { | ||
| releaseFileRegionChunkBuf(); | ||
| handleWriteError(cause); | ||
| return true; |
There was a problem hiding this comment.
This return true looks redundant as we could fall through to the one below, right?
| @Override | ||
| protected boolean supportsCustomFileRegion() { | ||
| return false; | ||
| return true; |
There was a problem hiding this comment.
We should add some tests that exercises the chunking, and also a test where we send two FileRegions in a row where at least the first one exercises the chunking.
The tests enabled by this flag only send 8 KiB of data, I think.
276cbfa to
28c2be7
Compare
28c2be7 to
9db54a0
Compare
|
All five addressed (9db54a0):
A few small adjacent fixes came up while wiring this up:
|
|
@chrisvest bc2c593 try to fix the failed test The chunking assertions in The tests run through Fix: after
|
|
Could not create auto-port PR. |
|
Thank you @normanmaurer and @chrisvest |
What do I need to do for this? |
|
@LuciferYang nothing ... PR for 5.0 is #16677 |
…16677) `AbstractIoUringStreamChannel` only handles `DefaultFileRegion` (via splice). Any other `FileRegion` implementation hits `UnsupportedOperationException` from the parent's `filterOutboundMessage`. This is inconsistent with EPOLL and NIO, which accept any `FileRegion` via a `transferTo` fallback. Frameworks like Apache Spark wrap `DefaultFileRegion` inside composite `FileRegion` implementations (e.g., `MessageWithHeader`) that work on NIO and EPOLL but crash on io_uring. See #16570 For generic `FileRegion` (non-`DefaultFileRegion` or when splice is unsupported), let it pass through `filterOutboundMessage` into the outbound buffer, then convert it to a direct `ByteBuf` in **64 KB chunks** via `FileRegion.transferTo` in the write path (`scheduleWriteSingle` / `writeComplete0`). - `DefaultFileRegion` + splice path is unchanged. - Data is chunked to limit memory usage — each chunk is read via `transferTo` into a capped `ByteBufWritableByteChannel`, then submitted as an io_uring async send. - Partial sends and EAGAIN preserve the chunk buffer for retry; unrecoverable errors release it and fail the write. - Empty FileRegions (count = 0) submit a 0-byte send so the completion path can remove them. Also enabled `testCustomFileRegion` in `IoUringSocketFileRegionTest` and `IoUringBufferRingSocketFileRegionTest`. Any `FileRegion` implementation is now writable on io_uring, consistent with EPOLL and NIO. | Transport | `DefaultFileRegion` | Generic `FileRegion` | |---|---|---| | NIO | `FileChannel.transferTo` | `region.transferTo` | | EPOLL | native sendfile | `region.transferTo` fallback | | io_uring (before) | splice | **UnsupportedOperationException** | | io_uring (after) | splice (unchanged) | chunked `transferTo` → ByteBuf → async send | --------- Co-authored-by: Norman Maurer <[email protected]> Co-authored-by: YangJie <[email protected]>
…s count() Motivation: The generic FileRegion fallback added in netty#16571 drains the region by looping on `buf.writableBytes() > 0` only. epoll's writeFileRegion() guards the same call with `if (region.transferred() >= region.count()) { in.remove(); return 0; }` -- the FileRegion contract permits implementations to assume transferTo() is not invoked past transferred() == count(). Calling transferTo() past completion can trigger side effects in custom AbstractFileRegion implementations -- for example emitting an extra framing prefix or trailer onto the wire -- which then corrupts the protocol on the receiving side. Modification: Tighten the io_uring drain-loop condition with `region.transferred() < region.count()` in addition to `buf.writableBytes() > 0`, mirroring epoll's short-circuit. Add IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion plus an OvershootDetectingFileRegion fixture that records every transferTo() call made past transferred() == count(). The test fails on the prior implementation and passes with this change. Result: io_uring's generic FileRegion path no longer over-calls transferTo(). Custom AbstractFileRegion implementations that rely on the documented contract (e.g. Apache Spark's SaslEncryption, where SparkSaslSuite fails deterministically under io_uring on 4.2.13) work as expected.
…s count() Motivation: The generic FileRegion fallback added in netty#16571 drains the region by looping on `buf.writableBytes() > 0` only. epoll's writeFileRegion() guards the same call with `if (region.transferred() >= region.count()) { in.remove(); return 0; }` -- the FileRegion contract permits implementations to assume transferTo() is not invoked past transferred() == count(). Calling transferTo() past completion can trigger side effects in custom FileRegion implementations that lazily emit per-chunk framing -- for example an encryption, compression, or framing layer that flushes a completed inner chunk on each call -- producing a phantom chunk header that lands on the wire and corrupts the receiver's framing. Modification: Tighten the io_uring drain-loop condition with `region.transferred() < region.count()` in addition to `buf.writableBytes() > 0`, mirroring epoll's short-circuit. Add IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion plus an OvershootDetectingFileRegion fixture that records every transferTo() call made past transferred() == count(). The test fails on the prior implementation and passes with this change. Result: io_uring's generic FileRegion path no longer over-calls transferTo(). Custom FileRegion implementations that rely on the documented contract work under io_uring with the same semantics they already have under NIO and epoll.
…s count() Motivation: The generic FileRegion fallback added in netty#16571 drains the region by looping on `buf.writableBytes() > 0` only. epoll's writeFileRegion() guards the same call with `if (region.transferred() >= region.count()) { in.remove(); return 0; }` -- the FileRegion contract permits implementations to assume transferTo() is not invoked past transferred() == count(). Calling transferTo() past completion can trigger side effects in custom FileRegion implementations that lazily emit per-chunk framing -- for example an encryption, compression, or framing layer that flushes a completed inner chunk on each call -- producing a phantom chunk header that lands on the wire and corrupts the receiver's framing. Modification: Tighten the io_uring drain-loop condition with `region.transferred() < region.count()` in addition to `buf.writableBytes() > 0`, mirroring epoll's short-circuit. Add IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion plus an OvershootDetectingFileRegion fixture that records every transferTo() call made past transferred() == count(). The test fails on the prior implementation and passes with this change. Result: io_uring's generic FileRegion path no longer over-calls transferTo(). Custom FileRegion implementations that rely on the documented contract work under io_uring with the same semantics they already have under NIO and epoll.
…s count() Motivation: The generic FileRegion fallback added in netty#16571 drains the region by looping on `buf.writableBytes() > 0` only. epoll's writeFileRegion() guards the same call with `if (region.transferred() >= region.count()) { in.remove(); return 0; }` -- the FileRegion contract permits implementations to assume transferTo() is not invoked past transferred() == count(). Calling transferTo() past completion can trigger side effects in custom FileRegion implementations that lazily emit per-chunk framing -- for example an encryption, compression, or framing layer that flushes a completed inner chunk on each call -- producing a phantom chunk header that lands on the wire and corrupts the receiver's framing. Modification: Tighten the io_uring drain-loop condition with `region.transferred() < region.count()` in addition to `buf.writableBytes() > 0`, mirroring epoll's short-circuit. Add IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion plus an OvershootDetectingFileRegion fixture that records every transferTo() call made past transferred() == count(). The test fails on the prior implementation and passes with this change. Result: io_uring's generic FileRegion path no longer over-calls transferTo(). Custom FileRegion implementations that rely on the documented contract work under io_uring with the same semantics they already have under NIO and epoll.
…s count() Motivation: The generic FileRegion fallback added in netty#16571 drains the region by looping on `buf.writableBytes() > 0` only. epoll's writeFileRegion() guards the same call with `if (region.transferred() >= region.count()) { in.remove(); return 0; }` -- the FileRegion contract permits implementations to assume transferTo() is not invoked past transferred() == count(). Calling transferTo() past completion can trigger side effects in custom FileRegion implementations that lazily emit per-chunk framing -- for example an encryption, compression, or framing layer that flushes a completed inner chunk on each call -- producing a phantom chunk header that lands on the wire and corrupts the receiver's framing. Modification: Tighten the io_uring drain-loop condition with `region.transferred() < region.count()` in addition to `buf.writableBytes() > 0`, mirroring epoll's short-circuit. Add IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion plus an OvershootDetectingFileRegion fixture that records every transferTo() call made past transferred() == count(). The test fails on the prior implementation and passes with this change. Result: io_uring's generic FileRegion path no longer over-calls transferTo(). Custom FileRegion implementations that rely on the documented contract work under io_uring with the same semantics they already have under NIO and epoll.
…tual bytes
Allocate the full FILE_REGION_MAX_CHUNK_SIZE for the chunk buffer instead
of `min(count() - transferred(), MAX)`. The count()-based sizing assumes
that count() is an accurate upper bound on the bytes a FileRegion will
write to its target -- this holds for DefaultFileRegion but is violated by
encryption-on-write implementations like Spark's
SaslEncryption.EncryptedMessage, which:
- Returns count() = unencrypted message size (or buf.readableBytes() for
its ByteBuf-backed variant, which decreases as the underlying buffer
drains).
- Emits encrypted output through transferTo() that is LARGER than count()
advertises (per-chunk framing + cipher overhead, plus the buf.skipBytes
side effect that mutates count() mid-stream).
When the chunk buf is sized to count() - transferred() and that's smaller
than the encrypted output, transferTo() returns the +1 fallback (advancing
transferred by 1 to signal "I have more"), the buf is sent with only some
of the encrypted bytes, and the completion handler sees
`region.transferred() >= region.count()` (true now that count() has
dropped to 0) and removes the entry from the outbound buffer. The unsent
bytes still in the FileRegion's internal currentChunk are lost, the peer
gets a partial SASL frame, and the connection stalls.
Sizing the buffer to FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB) ensures
small-payload encrypted writes fit in one chunk, so the +1 fallback never
fires and the count() vs transferred() race never matters. For large
payloads the buffer caps at MAX_CHUNK_SIZE as before -- no regression in
direct-memory headroom.
This is the second of two bugs in netty#16571 found while running Spark's
SparkSaslSuite under io_uring; the first (drain loop overshooting
transferred()==count()) is fixed in the previous commit.
Verified: SparkSaslSuite + RpcIntegrationSuite (21 tests covering small
RPCs, encrypted RPCs, encrypted FileRegion fetches, multi-message stream
uploads up to 512 KiB) pass cleanly across three back-to-back runs in an
OrbStack Linux container with --security-opt seccomp=unconfined.
…llback Motivation: The generic FileRegion fallback added in netty#16571 sizes its chunk buffer to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`. That sizing treats count() as an upper bound on the bytes transferTo() will emit -- which holds for DefaultFileRegion but custom FileRegion implementations may violate it. An on-the-fly encryption or framing layer, for example, exposes count() as the source size while emitting larger output through transferTo() (ciphertext, per-chunk framing, etc.). With a too-small chunk buffer the region's output is truncated; if the implementation then advances transferred() to count() once its source is drained, the completion handler removes the entry from the outbound buffer while the rest of the bytes are still sitting inside the region. The peer sees a partial frame and the connection stalls. Modification: Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally. Add IoUringSocketFileRegionTest#testFileRegionEmittingMoreThanCount plus an EmitMoreThanCountFileRegion fixture that reports count = 5 while transferTo() emits 20 bytes in one call. The receiver-side byte count distinguishes the buggy path (5 bytes received) from the fixed path (20 bytes received). The test skips when the operator has configured a chunk size smaller than the region's per-call output -- the fix is bounded to outputs that fit in one chunk. Result: io_uring's generic FileRegion path no longer truncates the per-call output of custom FileRegion implementations whose actual emit size exceeds count(), so long as the per-call output fits in FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the io.netty.iouring.fileRegionChunkSize system property). Memory cost: every generic-FileRegion send now allocates the full chunk size up front even for tiny regions, but the buffer is per in-flight write and the direct-buffer allocator pools it on release.
…llback Motivation: The generic FileRegion fallback added in netty#16571 sizes its chunk buffer to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`. That sizing treats count() as an upper bound on the bytes transferTo() will emit -- which holds for DefaultFileRegion but custom FileRegion implementations may violate it. An on-the-fly encryption or framing layer, for example, exposes count() as the source size while emitting larger output through transferTo() (ciphertext, per-chunk framing, etc.). With a too-small chunk buffer the region's output is truncated; if the implementation then advances transferred() to count() once its source is drained, the completion handler removes the entry from the outbound buffer while the rest of the bytes are still sitting inside the region. The peer sees a partial frame and the connection stalls. Modification: Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally. Add IoUringSocketFileRegionTest#testFileRegionEmittingMoreThanCount plus an EmitMoreThanCountFileRegion fixture that reports count = 5 while transferTo() emits 20 bytes in one call. The receiver-side byte count distinguishes the buggy path (5 bytes received) from the fixed path (20 bytes received). The test skips when the operator has configured a chunk size smaller than the region's per-call output -- the fix is bounded to outputs that fit in one chunk. Result: io_uring's generic FileRegion path no longer truncates the per-call output of custom FileRegion implementations whose actual emit size exceeds count(), so long as the per-call output fits in FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the io.netty.iouring.fileRegionChunkSize system property). Memory cost: every generic-FileRegion send now reserves the full chunk size up front even for tiny regions. The buffer is per in-flight write and is released when the write completes; with the default pooled/adaptive allocator it is recycled, while UnpooledByteBufAllocator pays a direct malloc/free per send.
…llback Motivation: The generic FileRegion fallback added in netty#16571 sizes its chunk buffer to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`. That sizing treats count() as an upper bound on the bytes transferTo() will emit -- which holds for DefaultFileRegion but custom FileRegion implementations may violate it. An on-the-fly encryption or framing layer, for example, exposes count() as the source size while emitting larger output through transferTo() (ciphertext, per-chunk framing, etc.). With a too-small chunk buffer the region's output is truncated; if the implementation then advances transferred() to count() once its source is drained, the completion handler removes the entry from the outbound buffer while the rest of the bytes are still sitting inside the region. The peer sees a partial frame and the connection stalls. Modification: Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally. Add IoUringSocketFileRegionTest#testFileRegionEmittingMoreThanCount plus an EmitMoreThanCountFileRegion fixture that reports count = 5 while transferTo() emits 20 bytes in one call. With the fix applied the receiver observes all 20 bytes; against unpatched code the transport's 5-byte chunk buffer accepts only 5 bytes of the emitted chunk and the fixture's fail-fast short-write check throws IOException, which surfaces on the write future and fails the test. The test skips when the operator has configured a chunk size smaller than the region's per-call output -- the fix is bounded to outputs that fit in one chunk. Result: io_uring's generic FileRegion path no longer truncates the per-call output of custom FileRegion implementations whose actual emit size exceeds count(), so long as the per-call output fits in FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the io.netty.iouring.fileRegionChunkSize system property). Memory cost: every generic-FileRegion send now reserves the full chunk size up front even for tiny regions. The buffer is per in-flight write and is released when the write completes; with the default pooled/adaptive allocator it is recycled, while UnpooledByteBufAllocator pays a direct malloc/free per send.
…llback Motivation: The generic FileRegion fallback added in netty#16571 sizes its chunk buffer to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`. That sizing treats count() as an upper bound on the bytes transferTo() will emit -- which holds for DefaultFileRegion but custom FileRegion implementations may violate it. An on-the-fly encryption or framing layer, for example, exposes count() as the source size while emitting larger output through transferTo() (ciphertext, per-chunk framing, etc.). With a too-small chunk buffer the region's output is truncated; if the implementation then advances transferred() to count() once its source is drained, the completion handler removes the entry from the outbound buffer while the rest of the bytes are still sitting inside the region. The peer sees a partial frame and the connection stalls. Modification: Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally. Add IoUringSocketFileRegionTest#testFileRegionEmittingMoreThanCount plus an EmitMoreThanCountFileRegion fixture that reports count = 5 while transferTo() emits 20 bytes in one call. With the fix applied the receiver observes all 20 bytes; against unpatched code the transport's 5-byte chunk buffer accepts only 5 bytes of the emitted chunk and the fixture's fail-fast short-write check throws IOException, which surfaces on the write future and fails the test. The test skips when the operator has configured a chunk size smaller than the region's per-call output -- the fix is bounded to outputs that fit in one chunk. Result: io_uring's generic FileRegion path no longer truncates the per-call output of custom FileRegion implementations whose actual emit size exceeds count(), so long as the per-call output fits in FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the io.netty.iouring.fileRegionChunkSize system property). Memory cost: every generic-FileRegion send now reserves the full chunk size up front even for tiny regions. The buffer is per in-flight write and is released when the write completes; with the default pooled/adaptive allocator it is recycled, while UnpooledByteBufAllocator pays a fresh native allocation/free per send.
…llback Motivation: The generic FileRegion fallback added in netty#16571 sizes its chunk buffer to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`. That sizing treats count() as an upper bound on the bytes transferTo() will emit -- which holds for DefaultFileRegion but custom FileRegion implementations may violate it. An on-the-fly encryption or framing layer, for example, exposes count() as the source size while emitting larger output through transferTo() (ciphertext, per-chunk framing, etc.). With a too-small chunk buffer the region's output is truncated; if the implementation then advances transferred() to count() once its source is drained, the completion handler removes the entry from the outbound buffer while the rest of the bytes are still sitting inside the region. The peer sees a partial frame and the connection stalls. Modification: Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally. Widen the chunk-allocation try/catch from Exception to Throwable so an OutOfDirectMemoryError from alloc().directBuffer(...) -- or any Error from transferTo() -- still releases the chunk buffer (when already allocated) and surfaces on the write future instead of leaking and stalling the channel. Add SocketFileRegionTest#testFileRegionEmittingMoreThanCount plus an EmitMoreThanCountFileRegion fixture that reports count = 5 while transferTo() emits 20 bytes in one call. The test lives on the base class so every FileRegion-supporting transport (NIO, epoll, kqueue, io_uring) must satisfy the same invariant. With the fix applied the receiver observes all 20 bytes; against unpatched io_uring code the transport's 5-byte chunk buffer accepts only 5 bytes of the emitted chunk and the fixture's fail-fast short-write check throws IOException, which surfaces on the write future and fails the test. Introduce SocketFileRegionTest#maxFileRegionPerCallEmit() (default Long.MAX_VALUE) so transports that copy through an intermediate buffer can declare their per-call output cap. The two io_uring subclasses override it to CONFIGURED_CHUNK_SIZE so the test skips cleanly when an operator has shrunk the chunk size below the region's per-call output. Document the per-call-output cap on the FILE_REGION_MAX_CHUNK_SIZE field so operators tuning the system property know not to size it below the largest emit any FileRegion implementation can produce. Result: io_uring's generic FileRegion path no longer truncates the per-call output of custom FileRegion implementations whose actual emit size exceeds count(), so long as the per-call output fits in FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the io.netty.iouring.fileRegionChunkSize system property). The new regression test is shared across all FileRegion-supporting transports. Memory cost: every generic-FileRegion send now reserves the full chunk size up front even for tiny regions. The buffer is per in-flight write and is released when the write completes; with the default pooled/adaptive allocator it is recycled, while UnpooledByteBufAllocator pays a fresh native allocation/free per send.
…llback Motivation: The generic FileRegion fallback added in netty#16571 sizes its chunk buffer to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`. That sizing treats count() as an upper bound on the bytes transferTo() will emit -- which holds for DefaultFileRegion but custom FileRegion implementations may violate it. An on-the-fly encryption or framing layer, for example, exposes count() as the source size while emitting larger output through transferTo() (ciphertext, per-chunk framing, etc.). With a too-small chunk buffer the region's output is truncated; if the implementation then advances transferred() to count() once its source is drained, the completion handler removes the entry from the outbound buffer while the rest of the bytes are still sitting inside the region. The peer sees a partial frame and the connection stalls. Modification: Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally. Widen the chunk-allocation try/catch from Exception to Throwable so an OutOfDirectMemoryError from alloc().directBuffer(...) -- or any Error from transferTo() -- still releases the chunk buffer (when already allocated) and surfaces on the write future instead of leaking and stalling the channel. Add SocketFileRegionTest#testFileRegionEmittingMoreThanCount plus an EmitMoreThanCountFileRegion fixture that reports count = 5 while transferTo() emits 20 bytes in one call. The test lives on the base class so every FileRegion-supporting transport (NIO, epoll, kqueue, io_uring) must satisfy the same invariant. With the fix applied the receiver observes all 20 bytes; against unpatched io_uring code the transport's 5-byte chunk buffer accepts only 5 bytes of the emitted chunk and the fixture's fail-fast short-write check throws IOException, which surfaces on the write future and fails the test. Introduce SocketFileRegionTest#maxFileRegionPerCallEmit() (default Long.MAX_VALUE) so transports that copy through an intermediate buffer can declare their per-call output cap. The two io_uring subclasses override it to CONFIGURED_CHUNK_SIZE so the test skips cleanly when an operator has shrunk the chunk size below the region's per-call output. Document the per-call-output cap on the FILE_REGION_MAX_CHUNK_SIZE field so operators tuning the system property know not to size it below the largest emit any FileRegion implementation can produce. Result: io_uring's generic FileRegion path no longer truncates the per-call output of custom FileRegion implementations whose actual emit size exceeds count(), so long as the per-call output fits in FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the io.netty.iouring.fileRegionChunkSize system property). The new regression test is shared across all FileRegion-supporting transports. Memory cost: every generic-FileRegion send now reserves the full chunk size up front even for tiny regions. The buffer is per in-flight write and is released when the write completes; with the default pooled/adaptive allocator it is recycled, while UnpooledByteBufAllocator pays a fresh native allocation/free per send.
…s count() (#16826) Closes #16825. ### Motivation The generic `FileRegion` fallback added in #16571 drains the region only on `buf.writableBytes() > 0`: ```java while (buf.writableBytes() > 0) { long t = region.transferTo(ch, region.transferred()); if (t <= 0) { break; } } ``` epoll's equivalent at `AbstractEpollStreamChannel#writeFileRegion` guards the call with: ```java if (region.transferred() >= region.count()) { in.remove(); return 0; } ``` so epoll never invokes `transferTo()` past `transferred() == count()`. The io_uring fallback can, which violates the `FileRegion` contract for implementations that lazily emit per-chunk framing — the extra call builds an empty next chunk and ships its framing into the buffer, breaking the receiver. See #16825 for the full diagnosis. ### Modification Bound the inner drain loop with `region.transferred() < region.count()` in addition to `buf.writableBytes() > 0`, mirroring epoll's short-circuit. Add `IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion` plus an `OvershootDetectingFileRegion` fixture that records every `transferTo()` invocation made past `transferred() == count()`. The fixture writes one byte and advances `transferred` to `count` in one shot, exercising the buggy path (15 overshoots on default chunk size) and the fixed path (0 overshoots) deterministically. A few test-design notes for review: - Skips on non-io_uring client permutations — the drain loop is on the sending side. - Skips when `io.netty.iouring.fileRegionChunkSize` is set to 1, since that leaves no spare capacity for the overshoot to land in. - Asserts on the writer-side counter synchronously after `writeAndFlush(...).sync()`; no time-based wait. - Routes contract violations through `IOException` rather than `AssertionError` so they surface on the write future's cause through the transport's `catch (Exception)` instead of wedging the EventLoop. - `@Timeout(value = 30, unit = SECONDS)` as a safety net for any future regression that wedges `sync()`. ### Result io_uring's generic `FileRegion` path stops over-calling `transferTo()`. Custom `FileRegion` implementations that lazily emit per-chunk framing work under io_uring with the same semantics they already have under NIO and epoll.
#16832) …s count() (#16826) Closes #16825. The generic `FileRegion` fallback added in #16571 drains the region only on `buf.writableBytes() > 0`: ```java while (buf.writableBytes() > 0) { long t = region.transferTo(ch, region.transferred()); if (t <= 0) { break; } } ``` epoll's equivalent at `AbstractEpollStreamChannel#writeFileRegion` guards the call with: ```java if (region.transferred() >= region.count()) { in.remove(); return 0; } ``` so epoll never invokes `transferTo()` past `transferred() == count()`. The io_uring fallback can, which violates the `FileRegion` contract for implementations that lazily emit per-chunk framing — the extra call builds an empty next chunk and ships its framing into the buffer, breaking the receiver. See #16825 for the full diagnosis. Bound the inner drain loop with `region.transferred() < region.count()` in addition to `buf.writableBytes() > 0`, mirroring epoll's short-circuit. Add `IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion` plus an `OvershootDetectingFileRegion` fixture that records every `transferTo()` invocation made past `transferred() == count()`. The fixture writes one byte and advances `transferred` to `count` in one shot, exercising the buggy path (15 overshoots on default chunk size) and the fixed path (0 overshoots) deterministically. A few test-design notes for review: - Skips on non-io_uring client permutations — the drain loop is on the sending side. - Skips when `io.netty.iouring.fileRegionChunkSize` is set to 1, since that leaves no spare capacity for the overshoot to land in. - Asserts on the writer-side counter synchronously after `writeAndFlush(...).sync()`; no time-based wait. - Routes contract violations through `IOException` rather than `AssertionError` so they surface on the write future's cause through the transport's `catch (Exception)` instead of wedging the EventLoop. - `@Timeout(value = 30, unit = SECONDS)` as a safety net for any future regression that wedges `sync()`. io_uring's generic `FileRegion` path stops over-calling `transferTo()`. Custom `FileRegion` implementations that lazily emit per-chunk framing work under io_uring with the same semantics they already have under NIO and epoll. Co-authored-by: YangJie <[email protected]>
Motivation
AbstractIoUringStreamChannelonly handlesDefaultFileRegion(via splice). Any otherFileRegionimplementation hitsUnsupportedOperationExceptionfrom the parent'sfilterOutboundMessage.This is inconsistent with EPOLL and NIO, which accept any
FileRegionvia atransferTofallback. Frameworks like Apache Spark wrapDefaultFileRegioninside compositeFileRegionimplementations (e.g.,MessageWithHeader) that work on NIO and EPOLL but crash on io_uring.See #16570
Modification
For generic
FileRegion(non-DefaultFileRegionor when splice is unsupported), let it pass throughfilterOutboundMessageinto the outbound buffer, then convert it to a directByteBufin 64 KB chunks viaFileRegion.transferToin the write path (scheduleWriteSingle/writeComplete0).DefaultFileRegion+ splice path is unchanged.transferTointo a cappedByteBufWritableByteChannel, then submitted as an io_uring async send.Also enabled
testCustomFileRegioninIoUringSocketFileRegionTestandIoUringBufferRingSocketFileRegionTest.Result
Any
FileRegionimplementation is now writable on io_uring, consistent with EPOLL and NIO.DefaultFileRegionFileRegionFileChannel.transferToregion.transferToregion.transferTofallbacktransferTo→ ByteBuf → async send