Skip to content

Add generic FileRegion support in io_uring stream channel#16571

Merged
normanmaurer merged 9 commits into
netty:4.2from
LuciferYang:fix-iouring-fileregion-fallback
Apr 21, 2026
Merged

Add generic FileRegion support in io_uring stream channel#16571
normanmaurer merged 9 commits into
netty:4.2from
LuciferYang:fix-iouring-fileregion-fallback

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Mar 27, 2026

Motivation

AbstractIoUringStreamChannel only handles DefaultFileRegion (via splice). Any other FileRegion implementation hits UnsupportedOperationException from the parent's filterOutboundMessage.

This is inconsistent with EPOLL and NIO, which accept any FileRegion via a transferTo fallback. Frameworks like Apache Spark wrap DefaultFileRegion inside composite FileRegion implementations (e.g., MessageWithHeader) that work on NIO and EPOLL but crash on io_uring.

See #16570

Modification

For generic FileRegion (non-DefaultFileRegion or when splice is unsupported), let it pass through filterOutboundMessage into the outbound buffer, then convert it to a direct ByteBuf in 64 KB chunks via FileRegion.transferTo in the write path (scheduleWriteSingle / writeComplete0).

  • DefaultFileRegion + splice path is unchanged.
  • Data is chunked to limit memory usage — each chunk is read via transferTo into a capped ByteBufWritableByteChannel, then submitted as an io_uring async send.
  • Partial sends and EAGAIN preserve the chunk buffer for retry; unrecoverable errors release it and fail the write.
  • Empty FileRegions (count = 0) submit a 0-byte send so the completion path can remove them.

Also enabled testCustomFileRegion in IoUringSocketFileRegionTest and IoUringBufferRingSocketFileRegionTest.

Result

Any FileRegion implementation is now writable on io_uring, consistent with EPOLL and NIO.

Transport DefaultFileRegion Generic FileRegion
NIO FileChannel.transferTo region.transferTo
EPOLL native sendfile region.transferTo fallback
io_uring (before) splice UnsupportedOperationException
io_uring (after) splice (unchanged) chunked transferTo → ByteBuf → async send

@LuciferYang
Copy link
Copy Markdown
Contributor Author

cc @normanmaurer

@normanmaurer
Copy link
Copy Markdown
Member

@normanmaurer normanmaurer added this to the 4.2.13.Final milestone Mar 30, 2026
@normanmaurer normanmaurer added the needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. label Mar 30, 2026
@LuciferYang
Copy link
Copy Markdown
Contributor Author

* Maximum bytes per chunk when converting a generic {@link FileRegion}
* to {@link ByteBuf} for the io_uring async send path.
*/
private static final int FILE_REGION_MAX_CHUNK_SIZE = 64 * 1024;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe add a system property ?

}
long address = IoUring.memoryAddress(buf) + buf.readerIndex();
int length = buf.readableBytes();
ops = IoUringIoOps.newSend(fd, (byte) 0, 0, address, length, nextOpsId());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we factor this out into another method and call the methods from here ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*/
private static final int FILE_REGION_MAX_CHUNK_SIZE = 64 * 1024;
private static final int FILE_REGION_MAX_CHUNK_SIZE =
SystemPropertyUtil.getInt("io.netty.iouring.fileRegionChunkSize", 64 * 1024);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this property name OK?

@normanmaurer normanmaurer requested a review from chrisvest March 30, 2026 16:32
@dreamlike-ocean
Copy link
Copy Markdown
Contributor

I have a few thoughts related to this PR for your consideration.
Perhaps we can explore them in the future

  1. There was a previous report about splice performance in Proposal: configurable sendfile and pipe size for io_uring transport #15747.
    I’m curious whether the current fallback implementation shows measurable performance differences compared to splice.

If the fallback path performs better under certain workloads, would it make sense to allow selecting it explicitly via a command-line option, instead of relying purely on automatic fallback?

  1. If the chunk size exceeds the user-configured IO_URING_WRITE_ZERO_COPY_THRESHOLD, it may fall into a range where zero-copy is beneficial. In that case, would it make sense to route it through the io_uring_send_zc path instead?

@LuciferYang
Copy link
Copy Markdown
Contributor Author

Thanks for the thoughtful suggestions, @dreamlike-ocean!

  1. Splice vs fallback performance: Good point linking to Proposal: configurable sendfile and pipe size for io_uring transport #15747. In the current implementation, DefaultFileRegion takes the splice path when IoUring.isSpliceSupported() is true; otherwise it falls through to the same chunked transferToByteBuf → send fallback as generic FileRegion. So the fallback path already serves as an implicit alternative when splice is unavailable. Making it explicitly selectable (e.g., via a config option to bypass splice even when supported) could be a useful follow-up, but I think that fits better as part of Proposal: configurable sendfile and pipe size for io_uring transport #15747 rather than in this PR.

  2. Zero-copy send for large chunks: Yes, this is a valid optimization opportunity. Currently the fallback path submits chunks via IoUringIoOps.newSend() unconditionally — it doesn't check shouldWriteZeroCopy() / route through send_zc, even when the chunk size exceeds IO_URING_WRITE_ZERO_COPY_THRESHOLD. This could be added as a follow-up enhancement.

This PR focuses on correctness — making any FileRegion work on io_uring instead of throwing UnsupportedOperationException, consistent with NIO and EPOLL. Both of your suggestions are good directions for future performance optimization.

@normanmaurer
Copy link
Copy Markdown
Member

@chrisvest PTAL

@LuciferYang
Copy link
Copy Markdown
Contributor Author

What's the progress on this pr ...

Copy link
Copy Markdown
Member

@chrisvest chrisvest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some comments.

try {
ByteBufWritableByteChannel ch = new ByteBufWritableByteChannel(buf);
int written = 0;
while (written < chunkSize) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use buf.writableBytes() > 0 as a condition and remove written

}
} else {
// Don't release the chunk buffer before checking the error —
// for retryable errors (EAGAIN) we need to re-send the same data.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we re-sending the same data? The FileRegion.transferred() got incremented in the scheduleWriteFileRegion call.

} catch (Throwable cause) {
releaseFileRegionChunkBuf();
handleWriteError(cause);
return true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return true looks redundant as we could fall through to the one below, right?

@Override
protected boolean supportsCustomFileRegion() {
return false;
return true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some tests that exercises the chunking, and also a test where we send two FileRegions in a row where at least the first one exercises the chunking.

The tests enabled by this flag only send 8 KiB of data, I think.

@LuciferYang LuciferYang force-pushed the fix-iouring-fileregion-fallback branch from 276cbfa to 28c2be7 Compare April 17, 2026 05:48
@LuciferYang LuciferYang force-pushed the fix-iouring-fileregion-fallback branch from 28c2be7 to 9db54a0 Compare April 17, 2026 05:54
@LuciferYang
Copy link
Copy Markdown
Contributor Author

All five addressed (9db54a0):

  1. while (buf.writableBytes() > 0), written removed.

  2. Added a note on FileRegion.transferred() — with the io_uring chunked fallback the counter advances when bytes are queued for submission, not when they're confirmed on the wire.

  3. Reworded the EAGAIN comment. ioResult returns 0, and scheduleWriteFileRegion resubmits the same chunk buffer once POLLOUT fires (transferred() is already advanced, so the buffer holds the only copy of those bytes).

  4. Dropped the inner try/catch — the outer catch handles the throw path the same way and falls through to return true.

  5. Added testCustomFileRegionChunking and testTwoCustomFileRegionsWithChunking, and flipped IoUringBufferRingSocketFileRegionTest#supportsCustomFileRegion to true so the buffer-ring variant runs them too. Both tests wrap a DefaultFileRegion in a CountingFileRegion that delegates all methods but counts transferTo calls, so I can assert transferToCalls >= ceil(payload/chunkSize) — a mutation disabling chunking would fail.

A few small adjacent fixes came up while wiring this up:

  • Capped FILE_REGION_MAX_CHUNK_SIZE at 16 MiB so a bad io.netty.iouring.fileRegionChunkSize override can't OOM the process.
  • Included count and transferred in the "transferTo produced 0 bytes" ChannelException message so it's clear which region failed.

@LuciferYang
Copy link
Copy Markdown
Contributor Author

@chrisvest bc2c593 try to fix the failed test

The chunking assertions in testCustomFileRegionChunking and testTwoCustomFileRegionsWithChunking expected at least 8 transferTo calls for a 512 KB payload with a 64 KB chunk, but CI only saw 1.

The tests run through AbstractComboTestsuiteTest, which cycles through three combos from IoUringSocketTestPermutation: io_uring+io_uring, io_uring+NIO client, and NIO+io_uring. The chunking code in AbstractIoUringStreamChannel.scheduleWriteFileRegion only runs when the client is IoUringSocketChannel. With an NIO client, NioSocketChannel.doWriteFileRegion passes the socket channel straight into region.transferTo(javaChannel(), position), and the JDK routes that through sendfile(2) — one kernel call copies the whole 512 KB, so transferTo is invoked exactly once. That's the correct behavior for NIO; the assertion just doesn't apply there.

Fix: after connect, capture boolean ioUringClient = cc instanceof IoUringSocketChannel; and gate the chunking count check with assumeTrue(ioUringClient, ...). The byte-count and exception assertions still run for all three combos, so data correctness is covered everywhere; only the chunking-specific count is skipped on the NIO-client combo.

IoUringSocketChannel is final and clientSocket() only hands out IoUringSocketChannel or NioSocketChannel, so the instanceof check covers every case the permutation can produce.

@normanmaurer normanmaurer merged commit 8204af4 into netty:4.2 Apr 21, 2026
19 checks passed
@netty-project-bot
Copy link
Copy Markdown
Contributor

Could not create auto-port PR.
Got conflicts when cherry-picking onto 5.0.

@LuciferYang
Copy link
Copy Markdown
Contributor Author

Thank you @normanmaurer and @chrisvest

@LuciferYang
Copy link
Copy Markdown
Contributor Author

Could not create auto-port PR. Got conflicts when cherry-picking onto 5.0.

What do I need to do for this?

@normanmaurer
Copy link
Copy Markdown
Member

@LuciferYang nothing ... PR for 5.0 is #16677

@chrisvest chrisvest removed the needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. label Apr 21, 2026
normanmaurer added a commit that referenced this pull request Apr 22, 2026
…16677)

`AbstractIoUringStreamChannel` only handles `DefaultFileRegion` (via
splice). Any other `FileRegion` implementation hits
`UnsupportedOperationException` from the parent's
`filterOutboundMessage`.

This is inconsistent with EPOLL and NIO, which accept any `FileRegion`
via a `transferTo` fallback. Frameworks like Apache Spark wrap
`DefaultFileRegion` inside composite `FileRegion` implementations (e.g.,
`MessageWithHeader`) that work on NIO and EPOLL but crash on io_uring.

See #16570

For generic `FileRegion` (non-`DefaultFileRegion` or when splice is
unsupported), let it pass through `filterOutboundMessage` into the
outbound buffer, then convert it to a direct `ByteBuf` in **64 KB
chunks** via `FileRegion.transferTo` in the write path
(`scheduleWriteSingle` / `writeComplete0`).

- `DefaultFileRegion` + splice path is unchanged.
- Data is chunked to limit memory usage — each chunk is read via
`transferTo` into a capped `ByteBufWritableByteChannel`, then submitted
as an io_uring async send.
- Partial sends and EAGAIN preserve the chunk buffer for retry;
unrecoverable errors release it and fail the write.
- Empty FileRegions (count = 0) submit a 0-byte send so the completion
path can remove them.

Also enabled `testCustomFileRegion` in `IoUringSocketFileRegionTest` and
`IoUringBufferRingSocketFileRegionTest`.

Any `FileRegion` implementation is now writable on io_uring, consistent
with EPOLL and NIO.

| Transport | `DefaultFileRegion` | Generic `FileRegion` |
|---|---|---|
| NIO | `FileChannel.transferTo` | `region.transferTo` |
| EPOLL | native sendfile | `region.transferTo` fallback |
| io_uring (before) | splice | **UnsupportedOperationException** |
| io_uring (after) | splice (unchanged) | chunked `transferTo` → ByteBuf
→ async send |

---------

Co-authored-by: Norman Maurer <[email protected]>

Co-authored-by: YangJie <[email protected]>
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…s count()

Motivation:

The generic FileRegion fallback added in netty#16571 drains the region by
looping on `buf.writableBytes() > 0` only. epoll's writeFileRegion()
guards the same call with `if (region.transferred() >= region.count())
{ in.remove(); return 0; }` -- the FileRegion contract permits
implementations to assume transferTo() is not invoked past
transferred() == count(). Calling transferTo() past completion can
trigger side effects in custom AbstractFileRegion implementations --
for example emitting an extra framing prefix or trailer onto the wire
-- which then corrupts the protocol on the receiving side.

Modification:

Tighten the io_uring drain-loop condition with
`region.transferred() < region.count()` in addition to
`buf.writableBytes() > 0`, mirroring epoll's short-circuit.

Add IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion
plus an OvershootDetectingFileRegion fixture that records every
transferTo() call made past transferred() == count(). The test fails
on the prior implementation and passes with this change.

Result:

io_uring's generic FileRegion path no longer over-calls transferTo().
Custom AbstractFileRegion implementations that rely on the documented
contract (e.g. Apache Spark's SaslEncryption, where SparkSaslSuite
fails deterministically under io_uring on 4.2.13) work as expected.
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…s count()

Motivation:

The generic FileRegion fallback added in netty#16571 drains the region by
looping on `buf.writableBytes() > 0` only. epoll's writeFileRegion()
guards the same call with `if (region.transferred() >= region.count())
{ in.remove(); return 0; }` -- the FileRegion contract permits
implementations to assume transferTo() is not invoked past
transferred() == count(). Calling transferTo() past completion can
trigger side effects in custom FileRegion implementations that lazily
emit per-chunk framing -- for example an encryption, compression, or
framing layer that flushes a completed inner chunk on each call --
producing a phantom chunk header that lands on the wire and corrupts
the receiver's framing.

Modification:

Tighten the io_uring drain-loop condition with
`region.transferred() < region.count()` in addition to
`buf.writableBytes() > 0`, mirroring epoll's short-circuit.

Add IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion
plus an OvershootDetectingFileRegion fixture that records every
transferTo() call made past transferred() == count(). The test fails
on the prior implementation and passes with this change.

Result:

io_uring's generic FileRegion path no longer over-calls transferTo().
Custom FileRegion implementations that rely on the documented contract
work under io_uring with the same semantics they already have under
NIO and epoll.
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…s count()

Motivation:

The generic FileRegion fallback added in netty#16571 drains the region by
looping on `buf.writableBytes() > 0` only. epoll's writeFileRegion()
guards the same call with `if (region.transferred() >= region.count())
{ in.remove(); return 0; }` -- the FileRegion contract permits
implementations to assume transferTo() is not invoked past
transferred() == count(). Calling transferTo() past completion can
trigger side effects in custom FileRegion implementations that lazily
emit per-chunk framing -- for example an encryption, compression, or
framing layer that flushes a completed inner chunk on each call --
producing a phantom chunk header that lands on the wire and corrupts
the receiver's framing.

Modification:

Tighten the io_uring drain-loop condition with
`region.transferred() < region.count()` in addition to
`buf.writableBytes() > 0`, mirroring epoll's short-circuit.

Add IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion
plus an OvershootDetectingFileRegion fixture that records every
transferTo() call made past transferred() == count(). The test fails
on the prior implementation and passes with this change.

Result:

io_uring's generic FileRegion path no longer over-calls transferTo().
Custom FileRegion implementations that rely on the documented contract
work under io_uring with the same semantics they already have under
NIO and epoll.
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…s count()

Motivation:

The generic FileRegion fallback added in netty#16571 drains the region by
looping on `buf.writableBytes() > 0` only. epoll's writeFileRegion()
guards the same call with `if (region.transferred() >= region.count())
{ in.remove(); return 0; }` -- the FileRegion contract permits
implementations to assume transferTo() is not invoked past
transferred() == count(). Calling transferTo() past completion can
trigger side effects in custom FileRegion implementations that lazily
emit per-chunk framing -- for example an encryption, compression, or
framing layer that flushes a completed inner chunk on each call --
producing a phantom chunk header that lands on the wire and corrupts
the receiver's framing.

Modification:

Tighten the io_uring drain-loop condition with
`region.transferred() < region.count()` in addition to
`buf.writableBytes() > 0`, mirroring epoll's short-circuit.

Add IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion
plus an OvershootDetectingFileRegion fixture that records every
transferTo() call made past transferred() == count(). The test fails
on the prior implementation and passes with this change.

Result:

io_uring's generic FileRegion path no longer over-calls transferTo().
Custom FileRegion implementations that rely on the documented contract
work under io_uring with the same semantics they already have under
NIO and epoll.
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…s count()

Motivation:

The generic FileRegion fallback added in netty#16571 drains the region by
looping on `buf.writableBytes() > 0` only. epoll's writeFileRegion()
guards the same call with `if (region.transferred() >= region.count())
{ in.remove(); return 0; }` -- the FileRegion contract permits
implementations to assume transferTo() is not invoked past
transferred() == count(). Calling transferTo() past completion can
trigger side effects in custom FileRegion implementations that lazily
emit per-chunk framing -- for example an encryption, compression, or
framing layer that flushes a completed inner chunk on each call --
producing a phantom chunk header that lands on the wire and corrupts
the receiver's framing.

Modification:

Tighten the io_uring drain-loop condition with
`region.transferred() < region.count()` in addition to
`buf.writableBytes() > 0`, mirroring epoll's short-circuit.

Add IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion
plus an OvershootDetectingFileRegion fixture that records every
transferTo() call made past transferred() == count(). The test fails
on the prior implementation and passes with this change.

Result:

io_uring's generic FileRegion path no longer over-calls transferTo().
Custom FileRegion implementations that rely on the documented contract
work under io_uring with the same semantics they already have under
NIO and epoll.
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…tual bytes

Allocate the full FILE_REGION_MAX_CHUNK_SIZE for the chunk buffer instead
of `min(count() - transferred(), MAX)`. The count()-based sizing assumes
that count() is an accurate upper bound on the bytes a FileRegion will
write to its target -- this holds for DefaultFileRegion but is violated by
encryption-on-write implementations like Spark's
SaslEncryption.EncryptedMessage, which:

  - Returns count() = unencrypted message size (or buf.readableBytes() for
    its ByteBuf-backed variant, which decreases as the underlying buffer
    drains).
  - Emits encrypted output through transferTo() that is LARGER than count()
    advertises (per-chunk framing + cipher overhead, plus the buf.skipBytes
    side effect that mutates count() mid-stream).

When the chunk buf is sized to count() - transferred() and that's smaller
than the encrypted output, transferTo() returns the +1 fallback (advancing
transferred by 1 to signal "I have more"), the buf is sent with only some
of the encrypted bytes, and the completion handler sees
`region.transferred() >= region.count()` (true now that count() has
dropped to 0) and removes the entry from the outbound buffer. The unsent
bytes still in the FileRegion's internal currentChunk are lost, the peer
gets a partial SASL frame, and the connection stalls.

Sizing the buffer to FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB) ensures
small-payload encrypted writes fit in one chunk, so the +1 fallback never
fires and the count() vs transferred() race never matters. For large
payloads the buffer caps at MAX_CHUNK_SIZE as before -- no regression in
direct-memory headroom.

This is the second of two bugs in netty#16571 found while running Spark's
SparkSaslSuite under io_uring; the first (drain loop overshooting
transferred()==count()) is fixed in the previous commit.

Verified: SparkSaslSuite + RpcIntegrationSuite (21 tests covering small
RPCs, encrypted RPCs, encrypted FileRegion fetches, multi-message stream
uploads up to 512 KiB) pass cleanly across three back-to-back runs in an
OrbStack Linux container with --security-opt seccomp=unconfined.
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…llback

Motivation:

The generic FileRegion fallback added in netty#16571 sizes its chunk buffer
to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`.
That sizing treats count() as an upper bound on the bytes transferTo()
will emit -- which holds for DefaultFileRegion but custom FileRegion
implementations may violate it. An on-the-fly encryption or framing
layer, for example, exposes count() as the source size while emitting
larger output through transferTo() (ciphertext, per-chunk framing,
etc.). With a too-small chunk buffer the region's output is truncated;
if the implementation then advances transferred() to count() once its
source is drained, the completion handler removes the entry from the
outbound buffer while the rest of the bytes are still sitting inside
the region. The peer sees a partial frame and the connection stalls.

Modification:

Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally.

Add IoUringSocketFileRegionTest#testFileRegionEmittingMoreThanCount
plus an EmitMoreThanCountFileRegion fixture that reports count = 5
while transferTo() emits 20 bytes in one call. The receiver-side byte
count distinguishes the buggy path (5 bytes received) from the fixed
path (20 bytes received). The test skips when the operator has
configured a chunk size smaller than the region's per-call output --
the fix is bounded to outputs that fit in one chunk.

Result:

io_uring's generic FileRegion path no longer truncates the per-call
output of custom FileRegion implementations whose actual emit size
exceeds count(), so long as the per-call output fits in
FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the
io.netty.iouring.fileRegionChunkSize system property). Memory cost:
every generic-FileRegion send now allocates the full chunk size up
front even for tiny regions, but the buffer is per in-flight write
and the direct-buffer allocator pools it on release.
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…llback

Motivation:

The generic FileRegion fallback added in netty#16571 sizes its chunk buffer
to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`.
That sizing treats count() as an upper bound on the bytes transferTo()
will emit -- which holds for DefaultFileRegion but custom FileRegion
implementations may violate it. An on-the-fly encryption or framing
layer, for example, exposes count() as the source size while emitting
larger output through transferTo() (ciphertext, per-chunk framing,
etc.). With a too-small chunk buffer the region's output is truncated;
if the implementation then advances transferred() to count() once its
source is drained, the completion handler removes the entry from the
outbound buffer while the rest of the bytes are still sitting inside
the region. The peer sees a partial frame and the connection stalls.

Modification:

Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally.

Add IoUringSocketFileRegionTest#testFileRegionEmittingMoreThanCount
plus an EmitMoreThanCountFileRegion fixture that reports count = 5
while transferTo() emits 20 bytes in one call. The receiver-side byte
count distinguishes the buggy path (5 bytes received) from the fixed
path (20 bytes received). The test skips when the operator has
configured a chunk size smaller than the region's per-call output --
the fix is bounded to outputs that fit in one chunk.

Result:

io_uring's generic FileRegion path no longer truncates the per-call
output of custom FileRegion implementations whose actual emit size
exceeds count(), so long as the per-call output fits in
FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the
io.netty.iouring.fileRegionChunkSize system property). Memory cost:
every generic-FileRegion send now reserves the full chunk size up
front even for tiny regions. The buffer is per in-flight write and is
released when the write completes; with the default pooled/adaptive
allocator it is recycled, while UnpooledByteBufAllocator pays a
direct malloc/free per send.
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…llback

Motivation:

The generic FileRegion fallback added in netty#16571 sizes its chunk buffer
to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`.
That sizing treats count() as an upper bound on the bytes transferTo()
will emit -- which holds for DefaultFileRegion but custom FileRegion
implementations may violate it. An on-the-fly encryption or framing
layer, for example, exposes count() as the source size while emitting
larger output through transferTo() (ciphertext, per-chunk framing,
etc.). With a too-small chunk buffer the region's output is truncated;
if the implementation then advances transferred() to count() once its
source is drained, the completion handler removes the entry from the
outbound buffer while the rest of the bytes are still sitting inside
the region. The peer sees a partial frame and the connection stalls.

Modification:

Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally.

Add IoUringSocketFileRegionTest#testFileRegionEmittingMoreThanCount
plus an EmitMoreThanCountFileRegion fixture that reports count = 5
while transferTo() emits 20 bytes in one call. With the fix applied
the receiver observes all 20 bytes; against unpatched code the
transport's 5-byte chunk buffer accepts only 5 bytes of the emitted
chunk and the fixture's fail-fast short-write check throws IOException,
which surfaces on the write future and fails the test. The test skips
when the operator has configured a chunk size smaller than the
region's per-call output -- the fix is bounded to outputs that fit
in one chunk.

Result:

io_uring's generic FileRegion path no longer truncates the per-call
output of custom FileRegion implementations whose actual emit size
exceeds count(), so long as the per-call output fits in
FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the
io.netty.iouring.fileRegionChunkSize system property). Memory cost:
every generic-FileRegion send now reserves the full chunk size up
front even for tiny regions. The buffer is per in-flight write and is
released when the write completes; with the default pooled/adaptive
allocator it is recycled, while UnpooledByteBufAllocator pays a
direct malloc/free per send.
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…llback

Motivation:

The generic FileRegion fallback added in netty#16571 sizes its chunk buffer
to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`.
That sizing treats count() as an upper bound on the bytes transferTo()
will emit -- which holds for DefaultFileRegion but custom FileRegion
implementations may violate it. An on-the-fly encryption or framing
layer, for example, exposes count() as the source size while emitting
larger output through transferTo() (ciphertext, per-chunk framing,
etc.). With a too-small chunk buffer the region's output is truncated;
if the implementation then advances transferred() to count() once its
source is drained, the completion handler removes the entry from the
outbound buffer while the rest of the bytes are still sitting inside
the region. The peer sees a partial frame and the connection stalls.

Modification:

Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally.

Add IoUringSocketFileRegionTest#testFileRegionEmittingMoreThanCount
plus an EmitMoreThanCountFileRegion fixture that reports count = 5
while transferTo() emits 20 bytes in one call. With the fix applied
the receiver observes all 20 bytes; against unpatched code the
transport's 5-byte chunk buffer accepts only 5 bytes of the emitted
chunk and the fixture's fail-fast short-write check throws IOException,
which surfaces on the write future and fails the test. The test skips
when the operator has configured a chunk size smaller than the
region's per-call output -- the fix is bounded to outputs that fit
in one chunk.

Result:

io_uring's generic FileRegion path no longer truncates the per-call
output of custom FileRegion implementations whose actual emit size
exceeds count(), so long as the per-call output fits in
FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the
io.netty.iouring.fileRegionChunkSize system property). Memory cost:
every generic-FileRegion send now reserves the full chunk size up
front even for tiny regions. The buffer is per in-flight write and is
released when the write completes; with the default pooled/adaptive
allocator it is recycled, while UnpooledByteBufAllocator pays a
fresh native allocation/free per send.
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…llback

Motivation:

The generic FileRegion fallback added in netty#16571 sizes its chunk buffer
to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`.
That sizing treats count() as an upper bound on the bytes transferTo()
will emit -- which holds for DefaultFileRegion but custom FileRegion
implementations may violate it. An on-the-fly encryption or framing
layer, for example, exposes count() as the source size while emitting
larger output through transferTo() (ciphertext, per-chunk framing,
etc.). With a too-small chunk buffer the region's output is truncated;
if the implementation then advances transferred() to count() once its
source is drained, the completion handler removes the entry from the
outbound buffer while the rest of the bytes are still sitting inside
the region. The peer sees a partial frame and the connection stalls.

Modification:

Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally.

Widen the chunk-allocation try/catch from Exception to Throwable so an
OutOfDirectMemoryError from alloc().directBuffer(...) -- or any Error
from transferTo() -- still releases the chunk buffer (when already
allocated) and surfaces on the write future instead of leaking and
stalling the channel.

Add SocketFileRegionTest#testFileRegionEmittingMoreThanCount plus an
EmitMoreThanCountFileRegion fixture that reports count = 5 while
transferTo() emits 20 bytes in one call. The test lives on the base
class so every FileRegion-supporting transport (NIO, epoll, kqueue,
io_uring) must satisfy the same invariant. With the fix applied the
receiver observes all 20 bytes; against unpatched io_uring code the
transport's 5-byte chunk buffer accepts only 5 bytes of the emitted
chunk and the fixture's fail-fast short-write check throws IOException,
which surfaces on the write future and fails the test.

Introduce SocketFileRegionTest#maxFileRegionPerCallEmit() (default
Long.MAX_VALUE) so transports that copy through an intermediate buffer
can declare their per-call output cap. The two io_uring subclasses
override it to CONFIGURED_CHUNK_SIZE so the test skips cleanly when an
operator has shrunk the chunk size below the region's per-call output.

Document the per-call-output cap on the FILE_REGION_MAX_CHUNK_SIZE
field so operators tuning the system property know not to size it
below the largest emit any FileRegion implementation can produce.

Result:

io_uring's generic FileRegion path no longer truncates the per-call
output of custom FileRegion implementations whose actual emit size
exceeds count(), so long as the per-call output fits in
FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the
io.netty.iouring.fileRegionChunkSize system property). The new
regression test is shared across all FileRegion-supporting transports.

Memory cost: every generic-FileRegion send now reserves the full chunk
size up front even for tiny regions. The buffer is per in-flight write
and is released when the write completes; with the default
pooled/adaptive allocator it is recycled, while
UnpooledByteBufAllocator pays a fresh native allocation/free per send.
LuciferYang added a commit to LuciferYang/netty that referenced this pull request May 19, 2026
…llback

Motivation:

The generic FileRegion fallback added in netty#16571 sizes its chunk buffer
to `min(region.count() - region.transferred(), FILE_REGION_MAX_CHUNK_SIZE)`.
That sizing treats count() as an upper bound on the bytes transferTo()
will emit -- which holds for DefaultFileRegion but custom FileRegion
implementations may violate it. An on-the-fly encryption or framing
layer, for example, exposes count() as the source size while emitting
larger output through transferTo() (ciphertext, per-chunk framing,
etc.). With a too-small chunk buffer the region's output is truncated;
if the implementation then advances transferred() to count() once its
source is drained, the completion handler removes the entry from the
outbound buffer while the rest of the bytes are still sitting inside
the region. The peer sees a partial frame and the connection stalls.

Modification:

Allocate the full FILE_REGION_MAX_CHUNK_SIZE unconditionally.

Widen the chunk-allocation try/catch from Exception to Throwable so an
OutOfDirectMemoryError from alloc().directBuffer(...) -- or any Error
from transferTo() -- still releases the chunk buffer (when already
allocated) and surfaces on the write future instead of leaking and
stalling the channel.

Add SocketFileRegionTest#testFileRegionEmittingMoreThanCount plus an
EmitMoreThanCountFileRegion fixture that reports count = 5 while
transferTo() emits 20 bytes in one call. The test lives on the base
class so every FileRegion-supporting transport (NIO, epoll, kqueue,
io_uring) must satisfy the same invariant. With the fix applied the
receiver observes all 20 bytes; against unpatched io_uring code the
transport's 5-byte chunk buffer accepts only 5 bytes of the emitted
chunk and the fixture's fail-fast short-write check throws IOException,
which surfaces on the write future and fails the test.

Introduce SocketFileRegionTest#maxFileRegionPerCallEmit() (default
Long.MAX_VALUE) so transports that copy through an intermediate buffer
can declare their per-call output cap. The two io_uring subclasses
override it to CONFIGURED_CHUNK_SIZE so the test skips cleanly when an
operator has shrunk the chunk size below the region's per-call output.

Document the per-call-output cap on the FILE_REGION_MAX_CHUNK_SIZE
field so operators tuning the system property know not to size it
below the largest emit any FileRegion implementation can produce.

Result:

io_uring's generic FileRegion path no longer truncates the per-call
output of custom FileRegion implementations whose actual emit size
exceeds count(), so long as the per-call output fits in
FILE_REGION_MAX_CHUNK_SIZE (default 64 KiB, tunable via the
io.netty.iouring.fileRegionChunkSize system property). The new
regression test is shared across all FileRegion-supporting transports.

Memory cost: every generic-FileRegion send now reserves the full chunk
size up front even for tiny regions. The buffer is per in-flight write
and is released when the write completes; with the default
pooled/adaptive allocator it is recycled, while
UnpooledByteBufAllocator pays a fresh native allocation/free per send.
normanmaurer pushed a commit that referenced this pull request May 19, 2026
…s count() (#16826)

Closes #16825.

### Motivation

The generic `FileRegion` fallback added in #16571 drains the region only
on `buf.writableBytes() > 0`:

```java
while (buf.writableBytes() > 0) {
    long t = region.transferTo(ch, region.transferred());
    if (t <= 0) {
        break;
    }
}
```

epoll's equivalent at `AbstractEpollStreamChannel#writeFileRegion`
guards the call with:

```java
if (region.transferred() >= region.count()) {
    in.remove();
    return 0;
}
```

so epoll never invokes `transferTo()` past `transferred() == count()`.
The io_uring fallback can, which violates the `FileRegion` contract for
implementations that lazily emit per-chunk framing — the extra call
builds an empty next chunk and ships its framing into the buffer,
breaking the receiver. See #16825 for the full diagnosis.

### Modification

Bound the inner drain loop with `region.transferred() < region.count()`
in addition to `buf.writableBytes() > 0`, mirroring epoll's
short-circuit.

Add `IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion`
plus an `OvershootDetectingFileRegion` fixture that records every
`transferTo()` invocation made past `transferred() == count()`. The
fixture writes one byte and advances `transferred` to `count` in one
shot, exercising the buggy path (15 overshoots on default chunk size)
and the fixed path (0 overshoots) deterministically.

A few test-design notes for review:

- Skips on non-io_uring client permutations — the drain loop is on the
sending side.
- Skips when `io.netty.iouring.fileRegionChunkSize` is set to 1, since
that leaves no spare capacity for the overshoot to land in.
- Asserts on the writer-side counter synchronously after
`writeAndFlush(...).sync()`; no time-based wait.
- Routes contract violations through `IOException` rather than
`AssertionError` so they surface on the write future's cause through the
transport's `catch (Exception)` instead of wedging the EventLoop.
- `@Timeout(value = 30, unit = SECONDS)` as a safety net for any future
regression that wedges `sync()`.

### Result

io_uring's generic `FileRegion` path stops over-calling `transferTo()`.
Custom `FileRegion` implementations that lazily emit per-chunk framing
work under io_uring with the same semantics they already have under NIO
and epoll.
normanmaurer added a commit that referenced this pull request May 19, 2026
#16832)

…s count() (#16826)

Closes #16825.

The generic `FileRegion` fallback added in #16571 drains the region only
on `buf.writableBytes() > 0`:

```java
while (buf.writableBytes() > 0) {
    long t = region.transferTo(ch, region.transferred());
    if (t <= 0) {
        break;
    }
}
```

epoll's equivalent at `AbstractEpollStreamChannel#writeFileRegion`
guards the call with:

```java
if (region.transferred() >= region.count()) {
    in.remove();
    return 0;
}
```

so epoll never invokes `transferTo()` past `transferred() == count()`.
The io_uring fallback can, which violates the `FileRegion` contract for
implementations that lazily emit per-chunk framing — the extra call
builds an empty next chunk and ships its framing into the buffer,
breaking the receiver. See #16825 for the full diagnosis.

Bound the inner drain loop with `region.transferred() < region.count()`
in addition to `buf.writableBytes() > 0`, mirroring epoll's
short-circuit.

Add `IoUringSocketFileRegionTest#testFileRegionDrainStopsAtCompletion`
plus an `OvershootDetectingFileRegion` fixture that records every
`transferTo()` invocation made past `transferred() == count()`. The
fixture writes one byte and advances `transferred` to `count` in one
shot, exercising the buggy path (15 overshoots on default chunk size)
and the fixed path (0 overshoots) deterministically.

A few test-design notes for review:

- Skips on non-io_uring client permutations — the drain loop is on the
sending side.
- Skips when `io.netty.iouring.fileRegionChunkSize` is set to 1, since
that leaves no spare capacity for the overshoot to land in.
- Asserts on the writer-side counter synchronously after
`writeAndFlush(...).sync()`; no time-based wait.
- Routes contract violations through `IOException` rather than
`AssertionError` so they surface on the write future's cause through the
transport's `catch (Exception)` instead of wedging the EventLoop.
- `@Timeout(value = 30, unit = SECONDS)` as a safety net for any future
regression that wedges `sync()`.

io_uring's generic `FileRegion` path stops over-calling `transferTo()`.
Custom `FileRegion` implementations that lazily emit per-chunk framing
work under io_uring with the same semantics they already have under NIO
and epoll.

Co-authored-by: YangJie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants