Skip to content

Save slicing HTTP 2 headers & data#13783

Open
franz1981 wants to merge 2 commits into
netty:4.1from
franz1981:4.1_no_slice
Open

Save slicing HTTP 2 headers & data#13783
franz1981 wants to merge 2 commits into
netty:4.1from
franz1981:4.1_no_slice

Conversation

@franz1981
Copy link
Copy Markdown
Contributor

Motivation:

DefaultHttp2FrameWriter's always create aggregate promises and slices out of headers and data: both could be saved while reducing the amount of pipeline traversals in case the additional cost of creating a sliced buffer surpass the data to be written.

Modifications:

Small header and data could be copied directly in a single buffer, without any need to create aggregate promises.

Result:

Faster small data's writes

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Jan 16, 2024

@normanmaurer I'm not quite sure i could reuse the promise, so please take a look if I'm not assuming anything wrong...

The cutoff value has been decided by summing the required bytes for both a new sliced buffer (around 40 bytes ) + outbound buffer's entries (64 or 64 * 2 bytes, depending if headers/data paths) + promise aggregator (64 bytes), which means > 128 bytes.
Choosing 128 as a cutoff value has been conservative but able to capture many existing cases.
@idelpivnitskiy are you aware if this is something useful beyond microbenchmark, in term of cut-off values?

From what I can see, encoded response headers below 128 bytes are quite common, while for the data, depends, but it is still a possible scenario, although more typical in benchmarks IMO.

@franz1981
Copy link
Copy Markdown
Contributor Author

This is just a preliminary example of the improvement, given that it doesn't account for the saving of the aggregate promise (which translate in less morphism of call-sites around the promises handling):

before:

Benchmark                                (padding)  (payloadSize)  (pooled)  Mode  Cnt    Score   Error  Units
Http2FrameWriterDataBenchmark.newWriter          0             64      true  avgt   10  112.406 ± 1.096  ns/op
Http2FrameWriterDataBenchmark.newWriter          0             64     false  avgt   10  110.992 ± 0.120  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024      true  avgt   10  102.480 ± 1.899  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024     false  avgt   10  112.736 ± 5.971  ns/op

now:


Benchmark                                (padding)  (payloadSize)  (pooled)  Mode  Cnt    Score   Error  Units
Http2FrameWriterDataBenchmark.newWriter          0             64      true  avgt   10   91.973 ± 0.343  ns/op
Http2FrameWriterDataBenchmark.newWriter          0             64     false  avgt   10  100.445 ± 0.238  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024      true  avgt   10   94.347 ± 0.157  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024     false  avgt   10   98.489 ± 0.196  ns/op

which is already ~10-20% improvement just on headers/data writing.

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Jan 16, 2024

An additional benefit (specific to Vertx, but can really happen regardless), is that the data written by users could uses heap-based ByteBuf and embedding earlier their content into the direct buffer to send would save performing the copy, see

image

this code path just disappear with this PR, given that no heap buffer is passed to the transport.

I've added 003d1e29350c47fe50d40456312b25d5e91720da to ignore the cutoff limits if the data is held in an array buffer, so we would perform the copy earlier, saving creating the aggregate promise and the additional pipeline traversal

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Jan 16, 2024

I think I have a leak when the data buffer is not sent anymore across the transport, and be released after flushed. I should release it right after copied

@franz1981
Copy link
Copy Markdown
Contributor Author

@ejona86 @carl-mastrangelo do you have any chance (or any pointer to make me try it) to verify/validate some of the changes I am sending recently, performance-wise? I'm thinking about gRpc over HTTP 2 and having something different from our Quarkus/Vertx stack would help a lot

@franz1981
Copy link
Copy Markdown
Contributor Author

@bryce-anderson this is using a very similar optimization although I couldn't push it as far as http 1.1 and make it produce a single buffer overall (we don't even presize the header here :/)

@carl-mastrangelo
Copy link
Copy Markdown
Member

Alas, I have moved on from the gRPC team, and don't have access to sample http2 traffic.

@franz1981
Copy link
Copy Markdown
Contributor Author

Thanks @carl-mastrangelo to have answered; if there is some Google users interested into grpc Netty, please reach me out here

@normanmaurer
Copy link
Copy Markdown
Member

@franz1981 Please let me know once this is ready again for review

@franz1981
Copy link
Copy Markdown
Contributor Author

Yep, @normanmaurer this should be ready to go as well

@franz1981 franz1981 force-pushed the 4.1_no_slice branch 2 times, most recently from 66cca39 to 157a4a4 Compare January 31, 2024 16:43
@franz1981
Copy link
Copy Markdown
Contributor Author

mmm @normanmaurer what happened for 33f5c56? :P

@normanmaurer
Copy link
Copy Markdown
Member

@franz1981 what you mean ?

@franz1981
Copy link
Copy Markdown
Contributor Author

@normanmaurer I've received notification of Merge branch '4.1' into 4.1_no_slice changes, but IDK what it is :P

@normanmaurer
Copy link
Copy Markdown
Member

@franz1981 ah this was just to bring it up to date with current 4.1

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Feb 16, 2024

@normanmaurer there is something more you want me to take a look for this before getting another review round?

As said, I am not very confident the original code was correctly handling leaks here, but is really tons of code and we should address it separately (eg all the header retained slices around, the thrown you noticed in some earlier comment) - ideally first, if we think them to be critic.

What I could do is to try address them already for the fast-paths I've introduced, but will create the weird situation, where the optimized (although common enough, I suppose) new stuff is more correct than what exists from some time already...

@normanmaurer
Copy link
Copy Markdown
Member

@franz1981 just hold of a bit... I will try to find time to review this one first

@franz1981
Copy link
Copy Markdown
Contributor Author

@bryce-anderson if you want to give it another shot, I have tried implementing your suggestion to further reduce the number of allocations

@franz1981
Copy link
Copy Markdown
Contributor Author

Any other concerns @bryce-anderson or @normanmaurer ?
I can try reducing the amount of changes if it helps

@franz1981
Copy link
Copy Markdown
Contributor Author

@normanmaurer @chrisvest any news for this folks?
Not urgent but the performance effects was rather positives....

@normanmaurer
Copy link
Copy Markdown
Member

@ejona86 @idelpivnitskiy @bryce-anderson can you check again ?

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Aug 9, 2024

As soon as I return from holidays I would like to proceed on this... @bryce-anderson @ejona86 @idelpivnitskiy wdyt about the approach? Any other concern?

@franz1981
Copy link
Copy Markdown
Contributor Author

I can rebase this and retry - resolving conflicts.
Perf-wise it was very beneficial so...i don't want let this go ;)

@normanmaurer
Copy link
Copy Markdown
Member

sure lets do it

@normanmaurer
Copy link
Copy Markdown
Member

@franz1981 do we want to rebase and get this in ?

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Aug 28, 2025

Try it if you can @normanmaurer it was a good improvement at the time I have done it

I will be back next week so cannot 🙏

@chrisvest
Copy link
Copy Markdown
Member

I tried cherry-picking this to 4.2, but it doesn't look beneficial on the face of it:

Current 4.2:

Benchmark                                (padding)  (payloadSize)  (pooled)  Mode  Cnt    Score   Error  Units
Http2FrameWriterDataBenchmark.newWriter          0             64      true  avgt    5   64.933 ± 0.610  ns/op
Http2FrameWriterDataBenchmark.newWriter          0             64     false  avgt    5  318.025 ± 3.257  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024      true  avgt    5   60.911 ± 0.777  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024     false  avgt    5  309.507 ± 3.505  ns/op
Benchmark                                (padding)  (payloadSize)  (pooled)  Mode  Cnt    Score   Error  Units
Http2FrameWriterDataBenchmark.newWriter          0             64      true  avgt    5   67.270 ± 6.872  ns/op
Http2FrameWriterDataBenchmark.newWriter          0             64     false  avgt    5  310.112 ± 4.378  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024      true  avgt    5   60.223 ± 1.268  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024     false  avgt    5  306.737 ± 4.231  ns/op

This PR cherry-picked:

Benchmark                                (padding)  (payloadSize)  (pooled)  Mode  Cnt    Score   Error  Units
Http2FrameWriterDataBenchmark.newWriter          0             64      true  avgt    5   71.655 ± 0.913  ns/op
Http2FrameWriterDataBenchmark.newWriter          0             64     false  avgt    5  303.061 ± 5.332  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024      true  avgt    5   60.831 ± 1.914  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024     false  avgt    5  304.938 ± 7.290  ns/op
Benchmark                                (padding)  (payloadSize)  (pooled)  Mode  Cnt    Score   Error  Units
Http2FrameWriterDataBenchmark.newWriter          0             64      true  avgt    5   69.571 ± 0.767  ns/op
Http2FrameWriterDataBenchmark.newWriter          0             64     false  avgt    5  300.887 ± 5.980  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024      true  avgt    5   64.977 ± 2.050  ns/op
Http2FrameWriterDataBenchmark.newWriter          0           1024     false  avgt    5  303.579 ± 3.097  ns/op

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Oct 3, 2025

This was beneficial with POSIX writev last time I checked it and due to pipeline traversal, which:

  • has been speed up last year by @chrisvest
  • and still requires realistic pipeline settings (in Quarkus we have idle state handlers and few others floating around)

And the other reason was to save aggregate promises; IDK if that benchmark stress it.
Another is at #13783 (comment): if users have heap based small data (which I believe it was the case for techempower) having this PR help to saving copying it into an off-heap one (with its own ref cnt) for sake of sending it via transport.

I suggest you try a simple real benchmark (a la plaintext techempower) with H2load. I remember it was what I have used when I created it

But I see in some prior comments I've used Http2FrameWriterDataBenchmark myself: can you share your 4.2 branch with the cp?

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Oct 3, 2025

@chrisvest Running the JMH benchmark with the existing implementation (not this PR) I see few interesting things

image

This is where I have highlighted the allocation of the promise aggregator; it's very tiny in the benchmark, so i can see why it is not that important.
io/netty/handler/codec/http2/DefaultHttp2FrameWriter.writeData's buffer allocation (I've replaced the pooled allocator with the adaptive one) is very dominant there.
More: io/netty/channel/embedded/EmbeddedChannel.maybeRunPendingTasks is the most dominant cost (on the right) and there's no actual syscall writev or whatever.
In short, the main reasons why this PR was beneficial are gone

@chrisvest
Copy link
Copy Markdown
Member

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Oct 3, 2025

I've just tried @chrisvest

image

writeBytes strikes again: the performance IS improved but this writeBytes make it on par, again.

while this is what you got with the unpooled variant of the bench

image

still not good as it create on the flight a direct instance...

I would stick with the pooled variant of the benchmark as more realistic, but the writeBytes performance is really a concern

I think it's something we have to think about...

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Oct 3, 2025

FYI @chrisvest I've applied a small patch to the adaptive byte buf (explained at #15723 (comment)) in order to make more evident the improvement on this PR:
before:

Benchmark                                       (padding)  (payloadSize)  (pooled)  Mode  Cnt   Score   Error  Units
Http2FrameWriterDataBenchmark.newWriter                 0            128      true  avgt    5  63.886 ± 0.933  ns/op

after:

Benchmark                                       (padding)  (payloadSize)  (pooled)  Mode  Cnt   Score   Error  Units
Http2FrameWriterDataBenchmark.newWriter                 0            128      true  avgt    5  50.693 ± 0.823  ns/op

which is similar to what I was claiming at the beginning for this PR.

NOTE:
while profiling this I've found some disappointing non inlined method too (in the after case), so probably this can be even faster:
image

@franz1981
Copy link
Copy Markdown
Contributor Author

@chrisvest I could revamp this WDYT?
I was waiting some perf changes from you which has been merged ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants