Improve adaptive allocator thread local performance by franz1981 · Pull Request #15741 · netty/netty

franz1981 · 2025-10-10T12:31:26Z

Motivation:

Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance

Modification:

Reduce the amount of atomic operations in the thread local allocation's fast path

Result:

Fixes #15571

These are the different variations I want to test:

Uses unguarded Recyclers
Implements "compressed" local free list (LIFO)
Use a mpsc q for the reuse chunk q in the thread-local case
NO VISIBLE IMPROVEMENTS
Guards nextInLine's getAndSet with a null check via volatile get first, since size classed chunks rarely end up into nextInLine (i.e. which is mostly null)
NO VISIBLE IMPROVEMENTS
Implements a var handle based MpscIntQueue (done at 1c4e1e4)
NO VISIBLE IMPROVEMENTS
Remove the live/raw ref cnt as mentioned at Make AdaptiveByteBuf.setBytes faster #15736 (comment)
Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0)
Use the "pinned" Recycler instead of the FastThreadLocal-based one

franz1981 · 2025-10-10T12:33:14Z

FYI @georgebanasios and @chrisvest

Each of the commit has its own "cumulative" performance results, to help understand the effects performance-wise.

Despite been WIP this is still in a decent state thanks to the work done by @georgebanasios at #15530

laosijikaichele · 2025-10-26T06:24:20Z

I benchmarked with ByteBufAllocatorAllocPatternBenchmark, the adaptive code based on the latest commit of this PR, the others code based on #15525 with the latest 4.2-branch merged:

Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt    5  16114924.284 ±  609889.153  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt    5  23553245.698 ±   84843.835  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt    5  13523664.203 ± 1699723.688  ops/s

The flame of adaptive:

franz1981 · 2025-10-26T06:46:16Z

So something is off here (which I am not hitting on my machine...); it seems the chunk queue capacity is not enough and you got allocations in the hot paths. You can increase the capacity of the reusable chunk q but I don't understand why I cannot hit this...did you changed the benchmark to retain more buffers?

Try use the Benchmark I have on this PR which is saving more hotspot to cheat and if you want add more variability on it.
Re the branchy hot spot fill; I have already complained on the hotspot GitHub prs around unsafe::set memory (now intrinsics) to be less branchy on the alignment stride

laosijikaichele · 2025-10-26T07:31:17Z

On my machine, the NettyRuntime.availableProcessors() returns 8, what's the value of NettyRuntime.availableProcessors() on your machine?

I increased the CHUNK_REUSE_QUEUE capacity from NettyRuntime.availableProcessors() * 2 to NettyRuntime.availableProcessors() * 4, and run benchmark again:

Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt    5  26091400.366 ±  177201.421  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt    5  30200336.405 ± 1365627.802  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt    5  13082622.112 ±  130244.414  ops/s

The number of adaptive number improved a lot, which indicates that the capacity of CHUNK_REUSE_QUEUE does matter.
Also I noticed the mimalloc number increased too, but I did not change any code of mimalloc since the previous run, I will try to check the reason...

franz1981 · 2025-10-26T08:51:00Z

Consider that I run the benchmark with harness executor as event loops and for fast thread local chunk queues it shouldn't matter, since are not shared at all...
I would work to change that; the capacity shouldn't really be bound to the core count although releases can happen concurrently

laosijikaichele · 2025-10-26T09:14:53Z

Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt    5  26091400.366 ±  177201.421  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt    5  30200336.405 ± 1365627.802  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt    5  13082622.112 ±  130244.414  ops/s
The number of adaptive number improved a lot, which indicates that the capacity of CHUNK_REUSE_QUEUE does matter. Also I noticed the mimalloc number increased too, but I did not change any code of mimalloc since the previous run, I will try to check the reason...

I found the reason why the mimalloc number increased: I have increased the 'Xms,Xmx and MaxDirectMemorySize' from '768 m' to '2768 m' for this run. So it seems 'Xms,Xmx and MaxDirectMemorySize' take effect on the performance of mimalloc.

franz1981 · 2025-10-26T09:16:36Z

Do you have the flame-graphs for the adaptive one? You can use cstack vm as rawCommand

laosijikaichele · 2025-10-26T09:17:39Z

Do you have the flame-graphs for the adaptive one? You can use cstack vm as rawCommand

Did you mean for the second run?

laosijikaichele · 2025-10-26T09:24:26Z

Do you have the flame-graphs for the adaptive one? You can use cstack vm as rawCommand

Did you mean for the second run?

This is the adaptive flame-graph of the second run:

franz1981 · 2025-10-26T09:37:15Z

The rightmost part at..

?

franz1981 · 2025-10-26T09:43:29Z

Regardless, without the changes in this PR we should be too much distant from mimalloc, I think (@chrisvest can verify for M1 too, I don't want to bother you too much @laosijikaichele !) - whilst with these we are in the 86% of its performance.
But we should better size the chunk queue for size classed for not shared magazines, so I am taking note (i.e. is not subject to high concurrency and chunks can be recycled into it even if temporary fully consumed)

We have yet few improvements which are not in, but I plan to remove this work once I have progressed with the Netty loom scheduler I am working on

laosijikaichele · 2025-10-26T09:51:41Z

The rightmost part

you mean this?

franz1981 · 2025-10-26T09:56:56Z

Yep. That's why I prefer to use larger sample size space and no random in the hot path...it has some variability of cost too 😢
While larger space has "just" memory accesses

franz1981 · 2025-11-07T10:17:44Z

After #15764 I have to rebase this PR and I've already updated the description checklist at

Remove the live/raw ref cnt as mentioned at #15736 (comment)

I think

Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0)

is even more important to be brought in, because of the additional indirection level, and despite overall it would save inlining issues (already verified at #15764 (comment))

Benchmark (allocatorType) (pollutionIterations) Mode Cnt Score Error Units ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 0 avgt 20 45.499 ± 0.249 ns/op ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 200000 avgt 20 46.075 ± 0.414 ns/op

Benchmark (allocatorType) (pollutionIterations) Mode Cnt Score Error Units ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 0 avgt 20 40.054 ± 0.104 ns/op ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 200000 avgt 20 45.149 ± 0.648 ns/op

Benchmark (allocatorType) (pollutionIterations) Mode Cnt Score Error Units ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 0 avgt 20 37.085 ± 0.322 ns/op ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 200000 avgt 20 43.207 ± 0.175 ns/op

franz1981 · 2025-12-16T05:44:51Z

I've rebased over 4.2 and updates the numbers in the commits messages.
I'm running tests with JDK 25 and without enabling native access i.e. we're using VarHandle based reference counting.
Clearly this is not great for the JCTools queue used as chunk reuse q because it will be based on atomics and not Unsafe - which can have some visible performance difference.

franz1981 · 2025-12-16T06:17:15Z

It looks like #15799 has broken 2ecb257 because i cannot compress anymore the offsets in the same way

franz1981 · 2025-12-28T16:17:52Z

This is ready for review @chrisvest ; it has a minimal set of changes which already help performance (although with w an increased memory footprint due to the local free list)
I plan to implement separately a change to remove the reference count on the sized class case and piggyback on the free list only as mentioned in one of the bullet points in this PR head comment.

chrisvest

One comment, but it looks pretty good.

chrisvest · 2026-01-06T00:05:43Z

+            if (localFreeList != null) {
+                assert Thread.currentThread() == ownerThread;
+                if (localFreeList.isEmpty()) {
+                    startIndex = externalFreeList.poll();


We could perhaps move entries into the local list in a batch. Otherwise the local freelist is only populated by releases done by the owning thread.

We fallback here once there is nothing in the local free list, so the only benefit of copying is that we drain LIFO instead of FIFO from the external one, but to be fair I prefer to prioritize draining from the local one instead of mixing segments from the two free lists; the local segments have the benefit that they are more likely to be "hot" on the owner thread core

franz1981 · 2026-01-06T08:31:47Z

One note for @chrisvest

#15741 (comment)

We need to think about the size classed chunk q sizing (and how to use it, since it doesn't prioritize the available capacity, which is mutable) and I haven't gotten rid of the ref cnt additional operations on the size classed chunks which I know already (tested eons ago) it deliver the biggest boost.
I plan to tackle these in separate PRs.

chrisvest · 2026-01-06T18:10:39Z

Sounds good, @franz1981. Let's merge this and think about further improvements in future PRs.

Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes netty#15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at netty#15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one (cherry picked from commit accd981)

Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes #15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at #15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one (cherry picked from commit accd981) Co-authored-by: Francesco Nigro <[email protected]>

franz1981 mentioned this pull request Oct 17, 2025

Simplify reference counting #15764

Merged

laosijikaichele mentioned this pull request Oct 26, 2025

Improve the fast-path for adaptive unshared size class allocations. #15571

Closed

franz1981 added 3 commits December 16, 2025 06:41

franz1981 force-pushed the adaptive_size_class_work branch from 314351c to 2ecb257 Compare December 16, 2025 05:42

franz1981 requested review from chrisvest and normanmaurer December 16, 2025 06:09

franz1981 added 2 commits December 16, 2025 15:24

Remove compressed stack

f779089

Tame footprint: compute free lists on flight

76999f8

franz1981 marked this pull request as ready for review December 16, 2025 21:51

chrisvest approved these changes Jan 6, 2026

View reviewed changes

chrisvest added this to the 4.2.10.Final milestone Jan 6, 2026

chrisvest added the needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. label Jan 6, 2026

chrisvest merged commit accd981 into netty:4.2 Jan 6, 2026
19 checks passed

franz1981 deleted the adaptive_size_class_work branch January 8, 2026 16:01

franz1981 mentioned this pull request Jan 13, 2026

Experimental: introduce mimalloc MiMallocByteBufAllocator #15525

Closed

franz1981 mentioned this pull request Feb 19, 2026

Remove reference counting from size classed chunks #16306

Merged

nsnmurthyk mentioned this pull request Apr 16, 2026

Netty Issue: GZIP trailer corruption with large payloads after upgrading from 4.2.9 to 4.2.11 #16656

Closed

Uh oh!

Conversation

franz1981 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 10, 2025

Uh oh!

laosijikaichele commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laosijikaichele commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laosijikaichele commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 26, 2025

Uh oh!

laosijikaichele commented Oct 26, 2025

Uh oh!

laosijikaichele commented Oct 26, 2025

Uh oh!

franz1981 commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laosijikaichele commented Oct 26, 2025

Uh oh!

franz1981 commented Oct 26, 2025

Uh oh!

franz1981 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Dec 16, 2025

Uh oh!

franz1981 commented Dec 16, 2025

Uh oh!

franz1981 commented Dec 28, 2025

Uh oh!

chrisvest left a comment

Choose a reason for hiding this comment

Uh oh!

chrisvest Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

franz1981 Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

franz1981 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisvest commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

franz1981 commented Oct 10, 2025 •

edited

Loading

laosijikaichele commented Oct 26, 2025 •

edited

Loading

franz1981 commented Oct 26, 2025 •

edited

Loading

laosijikaichele commented Oct 26, 2025 •

edited

Loading

franz1981 commented Oct 26, 2025 •

edited

Loading

laosijikaichele commented Oct 26, 2025 •

edited

Loading

franz1981 commented Oct 26, 2025 •

edited

Loading

franz1981 commented Oct 26, 2025 •

edited

Loading

franz1981 commented Nov 7, 2025 •

edited

Loading

franz1981 Jan 6, 2026 •

edited

Loading

franz1981 commented Jan 6, 2026 •

edited

Loading