Skip to content

Improve adaptive allocator thread local performance#15741

Merged
chrisvest merged 5 commits into
netty:4.2from
franz1981:adaptive_size_class_work
Jan 6, 2026
Merged

Improve adaptive allocator thread local performance#15741
chrisvest merged 5 commits into
netty:4.2from
franz1981:adaptive_size_class_work

Conversation

@franz1981
Copy link
Copy Markdown
Contributor

@franz1981 franz1981 commented Oct 10, 2025

Motivation:

Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance

Modification:

Reduce the amount of atomic operations in the thread local allocation's fast path

Result:

Fixes #15571

These are the different variations I want to test:

  • Uses unguarded Recyclers
  • Implements "compressed" local free list (LIFO)
  • Use a mpsc q for the reuse chunk q in the thread-local case
    NO VISIBLE IMPROVEMENTS
  • Guards nextInLine's getAndSet with a null check via volatile get first, since size classed chunks rarely end up into nextInLine (i.e. which is mostly null)
    NO VISIBLE IMPROVEMENTS
  • Implements a var handle based MpscIntQueue (done at 1c4e1e4)
    NO VISIBLE IMPROVEMENTS
  • Remove the live/raw ref cnt as mentioned at Make AdaptiveByteBuf.setBytes faster #15736 (comment)
  • Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0)
  • Use the "pinned" Recycler instead of the FastThreadLocal-based one

@franz1981
Copy link
Copy Markdown
Contributor Author

FYI @georgebanasios and @chrisvest

Each of the commit has its own "cumulative" performance results, to help understand the effects performance-wise.

Despite been WIP this is still in a decent state thanks to the work done by @georgebanasios at #15530

@laosijikaichele
Copy link
Copy Markdown
Contributor

laosijikaichele commented Oct 26, 2025

I benchmarked with ByteBufAllocatorAllocPatternBenchmark, the adaptive code based on the latest commit of this PR, the others code based on #15525 with the latest 4.2-branch merged:

Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt    5  16114924.284 ±  609889.153  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt    5  23553245.698 ±   84843.835  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt    5  13523664.203 ± 1699723.688  ops/s

The flame of adaptive:
下载 (5)

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Oct 26, 2025

So something is off here (which I am not hitting on my machine...); it seems the chunk queue capacity is not enough and you got allocations in the hot paths. You can increase the capacity of the reusable chunk q but I don't understand why I cannot hit this...did you changed the benchmark to retain more buffers?

Try use the Benchmark I have on this PR which is saving more hotspot to cheat and if you want add more variability on it.
Re the branchy hot spot fill; I have already complained on the hotspot GitHub prs around unsafe::set memory (now intrinsics) to be less branchy on the alignment stride

@laosijikaichele
Copy link
Copy Markdown
Contributor

laosijikaichele commented Oct 26, 2025

On my machine, the NettyRuntime.availableProcessors() returns 8, what's the value of NettyRuntime.availableProcessors() on your machine?

I increased the CHUNK_REUSE_QUEUE capacity from NettyRuntime.availableProcessors() * 2 to NettyRuntime.availableProcessors() * 4, and run benchmark again:

Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt    5  26091400.366 ±  177201.421  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt    5  30200336.405 ± 1365627.802  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt    5  13082622.112 ±  130244.414  ops/s

The number of adaptive number improved a lot, which indicates that the capacity of CHUNK_REUSE_QUEUE does matter.
Also I noticed the mimalloc number increased too, but I did not change any code of mimalloc since the previous run, I will try to check the reason...

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Oct 26, 2025

Consider that I run the benchmark with harness executor as event loops and for fast thread local chunk queues it shouldn't matter, since are not shared at all...
I would work to change that; the capacity shouldn't really be bound to the core count although releases can happen concurrently

@laosijikaichele
Copy link
Copy Markdown
Contributor

laosijikaichele commented Oct 26, 2025

Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt    5  26091400.366 ±  177201.421  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt    5  30200336.405 ± 1365627.802  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt    5  13082622.112 ±  130244.414  ops/s

The number of adaptive number improved a lot, which indicates that the capacity of CHUNK_REUSE_QUEUE does matter. Also I noticed the mimalloc number increased too, but I did not change any code of mimalloc since the previous run, I will try to check the reason...

I found the reason why the mimalloc number increased: I have increased the 'Xms,Xmx and MaxDirectMemorySize' from '768 m' to '2768 m' for this run. So it seems 'Xms,Xmx and MaxDirectMemorySize' take effect on the performance of mimalloc.

@franz1981
Copy link
Copy Markdown
Contributor Author

Do you have the flame-graphs for the adaptive one? You can use cstack vm as rawCommand

@laosijikaichele
Copy link
Copy Markdown
Contributor

Do you have the flame-graphs for the adaptive one? You can use cstack vm as rawCommand

Did you mean for the second run?

@laosijikaichele
Copy link
Copy Markdown
Contributor

Do you have the flame-graphs for the adaptive one? You can use cstack vm as rawCommand

Did you mean for the second run?

This is the adaptive flame-graph of the second run:

下载 (6)

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Oct 26, 2025

The rightmost part at..
截屏2025-10-26 17 48 58
?

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Oct 26, 2025

Regardless, without the changes in this PR we should be too much distant from mimalloc, I think (@chrisvest can verify for M1 too, I don't want to bother you too much @laosijikaichele !) - whilst with these we are in the 86% of its performance.
But we should better size the chunk queue for size classed for not shared magazines, so I am taking note (i.e. is not subject to high concurrency and chunks can be recycled into it even if temporary fully consumed)

We have yet few improvements which are not in, but I plan to remove this work once I have progressed with the Netty loom scheduler I am working on

@laosijikaichele
Copy link
Copy Markdown
Contributor

The rightmost part

截屏2025-10-26 17 48 58

you mean this?

@franz1981
Copy link
Copy Markdown
Contributor Author

Yep. That's why I prefer to use larger sample size space and no random in the hot path...it has some variability of cost too 😢
While larger space has "just" memory accesses

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Nov 7, 2025

After #15764 I have to rebase this PR and I've already updated the description checklist at

Remove the live/raw ref cnt as mentioned at #15736 (comment)

I think

Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0)

is even more important to be brought in, because of the additional indirection level, and despite overall it would save inlining issues (already verified at #15764 (comment))

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt   20  45.499 ± 0.249  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt   20  46.075 ± 0.414  ns/op
Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt   20  40.054 ± 0.104  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt   20  45.149 ± 0.648  ns/op
Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt   20  37.085 ± 0.322  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt   20  43.207 ± 0.175  ns/op
@franz1981 franz1981 force-pushed the adaptive_size_class_work branch from 314351c to 2ecb257 Compare December 16, 2025 05:42
@franz1981
Copy link
Copy Markdown
Contributor Author

I've rebased over 4.2 and updates the numbers in the commits messages.
I'm running tests with JDK 25 and without enabling native access i.e. we're using VarHandle based reference counting.
Clearly this is not great for the JCTools queue used as chunk reuse q because it will be based on atomics and not Unsafe - which can have some visible performance difference.

@franz1981
Copy link
Copy Markdown
Contributor Author

It looks like #15799 has broken 2ecb257 because i cannot compress anymore the offsets in the same way

@franz1981 franz1981 marked this pull request as ready for review December 16, 2025 21:51
@franz1981
Copy link
Copy Markdown
Contributor Author

This is ready for review @chrisvest ; it has a minimal set of changes which already help performance (although with w an increased memory footprint due to the local free list)
I plan to implement separately a change to remove the reference count on the sized class case and piggyback on the free list only as mentioned in one of the bullet points in this PR head comment.

Copy link
Copy Markdown
Member

@chrisvest chrisvest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, but it looks pretty good.

if (localFreeList != null) {
assert Thread.currentThread() == ownerThread;
if (localFreeList.isEmpty()) {
startIndex = externalFreeList.poll();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could perhaps move entries into the local list in a batch. Otherwise the local freelist is only populated by releases done by the owning thread.

Copy link
Copy Markdown
Contributor Author

@franz1981 franz1981 Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We fallback here once there is nothing in the local free list, so the only benefit of copying is that we drain LIFO instead of FIFO from the external one, but to be fair I prefer to prioritize draining from the local one instead of mixing segments from the two free lists; the local segments have the benefit that they are more likely to be "hot" on the owner thread core

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Jan 6, 2026

One note for @chrisvest

#15741 (comment)

We need to think about the size classed chunk q sizing (and how to use it, since it doesn't prioritize the available capacity, which is mutable) and I haven't gotten rid of the ref cnt additional operations on the size classed chunks which I know already (tested eons ago) it deliver the biggest boost.
I plan to tackle these in separate PRs.

@chrisvest
Copy link
Copy Markdown
Member

Sounds good, @franz1981. Let's merge this and think about further improvements in future PRs.

@chrisvest chrisvest added this to the 4.2.10.Final milestone Jan 6, 2026
@chrisvest chrisvest added the needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. label Jan 6, 2026
@chrisvest chrisvest merged commit accd981 into netty:4.2 Jan 6, 2026
19 checks passed
chrisvest pushed a commit to chrisvest/netty that referenced this pull request Jan 6, 2026
Motivation:

Adaptive allocator perform costly atomic operations in the thread local
path, which reduce its performance

Modification:

Reduce the amount of atomic operations in the thread local allocation's
fast path

Result:

Fixes netty#15571

These are the different variations I want to test:

- [x] Uses unguarded `Recycler`s
- [x] Implements "compressed" local free list (LIFO)
- [x] Use a mpsc q for the reuse chunk q in the thread-local case
**NO VISIBLE IMPROVEMENTS**
- [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile
`get` first, since size classed chunks rarely end up into `nextInLine`
(i.e. which is mostly `null`)
**NO VISIBLE IMPROVEMENTS**
- [x] Implements a var handle based `MpscIntQueue` (done at
1c4e1e4)
**NO VISIBLE IMPROVEMENTS**
- [x] Remove the live/raw ref cnt as mentioned at
netty#15736 (comment)
- [ ] Remove the ref count for size classed chunks (see
8953bbe and
8cb1bf0)
- [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based
one

(cherry picked from commit accd981)
chrisvest added a commit that referenced this pull request Jan 7, 2026
Motivation:

Adaptive allocator perform costly atomic operations in the thread local
path, which reduce its performance

Modification:

Reduce the amount of atomic operations in the thread local allocation's
fast path

Result:

Fixes #15571

These are the different variations I want to test:

- [x] Uses unguarded `Recycler`s
- [x] Implements "compressed" local free list (LIFO)
- [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO
VISIBLE IMPROVEMENTS**
- [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile
`get` first, since size classed chunks rarely end up into `nextInLine`
(i.e. which is mostly `null`)
**NO VISIBLE IMPROVEMENTS**
- [x] Implements a var handle based `MpscIntQueue` (done at
1c4e1e4)
**NO VISIBLE IMPROVEMENTS**
- [x] Remove the live/raw ref cnt as mentioned at
#15736 (comment)
- [ ] Remove the ref count for size classed chunks (see
8953bbe and
8cb1bf0)
- [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based
one

(cherry picked from commit accd981)

Co-authored-by: Francesco Nigro <[email protected]>
@franz1981 franz1981 deleted the adaptive_size_class_work branch January 8, 2026 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants