Skip to content

Improve the fast-path for adaptive unshared size class allocations.#15571

Closed
georgebanasios wants to merge 29 commits into
netty:4.2from
georgebanasios:adaptive-unshared-fast-path
Closed

Improve the fast-path for adaptive unshared size class allocations.#15571
georgebanasios wants to merge 29 commits into
netty:4.2from
georgebanasios:adaptive-unshared-fast-path

Conversation

@georgebanasios
Copy link
Copy Markdown
Contributor

@georgebanasios georgebanasios commented Aug 19, 2025

Motivation:

The original implementation, had a performance gap in the unshared, size-classed allocation path.
Key bottlenecks on the hot path:

  1. Excessive Atomic Operations: Every buffer allocation from a SizeClassedChunk triggered atomic retain()/release() operations on the chunk's reference count, creating overhead even in an uncontended, single-threaded context.
  2. Generic Recycling Overhead: The use of the general-purpose Recycler for pooling AdaptiveByteBuf instances in the thread-local path introduced unnecessary overhead compared to a simpler, non-thread-safe pooling mechanism.
  3. Unspecialized Logic: The single Magazine class used conditional logic (if (shareable)) to handle both shared and unshared scenarios. This prevented the JIT compiler from creating an optimized, lock-free fast path for thread-local operations.
  4. Inefficient Caching: The logic for reusing chunks from the shared queue was not optimized for the thread-local case, leading to frequent and costly interactions with concurrent queues.

Modification:

  1. Magazine Specialization via Polymorphism
    The single Magazine class was replaced by an abstract class AbstractMagazine with two concrete implementations:
    a. SharedMagazine: Retains the original StampedLock and atomic operations, ensuring thread-safety for the multi-threaded, contended path.
    b. ThreadLocalMagazine: A new, specialized class for the unshared path. It is completely lock-free and uses non-atomic operations.
  2. Elimination of Reference Counting for SizeClassedChunk
    a. The atomic reference count (refCnt) was completely removed from SizeClassedChunk. It is now implemented in a new BumpChunk class which handles non-size-classed allocations and retains the original ref-counting logic.
    b. A new state-based, "racy-by-design" (based on: https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/) deallocation mechanism was introduced using a volatile int state field. A chunk is only deallocated when the total count of returned segments (both local and external) equals its total segment count.
  3. The SizeClassedChunk free list was split for optimization
    a. IntStack (localFreeList): A non-thread-safe stack for the owner thread to use that removes the atomic operation in the hot path (due to release that can happen concurrently) and improves the locality of reused segments.
    b. MpscIntQueue (externalFreeList): A concurrent queue for other threads to safely return segments.
  4. Optimized Buffer and Chunk Caching
    A new ThreadLocalCache class was introduced to manage all thread-local resources.
    a. Buffer Recycling: The generic Recycler was replaced with a simple FIFO ArrayDeque for pooling AdaptiveByteBuf objects on the hot path, which is much faster in a single-threaded context.
    b. Chunk Caching: The ThreadLocalMagazine now maintains a localChunkCache(ArrayList) to reduce the interactions with the external chunk reuse queue, serving as a faster, first-level cache.
    c. Queue Specialization: The chunk reuse queue for thread-local magazines is now a MpscQueue instead of the original MpmcQueue.
  5. Refined Allocation Logic
    The logic was made more direct for the specialized classes.
    a. The allocation logic for SizeClassedChunk no longer relies on calculating remainingCapacity(). Its tryAllocate() method now directly attempts to pop() a segment from its free list and if it fails it proceeds to allocate a new chunk.
    b. BumpChunk (for non-size-classed allocations) retains the original remainingCapacity() check, as it allocates memory sequentially.

Result:

Improved performance on the unshared size class allocation fast-path.

Fixes #15530

@georgebanasios
Copy link
Copy Markdown
Contributor Author

georgebanasios commented Aug 19, 2025

Before:
Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt   20  32605243.552 ±  411123.604  ops/s
ByteBufAllocatorAllocPatternBenchmark.fakeDirect      thrpt   20  57945840.039 ±  538341.192  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt   20  56046524.622 ± 1070831.040  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt   20  19418436.675 ± 1397337.546  ops/s

After:
Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt   20  62902285.890 ± 1640950.799  ops/s
ByteBufAllocatorAllocPatternBenchmark.fakeDirect      thrpt   20  55194085.541 ±  874428.134  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt   20  57301416.844 ±  335091.187  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt   20  17731544.634 ±  278838.282  ops/s

The reference benchmark is in the original ticket.

@georgebanasios
Copy link
Copy Markdown
Contributor Author

georgebanasios commented Aug 19, 2025

@franz1981 I opened this draft PR as @chrisvest suggested, just so it's easier for everyone to track the changes in one place. Let's continue the feedback and discussion here.
@laosijikaichele

The primary changes since your last review are the removal of FastThreadLocal from MagazineGroup and the logic for the magazine queues. I think this has introduced a performance drop, which I am currently checking to ensure I haven't missed anything in the implementation.
The performance on the shared path should be slightly improved as well compared to the original code. (not on mimalloc's level though)

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
@franz1981
Copy link
Copy Markdown
Contributor

Check the failures too @georgebanasios I have no laptop till sept so I will review the better I can via phone 🤳

In term of abstractions I have some mixed feeling that we could save inheritance and make code which look the same to just be duplicated, specialising behaviours which are specific for the size classed case which will be very common (and has been designed to be like that!)
This will give more room to not have opaque types with methods which although named similarly behave very differently, something which usually benefit performance and readability.

I have tried in my first PoCs to use inheritance because I didn't wanted to change too much code but for maximum performance and stability (when all the types and code paths are used in a real application) saving abstractions can just be of benefit.

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
@georgebanasios
Copy link
Copy Markdown
Contributor Author

georgebanasios commented Aug 20, 2025

There are two issues in the CI.
One is related to the addition of the local free list on the shared path, which I've fixed.
The other, is also in the shared path and seems to be a race condition in the chunk's lifecycle (not sure yet). I'm seeing illegal reference counting errors, attaching to a non-null magazine, releasing from a null magazine, etc. Once I figure that out, I'll push the fixes together.
The issue is related to the AdaptiveByteBufAllocatorTest > testAllocateWithoutLock test, which has become flaky.

Edit: I've found the issue, I'm working on it and I'll push the fixes.

@georgebanasios georgebanasios force-pushed the adaptive-unshared-fast-path branch from b06a348 to 05061fd Compare August 21, 2025 17:19
@georgebanasios
Copy link
Copy Markdown
Contributor Author

Hey @franz1981 I made some changes to the queues of the magazine group because previously I had placed a non thread safe q for shared magazines on the magazine group level and was causing a race condition which funny enough I think it revealed something regarding the local free list on the shared path.

Right now, we have an external shared queue in the MagazineGroup and a fast local queue for each Magazine (accessed by the owner thread and when we have the lock).

As far as the local free list on the shared path (and let me know if I missed something), the magazine's allocation path is locked, but the lock doesn't guarantee exclusive access to the chunk itself (only to the magazine's state).
A single size classed chunk can be concurrently accessed by multiple threads through different SharedMagazine instances. If we allow the shared path to use the chunk's non-thread-safe localFreeList, this causes a race condition on the stack.
So I reverted the shared path to use the external free list for now (we can revisit on the shared path changes, I think moving the stack to the magazine level solves this).

@franz1981
Copy link
Copy Markdown
Contributor

franz1981 commented Aug 21, 2025

but the lock doesn't guarantee exclusive access to the chunk itself (only to the magazine's state)

As far as I know the allocation paths (remaining capacity and read into) are protected and never shared across threads.
It's the release segment which can happen from multiple threads (deallocation too, by consequence).
Marking for deallocation happens from a single thread as well afaik since it is used to mark chunks which didn't go into the shared chunk queue.

Chunks afaik belong to a specific magazine and could be used without locks too, but won't be accessed (for allocations) from the actual owning magazine while it happens (the so called allocate without lock code path).

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
@franz1981
Copy link
Copy Markdown
Contributor

franz1981 commented Aug 22, 2025

Now, thanks to your latest changes, the shared magazines implements something which we need yet prove to be beneficial:

  • segment allocation always on local free list
  • segment release always on the external one (with the usual costly atomic operation)

This could be tested (before/after) with the existing benchmark without setting the harness custom executor.

Since the Mpsc q is single consumer, it uses volatile loads on consumer side, whilst the IntStack nope, although a volatile load should be fast enough (that's why in my first iteration I didn't implemented the local free list for shared magazines).

Now we have to understand if it is worthy, and in the current form (guesses):

  • If the Mpsc q is implemented with an int[] and VarHandle It could benefit by a batch drain into the int[] of IntStack using array copy (losing the LIFO order) or even one by one too, but without keep on calling poll
  • thanks to a local stack we are now LIFO, but since the refill is not frequent it doesn't matter as much as with the unshared magazine

Why the last point?
In theory If we have frequent local refill (which we have not), we could save polluting the caches in the CPU of all the segments and just reusing the last recently accessed, reducing the potential cache line misses.
I have pinged @laosijikaichele about this behaviour on the mimalloc pr since mimalloc is using "double" local free lists which always fully use all the available "buffer"s before refilling it, and causing a similar potential problem.

@georgebanasios
Copy link
Copy Markdown
Contributor Author

It turns out the local IntStack for shared magazines causes a slight performance regression.

Original (Netty 4.2):

Benchmark                                              Mode  Cnt         Score        Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt   20  35719614.293 ± 690591.331  ops/s
ByteBufAllocatorAllocPatternBenchmark.fakeDirect      thrpt   20  36387469.016 ± 286736.991  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt   20  50486234.891 ± 448406.151  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt   20  11517729.535 ± 526859.543  ops/s

Current code without stack for shared path:

Benchmark                                              Mode  Cnt         Score        Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt   20  36739216.384 ± 637446.926  ops/s
ByteBufAllocatorAllocPatternBenchmark.fakeDirect      thrpt   20  35812329.244 ± 475334.160  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt   20  49950844.079 ± 617093.970  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt   20  11483370.411 ± 342917.106  ops/s

Current code with stack for shared path:

Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt   20  33515346.111 ±  344516.536  ops/s
ByteBufAllocatorAllocPatternBenchmark.fakeDirect      thrpt   20  37706844.983 ± 1864495.394  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt   20  52113564.202 ±  260621.124  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt   20  12103734.808 ±  137343.439  ops/s

@franz1981
Copy link
Copy Markdown
Contributor

I am a bit surprised that the shared magazine case is not improved tbh, since we saved the (size classed) chunk reference count, but probably we have too many atomics there and not regressing is still fine.
In theory the most of what we care about are allocations in the event loops...

Now an interesting experiment:
You can add a jmh param i.e. int pollutionIterations
to the existing benchmark where you create both a normal and an event loop thread and run half of the iterations configured per each of the benchmarked method.
This is a "trick" to pollute the type profile of allocate/release with both shared and unshared magazines and when you run the actual benchmark you can observe if the abstractions and attempt to be DRY doesn't back fired to us.
In the real world the allocations are not just shared/unshared and we want to know how both perform there too..

@franz1981
Copy link
Copy Markdown
Contributor

I still believe that an optimized copy from the Mpsc to the int stack is needed to save the unshared case to regress when it is not used from the owner thread...
But need first to very it via profiling.
If that help I can follow up on it once back

@georgebanasios
Copy link
Copy Markdown
Contributor Author

georgebanasios commented Aug 23, 2025

Regarding the pollutionIterations and abstractions (here's the benchmark georgebanasios@fc94123, let me know if you were thinking something different. I changed the initialization of the allocators from static to a fresh one on each trial to not get counter intuitive results, even though in both cases the outcome is the same).

Benchmark                                             (pollutionIterations)   Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect                      0  thrpt   20  53041903.394 ± 2965899.618  ops/s
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect                 200000  thrpt   20  48668912.190 ± 1352680.061  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect                      0  thrpt   20  57201513.978 ±  356555.319  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect                 200000  thrpt   20  34807679.057 ±  358323.970  ops/s

@franz1981
Copy link
Copy Markdown
Contributor

franz1981 commented Aug 23, 2025

Mimalloc has never been that bad, what's going on?

I am checking the benchmark code 🙏

@georgebanasios georgebanasios changed the title Improve adaptive unshared size class allocation's fast-path Improve the fast-path for adaptive unshared size class allocations. Sep 17, 2025
@georgebanasios
Copy link
Copy Markdown
Contributor Author

georgebanasios commented Sep 17, 2025

@georgebanasios I'm trying to extract from this PR the minimal set of changes which improve the performance at the expected level: I invite you to do the same @georgebanasios because after so much work I see that I'm lost into the number of changes 🙏

I added a description to state the changes so far, feel free to take a look if anything's wrong/missing.

@franz1981
Copy link
Copy Markdown
Contributor

FYI I've tried to reduce the amount of changes (despite the ugly code, but is meant just to reduce the diff): https://github.com/franz1981/netty/commits/adaptive_size_class_improvements/

This is not optimizing the chunk reuse queue/reuse logic and is not using (yet) the new recycler, but on my machine it delivers the highest IPC and performance so far.
Feel free to give it a shot for M4; maybe I will surprised once again to see how it different from x86 performance.

@georgebanasios
Copy link
Copy Markdown
Contributor Author

@franz1981 I checked out your branch and these are the results I got:

Benchmark                                               (allocatorType)  (pollutionIterations)   Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  thrpt   20  55150266.972 ± 1225229.957  ops/s
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  thrpt   20  49043596.882 ± 2025469.514  ops/s

@franz1981
Copy link
Copy Markdown
Contributor

Thanks @georgebanasios

Consider that in both a Ryzen 4 box and a Xeon I was getting ~40 M op/s with the latest commit in this PR and ~44 M op/s on the branch you tested...
I see that since we are CPU bound, the different architectures of these CPUs make it very complex to pick the right version.
I am not sure at this point if we should favour the numbers on Apple, only because is a consumer CPU and not sure is representative of server ones, but it feels bad we cannot achieve the best numbers on both...

@georgebanasios
Copy link
Copy Markdown
Contributor Author

georgebanasios commented Sep 18, 2025

@franz1981 Yeah the results makes it difficult to choose. Unfortunately my knowledge on such CPU architectures is limited otherwise I'd be happy to help more, so if you think that new commit is better for netty (the server argument makes more sense to me tbh) we should go with that.
After tomorrow morning I'll be without a laptop for a week, so I'll post my results from the new one to see what we get there too(probably an intel not sure)

@franz1981
Copy link
Copy Markdown
Contributor

No worries, me too related Apple...and I know very few which can claim to know it well.
My branch isn't there with the purpose of override the work of this PR, since I value the big work you have done it here, but I wanted to isolate the few (if few!) critical changes which bring the most benefits.
If we can make the whole work made of 2 prs:

  1. With few critical changes perf wise with little risk
  2. Another one modifying the behaviour and types with higher risk

We could make the first one to be merged asap, and have feedbacks by the community, whilst we can take our time to do the second one..

Wdyt @normanmaurer ?
You prefer us to lead this differently?

@georgebanasios
Copy link
Copy Markdown
Contributor Author

georgebanasios commented Sep 18, 2025

No worries about overriding, etc from my side at all! Totally fine with you taking it in the other branch and continue there.
Yeah let's get @normanmaurer's opinion on this and the architectural differences you pointed out too.

@franz1981
Copy link
Copy Markdown
Contributor

Did you had any chance to move this forward @georgebanasios ?
I'm fixing other Netty related issues ATM

@chrisvest
Copy link
Copy Markdown
Member

Netty usage, where peak performance and efficiency is a concern, is still mostly x86_64.
These micro arch optimizations tend to wash a bit in real world workloads, and become a lot smaller than our focused benchmarks suggests.
There are also other things we can do in other PRs, that Franz mentioned, like improving the reference counting and inlining of same. On balance we should be able to get wins on all archs.

@georgebanasios
Copy link
Copy Markdown
Contributor Author

georgebanasios commented Oct 6, 2025

@franz1981 I didn't have a chance no. I'm planning on continuing this weekend.
What are we thinking regarding this pr?
(I've also updated the description to reflect the changes that have been made so far.)

@franz1981
Copy link
Copy Markdown
Contributor

franz1981 commented Oct 6, 2025

If we can bring in smaller changes which improve perf, would be better IMO
Feel free to pick the most relevant changes with the smaller number of code modifications (which remove ref cnt on some type of chunks, and use cheaper queues where needed) and we can ask @chrisvest to review it as well
WDYT @chrisvest ?

@laosijikaichele
Copy link
Copy Markdown
Contributor

laosijikaichele commented Oct 25, 2025

Sorry for my late reply, hope it's not too late :).

Great work @franz1981 @georgebanasios.

I recently got some new observations to share:

Before, the ByteBufAllocatorAllocPatternBenchmark setup random sizes, and every benchmark iteration use the same random sizes loop to do the allocations, which looks random but actually it's a big loop of sizes(correct me if I'm wrong), I think it's better to optimize it.

So I optimized the ByteBufAllocatorAllocPatternBenchmark, and let it pick random size every invocation, which makes the sizes pattern more random, the new code: ByteBufAllocatorAllocPatternBenchmark.

Run the new ByteBufAllocatorAllocPatternBenchmark on M1 chip:

Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt    5  14180087.585 ±  582466.544  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt    5  23090530.626 ±  239564.032  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt    5  12727080.639 ± 2273078.227  ops/s

Further, I added a more random sizes benchmark: ByteBufAllocatorRandomBenchmark, which randomly pick size from range: [0 - 8Kib).

Run the ByteBufAllocatorRandomBenchmark:

Benchmark                                        Mode  Cnt         Score        Error  Units
ByteBufAllocatorRandomBenchmark.adaptiveDirect  thrpt    5   8437937.643 ± 339035.807  ops/s
ByteBufAllocatorRandomBenchmark.mimallocDirect  thrpt    5  17154273.444 ± 876545.019  ops/s
ByteBufAllocatorRandomBenchmark.pooledDirect    thrpt    5  11540785.897 ± 115708.053  ops/s

So, I think, if the allocation pattern is less random, the adaptive will get better performance, if the allocation pattern is more random, then the mimalloc allocator will get better performance.

@franz1981
Copy link
Copy Markdown
Contributor

This which commit is referring too?
I would profile or use Linux perf to make sure of which effects we have here.
The size class choice is not using any branch in the latest commits so I am not sure is due to this

@laosijikaichele
Copy link
Copy Markdown
Contributor

This which commit is referring too? I would profile or use Linux perf to make sure of which effects we have here. The size class choice is not using any branch in the latest commits so I am not sure is due to this

I use AdaptivePoolingAllocator of this PR's latest commit.

@franz1981
Copy link
Copy Markdown
Contributor

I have opened an alternative PR using the latest recycler etc etc.
And that will be the starting points for confined improvements (easier to review) hopefully with similar or better (at the end) outcome, performance wise.
In the new pr I have changed the benchmark
Because I found it a bit broken.

Said that, I have not profiled what we have here with perfnorm nor async profiler.

@laosijikaichele
Copy link
Copy Markdown
Contributor

I have opened an alternative PR using the latest recycler etc etc. And that will be the starting points for confined

Cool, I will try to do some benchmark on it tommorrow.

@franz1981
Copy link
Copy Markdown
Contributor

franz1981 commented Oct 25, 2025

Thinking about it twice, there is still the difference between bump and sized class chunks and it would force the compiler to branch and perform a speculative decision, likely wrong based on the distribution; but without a profiler is difficult to know if that's the culprit.

Another option is that the increase randomness is achieved by increasing the sample count, which just mean that we maybe increased the cache usage, and adaptive is not that cheap (yet) on that....
As said, it could be anything, so please if you have the chance, attach a profiler so we can better understand what's going on 🙏

@franz1981
Copy link
Copy Markdown
Contributor

Re

which looks random but actually it's a big loop of sizes(correct me if I'm wrong

That is fine, but is important that the sequence of choices in the loop is long and random enough.the length depends by the branch predictor state capacity

@laosijikaichele
Copy link
Copy Markdown
Contributor

laosijikaichele commented Oct 26, 2025

I have opened an alternative PR using the latest recycler etc etc. And that will be the starting points for confined

Cool, I will try to do some benchmark on it tommorrow.

I posted the benchmark in #15741 (comment).

chrisvest pushed a commit that referenced this pull request Jan 6, 2026
Motivation:

Adaptive allocator perform costly atomic operations in the thread local
path, which reduce its performance

Modification:

Reduce the amount of atomic operations in the thread local allocation's
fast path

Result:

Fixes #15571


These are the different variations I want to test:

- [x] Uses unguarded `Recycler`s
- [x] Implements "compressed" local free list (LIFO) 
- [x] Use a mpsc q for the reuse chunk q in the thread-local case 
**NO VISIBLE IMPROVEMENTS**
- [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile
`get` first, since size classed chunks rarely end up into `nextInLine`
(i.e. which is mostly `null`)
**NO VISIBLE IMPROVEMENTS**
- [x] Implements a var handle based `MpscIntQueue` (done at
1c4e1e4)
**NO VISIBLE IMPROVEMENTS**
- [x] Remove the live/raw ref cnt as mentioned at
#15736 (comment)
- [ ] Remove the ref count for size classed chunks (see
8953bbe and
8cb1bf0)
- [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based
one
chrisvest pushed a commit to chrisvest/netty that referenced this pull request Jan 6, 2026
Motivation:

Adaptive allocator perform costly atomic operations in the thread local
path, which reduce its performance

Modification:

Reduce the amount of atomic operations in the thread local allocation's
fast path

Result:

Fixes netty#15571

These are the different variations I want to test:

- [x] Uses unguarded `Recycler`s
- [x] Implements "compressed" local free list (LIFO)
- [x] Use a mpsc q for the reuse chunk q in the thread-local case
**NO VISIBLE IMPROVEMENTS**
- [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile
`get` first, since size classed chunks rarely end up into `nextInLine`
(i.e. which is mostly `null`)
**NO VISIBLE IMPROVEMENTS**
- [x] Implements a var handle based `MpscIntQueue` (done at
1c4e1e4)
**NO VISIBLE IMPROVEMENTS**
- [x] Remove the live/raw ref cnt as mentioned at
netty#15736 (comment)
- [ ] Remove the ref count for size classed chunks (see
8953bbe and
8cb1bf0)
- [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based
one

(cherry picked from commit accd981)
chrisvest added a commit that referenced this pull request Jan 7, 2026
Motivation:

Adaptive allocator perform costly atomic operations in the thread local
path, which reduce its performance

Modification:

Reduce the amount of atomic operations in the thread local allocation's
fast path

Result:

Fixes #15571

These are the different variations I want to test:

- [x] Uses unguarded `Recycler`s
- [x] Implements "compressed" local free list (LIFO)
- [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO
VISIBLE IMPROVEMENTS**
- [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile
`get` first, since size classed chunks rarely end up into `nextInLine`
(i.e. which is mostly `null`)
**NO VISIBLE IMPROVEMENTS**
- [x] Implements a var handle based `MpscIntQueue` (done at
1c4e1e4)
**NO VISIBLE IMPROVEMENTS**
- [x] Remove the live/raw ref cnt as mentioned at
#15736 (comment)
- [ ] Remove the ref count for size classed chunks (see
8953bbe and
8cb1bf0)
- [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based
one

(cherry picked from commit accd981)

Co-authored-by: Francesco Nigro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve adaptive unshared size class allocation's fast-path

5 participants