Improve the fast-path for adaptive unshared size class allocations. by georgebanasios · Pull Request #15571 · netty/netty

georgebanasios · 2025-08-19T13:44:50Z

Motivation:

The original implementation, had a performance gap in the unshared, size-classed allocation path.
Key bottlenecks on the hot path:

Excessive Atomic Operations: Every buffer allocation from a SizeClassedChunk triggered atomic retain()/release() operations on the chunk's reference count, creating overhead even in an uncontended, single-threaded context.
Generic Recycling Overhead: The use of the general-purpose Recycler for pooling AdaptiveByteBuf instances in the thread-local path introduced unnecessary overhead compared to a simpler, non-thread-safe pooling mechanism.
Unspecialized Logic: The single Magazine class used conditional logic (if (shareable)) to handle both shared and unshared scenarios. This prevented the JIT compiler from creating an optimized, lock-free fast path for thread-local operations.
Inefficient Caching: The logic for reusing chunks from the shared queue was not optimized for the thread-local case, leading to frequent and costly interactions with concurrent queues.

Modification:

Magazine Specialization via Polymorphism
The single Magazine class was replaced by an abstract class AbstractMagazine with two concrete implementations:
a. SharedMagazine: Retains the original StampedLock and atomic operations, ensuring thread-safety for the multi-threaded, contended path.
b. ThreadLocalMagazine: A new, specialized class for the unshared path. It is completely lock-free and uses non-atomic operations.
Elimination of Reference Counting for SizeClassedChunk
a. The atomic reference count (refCnt) was completely removed from SizeClassedChunk. It is now implemented in a new BumpChunk class which handles non-size-classed allocations and retains the original ref-counting logic.
b. A new state-based, "racy-by-design" (based on: https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/) deallocation mechanism was introduced using a volatile int state field. A chunk is only deallocated when the total count of returned segments (both local and external) equals its total segment count.
The SizeClassedChunk free list was split for optimization
a. IntStack (localFreeList): A non-thread-safe stack for the owner thread to use that removes the atomic operation in the hot path (due to release that can happen concurrently) and improves the locality of reused segments.
b. MpscIntQueue (externalFreeList): A concurrent queue for other threads to safely return segments.
Optimized Buffer and Chunk Caching
A new ThreadLocalCache class was introduced to manage all thread-local resources.
a. Buffer Recycling: The generic Recycler was replaced with a simple FIFO ArrayDeque for pooling AdaptiveByteBuf objects on the hot path, which is much faster in a single-threaded context.
b. Chunk Caching: The ThreadLocalMagazine now maintains a localChunkCache(ArrayList) to reduce the interactions with the external chunk reuse queue, serving as a faster, first-level cache.
c. Queue Specialization: The chunk reuse queue for thread-local magazines is now a MpscQueue instead of the original MpmcQueue.
Refined Allocation Logic
The logic was made more direct for the specialized classes.
a. The allocation logic for SizeClassedChunk no longer relies on calculating remainingCapacity(). Its tryAllocate() method now directly attempts to pop() a segment from its free list and if it fails it proceeds to allocate a new chunk.
b. BumpChunk (for non-size-classed allocations) retains the original remainingCapacity() check, as it allocates memory sequentially.

Result:

Improved performance on the unshared size class allocation fast-path.

Fixes #15530

This reverts commit ea20a8d.

georgebanasios · 2025-08-19T14:08:09Z

Before:
Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt   20  32605243.552 ±  411123.604  ops/s
ByteBufAllocatorAllocPatternBenchmark.fakeDirect      thrpt   20  57945840.039 ±  538341.192  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt   20  56046524.622 ± 1070831.040  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt   20  19418436.675 ± 1397337.546  ops/s

After:
Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt   20  62902285.890 ± 1640950.799  ops/s
ByteBufAllocatorAllocPatternBenchmark.fakeDirect      thrpt   20  55194085.541 ±  874428.134  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt   20  57301416.844 ±  335091.187  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt   20  17731544.634 ±  278838.282  ops/s

The reference benchmark is in the original ticket.

georgebanasios · 2025-08-19T14:16:53Z

@franz1981 I opened this draft PR as @chrisvest suggested, just so it's easier for everyone to track the changes in one place. Let's continue the feedback and discussion here.
@laosijikaichele

The primary changes since your last review are the removal of FastThreadLocal from MagazineGroup and the logic for the magazine queues. I think this has introduced a performance drop, which I am currently checking to ensure I haven't missed anything in the implementation.
The performance on the shared path should be slightly improved as well compared to the original code. (not on mimalloc's level though)

franz1981 · 2025-08-19T15:39:24Z

Check the failures too @georgebanasios I have no laptop till sept so I will review the better I can via phone 🤳

In term of abstractions I have some mixed feeling that we could save inheritance and make code which look the same to just be duplicated, specialising behaviours which are specific for the size classed case which will be very common (and has been designed to be like that!)
This will give more room to not have opaque types with methods which although named similarly behave very differently, something which usually benefit performance and readability.

I have tried in my first PoCs to use inheritance because I didn't wanted to change too much code but for maximum performance and stability (when all the types and code paths are used in a real application) saving abstractions can just be of benefit.

georgebanasios · 2025-08-20T17:54:44Z

There are two issues in the CI.
One is related to the addition of the local free list on the shared path, which I've fixed.
The other, is also in the shared path and seems to be a race condition in the chunk's lifecycle (not sure yet). I'm seeing illegal reference counting errors, attaching to a non-null magazine, releasing from a null magazine, etc. Once I figure that out, I'll push the fixes together.
The issue is related to the AdaptiveByteBufAllocatorTest > testAllocateWithoutLock test, which has become flaky.

Edit: I've found the issue, I'm working on it and I'll push the fixes.

georgebanasios · 2025-08-21T17:50:46Z

Hey @franz1981 I made some changes to the queues of the magazine group because previously I had placed a non thread safe q for shared magazines on the magazine group level and was causing a race condition which funny enough I think it revealed something regarding the local free list on the shared path.

Right now, we have an external shared queue in the MagazineGroup and a fast local queue for each Magazine (accessed by the owner thread and when we have the lock).

As far as the local free list on the shared path (and let me know if I missed something), the magazine's allocation path is locked, but the lock doesn't guarantee exclusive access to the chunk itself (only to the magazine's state).
A single size classed chunk can be concurrently accessed by multiple threads through different SharedMagazine instances. If we allow the shared path to use the chunk's non-thread-safe localFreeList, this causes a race condition on the stack.
So I reverted the shared path to use the external free list for now (we can revisit on the shared path changes, I think moving the stack to the magazine level solves this).

franz1981 · 2025-08-21T18:01:12Z

but the lock doesn't guarantee exclusive access to the chunk itself (only to the magazine's state)

As far as I know the allocation paths (remaining capacity and read into) are protected and never shared across threads.
It's the release segment which can happen from multiple threads (deallocation too, by consequence).
Marking for deallocation happens from a single thread as well afaik since it is used to mark chunks which didn't go into the shared chunk queue.

Chunks afaik belong to a specific magazine and could be used without locks too, but won't be accessed (for allocations) from the actual owning magazine while it happens (the so called allocate without lock code path).

franz1981 · 2025-08-22T13:59:21Z

Now, thanks to your latest changes, the shared magazines implements something which we need yet prove to be beneficial:

segment allocation always on local free list
segment release always on the external one (with the usual costly atomic operation)

This could be tested (before/after) with the existing benchmark without setting the harness custom executor.

Since the Mpsc q is single consumer, it uses volatile loads on consumer side, whilst the IntStack nope, although a volatile load should be fast enough (that's why in my first iteration I didn't implemented the local free list for shared magazines).

Now we have to understand if it is worthy, and in the current form (guesses):

If the Mpsc q is implemented with an int[] and VarHandle It could benefit by a batch drain into the int[] of IntStack using array copy (losing the LIFO order) or even one by one too, but without keep on calling poll
thanks to a local stack we are now LIFO, but since the refill is not frequent it doesn't matter as much as with the unshared magazine

Why the last point?
In theory If we have frequent local refill (which we have not), we could save polluting the caches in the CPU of all the segments and just reusing the last recently accessed, reducing the potential cache line misses.
I have pinged @laosijikaichele about this behaviour on the mimalloc pr since mimalloc is using "double" local free lists which always fully use all the available "buffer"s before refilling it, and causing a similar potential problem.

georgebanasios · 2025-08-22T16:23:14Z

It turns out the local IntStack for shared magazines causes a slight performance regression.

Original (Netty 4.2):

Benchmark                                              Mode  Cnt         Score        Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt   20  35719614.293 ± 690591.331  ops/s
ByteBufAllocatorAllocPatternBenchmark.fakeDirect      thrpt   20  36387469.016 ± 286736.991  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt   20  50486234.891 ± 448406.151  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt   20  11517729.535 ± 526859.543  ops/s

Current code without stack for shared path:

Benchmark                                              Mode  Cnt         Score        Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt   20  36739216.384 ± 637446.926  ops/s
ByteBufAllocatorAllocPatternBenchmark.fakeDirect      thrpt   20  35812329.244 ± 475334.160  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt   20  49950844.079 ± 617093.970  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt   20  11483370.411 ± 342917.106  ops/s

Current code with stack for shared path:

Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt   20  33515346.111 ±  344516.536  ops/s
ByteBufAllocatorAllocPatternBenchmark.fakeDirect      thrpt   20  37706844.983 ± 1864495.394  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt   20  52113564.202 ±  260621.124  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt   20  12103734.808 ±  137343.439  ops/s

franz1981 · 2025-08-22T16:51:07Z

I am a bit surprised that the shared magazine case is not improved tbh, since we saved the (size classed) chunk reference count, but probably we have too many atomics there and not regressing is still fine.
In theory the most of what we care about are allocations in the event loops...

Now an interesting experiment:
You can add a jmh param i.e. int pollutionIterations
to the existing benchmark where you create both a normal and an event loop thread and run half of the iterations configured per each of the benchmarked method.
This is a "trick" to pollute the type profile of allocate/release with both shared and unshared magazines and when you run the actual benchmark you can observe if the abstractions and attempt to be DRY doesn't back fired to us.
In the real world the allocations are not just shared/unshared and we want to know how both perform there too..

This reverts commit ea68585.

This reverts commit 07a48e4.

franz1981 · 2025-08-23T10:02:33Z

I still believe that an optimized copy from the Mpsc to the int stack is needed to save the unshared case to regress when it is not used from the owner thread...
But need first to very it via profiling.
If that help I can follow up on it once back

georgebanasios · 2025-08-23T12:08:30Z

Regarding the pollutionIterations and abstractions (here's the benchmark georgebanasios@fc94123, let me know if you were thinking something different. I changed the initialization of the allocators from static to a fresh one on each trial to not get counter intuitive results, even though in both cases the outcome is the same).

Benchmark                                             (pollutionIterations)   Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect                      0  thrpt   20  53041903.394 ± 2965899.618  ops/s
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect                 200000  thrpt   20  48668912.190 ± 1352680.061  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect                      0  thrpt   20  57201513.978 ±  356555.319  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect                 200000  thrpt   20  34807679.057 ±  358323.970  ops/s

franz1981 · 2025-08-23T12:17:01Z

Mimalloc has never been that bad, what's going on?

I am checking the benchmark code 🙏

georgebanasios · 2025-09-17T13:47:52Z

@georgebanasios I'm trying to extract from this PR the minimal set of changes which improve the performance at the expected level: I invite you to do the same @georgebanasios because after so much work I see that I'm lost into the number of changes 🙏

I added a description to state the changes so far, feel free to take a look if anything's wrong/missing.

franz1981 · 2025-09-17T18:13:27Z

FYI I've tried to reduce the amount of changes (despite the ugly code, but is meant just to reduce the diff): https://github.com/franz1981/netty/commits/adaptive_size_class_improvements/

This is not optimizing the chunk reuse queue/reuse logic and is not using (yet) the new recycler, but on my machine it delivers the highest IPC and performance so far.
Feel free to give it a shot for M4; maybe I will surprised once again to see how it different from x86 performance.

georgebanasios · 2025-09-18T02:31:27Z

@franz1981 I checked out your branch and these are the results I got:

Benchmark                                               (allocatorType)  (pollutionIterations)   Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  thrpt   20  55150266.972 ± 1225229.957  ops/s
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  thrpt   20  49043596.882 ± 2025469.514  ops/s

franz1981 · 2025-09-18T03:45:35Z

Thanks @georgebanasios

Consider that in both a Ryzen 4 box and a Xeon I was getting ~40 M op/s with the latest commit in this PR and ~44 M op/s on the branch you tested...
I see that since we are CPU bound, the different architectures of these CPUs make it very complex to pick the right version.
I am not sure at this point if we should favour the numbers on Apple, only because is a consumer CPU and not sure is representative of server ones, but it feels bad we cannot achieve the best numbers on both...

georgebanasios · 2025-09-18T06:37:22Z

@franz1981 Yeah the results makes it difficult to choose. Unfortunately my knowledge on such CPU architectures is limited otherwise I'd be happy to help more, so if you think that new commit is better for netty (the server argument makes more sense to me tbh) we should go with that.
After tomorrow morning I'll be without a laptop for a week, so I'll post my results from the new one to see what we get there too(probably an intel not sure)

franz1981 · 2025-09-18T07:48:54Z

No worries, me too related Apple...and I know very few which can claim to know it well.
My branch isn't there with the purpose of override the work of this PR, since I value the big work you have done it here, but I wanted to isolate the few (if few!) critical changes which bring the most benefits.
If we can make the whole work made of 2 prs:

With few critical changes perf wise with little risk
Another one modifying the behaviour and types with higher risk

We could make the first one to be merged asap, and have feedbacks by the community, whilst we can take our time to do the second one..

Wdyt @normanmaurer ?
You prefer us to lead this differently?

georgebanasios · 2025-09-18T09:31:14Z

No worries about overriding, etc from my side at all! Totally fine with you taking it in the other branch and continue there.
Yeah let's get @normanmaurer's opinion on this and the architectural differences you pointed out too.

franz1981 · 2025-10-06T04:21:40Z

Did you had any chance to move this forward @georgebanasios ?
I'm fixing other Netty related issues ATM

chrisvest · 2025-10-06T17:00:40Z

Netty usage, where peak performance and efficiency is a concern, is still mostly x86_64.
These micro arch optimizations tend to wash a bit in real world workloads, and become a lot smaller than our focused benchmarks suggests.
There are also other things we can do in other PRs, that Franz mentioned, like improving the reference counting and inlining of same. On balance we should be able to get wins on all archs.

georgebanasios · 2025-10-06T17:58:46Z

@franz1981 I didn't have a chance no. I'm planning on continuing this weekend.
What are we thinking regarding this pr?
(I've also updated the description to reflect the changes that have been made so far.)

franz1981 · 2025-10-06T18:07:05Z

If we can bring in smaller changes which improve perf, would be better IMO
Feel free to pick the most relevant changes with the smaller number of code modifications (which remove ref cnt on some type of chunks, and use cheaper queues where needed) and we can ask @chrisvest to review it as well
WDYT @chrisvest ?

laosijikaichele · 2025-10-25T14:32:14Z

Sorry for my late reply, hope it's not too late :).

Great work @franz1981 @georgebanasios.

I recently got some new observations to share:

Before, the ByteBufAllocatorAllocPatternBenchmark setup random sizes, and every benchmark iteration use the same random sizes loop to do the allocations, which looks random but actually it's a big loop of sizes(correct me if I'm wrong), I think it's better to optimize it.

So I optimized the ByteBufAllocatorAllocPatternBenchmark, and let it pick random size every invocation, which makes the sizes pattern more random, the new code: ByteBufAllocatorAllocPatternBenchmark.

Run the new ByteBufAllocatorAllocPatternBenchmark on M1 chip:

Benchmark                                              Mode  Cnt         Score         Error  Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect  thrpt    5  14180087.585 ±  582466.544  ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect  thrpt    5  23090530.626 ±  239564.032  ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect    thrpt    5  12727080.639 ± 2273078.227  ops/s

Further, I added a more random sizes benchmark: ByteBufAllocatorRandomBenchmark, which randomly pick size from range: [0 - 8Kib).

Run the ByteBufAllocatorRandomBenchmark:

Benchmark                                        Mode  Cnt         Score        Error  Units
ByteBufAllocatorRandomBenchmark.adaptiveDirect  thrpt    5   8437937.643 ± 339035.807  ops/s
ByteBufAllocatorRandomBenchmark.mimallocDirect  thrpt    5  17154273.444 ± 876545.019  ops/s
ByteBufAllocatorRandomBenchmark.pooledDirect    thrpt    5  11540785.897 ± 115708.053  ops/s

So, I think, if the allocation pattern is less random, the adaptive will get better performance, if the allocation pattern is more random, then the mimalloc allocator will get better performance.

franz1981 · 2025-10-25T14:35:04Z

This which commit is referring too?
I would profile or use Linux perf to make sure of which effects we have here.
The size class choice is not using any branch in the latest commits so I am not sure is due to this

laosijikaichele · 2025-10-25T14:36:49Z

This which commit is referring too? I would profile or use Linux perf to make sure of which effects we have here. The size class choice is not using any branch in the latest commits so I am not sure is due to this

I use AdaptivePoolingAllocator of this PR's latest commit.

franz1981 · 2025-10-25T14:44:22Z

I have opened an alternative PR using the latest recycler etc etc.
And that will be the starting points for confined improvements (easier to review) hopefully with similar or better (at the end) outcome, performance wise.
In the new pr I have changed the benchmark
Because I found it a bit broken.

Said that, I have not profiled what we have here with perfnorm nor async profiler.

laosijikaichele · 2025-10-25T14:46:47Z

I have opened an alternative PR using the latest recycler etc etc. And that will be the starting points for confined

Cool, I will try to do some benchmark on it tommorrow.

franz1981 · 2025-10-25T14:59:13Z

Thinking about it twice, there is still the difference between bump and sized class chunks and it would force the compiler to branch and perform a speculative decision, likely wrong based on the distribution; but without a profiler is difficult to know if that's the culprit.

Another option is that the increase randomness is achieved by increasing the sample count, which just mean that we maybe increased the cache usage, and adaptive is not that cheap (yet) on that....
As said, it could be anything, so please if you have the chance, attach a profiler so we can better understand what's going on 🙏

franz1981 · 2025-10-25T15:20:54Z

Re

which looks random but actually it's a big loop of sizes(correct me if I'm wrong

That is fine, but is important that the sequence of choices in the loop is long and random enough.the length depends by the branch predictor state capacity

laosijikaichele · 2025-10-26T06:07:20Z

I have opened an alternative PR using the latest recycler etc etc. And that will be the starting points for confined

Cool, I will try to do some benchmark on it tommorrow.

I posted the benchmark in #15741 (comment).

Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes #15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at #15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one

Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes netty#15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at netty#15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one (cherry picked from commit accd981)

Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes #15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at #15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one (cherry picked from commit accd981) Co-authored-by: Francesco Nigro <[email protected]>

georgebanasios added 5 commits August 11, 2025 16:24

improve free-list performance

ea20a8d

Revert "improve free-list performance"

dcdfa6e

This reverts commit ea20a8d.

Merge remote-tracking branch 'upstream/4.2' into 4.2

93999ab

Merge branch 'netty:4.2' into 4.2

689e058

Merge branch 'netty:4.2' into 4.2

ab8773e

franz1981 reviewed Aug 19, 2025

View reviewed changes

franz1981 reviewed Aug 20, 2025

View reviewed changes

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated

franz1981 reviewed Aug 20, 2025

View reviewed changes

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated

georgebanasios added 4 commits August 21, 2025 09:54

Merge branch 'netty:4.2' into 4.2

48c2762

test

478f75a

improvements v1

4d78e85

improvements v2 & magazine group queues fixes

05061fd

georgebanasios force-pushed the adaptive-unshared-fast-path branch from b06a348 to 05061fd Compare August 21, 2025 17:19

local free list on shared path

07a48e4

franz1981 reviewed Aug 22, 2025

View reviewed changes

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated

start with empty mpsc q

ea68585

georgebanasios added 2 commits August 23, 2025 11:09

Revert "start with empty mpsc q"

6a3acca

This reverts commit ea68585.

Revert "local free list on shared path"

2372161

This reverts commit 07a48e4.

franz1981 mentioned this pull request Sep 17, 2025

Remove usedMemory atomic counters on adaptive magazines #15677

Merged

remove usedMemory from magazines

42522dc

georgebanasios changed the title ~~Improve adaptive unshared size class allocation's fast-path~~ Improve the fast-path for adaptive unshared size class allocations. Sep 17, 2025

merge conflicts

b7fb931

franz1981 mentioned this pull request Oct 6, 2025

Make AdaptiveByteBuf.setBytes faster #15736

Merged

franz1981 mentioned this pull request Oct 10, 2025

Improve adaptive allocator thread local performance #15741

Merged

8 tasks

chrisvest closed this in #15741 Jan 6, 2026

chrisvest mentioned this pull request Jan 6, 2026

Improve adaptive allocator thread local performance (#15741) #16107

Merged

8 tasks

Uh oh!

Conversation

georgebanasios commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georgebanasios commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georgebanasios commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

franz1981 commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

georgebanasios commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georgebanasios commented Aug 21, 2025

Uh oh!

franz1981 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

franz1981 commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georgebanasios commented Aug 22, 2025

Uh oh!

franz1981 commented Aug 22, 2025

Uh oh!

franz1981 commented Aug 23, 2025

Uh oh!

georgebanasios commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georgebanasios commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Sep 17, 2025

Uh oh!

georgebanasios commented Sep 18, 2025

Uh oh!

franz1981 commented Sep 18, 2025

Uh oh!

georgebanasios commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Sep 18, 2025

Uh oh!

georgebanasios commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 6, 2025

Uh oh!

chrisvest commented Oct 6, 2025

Uh oh!

georgebanasios commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laosijikaichele commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 25, 2025

Uh oh!

laosijikaichele commented Oct 25, 2025

georgebanasios commented Aug 19, 2025 •

edited

Loading

georgebanasios commented Aug 19, 2025 •

edited

Loading

georgebanasios commented Aug 19, 2025 •

edited

Loading

georgebanasios commented Aug 20, 2025 •

edited

Loading

franz1981 commented Aug 21, 2025 •

edited

Loading

franz1981 commented Aug 22, 2025 •

edited

Loading

georgebanasios commented Aug 23, 2025 •

edited

Loading

franz1981 commented Aug 23, 2025 •

edited

Loading

georgebanasios commented Sep 17, 2025 •

edited

Loading

georgebanasios commented Sep 18, 2025 •

edited

Loading

georgebanasios commented Sep 18, 2025 •

edited

Loading

georgebanasios commented Oct 6, 2025 •

edited

Loading

franz1981 commented Oct 6, 2025 •

edited

Loading

laosijikaichele commented Oct 25, 2025 •

edited

Loading

franz1981 commented Oct 25, 2025 •

edited

Loading

laosijikaichele commented Oct 26, 2025 •

edited

Loading