Improve the fast-path for adaptive unshared size class allocations.#15571
Improve the fast-path for adaptive unshared size class allocations.#15571georgebanasios wants to merge 29 commits into
Conversation
The reference benchmark is in the original ticket. |
|
@franz1981 I opened this draft PR as @chrisvest suggested, just so it's easier for everyone to track the changes in one place. Let's continue the feedback and discussion here. The primary changes since your last review are the removal of FastThreadLocal from MagazineGroup and the logic for the magazine queues. I think this has introduced a performance drop, which I am currently checking to ensure I haven't missed anything in the implementation. |
|
Check the failures too @georgebanasios I have no laptop till sept so I will review the better I can via phone 🤳 In term of abstractions I have some mixed feeling that we could save inheritance and make code which look the same to just be duplicated, specialising behaviours which are specific for the size classed case which will be very common (and has been designed to be like that!) I have tried in my first PoCs to use inheritance because I didn't wanted to change too much code but for maximum performance and stability (when all the types and code paths are used in a real application) saving abstractions can just be of benefit. |
|
There are two issues in the CI. Edit: I've found the issue, I'm working on it and I'll push the fixes. |
b06a348 to
05061fd
Compare
|
Hey @franz1981 I made some changes to the queues of the magazine group because previously I had placed a non thread safe q for shared magazines on the magazine group level and was causing a race condition which funny enough I think it revealed something regarding the local free list on the shared path. Right now, we have an external shared queue in the MagazineGroup and a fast local queue for each Magazine (accessed by the owner thread and when we have the lock). As far as the local free list on the shared path (and let me know if I missed something), the magazine's allocation path is locked, but the lock doesn't guarantee exclusive access to the chunk itself (only to the magazine's state). |
As far as I know the allocation paths (remaining capacity and read into) are protected and never shared across threads. Chunks afaik belong to a specific magazine and could be used without locks too, but won't be accessed (for allocations) from the actual owning magazine while it happens (the so called allocate without lock code path). |
|
Now, thanks to your latest changes, the shared magazines implements something which we need yet prove to be beneficial:
This could be tested (before/after) with the existing benchmark without setting the harness custom executor. Since the Mpsc q is single consumer, it uses volatile loads on consumer side, whilst the IntStack nope, although a volatile load should be fast enough (that's why in my first iteration I didn't implemented the local free list for shared magazines). Now we have to understand if it is worthy, and in the current form (guesses):
Why the last point? |
|
It turns out the local IntStack for shared magazines causes a slight performance regression. Original (Netty 4.2): Current code without stack for shared path: Current code with stack for shared path: |
|
I am a bit surprised that the shared magazine case is not improved tbh, since we saved the (size classed) chunk reference count, but probably we have too many atomics there and not regressing is still fine. Now an interesting experiment: |
|
I still believe that an optimized copy from the Mpsc to the int stack is needed to save the unshared case to regress when it is not used from the owner thread... |
|
Regarding the pollutionIterations and abstractions (here's the benchmark georgebanasios@fc94123, let me know if you were thinking something different. I changed the initialization of the allocators from static to a fresh one on each trial to not get counter intuitive results, even though in both cases the outcome is the same). |
|
Mimalloc has never been that bad, what's going on? I am checking the benchmark code 🙏 |
I added a description to state the changes so far, feel free to take a look if anything's wrong/missing. |
|
FYI I've tried to reduce the amount of changes (despite the ugly code, but is meant just to reduce the diff): https://github.com/franz1981/netty/commits/adaptive_size_class_improvements/ This is not optimizing the chunk reuse queue/reuse logic and is not using (yet) the new recycler, but on my machine it delivers the highest IPC and performance so far. |
|
@franz1981 I checked out your branch and these are the results I got: |
|
Thanks @georgebanasios Consider that in both a Ryzen 4 box and a Xeon I was getting ~40 M op/s with the latest commit in this PR and ~44 M op/s on the branch you tested... |
|
@franz1981 Yeah the results makes it difficult to choose. Unfortunately my knowledge on such CPU architectures is limited otherwise I'd be happy to help more, so if you think that new commit is better for netty (the server argument makes more sense to me tbh) we should go with that. |
|
No worries, me too related Apple...and I know very few which can claim to know it well.
We could make the first one to be merged asap, and have feedbacks by the community, whilst we can take our time to do the second one.. Wdyt @normanmaurer ? |
|
No worries about overriding, etc from my side at all! Totally fine with you taking it in the other branch and continue there. |
|
Did you had any chance to move this forward @georgebanasios ? |
|
Netty usage, where peak performance and efficiency is a concern, is still mostly x86_64. |
|
@franz1981 I didn't have a chance no. I'm planning on continuing this weekend. |
|
If we can bring in smaller changes which improve perf, would be better IMO |
|
Sorry for my late reply, hope it's not too late :). Great work @franz1981 @georgebanasios. I recently got some new observations to share: Before, the So I optimized the Run the new Further, I added a more random sizes benchmark: ByteBufAllocatorRandomBenchmark, which randomly pick size from range: [0 - 8Kib). Run the So, I think, if the allocation pattern is less random, the adaptive will get better performance, if the allocation pattern is more random, then the mimalloc allocator will get better performance. |
|
This which commit is referring too? |
I use AdaptivePoolingAllocator of this PR's latest commit. |
|
I have opened an alternative PR using the latest recycler etc etc. Said that, I have not profiled what we have here with perfnorm nor async profiler. |
Cool, I will try to do some benchmark on it tommorrow. |
|
Thinking about it twice, there is still the difference between bump and sized class chunks and it would force the compiler to branch and perform a speculative decision, likely wrong based on the distribution; but without a profiler is difficult to know if that's the culprit. Another option is that the increase randomness is achieved by increasing the sample count, which just mean that we maybe increased the cache usage, and adaptive is not that cheap (yet) on that.... |
|
Re
That is fine, but is important that the sequence of choices in the loop is long and random enough.the length depends by the branch predictor state capacity |
I posted the benchmark in #15741 (comment). |
Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes #15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at #15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one
Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes netty#15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at netty#15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one (cherry picked from commit accd981)
Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes #15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at #15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one (cherry picked from commit accd981) Co-authored-by: Francesco Nigro <[email protected]>
Motivation:
The original implementation, had a performance gap in the unshared, size-classed allocation path.
Key bottlenecks on the hot path:
SizeClassedChunktriggered atomic retain()/release() operations on the chunk's reference count, creating overhead even in an uncontended, single-threaded context.Recyclerfor poolingAdaptiveByteBufinstances in the thread-local path introduced unnecessary overhead compared to a simpler, non-thread-safe pooling mechanism.Magazineclass used conditional logic (if (shareable)) to handle both shared and unshared scenarios. This prevented the JIT compiler from creating an optimized, lock-free fast path for thread-local operations.Modification:
The single Magazine class was replaced by an abstract class
AbstractMagazinewith two concrete implementations:a.
SharedMagazine: Retains the original StampedLock and atomic operations, ensuring thread-safety for the multi-threaded, contended path.b.
ThreadLocalMagazine: A new, specialized class for the unshared path. It is completely lock-free and uses non-atomic operations.SizeClassedChunka. The atomic reference count (
refCnt) was completely removed fromSizeClassedChunk. It is now implemented in a newBumpChunkclass which handles non-size-classed allocations and retains the original ref-counting logic.b. A new state-based, "racy-by-design" (based on: https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/) deallocation mechanism was introduced using a volatile int state field. A chunk is only deallocated when the total count of returned segments (both local and external) equals its total segment count.
SizeClassedChunkfree list was split for optimizationa.
IntStack(localFreeList): A non-thread-safe stack for the owner thread to use that removes the atomic operation in the hot path (due to release that can happen concurrently) and improves the locality of reused segments.b.
MpscIntQueue(externalFreeList): A concurrent queue for other threads to safely return segments.A new
ThreadLocalCacheclass was introduced to manage all thread-local resources.a. Buffer Recycling: The generic
Recyclerwas replaced with a simple FIFOArrayDequefor poolingAdaptiveByteBufobjects on the hot path, which is much faster in a single-threaded context.b. Chunk Caching: The
ThreadLocalMagazinenow maintains alocalChunkCache(ArrayList) to reduce the interactions with the external chunk reuse queue, serving as a faster, first-level cache.c. Queue Specialization: The chunk reuse queue for thread-local magazines is now a MpscQueue instead of the original MpmcQueue.
The logic was made more direct for the specialized classes.
a. The allocation logic for
SizeClassedChunkno longer relies on calculatingremainingCapacity(). ItstryAllocate()method now directly attempts topop()a segment from its free list and if it fails it proceeds to allocate a new chunk.b.
BumpChunk(for non-size-classed allocations) retains the originalremainingCapacity()check, as it allocates memory sequentially.Result:
Improved performance on the unshared size class allocation fast-path.
Fixes #15530