Improve adaptive allocator thread local performance#15741
Conversation
|
FYI @georgebanasios and @chrisvest Each of the commit has its own "cumulative" performance results, to help understand the effects performance-wise. Despite been WIP this is still in a decent state thanks to the work done by @georgebanasios at #15530 |
|
I benchmarked with ByteBufAllocatorAllocPatternBenchmark, the adaptive code based on the latest commit of this PR, the others code based on #15525 with the latest 4.2-branch merged: |
|
So something is off here (which I am not hitting on my machine...); it seems the chunk queue capacity is not enough and you got allocations in the hot paths. You can increase the capacity of the reusable chunk q but I don't understand why I cannot hit this...did you changed the benchmark to retain more buffers? Try use the Benchmark I have on this PR which is saving more hotspot to cheat and if you want add more variability on it. |
|
On my machine, the I increased the The number of adaptive number improved a lot, which indicates that the capacity of |
|
Consider that I run the benchmark with harness executor as event loops and for fast thread local chunk queues it shouldn't matter, since are not shared at all... |
I found the reason why the mimalloc number increased: I have increased the 'Xms,Xmx and MaxDirectMemorySize' from '768 m' to '2768 m' for this run. So it seems 'Xms,Xmx and MaxDirectMemorySize' take effect on the performance of mimalloc. |
|
Do you have the flame-graphs for the adaptive one? You can use cstack vm as rawCommand |
Did you mean for the second run? |
|
Regardless, without the changes in this PR we should be too much distant from mimalloc, I think (@chrisvest can verify for M1 too, I don't want to bother you too much @laosijikaichele !) - whilst with these we are in the 86% of its performance. We have yet few improvements which are not in, but I plan to remove this work once I have progressed with the Netty loom scheduler I am working on |
|
Yep. That's why I prefer to use larger sample size space and no random in the hot path...it has some variability of cost too 😢 |
|
After #15764 I have to rebase this PR and I've already updated the description checklist at
I think
is even more important to be brought in, because of the additional indirection level, and despite overall it would save inlining issues (already verified at #15764 (comment)) |
Benchmark (allocatorType) (pollutionIterations) Mode Cnt Score Error Units ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 0 avgt 20 45.499 ± 0.249 ns/op ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 200000 avgt 20 46.075 ± 0.414 ns/op
Benchmark (allocatorType) (pollutionIterations) Mode Cnt Score Error Units ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 0 avgt 20 40.054 ± 0.104 ns/op ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 200000 avgt 20 45.149 ± 0.648 ns/op
Benchmark (allocatorType) (pollutionIterations) Mode Cnt Score Error Units ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 0 avgt 20 37.085 ± 0.322 ns/op ByteBufAllocatorAllocPatternBenchmark.directAllocation ADAPTIVE 200000 avgt 20 43.207 ± 0.175 ns/op
314351c to
2ecb257
Compare
|
I've rebased over |
|
This is ready for review @chrisvest ; it has a minimal set of changes which already help performance (although with w an increased memory footprint due to the local free list) |
chrisvest
left a comment
There was a problem hiding this comment.
One comment, but it looks pretty good.
| if (localFreeList != null) { | ||
| assert Thread.currentThread() == ownerThread; | ||
| if (localFreeList.isEmpty()) { | ||
| startIndex = externalFreeList.poll(); |
There was a problem hiding this comment.
We could perhaps move entries into the local list in a batch. Otherwise the local freelist is only populated by releases done by the owning thread.
There was a problem hiding this comment.
We fallback here once there is nothing in the local free list, so the only benefit of copying is that we drain LIFO instead of FIFO from the external one, but to be fair I prefer to prioritize draining from the local one instead of mixing segments from the two free lists; the local segments have the benefit that they are more likely to be "hot" on the owner thread core
|
One note for @chrisvest We need to think about the size classed chunk q sizing (and how to use it, since it doesn't prioritize the available capacity, which is mutable) and I haven't gotten rid of the ref cnt additional operations on the size classed chunks which I know already (tested eons ago) it deliver the biggest boost. |
|
Sounds good, @franz1981. Let's merge this and think about further improvements in future PRs. |
Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes netty#15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at netty#15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one (cherry picked from commit accd981)
Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes #15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at #15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one (cherry picked from commit accd981) Co-authored-by: Francesco Nigro <[email protected]>



Motivation:
Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance
Modification:
Reduce the amount of atomic operations in the thread local allocation's fast path
Result:
Fixes #15571
These are the different variations I want to test:
RecyclersNO VISIBLE IMPROVEMENTS
nextInLine'sgetAndSetwith a null check via volatilegetfirst, since size classed chunks rarely end up intonextInLine(i.e. which is mostlynull)NO VISIBLE IMPROVEMENTS
MpscIntQueue(done at 1c4e1e4)NO VISIBLE IMPROVEMENTS
FastThreadLocal-based one