Experimental: introduce mimalloc MiMallocByteBufAllocator#15525
Experimental: introduce mimalloc MiMallocByteBufAllocator#15525laosijikaichele wants to merge 121 commits into
Conversation
|
To better simulate the real world allocations, the following benchmarks copied size array
|
| private MiByteBuf allocate(int size, int maxCapacity, MiByteBuf byteBuf) { | ||
| LocalHeap localHeap = THREAD_LOCAL_HEAP.get(); | ||
| int wSize = toWordSize(size); | ||
| if (size <= PAGES_FREE_DIRECT_SIZE_MAX) { |
There was a problem hiding this comment.
The fast path here is rather different from adaptive (the size classes): it has 128 size classes distant 8 bytes each.
This enable to save the double lookup to compute the class first.
Additionally, this is always using thread local; to have a fair comparison we should enable adaptive to always use thread local as well (i made it possible at that time iirc..)
There was a problem hiding this comment.
to have a fair comparison we should enable adaptive to always use thread local as well
The adaptive already use thread local too in the benchmark, because the benchmark use event-loop threads, check the second param of the benchmark class constructor:
There was a problem hiding this comment.
Yeah but the number of hops/indirections is still different due to the different way buffers are recycled...
Said that, my comment was more of a generic one
| Block block = page.freeList; | ||
| if (block != null) { | ||
| if (byteBuf == null) { | ||
| byteBuf = block; |
There was a problem hiding this comment.
In the best case scenario the pooling of buffers is not using a separate storage but is both of the right size (class) and held the wrapper too.
This save additional atomic operations (or using an array dequeue to hold the empty shells, as adaptive does)
|
As explained in few comments there are few key differences which can explain the performance advantage in microbenchs (#15509 should help as well, but there is still a double lookup + recycling cost). |
|
After a first round of assembly inspection of performance result, beware:
This can be easily fixed by making non inlineable the base benchmark call (likely) - but is not granted!
You can use async profiler with `cstack vm` to "see" it easily
In order to test how the allocators reuse memory instead it should be needed to "touch" and dirty enough memory to go to L2 or even LLC: with this, the way the allocators reuse memory will be dominant compared to the orchestration required to get/release buffers (which is still important, but is a different type of quality we want to improve on). |
… list; use one look up for size, instead of double data-dependent look up
…d of many, for each sucessful queue.offer() operation
|
There were inappropriate logic in So I re-run the benchmarks to check the numbers: 1. Allocate and release in same threads: jmh-link 2. Allocate in one thread, and release in another thread: jmh-link As we can see, for the second benchmark, the |
|
Ywc 🙏
The numbers are not still matching the ones on my old Xeon (I have used numactl + localalloc and a single thread to avoid any false sharing issues to add noise), but since atomic ops are a dominant factor in the fast path (we don't do anything with the allocated buffers!) and adaptive perform twice the number of such (ref cnt of buffer + release of size class segment id + release of reusable buffer wrapper) it is expected the performance difference. |
Yes, the sizes array now is pre-shuffled in setup() method, and be used by only one lookup instead of double lookup earlier: |
Added 1. Allocate and release in same threads: jmh-link
2. Allocate in one thread, and release in another thread: jmh-link
|
|
Thanks,.I will check tomorrow if I still see other suspect call in the assembly in the hot path 🙏 |
I cannot see it yet.. |
|
These are the numbers of my machine with JDk 21 (using the commit at #15509): $ java -Djmh.executor=CUSTOM -Djmh.executor.class=io.netty.microbench.util.AbstractMicrobenchmark\$HarnessExecutor -jar microbench/target/microbenchmarks.jar io.netty.microbench.buffer.ByteBufAllocatorAllocPatternBenchmark.*Direct -f 1 -t 1 -prof perfasm`
Benchmark Mode Cnt Score Error Units
ByteBufAllocatorAllocPatternBenchmark.adaptiveDirect thrpt 10 19872573.143 ± 176.568 ops/s
ByteBufAllocatorAllocPatternBenchmark.mimallocDirect thrpt 10 24398458.564 ± 130.254 ops/s
ByteBufAllocatorAllocPatternBenchmark.pooledDirect thrpt 10 13815460.976 ± 238.903 ops/sThe adaptive ones are very interesting but as expected mostly related release to be more costly due to the additional atomic operation on the segment i.e.:
|
I added it locally, will push it soon. |
I used JDK-17, will change to JDK-21 to test it too. |
|
Another thing which requires some attention in the current reference count scheme is Which doesn't look well optimized and is related to calling this on the chunk: private T retain0(T instance, final int increment, final int rawIncrement) {
int oldRef = getAndAddRawRefCnt(instance, rawIncrement);
if (oldRef != 2 && oldRef != 4 && (oldRef & 1) != 0) {
throw new IllegalReferenceCountException(0, increment);
}
// don't pass 0!
if ((oldRef <= 0 && oldRef + rawIncrement >= 0)
|| (oldRef >= 0 && oldRef + rawIncrement < oldRef)) {
// overflow case
getAndAddRawRefCnt(instance, -rawIncrement);
throw new IllegalReferenceCountException(realRefCnt(oldRef), increment);
}
return instance;
}with
A siimilar problem happen with segment's release (which perform a chunk's release too!): public final boolean release(T instance) {
int rawCnt = getRawRefCnt(instance);
return rawCnt == 2 ? tryFinalRelease0(instance, 2) || retryRelease0(instance, 1)
: nonFinalRelease0(instance, 1, rawCnt, toLiveRealRefCnt(rawCnt, 1));
}for many cases of This case could be improved (assuming there will be a single in-flight chunk buffer segment allocation) as public final boolean release(T instance) {
int rawCnt = getRawRefCnt(instance);
if (rawCnt == 2) {
return tryFinalRelease0(instance, 2) || retryRelease0(instance, 1);
}
// this is a fast-path useful for the adaptive chunk case
if (rawCnt == 4) {
// this is saving an expensive computation (using lea -0x2(%rax),%ebp) to the new ref cnt
return nonFinalRelease0(instance, 1, 4, 2);
}
return nonFinalRelease0(instance, 1, rawCnt, toLiveRealRefCnt(rawCnt, 1));
}Sadly this is not a great solution since the number of in-flight segments usually exceed 1 where the "slow path" release requires computing the "live real ref cnt" which is not needed for |
…eBufAllocatorProducerConsumerBenchmark
…ept for the page mark or abandoning
…e a local heap instance
…ntegrated with mimalloc allocator" This reverts commit 120b63f.
|
The Done some benchmarks on normal platform threads. Benchmark code: MiMallocByteBufAllocator. Test server: ARM, 8-cores, 32G ram. JDK: OpenJDK-21.0.7. The code has been merged from latest 4.2 branch to this PR. We use 16 threads to simulate the common use case(8-cores * 2). First round:
We can see the Second round:
We can see the huge RSS spike disappear. But the performance of Third round:
We can see the Observation:
|
|
@laosijikaichele the benchmark is configured to run on event loop threads? |
|
See #15741 (comment) the only unchecked point, in case you want to try it 🙏 |
It run on normal platform threads(non-event loop).
Will look into it. |
To quickly check the performance effect on atomic For We use 16 event-loop threads:
Observation:
|
|
It's a bit weird that 256 in adaptive get better performance than 128. I would likely profile what's going on there ❤️ |
I'd be also interested to try to improve the |















Motivation:
For threads-limited use cases, including threads-limited & long-running virtual threads use case, we can still utilize
threadlocalto improve performance, for example, our existing allocators still usethreadlocalfor event-loop threads allocation.According to the mimalloc-paper, mimalloc is designed to better deal with reference counting use cases, and has good performance advantages. This seems naturally suitable for our
ByteBufallocation.This PR implemented a netty version mimalloc allocator
MiMallocByteBufAllocator, which takes reference from https://github.com/microsoft/mimalloc, and shows better performance on initial benchmarks, the benchmark numbers will be shown later.Modification:
Added
MiMallocByteBufAllocatorand related classes.Result:
Better performance for threads-limited use cases.