Conversation
| int alignCapacity(int reqCapacity) { | ||
| int delta = reqCapacity & cacheAlignmentMask; | ||
| if (delta == 0) { | ||
| return reqCapacity; |
| // testInternalNioBuffer(128); | ||
| // testInternalNioBuffer(1024); | ||
| // testInternalNioBuffer(4 * 1024); | ||
| // testInternalNioBuffer(64 * 1024); |
| PooledByteBufAllocator pool = new PooledByteBufAllocator(true, 2, 2, 8192, 11, 1000, 1000, 1000, true, 64); | ||
| ByteBuf buff = pool.directBuffer(4096); | ||
| for(int i = 0; i < 4096; i++) { | ||
| buff.writeByte(100); |
| public void testArenaMetricsCacheAlign() { | ||
| testArenaMetrics0(new PooledByteBufAllocator(true, 2, 2, 8192, 11, 1000, 1000, 1000, true, 64), 100, 1, 1, 0); | ||
| } | ||
| @Test |
| } | ||
| buff.release(); | ||
| } | ||
|
|
| public PooledByteBufAllocator(boolean preferDirect, int nHeapArena, int nDirectArena, int pageSize, int maxOrder, | ||
| int tinyCacheSize, int smallCacheSize, int normalCacheSize, | ||
| boolean useCacheForAllThreads) { | ||
| boolean useCacheForAllThreads, int cacheAlignment) { |
There was a problem hiding this comment.
@kiril-me we also need to keep the old constructor to not break the API.
| return reqCapacity; | ||
| } else { | ||
| return alignCapacity(reqCapacity); | ||
| } |
There was a problem hiding this comment.
make this:
return cacheAlignment == 0 ? reqCapacity : alignCapacity(reqCapacity);| @Override | ||
| protected PoolChunk<byte[]> newUnpooledChunk(int capacity) { | ||
| return new PoolChunk<byte[]>(this, new byte[capacity], capacity); | ||
| return new PoolChunk<byte[]>(this, new byte[capacity], capacity, 0); |
There was a problem hiding this comment.
so as we only do this for direct buffers why not rename it to directMemoryCacheAlignment or something like this. This also is true for the system property etc.
| memory = allocateDirect(capacity + cacheAlignment); | ||
| offset = offsetCacheLine(memory, cacheAlignmentMask); | ||
| } | ||
| return new PoolChunk<ByteBuffer>(this, memory, capacity, offset); |
There was a problem hiding this comment.
consider changing this to:
return cacheAlignment == 0 ? new PoolChunk<ByteBuffer>(this, allocateDirect(capacity), capacity, 0) :
new PoolChunk<ByteBuffer>(this, allocateDirect(capacity + cacheAlignment), capacity, offsetCacheLine(memory, cacheAlignmentMask)) ;| protected PoolChunk<ByteBuffer> newChunk(int pageSize, int maxOrder, int pageShifts, int chunkSize) { | ||
| final ByteBuffer memory; | ||
| final int offset; | ||
| if (cacheAlignment == 0) { |
| pooledDirectBuffers[i].writeBytes(bytes); | ||
| } | ||
| } | ||
|
|
| int block = size / 128; | ||
| for (int i = 0; i < pooledDirectBuffers.length; i++) { | ||
| byte[] bytes = new byte[block]; | ||
| rand.nextBytes(bytes); |
There was a problem hiding this comment.
@kiril-me the allocating and filling of bytes[] should not happen in the benchmark itself, but be part of the @Setup otherwise it will affect the benchmark. Same goes fro everything else. that is not pooledDirectBuffers[i].writeBytes(bytes). Even better would be to also remove the array access here.
| } | ||
|
|
||
| @Benchmark | ||
| public void writeRead() { |
|
|
||
| import java.util.HashMap; | ||
| import java.util.Map; | ||
| /* |
There was a problem hiding this comment.
move copyright to the top of the file and also change year to 2017
| public PooledByteBufAllocator(boolean preferDirect, int nHeapArena, int nDirectArena, int pageSize, int maxOrder, | ||
| int tinyCacheSize, int smallCacheSize, int normalCacheSize, | ||
| boolean useCacheForAllThreads) { | ||
| boolean useCacheForAllThreads, int cacheAlignment) { |
There was a problem hiding this comment.
verify cacheAlignment is >= 0
|
@kiril-me also please re-run benchmarks and update here once you are done |
|
|
||
| private PooledByteBufAllocator pooledAllocator; | ||
|
|
||
| private ByteBuf pooledDirectBuffers; |
There was a problem hiding this comment.
nit: pooledDirectBuffers -> pooledDirectBuffer
| @Benchmark | ||
| public void write() { | ||
| pooledDirectBuffers.writeBytes(bytes); | ||
| } |
There was a problem hiding this comment.
@kiril-me also add a benchmark which just reads ? For this you will need to write in the doSetup() method tho.
|
@kiril-me also ensure you show the new numbers after the changes are in. |
|
@kiril-me please rebase on top of current 4.1 so it only includes your commit. |
|
@normanmaurer I made changes. Reworked benchmarks. Still need to make research how to make benchmarks stable. |
|
@kiril-me let me know once I should check again |
|
@normanmaurer I changed benchmarks. I have two direct buffers. First is default direct buffer for whom I calculate offset in case it was aligned. Because we want to have the miss-align buffer. The second is 64-byte aligned. The performance is visible for large buffer sizes. |
|
@kiril-me please share the new results. |
|
Result: |
|
@kiril-me didnt you say that there is a performance win with cache alignment ? All the numbers in your results suggest otherwise. |
|
I used average time benchmark. The lower value the better. Here is results for 1048576 buffer. |
|
@kiril-me ah doh! I did not notice you used |
|
Throughput measurement. |
There was a problem hiding this comment.
couldn't you share almost all the code while the only difference would be the alignOffset in some cases ?
There was a problem hiding this comment.
What do you mean? I use alignOffset for the miss-align line. Yes, it will be used in some cases. I didn't find the way to change offset inside buffer once it was created.
There was a problem hiding this comment.
can you add a comment that explains the 1137 ?
There was a problem hiding this comment.
This can be static final and also please add a comment why you used 4
|
@kiril-me also thanks for all the effort! Looks very good :) |
Scottmitch
left a comment
There was a problem hiding this comment.
changes look good! few small comments.
There was a problem hiding this comment.
does this have to be a power of 2 for the mask below to work? if so should we enforce that somewhere for example: warn and go to the next positive power of 2, or set to 0?
MathUtil.safeFindNextPositivePowerOfTwo maybe useful here.
There was a problem hiding this comment.
I added a check inside PooledByteBufAllocator. Should I add it in PoolArena too?
There was a problem hiding this comment.
PooledByteBufAllocator is good enough IMHO
There was a problem hiding this comment.
nit: could be tertiary for slightly less code:
return delta == 0 ? reqCapacity : reqCapacity + directMemoryCacheAlignment - delta;There was a problem hiding this comment.
nit: else is not necessary because you return in the if statement above
There was a problem hiding this comment.
nit: else is not necessary because you return in the if statement above
There was a problem hiding this comment.
you make a temporary for size and sizeMask but not for alignOffset ... do we need any temporaries?
There was a problem hiding this comment.
yes, I made it temporary as well.
There was a problem hiding this comment.
just curious why the temporaries are necessary ... is this just preference or habit from dealing with volatiles/mutable state?
There was a problem hiding this comment.
it's habit to be sure that data is mutable
|
@kiril-me please squash |
64-byte alignment is recommended by the Intel performance guide (https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors) for data-structures over 64 bytes. Requiring padding to a multiple of 64 bytes allows for using SIMD instructions consistently in loops without additional conditional checks. This should allow for simpler and more efficient code. Modification: At the moment cache alignment must be setup manually. But probably it might be taken from the system. The original code was introduced by @normanmaurer https://github.com/netty/netty/pull/4726/files Result: Buffer alignment works better than miss-align cache.
|
@kiril-me once you signed the ICLA I can merge this one... Thanks! |
|
Done. When are you planning to release 4.0.45.Final? |
|
@kiril-me thanks... within the next two weeks. |


Motivation:
64-byte alignment is recommended by the Intel performance guide (https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors) for data-structures over 64 bytes.
Requiring padding to a multiple of 64 bytes allows for using SIMD instructions consistently in loops without additional conditional checks. This should allow for simpler and more efficient code.
Modification:
At the moment cache alignment must be setup manually. But probably it might be taken from the system. The original code was introduced by @normanmaurer https://github.com/netty/netty/pull/4726/files
buffer/src/main/java/io/netty/buffer/PoolArena.java
buffer/src/main/java/io/netty/buffer/PoolChunk.java
buffer/src/main/java/io/netty/buffer/PooledByteBuf.java
buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java
buffer/src/test/java/io/netty/buffer/AbstractByteBufTest.java
buffer/src/test/java/io/netty/buffer/PoolArenaTest.java
buffer/src/test/java/io/netty/buffer/PooledByteBufAllocatorTest.java
microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorBenchmark.java
microbench/src/main/java/io/netty/microbench/buffer/PooledByteBufAllocatorAlignBenchmark.java
Benchmarks show better write and read performance on large buffer size.