Skip to content

Conversation

@chrisvest
Copy link
Member

@chrisvest chrisvest commented Apr 13, 2024

See PR #13075 for the motivations.

The AdaptivePoolingAllocator is modified to work with the ByteBuf API. This required adding another implementation of ByteBuf, because in this API the buffer and their allocator (pooling or otherwise) are tightly coupled in their implementations.

Note that the AdaptivePoolingAllocator require at least Java 8, because it relies on StampedLock. The constructor performs a version check, and throws an exception on older Java versions.

This allocator, if merged, will be strictly experimental in Netty 4.1.

@chrisvest chrisvest requested a review from normanmaurer April 13, 2024 22:26
@chrisvest chrisvest marked this pull request as draft April 13, 2024 22:26
@chrisvest chrisvest marked this pull request as ready for review April 14, 2024 01:05
@franz1981
Copy link
Contributor

franz1981 commented Apr 14, 2024

I like a lot the design of this, which resembles what modern GCs does with TLAB allocations:

  • dedicated fast path for bump only allocation
  • estimation of "burst" capacity based on allocation telemetry

And it allows implementing some sort of soft-limits with ease too, I think.

And this will make this very virtual thread friendly too, but, the devil is in the details...something not addressed by this is the Recycler used by the pooled allocation, which still relies on ThreadLocal: to fully close the "circle" it would be better to fix it too.

Separated advice instead: if a thread is a fast (event loop) thread, why not setting a fixed affinity with a specific magazine and let who is not (alien thread pools and virtual threads) "compete" against a striped amount of magazine using the thread id to pick one ?

This will unify the benefits of the 2 approaches making the allocation statistics of event loops threads allocation not "disturbed" by others and would likely save any atomic operations for such (actually implementing TLAB machinery).

@chrisvest
Copy link
Member Author

Separated advice instead: if a thread is a fast (event loop) thread, why not setting a fixed affinity with a specific magazine and let who is not (alien thread pools and virtual threads) "compete" against a striped amount of magazine using the thread id to pick one ?

This will unify the benefits of the 2 approaches making the allocation statistics of event loops threads allocation not "disturbed" by others and would likely save any atomic operations for such (actually implementing TLAB machinery).

Allocation needs to coordinate with any concurrent close() call, so saving the atomic operations might not be possible.

@franz1981
Copy link
Contributor

franz1981 commented Apr 14, 2024

It is ok, but considering that only when the current chunk is completed/exhausted we care about retiring it, maybe there is something to strip out of the hot path, for such threads. I have to play with it more to suggest something specific, but anyway; it would be an addition, given that as it is, is already very nice!

@chrisvest chrisvest force-pushed the 41-adaptive-allocator branch from f27032e to cbf0bda Compare April 14, 2024 23:52
@chrisvest chrisvest requested review from franz1981 and jchrys and removed request for jchrys April 15, 2024 19:55
@chrisvest chrisvest force-pushed the 41-adaptive-allocator branch from cbf0bda to f7f3005 Compare April 16, 2024 18:45
@chrisvest
Copy link
Member Author

The usedHeapMemory() and usedDirectMemory() accounting doesn't quite work. It's somehow missing decrements.

@chrisvest
Copy link
Member Author

Local benchmark numbers on Java 11:

Benchmark                                                  (size)   Mode  Cnt         Score        Error  Units
ByteBufAllocatorBenchmark.adaptiveDirectAllocAndFree        00256  thrpt   20  14659507.477 ± 165289.355  ops/s
ByteBufAllocatorBenchmark.adaptiveHeapAllocAndFree          00256  thrpt   20  22227938.709 ± 449200.309  ops/s
ByteBufAllocatorBenchmark.defaultPooledDirectAllocAndFree   00256  thrpt   20  13129535.618 ± 108988.642  ops/s
ByteBufAllocatorBenchmark.defaultPooledHeapAllocAndFree     00256  thrpt   20  13023927.401 ± 104843.216  ops/s
ByteBufAllocatorBenchmark.pooledDirectAllocAndFree          00256  thrpt   20  12122912.035 ± 337387.354  ops/s
ByteBufAllocatorBenchmark.pooledHeapAllocAndFree            00256  thrpt   20  12586541.154 ±  58123.116  ops/s
ByteBufAllocatorBenchmark.unpooledDirectAllocAndFree        00256  thrpt   20   2121884.292 ±  23003.444  ops/s
ByteBufAllocatorBenchmark.unpooledHeapAllocAndFree          00256  thrpt   20  19869435.020 ± 122243.355  ops/s

@chrisvest chrisvest marked this pull request as draft April 17, 2024 17:09
Copy link
Member

@normanmaurer normanmaurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of review... I am super excited about this @chrisvest

chrisvest and others added 11 commits April 18, 2024 08:38
See PR netty#13075 for the motivations.

The AdaptivePoolingAllocator is modified to work with the ByteBuf API.
This required adding another implementation of ByteBuf, because in this API the buffer and their allocator (pooling or otherwise) are tightly coupled in their implementations.

Note that the AdaptivePoolingAllocator require at least Java 8, because it relies on StampedLock.
The constructor performs a version check, and throws an exception on older Java versions.
The tests passed with "adaptive" as the default allocator.
@chrisvest
Copy link
Member Author

And the concurrent allocator benchmark, Java 17:

Benchmark                                                    (size)   Mode  Cnt          Score          Error  Units
ByteBufAllocatorConcurrentBenchmark.allocateReleaseAdaptive   00256  thrpt   20  164186255.001 ± 80355784.409  ops/s
ByteBufAllocatorConcurrentBenchmark.allocateReleasePooled     00256  thrpt   20   73136079.467 ±  2706876.076  ops/s
ByteBufAllocatorConcurrentBenchmark.allocateReleaseUnpooled   00256  thrpt   20    3022716.439 ±   137359.369  ops/s

@chrisvest chrisvest marked this pull request as ready for review April 18, 2024 22:13
@normanmaurer normanmaurer added this to the 4.1.110.Final milestone Apr 19, 2024
@normanmaurer
Copy link
Member

@chrisvest as this is marked as @UnstableApi we should pull it in once you are happy with it :) I am ....

@chrisvest chrisvest marked this pull request as draft April 19, 2024 20:39
@chrisvest
Copy link
Member Author

SocketObjectEchoTest and SocketRstTest are consistently complaining. Needs to be looked into.

By having a predictable pattern in the strings, it becomes much easier to reason about the cause of any test failure.
…e lock

We must never touch the chunk-delegate buffer's internal NIO buffer without holding the magazine lock.
Getting the internal NIO buffer involves modifying its position and limit, which are a shared mutable resource.
Therefor, we cannot lazily obtain this buffer, and have to grab it eagerly when the initialize the AdaptiveByteBuf under the magazine lock, which ensures that no other thread is touching that buffer.
This does unfortunately extend the critical section a bit.
The timeout is applied to the combined run time of all the test cases.
Since there are more cases, the test needs more time to complete.
@chrisvest chrisvest marked this pull request as ready for review April 19, 2024 23:38
@chrisvest
Copy link
Member Author

Reran the benchmarks on Java 11:

Benchmark                                                  (size)   Mode  Cnt         Score        Error  Units
ByteBufAllocatorBenchmark.adaptiveDirectAllocAndFree        00256  thrpt   20  21938808.710 ± 267349.998  ops/s
ByteBufAllocatorBenchmark.adaptiveHeapAllocAndFree          00256  thrpt   20  22934470.112 ± 612294.938  ops/s
ByteBufAllocatorBenchmark.defaultPooledDirectAllocAndFree   00256  thrpt   20  12384652.456 ±  73364.731  ops/s
ByteBufAllocatorBenchmark.defaultPooledHeapAllocAndFree     00256  thrpt   20  12404587.038 ±  54743.049  ops/s
ByteBufAllocatorBenchmark.pooledDirectAllocAndFree          00256  thrpt   20  11271708.987 ±  45600.629  ops/s
ByteBufAllocatorBenchmark.pooledHeapAllocAndFree            00256  thrpt   20  11613996.778 ± 112700.407  ops/s
ByteBufAllocatorBenchmark.unpooledDirectAllocAndFree        00256  thrpt   20   2136907.759 ±  20863.526  ops/s
ByteBufAllocatorBenchmark.unpooledHeapAllocAndFree          00256  thrpt   20  20517192.830 ± 103002.065  ops/s

@chrisvest
Copy link
Member Author

@normanmaurer @jchrys I think this is ready, now.

Copy link
Contributor

@jchrys jchrys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I'm genuinely excited about this. 🚀

@normanmaurer
Copy link
Member

@chrisvest ship it :shipit:

@chrisvest chrisvest merged commit 0dc95ef into netty:4.1 Apr 20, 2024
@chrisvest chrisvest deleted the 41-adaptive-allocator branch April 20, 2024 14:59
Copy link
Contributor

@franz1981 franz1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now checking the benchmark: sorry, had some family things going on recently...

LuciferYang pushed a commit to apache/spark that referenced this pull request May 28, 2024
### What changes were proposed in this pull request?
The pr aims to upgrade `netty` from `4.1.109.Final` to `4.1.110.Final`.

### Why are the changes needed?
- https://netty.io/news/2024/05/22/4-1-110-Final.html
  This version has brought some bug fixes and improvements, such as:
  Fix Zstd throws Exception on read-only volumes (netty/netty#13982)
  Add unix domain socket transport in netty 4.x via JDK16+ ([#13965](netty/netty#13965))
  Backport #13075: Add the AdaptivePoolingAllocator ([#13976](netty/netty#13976))
  Add no-value key handling only for form body ([#13998](netty/netty#13998))
  Add support for specifying SecureRandom in SSLContext initialization ([#14058](netty/netty#14058))

- https://github.com/netty/netty/issues?q=milestone%3A4.1.110.Final+is%3Aclosed

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46744 from panbingkun/SPARK-48420.

Authored-by: panbingkun <[email protected]>
Signed-off-by: yangjie01 <[email protected]>

private final ChunkAllocator chunkAllocator;
private final Queue<ChunkByteBuf> centralQueue;
private final StampedLock magazineExpandLock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am missing something obvious, this is going to cause the class to not be loadable in Java 6 and 7 and the version check in the constructor won't help because it's executed too late.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. I'll put up a PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoand
Copy link
Contributor

geoand commented Jun 4, 2024

Awesome to see this backported!

We will definitely start using it in Quarkus - I know @franz1981 is eager to see how it performs in real workloads.

@He-Pin
Copy link
Contributor

He-Pin commented Jun 20, 2024

88127f0d896686ff18a1575c5e8ccdf4

Seems like the new AdaptivePoolingAllocator uses more Memory.
Env: Java 21 with G1 and libjemalloc

@chrisvest
Copy link
Member Author

@He-Pin can you start a discussion and describe the workload? Would also be helpful if you can look into a heap dump and see if there's too many chunks in the central queue, or being held up by long-lived ByteBuf objects, or something else.

@He-Pin
Copy link
Contributor

He-Pin commented Jun 20, 2024

I will describe it in detail after work today.

@franz1981
Copy link
Contributor

Many thanks @He-Pin I am super curious as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants