Buddy allocation for large buffers in adaptive allocator#16053
Conversation
…t side of any subtree
This is temporary. We should do something faster than this.
Releasing could end up iterating past its sibling pair, so where releasing a node could end up updating a parent path that wasn't a parent of the released node.
When releasing parent nodes we were only considering the siblings one level down, not whether any other child of a given node had been claimed. This could lead to grand-parents getting marked as free, if we returned from a child that had a free sibling. Now we return whether the given subtree is free and only consider the sibling if so.
They were used for debugging and are no longer needed.
|
With the fixes to the freeing code, the buddy allocator is now both faster and use less memory (no longer accidentally keeping memory marked as claimed). However, it's not nearly enough to make up for the removal of the 32k and 64k size classes. FYI @franz1981 |
I think it might be better to look at that problem in a separate follow-up PR. Don't want each PR to get too big.
|
@franz1981 I think it'd be better to do chunk picking in a separate PR, so I brought back the 32k and 64k size classes. Those bring the memory usage back to previous levels. I marked this PR ready for review. |
# Conflicts: # buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java
franz1981
left a comment
There was a problem hiding this comment.
I will be soon back after holiday and look at this 🙏
normanmaurer
left a comment
There was a problem hiding this comment.
Generally looking good to me... just a a few nits and a question
| @Override | ||
| public int remainingCapacity() { | ||
| if (!freeList.isEmpty()) { | ||
| freeList.drain(256, this); |
There was a problem hiding this comment.
Also it feels kind of odd that a "getter" would have such a "side-effect".
There was a problem hiding this comment.
Perhaps, but the alternative is to iterate and sum the sizes of pending frees. It'd require adding some reduce-on-range method to the MpscIntQueue, which is of course possible.
The whole remainingCapacity business is a hold-over from bump allocation, so it's something we should move away from entirely, and instead rely on readInitInto returning a boolean.
Fixing this up is something I want to do in a future PR, perhaps as part of improving chunk picking.
| } else { | ||
| RefCnt.resetRefCnt(refCnt); | ||
| delegate.setIndex(0, 0); | ||
| allocatedBytes = 0; |
There was a problem hiding this comment.
Yes, because updates to this field may be delayed, e.g. by the BuddyChunk freeList. So we can't reset it in case there's relative updates pending. Instead we need to let the concrete Chunk implementations manage it.
Just like SizeClassedChunks, the BuddyChunks can be released directly back to the shared pool in the magazine group, without waiting for the chunk to be fully freed. This allows it to be reused much sooner, and reduces memory usage. This is technically enough to remove the 32k and 64k size classes. However, if we do that, then it would currently cause an increase in chunk churn, where chunks are allocated and released a lot. The churn needs to come under control, before those two size classes can be removed.
|
@franz1981 @normanmaurer Added one more commit where I fixed a missing piece that was causing the increased memory usage seen in #16053 (comment) Like the This means it's now technically possible to remove the 32k and 64k. However, doing so introduces a fair bit of chunk churn, as can be seen in this chart: Getting the churn under control, and making smarter picking decisions, is what I'll work on next in a separate PR. |
|
@chrisvest don't we also want to cherry-pick to 4.1 ? |
|
Yeah, I'll add it. |
Motivation: The histogram bump allocating chunks have poor chunk and memory reuse in practice, which leads to higher memory usage than the pooled allocator whenever an application performs enough allocations of buffers that don't fit in our size-classed chunks. Buddy allocation should in theory reduce memory consumption by allowing memory reuse within a chunk, similar to the size-classed chunks, but for variable power-of-two sized allocations. We've found that beyond 16k-20k buffer sizes, allocations predominantly comes in power-of-two sizes, hence buddy allocation should be a good fit for this size regime. Modification: * Implement a new chunk type that does buddy allocation, based on an array-embedded binary search tree. * The tree is encoded as a dense byte array, with two bits marking node or child-node usage, and six bits to encode the node size. * The histogram pointer-bump allocating chunk implementation is removed, which unlocks potential simplifications and optimizations that will benefit both buddy and size-classed chunks. * The 32k and 64k size classes are kept for the time being, to keep chunk churn under control, but they are planned to be removed in a follow-up PR. Result: We generally get improvements to memory usage, because the buddy allocator is able to reuse its chunks before they are fully deallocated. If the 32k and 64k size-classes are removed, then the improvements continue to hold up, but we see an increase in allocation churn for buddy chunks. This needs to be investigated and solved before we can remove the 32k and 64k size-classes. Presumably, it comes down to making better decisions about the size of the buddy chunks, and in picking which chunks to allocate from next once a magazine has exhausted its current chunk. (cherry picked from commit 2dbc1e7)
Motivation: The histogram bump allocating chunks have poor chunk and memory reuse in practice, which leads to higher memory usage than the pooled allocator whenever an application performs enough allocations of buffers that don't fit in our size-classed chunks. Buddy allocation should in theory reduce memory consumption by allowing memory reuse within a chunk, similar to the size-classed chunks, but for variable power-of-two sized allocations. We've found that beyond 16k-20k buffer sizes, allocations predominantly comes in power-of-two sizes, hence buddy allocation should be a good fit for this size regime. Modification: * Implement a new chunk type that does buddy allocation, based on an array-embedded binary search tree. * The tree is encoded as a dense byte array, with two bits marking node or child-node usage, and six bits to encode the node size. * The histogram pointer-bump allocating chunk implementation is removed, which unlocks potential simplifications and optimizations that will benefit both buddy and size-classed chunks. * The 32k and 64k size classes are kept for the time being, to keep chunk churn under control, but they are planned to be removed in a follow-up PR. Result: We generally get improvements to memory usage, because the buddy allocator is able to reuse its chunks before they are fully deallocated. If the 32k and 64k size-classes are removed, then the improvements continue to hold up, but we see an increase in allocation churn for buddy chunks. This needs to be investigated and solved before we can remove the 32k and 64k size-classes. Presumably, it comes down to making better decisions about the size of the buddy chunks, and in picking which chunks to allocate from next once a magazine has exhausted its current chunk. (cherry picked from commit 2dbc1e7)
…6133) Motivation: The histogram bump allocating chunks have poor chunk and memory reuse in practice, which leads to higher memory usage than the pooled allocator whenever an application performs enough allocations of buffers that don't fit in our size-classed chunks. Buddy allocation should in theory reduce memory consumption by allowing memory reuse within a chunk, similar to the size-classed chunks, but for variable power-of-two sized allocations. We've found that beyond 16k-20k buffer sizes, allocations predominantly comes in power-of-two sizes, hence buddy allocation should be a good fit for this size regime. Modification: * Implement a new chunk type that does buddy allocation, based on an array-embedded binary search tree. * The tree is encoded as a dense byte array, with two bits marking node or child-node usage, and six bits to encode the node size. * The histogram pointer-bump allocating chunk implementation is removed, which unlocks potential simplifications and optimizations that will benefit both buddy and size-classed chunks. * The 32k and 64k size classes are kept for the time being, to keep chunk churn under control, but they are planned to be removed in a follow-up PR. Result: We generally get improvements to memory usage, because the buddy allocator is able to reuse its chunks before they are fully deallocated. If the 32k and 64k size-classes are removed, then the improvements continue to hold up, but we see an increase in allocation churn for buddy chunks. This needs to be investigated and solved before we can remove the 32k and 64k size-classes. Presumably, it comes down to making better decisions about the size of the buddy chunks, and in picking which chunks to allocate from next once a magazine has exhausted its current chunk. (cherry picked from commit 2dbc1e7)
…6132) Motivation: The histogram bump allocating chunks have poor chunk and memory reuse in practice, which leads to higher memory usage than the pooled allocator whenever an application performs enough allocations of buffers that don't fit in our size-classed chunks. Buddy allocation should in theory reduce memory consumption by allowing memory reuse within a chunk, similar to the size-classed chunks, but for variable power-of-two sized allocations. We've found that beyond 16k-20k buffer sizes, allocations predominantly comes in power-of-two sizes, hence buddy allocation should be a good fit for this size regime. Modification: * Implement a new chunk type that does buddy allocation, based on an array-embedded binary search tree. * The tree is encoded as a dense byte array, with two bits marking node or child-node usage, and six bits to encode the node size. * The histogram pointer-bump allocating chunk implementation is removed, which unlocks potential simplifications and optimizations that will benefit both buddy and size-classed chunks. * The 32k and 64k size classes are kept for the time being, to keep chunk churn under control, but they are planned to be removed in a follow-up PR. Result: We generally get improvements to memory usage, because the buddy allocator is able to reuse its chunks before they are fully deallocated. If the 32k and 64k size-classes are removed, then the improvements continue to hold up, but we see an increase in allocation churn for buddy chunks. This needs to be investigated and solved before we can remove the 32k and 64k size-classes. Presumably, it comes down to making better decisions about the size of the buddy chunks, and in picking which chunks to allocate from next once a magazine has exhausted its current chunk. (cherry picked from commit 2dbc1e7)


Motivation:
The histogram bump allocating chunks have poor chunk and memory reuse in practice, which leads to higher memory usage than the pooled allocator whenever an application performs enough allocations of buffers that don't fit in our size-classed chunks.
Buddy allocation should in theory reduce memory consumption by allowing memory reuse within a chunk, similar to the size-classed chunks, but for variable power-of-two sized allocations.
We've found that beyond 16k-20k buffer sizes, allocations predominantly comes in power-of-two sizes, hence buddy allocation should be a good fit for this size regime.
Modification:
Result:
We generally get improvements to memory usage, because the buddy allocator is able to reuse its chunks before they are fully deallocated.
If the 32k and 64k size-classes are removed, then the improvements continue to hold up, but we see an increase in allocation churn for buddy chunks.
This needs to be investigated and solved before we can remove the 32k and 64k size-classes.
Presumably, it comes down to making better decisions about the size of the buddy chunks, and in picking which chunks to allocate from next once a magazine has exhausted its current chunk.