Skip to content

Buddy allocation for large buffers in adaptive allocator#16053

Merged
chrisvest merged 15 commits into
netty:4.2from
chrisvest:4.2-buddy-alloc
Jan 12, 2026
Merged

Buddy allocation for large buffers in adaptive allocator#16053
chrisvest merged 15 commits into
netty:4.2from
chrisvest:4.2-buddy-alloc

Conversation

@chrisvest
Copy link
Copy Markdown
Member

@chrisvest chrisvest commented Dec 16, 2025

Motivation:

The histogram bump allocating chunks have poor chunk and memory reuse in practice, which leads to higher memory usage than the pooled allocator whenever an application performs enough allocations of buffers that don't fit in our size-classed chunks.

Buddy allocation should in theory reduce memory consumption by allowing memory reuse within a chunk, similar to the size-classed chunks, but for variable power-of-two sized allocations.

We've found that beyond 16k-20k buffer sizes, allocations predominantly comes in power-of-two sizes, hence buddy allocation should be a good fit for this size regime.

Modification:

  • Implement a new chunk type that does buddy allocation, based on an array-embedded binary search tree.
  • The tree is encoded as a dense byte array, with two bits marking node or child-node usage, and six bits to encode the node size.
  • The histogram pointer-bump allocating chunk implementation is removed, which unlocks potential simplifications and optimizations that will benefit both buddy and size-classed chunks.
  • The 32k and 64k size classes are kept for the time being, to keep chunk churn under control, but they are planned to be removed in a follow-up PR.

Result:

We generally get improvements to memory usage, because the buddy allocator is able to reuse its chunks before they are fully deallocated.
If the 32k and 64k size-classes are removed, then the improvements continue to hold up, but we see an increase in allocation churn for buddy chunks.
This needs to be investigated and solved before we can remove the 32k and 64k size-classes.
Presumably, it comes down to making better decisions about the size of the buddy chunks, and in picking which chunks to allocate from next once a magazine has exhausted its current chunk.

This is temporary. We should do something faster than this.
Releasing could end up iterating past its sibling pair, so where releasing a node could end up updating a parent path that wasn't a parent of the released node.
When releasing parent nodes we were only considering the siblings one level down, not whether any other child of a given node had been claimed.
This could lead to grand-parents getting marked as free, if we returned from a child that had a free sibling.
Now we return whether the given subtree is free and only consider the sibling if so.
They were used for debugging and are no longer needed.
@chrisvest
Copy link
Copy Markdown
Member Author

With the fixes to the freeing code, the buddy allocator is now both faster and use less memory (no longer accidentally keeping memory marked as claimed). However, it's not nearly enough to make up for the removal of the 32k and 64k size classes.
Especially in cases where there's high buffer retention - many buffers alive at the same time. In that case, we end up with many chunks and make poor decisions reusing them because we just a queue per magazine group.

FYI @franz1981

@chrisvest
Copy link
Copy Markdown
Member Author

Here's a simulation run with the e-commerce data:
image
If I comment out the largeBufferMagazineGroup path, so large buffers are unpooled instead of using the buddy allocator for pooling, then memory usage is cut in half, which gives us a picture of how far away from the limit we are.

I think it might be better to look at that problem in a separate follow-up PR.
Don't want each PR to get too big.
@chrisvest chrisvest marked this pull request as ready for review January 6, 2026 00:25
@chrisvest
Copy link
Copy Markdown
Member Author

@franz1981 I think it'd be better to do chunk picking in a separate PR, so I brought back the 32k and 64k size classes. Those bring the memory usage back to previous levels. I marked this PR ready for review.

@chrisvest chrisvest requested a review from franz1981 January 6, 2026 18:49
# Conflicts:
#	buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java
Copy link
Copy Markdown
Contributor

@franz1981 franz1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be soon back after holiday and look at this 🙏

@chrisvest chrisvest added the needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. label Jan 7, 2026
@chrisvest chrisvest requested a review from normanmaurer January 8, 2026 21:03
Copy link
Copy Markdown
Member

@normanmaurer normanmaurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking good to me... just a a few nits and a question

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
@Override
public int remainingCapacity() {
if (!freeList.isEmpty()) {
freeList.drain(256, this);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it feels kind of odd that a "getter" would have such a "side-effect".

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but the alternative is to iterate and sum the sizes of pending frees. It'd require adding some reduce-on-range method to the MpscIntQueue, which is of course possible.

The whole remainingCapacity business is a hold-over from bump allocation, so it's something we should move away from entirely, and instead rely on readInitInto returning a boolean.

Fixing this up is something I want to do in a future PR, perhaps as part of improving chunk picking.

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
} else {
RefCnt.resetRefCnt(refCnt);
delegate.setIndex(0, 0);
allocatedBytes = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because updates to this field may be delayed, e.g. by the BuddyChunk freeList. So we can't reset it in case there's relative updates pending. Instead we need to let the concrete Chunk implementations manage it.

@chrisvest chrisvest requested a review from normanmaurer January 9, 2026 21:11
Just like SizeClassedChunks, the BuddyChunks can be released directly back to the shared pool in the magazine group, without waiting for the chunk to be fully freed.
This allows it to be reused much sooner, and reduces memory usage.

This is technically enough to remove the 32k and 64k size classes.
However, if we do that, then it would currently cause an increase in chunk churn, where chunks are allocated and released a lot.
The churn needs to come under control, before those two size classes can be removed.
@chrisvest
Copy link
Copy Markdown
Member Author

@franz1981 @normanmaurer Added one more commit where I fixed a missing piece that was causing the increased memory usage seen in #16053 (comment)

Like the SizeClassedChunks, the BuddyChunks can of course be reused before all their buffers have been freed.

This means it's now technically possible to remove the 32k and 64k. However, doing so introduces a fair bit of chunk churn, as can be seen in this chart:
image

Getting the churn under control, and making smarter picking decisions, is what I'll work on next in a separate PR.

@normanmaurer normanmaurer added this to the 4.2.10.Final milestone Jan 12, 2026
@chrisvest chrisvest merged commit 2dbc1e7 into netty:4.2 Jan 12, 2026
47 of 51 checks passed
@chrisvest chrisvest deleted the 4.2-buddy-alloc branch January 12, 2026 22:32
@normanmaurer
Copy link
Copy Markdown
Member

@chrisvest don't we also want to cherry-pick to 4.1 ?

@chrisvest
Copy link
Copy Markdown
Member Author

Yeah, I'll add it.

@chrisvest chrisvest added the needs-cherry-pick-4.1 This PR should be cherry-picked to 4.1 once merged. label Jan 12, 2026
chrisvest added a commit to chrisvest/netty that referenced this pull request Jan 12, 2026
Motivation:

The histogram bump allocating chunks have poor chunk and memory reuse in
practice, which leads to higher memory usage than the pooled allocator
whenever an application performs enough allocations of buffers that
don't fit in our size-classed chunks.

Buddy allocation should in theory reduce memory consumption by allowing
memory reuse within a chunk, similar to the size-classed chunks, but for
variable power-of-two sized allocations.

We've found that beyond 16k-20k buffer sizes, allocations predominantly
comes in power-of-two sizes, hence buddy allocation should be a good fit
for this size regime.

Modification:

* Implement a new chunk type that does buddy allocation, based on an
array-embedded binary search tree.
* The tree is encoded as a dense byte array, with two bits marking node
or child-node usage, and six bits to encode the node size.
* The histogram pointer-bump allocating chunk implementation is removed,
which unlocks potential simplifications and optimizations that will
benefit both buddy and size-classed chunks.
* The 32k and 64k size classes are kept for the time being, to keep
chunk churn under control, but they are planned to be removed in a
follow-up PR.

Result:

We generally get improvements to memory usage, because the buddy
allocator is able to reuse its chunks before they are fully deallocated.
If the 32k and 64k size-classes are removed, then the improvements
continue to hold up, but we see an increase in allocation churn for
buddy chunks.
This needs to be investigated and solved before we can remove the 32k
and 64k size-classes.
Presumably, it comes down to making better decisions about the size of
the buddy chunks, and in picking which chunks to allocate from next once
a magazine has exhausted its current chunk.

(cherry picked from commit 2dbc1e7)
chrisvest added a commit to chrisvest/netty that referenced this pull request Jan 12, 2026
Motivation:

The histogram bump allocating chunks have poor chunk and memory reuse in
practice, which leads to higher memory usage than the pooled allocator
whenever an application performs enough allocations of buffers that
don't fit in our size-classed chunks.

Buddy allocation should in theory reduce memory consumption by allowing
memory reuse within a chunk, similar to the size-classed chunks, but for
variable power-of-two sized allocations.

We've found that beyond 16k-20k buffer sizes, allocations predominantly
comes in power-of-two sizes, hence buddy allocation should be a good fit
for this size regime.

Modification:

* Implement a new chunk type that does buddy allocation, based on an
array-embedded binary search tree.
* The tree is encoded as a dense byte array, with two bits marking node
or child-node usage, and six bits to encode the node size.
* The histogram pointer-bump allocating chunk implementation is removed,
which unlocks potential simplifications and optimizations that will
benefit both buddy and size-classed chunks.
* The 32k and 64k size classes are kept for the time being, to keep
chunk churn under control, but they are planned to be removed in a
follow-up PR.

Result:

We generally get improvements to memory usage, because the buddy
allocator is able to reuse its chunks before they are fully deallocated.
If the 32k and 64k size-classes are removed, then the improvements
continue to hold up, but we see an increase in allocation churn for
buddy chunks.
This needs to be investigated and solved before we can remove the 32k
and 64k size-classes.
Presumably, it comes down to making better decisions about the size of
the buddy chunks, and in picking which chunks to allocate from next once
a magazine has exhausted its current chunk.

(cherry picked from commit 2dbc1e7)
chrisvest added a commit that referenced this pull request Jan 13, 2026
…6133)

Motivation:

The histogram bump allocating chunks have poor chunk and memory reuse in
practice, which leads to higher memory usage than the pooled allocator
whenever an application performs enough allocations of buffers that
don't fit in our size-classed chunks.

Buddy allocation should in theory reduce memory consumption by allowing
memory reuse within a chunk, similar to the size-classed chunks, but for
variable power-of-two sized allocations.

We've found that beyond 16k-20k buffer sizes, allocations predominantly
comes in power-of-two sizes, hence buddy allocation should be a good fit
for this size regime.

Modification:

* Implement a new chunk type that does buddy allocation, based on an
array-embedded binary search tree.
* The tree is encoded as a dense byte array, with two bits marking node
or child-node usage, and six bits to encode the node size.
* The histogram pointer-bump allocating chunk implementation is removed,
which unlocks potential simplifications and optimizations that will
benefit both buddy and size-classed chunks.
* The 32k and 64k size classes are kept for the time being, to keep
chunk churn under control, but they are planned to be removed in a
follow-up PR.

Result:

We generally get improvements to memory usage, because the buddy
allocator is able to reuse its chunks before they are fully deallocated.
If the 32k and 64k size-classes are removed, then the improvements
continue to hold up, but we see an increase in allocation churn for
buddy chunks.
This needs to be investigated and solved before we can remove the 32k
and 64k size-classes.
Presumably, it comes down to making better decisions about the size of
the buddy chunks, and in picking which chunks to allocate from next once
a magazine has exhausted its current chunk.

(cherry picked from commit 2dbc1e7)
chrisvest added a commit that referenced this pull request Jan 13, 2026
…6132)

Motivation:

The histogram bump allocating chunks have poor chunk and memory reuse in
practice, which leads to higher memory usage than the pooled allocator
whenever an application performs enough allocations of buffers that
don't fit in our size-classed chunks.

Buddy allocation should in theory reduce memory consumption by allowing
memory reuse within a chunk, similar to the size-classed chunks, but for
variable power-of-two sized allocations.

We've found that beyond 16k-20k buffer sizes, allocations predominantly
comes in power-of-two sizes, hence buddy allocation should be a good fit
for this size regime.

Modification:

* Implement a new chunk type that does buddy allocation, based on an
array-embedded binary search tree.
* The tree is encoded as a dense byte array, with two bits marking node
or child-node usage, and six bits to encode the node size.
* The histogram pointer-bump allocating chunk implementation is removed,
which unlocks potential simplifications and optimizations that will
benefit both buddy and size-classed chunks.
* The 32k and 64k size classes are kept for the time being, to keep
chunk churn under control, but they are planned to be removed in a
follow-up PR.

Result:

We generally get improvements to memory usage, because the buddy
allocator is able to reuse its chunks before they are fully deallocated.
If the 32k and 64k size-classes are removed, then the improvements
continue to hold up, but we see an increase in allocation churn for
buddy chunks.
This needs to be investigated and solved before we can remove the 32k
and 64k size-classes.
Presumably, it comes down to making better decisions about the size of
the buddy chunks, and in picking which chunks to allocate from next once
a magazine has exhausted its current chunk.

(cherry picked from commit 2dbc1e7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-cherry-pick-4.1 This PR should be cherry-picked to 4.1 once merged. needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants