Fulltext01 2
Fulltext01 2
UPTEC IT 24022
Abstract
In the current software development environment, Java remains one of the major languages,
powering numerous applications. Central to Java's effectiveness is the Java Virtual Machine
(JVM), with HotSpot being a key implementation. Within HotSpot, garbage collection (GC) is
critical for efficient memory management, where one collector is Z (ZGC), designed for minimal
latency and high throughput.
ZGC primarily uses bump-pointer allocation, which, while fast, can lead to fragmentation issues.
An alternative allocation strategy involves using free-lists to dynamically manage memory
blocks of various sizes, such as the buddy allocator. This thesis explores the adaptation and
evaluation of buddy allocators for potential integration within ZGC, aiming to enhance memory
allocation efficiency and minimize fragmentation.
The thesis investigates the binary buddy allocator, the binary tree buddy allocator, and the
inverse buddy allocator, assessing their performance and suitability for ZGC. Although not
integrated into ZGC, these exploratory modifications and evaluations provide insight into their
behavior and performance in a GC context. The study reveals that while buddy allocators offer
promising solutions to fragmentation, they require careful adaptation to handle the unique
demands of ZGC.
The conclusions drawn from this research highlight the potential of free-list-based allocators to
improve memory management in Java applications. These advances could reduce GC-induced
latency and enhance the scalability of Java-based systems, addressing the growing demands of
modern software applications. Fac ulty of Sci enc e and Technol ogy, U ppsal a U niv ersity. U pps ala. Supervis or: Erik Ös terl und, Subj ect reader: T obias Wrigstad, Ex aminer: R oland Bol
iii
Contents
1 Introduction 1
1.1 Purpose and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Individual Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Sequential Allocation . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Free-List Allocation . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Segregated-Fits Allocation . . . . . . . . . . . . . . . . . . . . 7
2.4 Garbage Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 OpenJDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 The Z Garbage Collector . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.1 Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.2 The Garbage Collection Cycle . . . . . . . . . . . . . . . . . . 12
3 Related Work 15
3.1 Memory Allocators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Free-Lists in Garbage Collection . . . . . . . . . . . . . . . . . . . . . 16
iv
4.2 Determining Buddy State . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Method 21
8 Evaluation Methodology 37
8.1 Allocator Configurations . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.2.1 Measuring Performance . . . . . . . . . . . . . . . . . . . . . 38
8.2.2 System Specifications . . . . . . . . . . . . . . . . . . . . . . 38
v
8.3 Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.3.1 Internal Fragmentation . . . . . . . . . . . . . . . . . . . . . . 39
8.3.2 External Fragmentation . . . . . . . . . . . . . . . . . . . . . . 39
8.3.3 Measuring Fragmentation . . . . . . . . . . . . . . . . . . . . 40
9 Results 41
9.1 Memory Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.3 Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.3.1 Internal Fragmentation . . . . . . . . . . . . . . . . . . . . . . 44
9.3.2 External Fragmentation . . . . . . . . . . . . . . . . . . . . . . 45
10 Discussion 47
11 Future Work 49
11.1 Integration Into ZGC . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
11.2 Smaller Minimum Allocation Size . . . . . . . . . . . . . . . . . . . . 49
11.3 Further Allocation Optimizations . . . . . . . . . . . . . . . . . . . . . 50
11.4 Combining Allocation Strategies . . . . . . . . . . . . . . . . . . . . . 50
12 Conclusions 51
vi
1 Introduction
1 Introduction
Effective memory management is a crucial for any software system. It can be catego-
rized into either manual memory management, where the developer is responsible for
both allocating and deallocating memory, or automatic memory management, where the
system handles memory on behalf of the developer. Garbage collection, a type of au-
tomatic memory management, identifies and reclaims memory that is no longer in use.
There are various different implementations of garbage collection that achieve this goal.
Java applications run within a Java Virtual Machine (JVM), and one such example is
the Open Java Development Kit (OpenJDK). OpenJDK includes various garbage col-
lectors, including the Z garbage collector (ZGC). ZGC organizes memory into many
pages, which are operated on concurrently during garbage collection. Objects are allo-
cated sequentially on these pages through a method known as bump-pointer allocation,
where a pointer tracks the position of the most recently allocated object, increasing it,
or “bumping” it, with each new allocation.
Because bump-pointer allocation always starts from the end address of the last alloca-
tion, it is unable to reuse previously allocated memory on the same page, that has since
become free. To resolve this, ZGC either relocates all active objects to a new page,
allowing the original page to be reset, or moves all objects one by one to the top of
the page. An alternative to this method is employing a free-list-based allocator, which
maintains a list of all unoccupied memory, thereby allowing allocations in any available
space, independent of prior allocations.
Using free-lists in garbage collectors is nothing new. Concurrent Mark Sweep (CMS) is
a garbage collector that relies on free-lists. It was part of OpenJDK until its deprecation
and subsequent removal. CMS could perform allocations without the constraints im-
posed by bump-pointer allocation methods, thanks to its use of free-lists. It is clear that
free-lists offer certain advantages, but they may not always be the optimal choice for
every situation. Using a free-list-based allocator within an existing garbage collector
can offer significant advantages, as the garbage collector has more information avail-
able about the objects it allocates that the allocator can use. This thesis investigates the
potential integration of a free-list-based allocator within ZGC, with focus on examin-
ing possible adaptations that can boost allocator efficiency within a garbage collection
context, without actual implementation in ZGC.
1
1 Introduction
The purpose of this thesis is to investigate how it is possible to adapt an existing free-list-
based allocator for use within ZGC, and to evaluate the effects of these adaptations on
allocator performance. The goal is to leverage insights of this work to direct subsequent
efforts in the domain of free-list-based allocators in garbage collection.
The research questions this thesis seeks to answer are the following:
1. What changes are neccesary when adapting an allocator for use within ZGC?
1.2 Delimitations
The primary delimitation of this thesis is the exclusion of allocator integration within
ZGC. The focus is on examining the allocators and their adaptations independently.
Pages in ZGC can be of three types: small, medium, or large. The modifications made
focus solely on small pages, as these are the most common. However, insights gained
from small pages may guide potential adaptations for medium and possibly large pages.
The primary author and principal contributor to this thesis is Casper Norrbin. Casper
independently developed all sections, except Section 2, which was collaboratively au-
thored with Niclas Gärds and Joel Sikström. Both Niclas and Joel pursued their theses
in parallel with this one, with Joel focusing on adapting the two-level segregated fit al-
locator [19] and Niclas on integrating an allocator into ZGC [4]. Joel’s work resulted
in distinct adaptations and outcomes, while Niclas explored the challenges and oppor-
tunities of integrating an allocator into ZGC, a natural extension of both this work and
Joel’s research.
For the section developed in collaboration with Niclas and Casper, the contributions are:
Sections 2.1, 2.2 and 2.3 are authored solely by Joel, Sections 2.4 and 2.5 are authored
solely by Casper, and Section 2.6 is authored solely by Niclas.
2
1 Introduction
1.4 Outline
The rest of this thesis is organized as follows: Section 1 outlines the purpose, goals,
delimitations, and individual contributions. Section 2 provides context on memory
management, fragmentation, memory allocation strategies, and ZGC. Section 3 reviews
existing memory allocators and the use of free-lists in garbage collection. Section 4
details the binary buddy allocator and its allocation, deallocation, and state determina-
tion methods. Section 5 describes the implementation process, from reference design
to evaluation. Section 6 introduces enhancements such as the binary tree and inverse
buddy allocators. Section 7 discusses specific adaptations of the allocators for ZGC.
Section 8 outlines the performance and fragmentation testing approaches. Section 9
presents the findings on page overhead, performance, and fragmentation. Section 10
interprets these results, and Section 11 suggests further research directions. Finally,
Section 12 summarizes key insights and their implications.
3
2 Background
2 Background
2.2 Fragmentation
4
2 Background
Figure 1 A memory region containing one allocated piece of memory that is 128 bytes
large. However, the user only requires 100 bytes and thus, 28 bytes are wasted.
is illustrated in Figure 2, where a total of 80 bytes is available but dispersed across the
memory region. Consequently, the largest allocation that the system can accommodate
is 32 bytes, as the single largest contiguous chunk of memory is of this size.
Figure 2 A memory region containing several allocated blocks with unused space be-
tween them. Although the total sum of the unused portions is 80 bytes, a single request
of more than 32 bytes cannot be fulfilled.
Effectively managing and reducing fragmentation is crucial for optimizing memory us-
age in long-running programs [5]. For example, if memory becomes too fragmented,
the system may not be able to satisfy allocation requests and be forced to either collect
garbage sooner than necessary or terminate if not using a garbage collector [5].
In this section, we describe two fundamental memory allocation strategies, called se-
quential allocation and free-list allocation [5]. Additionally, we discuss the more com-
plex case of using multiple free-lists, called segregated-fits.
5
2 Background
Sequential allocation is a method used for allocating memory within a contiguous chunk
of memory. In this approach, a pointer is used to track the current position within the
memory chunk. As new objects are allocated, the pointer is advanced by the size of
the object, along with extra memory that might be needed for alignment purposes. For
this reason, sequential allocation is also known as bump-pointer allocation due to the
incremental “bumping” of the pointer with each new allocation.
This approach is simple and efficient and is highly effective in situations where memory
fragmentation is not a significant concern and where a predictable, sequential layout
is desirable. However, it may not be the most suitable choice for all scenarios, espe-
cially where memory is not available in a large contiguous chunk. In that case, a more
sophisticated memory management strategy might be required.
First-fit The first block in the free-list that is large enough to fulfill the memory alloca-
tion request is selected. This method minimizes search time but does not consider
the possibility of a more suitable block elsewhere in the list. This often leads to
more fragmentation than is necessary, because blocks are split more often. Each
request searches from the beginning of the list.
Next-fit The search for a suitable block begins in the free-list, following a similar pro-
cess to that described for first-fit. However, in subsequent searches, it resumes
from where the previous search ended, improving efficiency when locating a new
block. This strategy is based on the observation that smaller blocks tend to ac-
cumulate at the beginning of the free-list [5], optimizing the search process by
starting further into the list in each iteration.
Best-fit The entire free-list is searched until the smallest available block that can fulfill
a request is located. This method minimizes fragmentation by selecting the block
that best matches the size of the requested memory, but it comes with the trade-off
of increased search time.
6
2 Background
The three policies described above have in common that when a block that is larger than
requested is found, it is split up.
Ideally, blocks are split as rarely as possible to maintain the ability to serve large allo-
cations. If blocks are split too often, many small blocks might accumulate, which might
increase external fragmentation to a level where new requests cannot be fulfilled. Addi-
tionally, splitting blocks less frequently will also mean less merging, or coalescing, of
blocks to larger sizes, which could improve performance.
Instead of using a single free-list where blocks of many sizes are stored, multiple
free-lists that store blocks of similar sizes or size ranges can instead be used, called
segregated-fits. The goal of using multiple free-lists is to narrow down the search space
to fewer blocks, minimizing the time to find blocks large enough to satisfy a request.
It is crucial to note that blocks are logically segregated into their respective free-lists
based on size but are not required to be physically adjacent in memory within the same
free-list.
Segregated-fits is often employed in real-time systems where predictable and efficient
memory allocation is crucial [5, 8]. The reduced search space and minimized search
time to find suitable blocks increase the probability of timing constraints being met.
This section draws heavily from The Garbage Collection Handbook: The Art of Au-
tomatic Memory Management [5], by Jones et al., which provides a comprehensive
overview of garbage collection methods, both historical and current.
As mentioned in Section 2.1, garbage collection is a form of automatic memory man-
agement where the system identifies and cleans up unused objects, which are considered
garbage. This removes the requirement of managing memory manually, which reduces
the possibility of user-related memory issues occurring. However, this also removes
some control from the user and could lead to an increased memory footprint.
From the perspective of the GC, the user program can be perceived solely as a memory
mutator. This is because the GC is only concerned with the memory management of the
program, not its functionality. Therefore, the threads of the user program are commonly
referred to as mutator threads, or just mutators.
7
2 Background
2.5 OpenJDK
OpenJDK [11] is a set of tools for creating and running Java programs, maintained by
Oracle. HotSpot [14] is one of these tools and the reference implementation of the Java
Virtual Machine (JVM) [10]. Specifically, a virtual machine (VM) emulates a distinct
computer to run various programs, which could be anything from full operating sys-
tems – to more specialized machines. The JVM is designed specifically to execute Java
8
2 Background
The following sections will cover the basics of how ZGC allocates data and how it can
dynamically collect garbage. The information is based partly on an overview by Yang
and Wrigstad [21] and partly on the source code of ZGC itself, as available in OpenJDK
version 22.32 [12].
2.6.1 Pages
9
2 Background
Table 1 Page sizes in ZGC (Taken from Yang and Wrigstad [20]).
Page Type Page Size Object Size
Small 2 MB [16B, 256KB]
Medium 32 MB (256KB, 4MB)
Large ≥ 4 MB ≥ 4MB
Pages in ZGC include metadata that allows them to perform allocations efficiently,
as well as ensure safe execution while running concurrently executing threads. An
overview of this metadata is shown in Figure 3, where the most important attributes
are the: Bump Pointer, Live Map, Sequence Number, and Age, which are explained in
detail below.
Bump Pointer The bump pointer is a pointer within the page’s memory range that
stores the location of where new allocations are made inside the page. Its func-
tionality and use-case are explained in detail in Section 2.3.1. All allocations
inside pages in ZGC are currently done using bump pointers. In Figure 3, the
bump pointer is displayed below four allocated objects, indicating that the next
position to allocate objects at would be below the fourth object.
Live Map The live map is used to keep track of which objects are currently live and
have been marked as reachable by the program. Objects not marked in the live
map are considered dead. The live map is shown on the right-hand side of Fig-
ure 3. It is constructed during the marking phase of the GC cycle as live objects
are encountered during memory traversal and marked in the live map. The live
map is represented by a bitmap, where each bit is mapped to an 8-byte chunk of
the page. If a bit is set, there is a live object at the corresponding address on the
page. The live map is used during the compacting phase of the GC in order to
know how much memory is alive inside the page.
10
2 Background
Sequence Number Every page has a sequence number denoting during which GC cy-
cle it was created. If the GC is on its N th cycle, pages created during that GC
cycle will have a sequence number N . Pages with sequence number N are called
allocating pages, and are the only pages which allocate new objects for mutator
threads. Pages with a sequence number below the current GC cycle, 0–N − 1 are
called relocatable pages since there will be no additional allocations done inside
them during the GC cycle, and garbage can be reclaimed with the use of reloca-
tion. For example, if the current Java program has performed 5 GC cycles, any
allocations after that will be exclusively done on pages with sequence number 5.
If the program decides to run a 6th cycle, only garbage from pages with sequence
numbers 0 through 5 will be collected, and new allocations will be done on pages
with sequence number 6.
Age ZGC is a generational garbage collector, meaning it treats objects of varying ages
differently to improve performance. This is based on the observation that most
objects survive for only a short period and those that survive for longer tend to
live for a long time. This is known as the weak generational hypothesis. ZGC
applies this approach on a page-by-page basis, where all allocations on a page
share the same age, which is conveniently stored on each page instead of for every
object. Objects are grouped into three different age categories: Eden, Survivor
and Old. The Eden age is for first-time allocation, Survivor signifies objects that
have survived one or more GC cycles and Old those that have survived a threshold
of cycles in the Survivor age. Classifying pages this way allows the GC to treat
pages of a certain age differently, which is especially useful for Eden pages, which
tend to accumulate garbage quickly.
11
2 Background
Live Map
Age
Sequence Number 5 eden
1
20B 0 24B
0
1
16B 16B
0
0
16B 16B
0
1
16B 16B
0
Bump Pointer
0
0
Figure 3 Illustration of a page in ZGC showing what kind of metadata is kept track
of to facilitate allocation and object bookkeeping. The page contains three live objects
marked in green and with the corresponding bit set in the live map to the right. Red
objects are previously allocated objects which have since become unreachable garbage
memory.
a) The initial state of the heap before the GC cycle starts. The left page has about
30% free memory and the right page has about 50%. The current GC cycle is 1,
which tells us that pages with sequence number 1 are allocating objects.
b) The GC cycle has started, and the old pages have been locked. New allocations are
now done on new pages instead of the old ones. The old pages are now guaranteed
to not allocate any new objects, which makes it possible for the marking phase to
begin to traverse the live objects.
c) The GC has finished the marking phase and now has liveness data that indicate
where live objects are allocated inside the relocatable pages.
12
2 Background
(a) Initial state of (b) The state of the heap after (c) The state of the heap af-
the heap before the marking has started. The blue ter all live objects are marked.
first GC cycle. The color indicates that the page is The red and green colors in-
gray color indicates no longer being allocated on. dicate that the marking phase
the page is allocat- has finished, and the pages
ing objects. have liveness data that explain
where live objects(green) are
located
.
Figure 4 Timeline of a garbage collection cycle, showing the state of the heap in the
different phases.
ZGC frees memory marked as garbage by copying live objects from one page to another
and re-purposing the source page for new allocations. This leads to the fragmented lay-
out of allocated memory to be compacted which makes it possible to fit all live objects
on multiple pages onto a single one. The process of copying live objects is also referred
to as relocation. When performing relocation, two different scenarios can occur. Either
there is enough free memory available in some page where we can allocate, or the heap
is full. If the heap is full, objects need to be compacted onto the same page that they are
already located on, which is called in-place compaction. The first scenario is illustrated
in Figure 5. This requires the heap to have enough free space available to create a new
page during the time of relocation. The GC can reclaim garbage memory by relocating
live objects from sparsely populated pages into a more dense configuration inside a new
page. The fragmentation and also total heap usage is therefore reduced.
13
2 Background
(a) A page has (b) A new page is (c) The objects (d) All objects
been selected for created with a new from the first page have been copied
relocation due to sequence number are relocated to the to the new page
high fragmenta- that is used as a new page. and the old page is
tion. relocation target. re-purposed.
Figure 5 An example of how a successful relocation is done when the heap has enough
space to allocate a new page.
Since a garbage collection is started as a reaction to high memory pressure, the available
memory might be limited. If a new page cannot be allocated, the first page is “created”
by compacting its objects in-place, and then continuing by relocating the live objects of
other pages onto that page. An in-place compaction is potentially an expensive opera-
tion, which requires the thread performing the re-arrangement to write to the memory of
the page while other threads may read its contents. To do this safely, any threads trying
to read from the page during this process must be paused, which removes concurrent
execution of the page. The process of in-place compaction of a page is illustrated in
Figure 6.
(a) The first live (b) The second live (c) All objects (d) The bump
object on the page object is moved as have been moved, pointer is moved
is moved to the close to the start as and the space after to the beginning of
start of the page. possible. is made available. available space.
Figure 6 An example of how in-place compaction is done to remove fragmentation.
14
3 Related Work
3 Related Work
Given the large number of existing allocator designs, one might conclude that most
problems with dynamic memory allocation have already been solved. In most cases, a
general-purpose allocator designed for use in any system or environment performs well
on average in terms of response time and fragmentation. However, there may be areas
for improvement depending on how much the use case can be narrowed down. For
example, a garbage collector has access to more information about the objects being
allocated than most other users of an allocator.
One of the first memory allocation techniques is the simple buddy allocator, conceived
by Knowlton [6]. The buddy allocator partitions memory into blocks, each sized as a
power of two. Initially, the entire memory is set as a single block. In the first allocation,
this block is divided recursively into halves until the block size closest to the allocation
size is obtained. These new differently-sized blocks can then be utilized for subsequent
allocations. The two resulting blocks from the splitting of a block are called “buddies”.
When a block is deallocated and its buddy is also unallocated, the two blocks are com-
bined into a larger block. This merging process continues recursively as long as the
buddy is also unallocated.
The simple design of the buddy allocator provides ample room for enhancements. Pe-
terson and Norman presented a more generalized version of the buddy allocator [17],
allowing block sizes that are not powers of two and allowing buddies of unequal sizes.
Recent advances have also included making the allocator lock-free [7] and enabling the
same block to exist in multiple sizes simultaneously to improve the efficiency of smaller
allocations [16].
The buddy allocator has seen wide use in real environments and applications. A notable
example is its use within the Linux kernel for allocating physical memory pages [3]. In
addition, the buddy allocator can be combined with various other allocation techniques
to improve efficiency. For example, jemalloc [2], created for the FreeBSD operating
system, is a general-purpose allocator that incorporates multiple strategies, including
the buddy allocator, to achieve overall efficiency.
15
3 Related Work
Using free-lists in garbage collection systems is not a new concept. For example, the ba-
sic mark-sweep algorithm, as outlined in Section 2.4, necessitates the use of a free-list.
The static position of live objects can lead to significant fragmentation. Consequently,
simpler allocation methods, such as bump-pointer allocation, become impractical. In-
stead, employing a strategy based on free-lists for allocation proves to be a more effec-
tive approach to managing and allocating memory surrounding live objects.
Among the free-list strategies outlined in Section 2.3.2, the segregated-fit method is
particularly noted for its effectiveness [5]. Programs that assume garbage collection
generally allocate more than those that manually handle memory. Therefore, the sig-
nificance of segregated-fit free-lists becomes particularly crucial as program allocations
increase.
The Concurrent Mark Sweep (CMS) garbage collector [9], introduced in JDK 1.4, is a
concrete example of free-lists being used within garbage collectors. CMS implements a
concurrent variant of the mark-sweep algorithm, utilizing free-lists for managing mem-
ory. Although it was deprecated in JDK 9, CMS demonstrates the practicality of using
free-lists in garbage collectors. This highlights their versatility and adaptability across
different garbage collection algorithms and implementations.
There is potential for free-list-based allocators to exist within already existing garbage
collection systems. Using multiple allocation strategies in tandem can enhance alloca-
tion efficiency, as can be seen in various allocators [2]. With a free-list-based allocator
integrated into a garbage collection system, the system can leverage the benefits of both
strategies to improve performance. For example, integrating a buddy allocator into ZGC
would allow for fast allocation using bump-pointer allocation, while also providing the
benefits of free-lists for managing memory surrounding live objects in situations where
bump-pointer allocation is not as efficient.
16
4 Binary Buddy Allocator
The concept of a buddy memory allocator was first introduced by Knowlton [6]. This
section explains how the original buddy memory allocator operates, as designed by
Knowlton. The core idea of the buddy allocator involves dividing memory into blocks
and pairing them as “buddies”. Buddy pairs are split when allocating memory and
merged when deallocating memory to allow allocations of different sizes. The simple
design of the buddy allocator makes it attractive for various applications. For example,
the Linux kernel implements a modified binary buddy allocator to manage physical
memory [3], and jemalloc adopts a buddy allocator for one of its memory allocation
methods [2].
The binary buddy allocator is the original and most basic type of buddy allocator. It
partitions memory into blocks with sizes that are powers of 2. A free list of available
blocks of different sizes is maintained to keep track of the free blocks. Initially, all
memory is one large block. Figure 7 shows the initial state of a binary buddy allocator
with a total memory of 128 bytes. It can be seen that the entire memory is contained
within a single block, which is also included in the free list.
Figure 7 The initial state of the free-list and memory in a binary buddy allocator. Shown
are the memory and the free-list of available blocks.
When requesting memory, a block of the smallest power-of-two size that can accommo-
date the requested memory will be returned. If there are no blocks of that size available,
a larger block will be recursively split into two until a block of the correct size is avail-
able. The split blocks are added to the free list for future allocations. An algorithmic
explanation of this process can be found in Algorithm 1.
17
4 Binary Buddy Allocator
Procedure Allocate(size)
1: level ← smallest power of 2 ≥ size
Find a suitable block
2: block ← no_block
3: for i ← level to max_level do
4: if free list for level i is not empty then
5: block ← remove first block from free list of level i
6: break
7: end if
8: end for
9: if block = no_block then
10: return allocation_failed
11: end if
Split the block if necessary
12: while level of block > level do
13: block, buddy ← split block into two buddy blocks
14: add buddy to its corresponding free list
15: end while
16: return block
Algorithm 1 Binary buddy allocation algorithm
As an example, consider Figure 8, which shows the resulting state of the previous figure
after requesting 16 bytes of memory. The initial 128-byte block has been split multiple
times to fulfill the request, with the leftmost block being returned and the addition of
various block sizes to the free list.
18
4 Binary Buddy Allocator
Figure 8 The state of the free-list and memory in a binary buddy allocator after one
16-byte allocation. Shown are the memory and the free-list of available blocks. Each
new row represents one step in the allocation process.
Procedure Deallocate(block)
1: level ← level of block
2: while true do
3: buddy ← find buddy of block
4: if buddy is not free then
5: add block to free list of level level
6: break
7: else
8: remove buddy from free list of level level
9: block ← merge block and buddy
10: level ← level + 1
11: end if
12: end while
Algorithm 2 Binary buddy deallocation algorithm
19
4 Binary Buddy Allocator
When deallocating a block, the allocator must determine the state of its corresponding
buddy. This is done using a bitmap, with each entry indicating if a block has been allo-
cated or not. The allocator assigns a unique number to each block that can potentially
be allocated. Figure 9 illustrates all possible blocks and their respective numbering. The
bitmap requires a total of 2n − 1 entries, where n is the number of levels of potential
block sizes.
The bitmap is correlated with the available blocks in the free-list and is modified both
when splitting and merging blocks. For example, Figure 10 shows the state of the bitmap
of a binary buddy allocator after one 16-byte allocation. It can be seen that the largest
block has been split down to the smallest size and that the buddies of the used blocks
are marked in the bitmap as available.
Figure 10 The state of the bitmap and free-list of a binary buddy allocator after one
16-byte allocation.
20
5 Method
5 Method
1. Implement the Reference Design: Implement the original binary buddy alloca-
tor to establish a foundation for further development.
2. Investigate Key Aspects: Identify important aspects of using a memory alloca-
tor within a garbage collection context, particularly ZGC, through research and
experimentation with source code.
3. Make Modifications Based on Existing Research: Modify the reference alloca-
tor based on prior research to enhance its performance.
4. Adapt for Garbage Collection: Further modify the allocator for specific use
cases within garbage collection.
5. Evaluate Performance: Measure memory fragmentation and allocation speed
for both the reference allocator and the modified versions.
The initial phase involved implementing the original binary buddy allocator designed
by Knowlton [6]. This implementation provided a deeper understanding of its function-
ality and established a solid foundation for subsequent development. The base allocator
was tested using both unit tests and real-world applications. By exporting the alloca-
tor as a shared library within a wrapper, it was possible to test real-world programs by
overriding the malloc/free operations using the LD_PRELOAD environment variable.
Once the base allocator was implemented, the goal of phase two was to understand the
nuances of memory allocation within the context of garbage collection and ZGC. This
involved conducting a literature review, examining specific parts of the ZGC source
code, and performing experiments. This phase was crucial for grasping the complexities
of ZGC and its memory allocation mechanisms, identifying potential areas for modifi-
cation, and determining which changes would have the greatest impact. The changes
made and the reasons behind them can be found in Sections 6 and 7.
With a thorough understanding of the binary buddy allocator and ZGC, the third phase
involved integrating modifications inspired by prior research. The most significant
changes were applied, sometimes requiring the implementation of conflicting concepts,
which were then compared and evaluated. The focus was on leveraging existing work
to enhance overall performance, rather than to start from scratch. These adjustments
were not exclusively tailored for garbage collection but aimed to improve general per-
formance or produce positive results when used in garbage collection processes.
21
5 Method
In the fourth phase, improvements were implemented that were specifically relevant
to the allocator’s use in a garbage collection environment. These modifications were
derived from insights and knowledge acquired in the preceding phases, rather than based
on existing research. The goal was to introduce features tailored for specific scenarios
within ZGC without compromising performance in other aspects of ZGC or elsewhere.
The implementation process was incremental, with each minor modification being tested
and validated before proceeding to the next step. As the allocator evolved, the number
of unit tests increased, which ensured comprehensive test coverage of its capabilities.
Additionally, the allocator was tested with preexisting programs after each modification
to ensure stability and performance.
The final phase involved evaluating the reference allocator and the modifications. Sev-
eral key metrics were considered during the evaluation, including fragmentation, mem-
ory usage, and allocation speed. These metrics were essential for assessing the alloca-
tor’s performance and comparing different versions. Different metrics may hold greater
significance in different scenarios, while others may have different requirements. A
detailed explanation of the evaluation process can be found in Section 8.
22
6 Improved Buddy Allocators
As of the time of writing, the original binary buddy allocator is nearly 60 years old. Over
the years, numerous improvements and advances have been made, including refining
the initial design [17] and adapting it for new applications and scenarios [7, 16]. This
section covers two notable adaptations of the binary buddy allocator: the binary tree
buddy allocator and the inverse buddy allocator. These adaptations aim to improve
the performance of the original buddy allocator in specific scenarios, such as when
allocating small blocks or when minimizing block splitting and merging.
The original binary buddy allocator keeps track of available blocks by organizing them
into different free-lists. Consequently, every time an allocation or deallocation occurs,
entries must be added to and removed from these lists. If a block needs to be split or can
be merged, more operations are needed. To avoid these costly free-list manipulations,
an implicit free-list can be implemented using a binary tree structure. This tree includes
all the information in the original free-list.
An example of such an implementation is that of Restioson [18]. In this implementation,
the tree stores the maximum allocation size for each block or its children. Since blocks
are evenly divided into two, only the block’s “level” needs to be stored to determine its
size. The initial state of a binary tree buddy allocator is illustrated in Figure 11. Each
level of the tree represents a specific size, and the root level corresponds to the entire
memory space.
Figure 11 The initial state of the binary tree and its free-list equivalent in a binary tree
buddy allocator.
23
6 Improved Buddy Allocators
When allocating with a binary tree, the level required for the allocation size is first
determined. Then, a binary search is started from the root node of the tree. The node
is checked to see if the available level is high enough; if it is, it checks the left child.
If the left child does not have a large enough block available, the right child is chosen
instead. This search process continues recursively on the child node until the correct
level within the tree is reached. Upon finding a suitable node, its level is set to 0, and
the parent nodes are updated to reflect the new state of their children. An algorithmic
explanation of this process can be found in Algorithm 3.
Procedure Allocate(size)
1: level ← smallest power of 2 ≥ size
2: node ← root of tree
Find an available node
3: while level of node > level do
4: lef t, right ← get children of node
5: if level avaliable in lef t ≥ level then
6: node ← lef t
7: else
8: node ← right
9: end if
10: end while
11: if level avaliable in node < level then
12: return allocation_failed
13: end if
14: mark node as unavaliable
Update the tree
15: for all parent nodes starting from node do
16: level available in parent ← max(lef t, right)
17: end for
18: return node
Algorithm 3 Binary tree allocation algorithm
To illustrate this, Figure 12 shows the state of a binary tree buddy allocator following
the allocation of a 32-byte and a 16-byte block. It can be seen that the initial allocation
is placed in the leftmost node at the second-lowest level, while the subsequent allocation
is placed to the right of it at the lowest level. Consequently, the entire left branch of the
binary tree now only contains one 16-byte block, as indicated by one being stored in the
respective nodes.
24
6 Improved Buddy Allocators
Figure 12 The state of the binary tree and its free-list equivalent in a binary tree buddy
allocator after one 32-byte and one 16-byte allocation.
Deallocation in a binary tree buddy allocator is also a simple process. Initially, the
node that has been deallocated is set to its corresponding level, followed by updating its
parent nodes to reflect this change. When the buddy of a block is also free, signified by
it having the same level as the recently deallocated node, they are merged. The merging
operation in a binary tree involves increasing the level of the parent by one, indicating
that the entire memory space it represents is now available. An algorithmic explanation
of this process can be found in Algorithm 4.
Procedure Deallocate(block)
1: level ← level of block
Set the level of the node
2: node ← level
Update the tree
3: for all parent nodes starting from node do
4: if lef t = right and lef t is free completely then
5: level available in parent ← lef t + 1
6: else
7: level available in parent ← max(lef t, right)
8: end if
9: end for
Algorithm 4 Binary tree deallocation algorithm
25
6 Improved Buddy Allocators
Allocating the smallest block size in a binary buddy allocator requires the most total
block splits. To improve the speed of allocating these blocks, adjustments are necessary.
Park et al. [16] proposed an alternative method for organizing the metadata of a binary
buddy allocator, which they call an inverse buddy allocator (iBuddy allocator). An
iBuddy allocator can allocate individual blocks in constant time, although at the expense
of slower allocations for larger blocks. Rather than dividing larger blocks into smaller
ones to fulfill allocations, all blocks are initially split, and merging is only done during
larger allocations.
In an iBuddy allocator, blocks of different sizes that overlap the same memory are all
present in the free-list and bitmap at the same time. In the iBuddy allocator, the bitmap
takes on additional meaning. When a block is marked as free, it implies that all smaller
blocks at that memory location are also free and available for allocation. This enables
the possibility of allocating a smaller size in a larger block, and keeping the rest of the
space in that block available for future allocations. Consider Figure 13, which shows a
valid initial state of an iBuddy allocator. It can be seen that all potential addresses for
the smallest-size blocks are present in the free-list but at different levels. For example,
the first block is available at the uppermost level, while the second block is available at
the lowest level.
Figure 13 The initial state of the free-list and bitmap in an inverse binary buddy alloca-
tor.
26
6 Improved Buddy Allocators
Due to the new structure of the bitmap, it is possible to allocate single-sized blocks
in place of larger blocks without affecting other blocks in the bitmap. When using an
iBuddy allocator, the largest available block in the free-list is always utilized for alloca-
tion, irrespective of the allocation size. Consider Figure 14, which shows an allocation
of one block of the smallest size. The largest block from the preceding figure has been
taken out of the free-list and bitmap, whereas the remainder of the free-list and bitmap
remains unaltered. This allocation did not influence other blocks, and all other smallest-
size blocks are still accessible at different levels. For the next smallest-size allocation,
the 64-byte block would be used. No explicit block splitting is needed, and by storing
the highest level of the free-list with available blocks, the allocation of smallest-sized
blocks can be done in constant time.
Figure 14 The state of the free-list and bitmap in an inverse binary buddy allocator after
one 16-byte allocation.
Fast allocation speed for small blocks comes at the cost of slow allocation speed for
larger blocks. In contrast to a binary buddy allocator, blocks cannot simply be removed,
as there are now overlapping smaller blocks present in the free-list and bitmap. These
must be removed to avoid allocating the same memory location twice. An algorithmic
explanation of this process can be found in Algorithm 5.
Consider Figure 15, which shows the same allocator as in the previous figure, now with
an additional 64-byte allocation. It can once again be seen that the highest-level block
has been removed, but now, all the blocks under it have also been cleared from both the
free-list and the bitmap. For larger blocks, the allocation speed is proportional to the
number of smallest-sized blocks needed to fill the requested size.
27
6 Improved Buddy Allocators
Procedure Allocate(size)
1: level ← smallest power of 2 ≥ size
2: block ← remove the first block from the highest-level non-empty free list
3: if level = 0 then
4: return block
5: else if level of block < level then
6: return allocation_failed
7: end if
Remove all lower blocks
8: for i ← level to 0 do
9: for all possible blocks in level i do
10: remove block from free list at level i
11: end for
12: end for
13: return block
Algorithm 5 iBuddy allocation algorithm
Figure 15 The state of the free-list and bitmap in an inverse binary buddy allocator after
one 16-byte allocation and one 64-byte allocation.
In an inverse buddy allocator, no blocks are explicitly merged during deallocation; in-
stead, the level at which they are placed back into the free-list indicates the size of the
available block. During deallocation, the status of the block’s buddy is checked; if it is
free, instead of merging, the deallocated block rises one level. This process repeats until
a buddy is no longer free, after which the block is inserted at that layer. An algorithmic
explanation of this process can be found in Algorithm 6.
28
6 Improved Buddy Allocators
Procedure Deallocate(block)
1:for all lowest-level blocks current_block that fit in block do
2: level ← 0
3: buddy ← find buddy of current_block at level
4: while buddy is free do
5: level ← level + 1
6: buddy ← find buddy of current_block at level
7: end while
8: add current_block to free list at level level
9:end for
Algorithm 6 iBuddy deallocation algorithm
As an example of the deallocation process, consider Figure 16, which shows the previ-
ous figure, now after deallocating the first 16-byte allocation. The buddy is first checked
at the bottom level, then again at the level above that, to finally end up at the third level,
where it is inserted and marked in the bitmap as free. The complexity of deallocating
single blocks is the same as that of the binary buddy allocator, but less work needs to be
done, as no explicit merging is done, only checks for the blocks’ buddies.
Figure 16 The state of the free-list and bitmap in an inverse binary buddy allocator after
one 16-byte allocation, one 64-byte allocation, and one 16-byte deallocation.
Similar to allocation, the deallocation speedup for smaller blocks comes at the cost
of larger blocks. When deallocating a large block, all the smaller blocks within that
block also need to be inserted back into the free-list and bitmap. Thus, the operation of
deallocating a large block is equivalent in cost to individually deallocating each of the
smallest-size blocks that make up the larger block. This is more costly than a binary
buddy, where larger blocks require the least work as they merge the fewest number of
times.
29
7 Adapting the Buddy Allocator for use With ZGC
As discussed in Section 6, there is ample room to improve the original binary buddy
allocator. The two designs discussed have different strengths, so both are implemented
to compare and contrast with each other and the original design. The binary tree allo-
cator closely resembles the binary buddy allocator in its allocation method, whereas the
iBuddy allocator differs significantly. Both allocators are investigated to determine the
most efficient allocation and deallocation processes for all possible allocatable sizes.
Further enhancements can be developed that are orthogonal to the specific allocator
implementations. These enhancements can be general optimizations or specifically tai-
lored to the context of operating within ZGC. Constraining the use case enables the
allocators to assume access to additional information or to narrow their functionality.
This section discusses the adaptations made to the allocators to improve their perfor-
mance within ZGC.
There are numerous scenarios in which a garbage collector could benefit from an ad-
vanced memory allocator, many of which are beyond the scope of this thesis. Each
scenario has distinct characteristics and unique requirements. Improvements should
enhance specific scenarios without causing adverse effects in other situations. If a mod-
ification improves one use case but negatively impacts another, it should be possible
to disable that change. This flexibility allows each scenario to utilize the most suit-
able configuration for that specific case by enabling or disabling the desired features.
The adaptations made are indepentant of one another, allowing for easy configuration
changes of specific features.
Making the allocator configurable improves its applicability to a wide range of scenar-
ios. Although this versatility is advantageous, it may lead to limitations in certain in-
stances. Prioritizing configurability can hinder the allocator from being fully optimized
for one particular use case. Significant modifications to enhance one specific scenario
could adversely affect others. The choice was made to prioritize configurability instead
of restricting the allocator to a single use case. This decision allows the allocator to be
used in a variety of scenarios, with the user selecting the most suitable configuration for
their specific needs.
30
7 Adapting the Buddy Allocator for use With ZGC
For some scenarios, a specific metadata implementation may be more beneficial than
others. The metadata structures can be altered to either prioritize memory efficiency or
speed, depending on needs. This section covers the choices made regarding the metadata
storage of the allocators.
The design of the buddy allocator requires storing additional metadata separate from
the data on the page. This metadata, which stores the status of possible blocks, incurs
a fixed overhead for each page allocated, rather than increasing with each allocation.
The size of the metadata is influenced by the total number of possible blocks rather than
their size. Consequently, having larger block sizes lead less of a proportional overhead,
whereas smaller block sizes result in significantly higher proportional overhead.
The basic binary buddy allocator is the most efficient in terms of memory usage, re-
quiring the least data. The state of each potential block must be stored, which can be
represented by a single bit. This data is only used during deallocation, so the states of
two buddy blocks can be merged into one bit using XOR operations, effectively halving
the memory requirement. This optimization is not possible for the iBuddy allocator, as
blocks can be deallocated without knowing the state of their buddy.
The binary tree allocator retains the design of the original binary buddy allocator but
stores its metadata in a different format. This approach allows for faster allocation and
deallocation, as there is no explicit free-list that needs to be managed.
The binary tree is stored as a flattened byte array, with each level stored contiguously
following the preceding one. Most levels have each byte representing the value of a
single block, but the lowest levels deviate from this pattern. Each node in the tree holds
the maximum possible level that can be allocated within it or its children. The smallest
blocks can only store 1 or 0, enabling memory optimization by compacting and storing
8 blocks within each byte in the array. Optimizing this level, which contains half of
the total blocks, results in memory savings of 43.75%. Similarly, the data for the next
two levels can be compacted to 2 bits per block, further reducing memory by 28.13%.
Higher levels require at least 4 bits of information and constitute a smaller fraction of
total blocks, so their memory is not optimized to avoid performance impacts from bit
operations.
31
7 Adapting the Buddy Allocator for use With ZGC
When deallocating, the allocator must determine the correct block to deallocate based
on the memory address. Since an address can belong to any block level, additional
information is required. The simplest solution is to require the user to provide the
allocated size, allowing the allocator to calculate the deallocation level. This method
is fast and incurs minimal overhead as the allocator does not need to store additional
data. However, it requires the user to keep track of the allocated size, which may not
always be feasible.
Alternatively, the level of each allocation can be stored inside the block or in a sepa-
rate data structure. Although this is simple and allows for fast lookups, it is not very
memory-efficient. If the data is stored within a block, the usable space decreases. Stor-
ing data separately for each possible block location results in a slightly larger but con-
stant overhead.
A third approach uses a bitmap to track which blocks have been split. This method is
very memory-efficient but requires more operations to find the correct blocks. Figure 17
illustrates this bitmap after a 16-byte block allocation, showing all blocks above the
allocated block marked as split. The smallest blocks do not require an entry in the
bitmap, as they cannot be split. During deallocation, the allocator starts at the top of the
bitmap and traverses down until the block is no longer marked as split.
Figure 17 The state of the split blocks-bitmap and free-list of a binary buddy allocator
after one 16-byte allocation.
All the three options are implemented and can be configured in the allocators, with the
most suitable option depending on the specific usage scenario. For general use, the
bitmap approach is the most memory-efficient across various allocator configurations.
For use within ZGC, data structures responsible for tracking size can be omitted, dele-
gating this function to the garbage collector. The collector has sufficient data from its
live analysis and object headers to compute the size of each object and allocation. This
approach eliminates additional overhead, making optimal use of existing resources.
32
7 Adapting the Buddy Allocator for use With ZGC
In certain situations, using an advanced allocator to manage memory may not be the
most efficient option. For instance, bump-pointer allocation could be used initially,
switching to a more advanced allocator when significant external fragmentation arises.
This approach maximizes the speed advantage of bump-pointer allocations while lever-
aging advanced allocators when necessary.
In the context of ZGC, the allocator must be able to allocate around memory already
in use. This imposes limitations on the allocator, as it no longer “owns” the memory it
uses for allocation. The allocator instead has to allocate in holes of available memory.
This affects where the storage of allocator metadata, as it cannot be stored within its
own memory region.
When working with memory not fully controlled by the allocator, the allocator starts
with no available memory for allocation. The user must indicate the locations of oc-
cupied memory or the intervals between them, defining free regions that the allocator
can use. In ZGC, live analysis of a page provides information on live objects, enabling
deallocation of the intervals between them.
To implement functionality for this scenario, the allocator’s initial state is fully occupied
with the smallest-sized blocks, avoiding accidental merging of blocks overlapping with
occupied regions. During the deallocation of a specific range, the largest fitting blocks
are added to the free-list, and each block and its split children are marked as no longer
split. Figure 18 illustrates this process, showing blocks added to the free-list covering
the entire free range.
Figure 18 The blocks added to the free-list after deallocating the range up until the last
block.
33
7 Adapting the Buddy Allocator for use With ZGC
Splitting and merging blocks is costly and should be avoided if possible. The binary
buddy allocator merges blocks whenever possible, often requiring immediate splitting
for subsequent allocations. Lee and Barkley [1] designed a modified buddy allocator
that delays block merging. As long as the allocation size distribution remains consistent,
costs related to splitting and merging can be avoided. A simpler version of Lee and
Barkley’s design is implemented, with a second free-list (lazy layer) on top of the buddy
allocator (buddy layer). This separate free-list does not perform any splitting or merging
and solely stores blocks of different sizes.
When deallocating using a lazy layer, the block is inserted into the lazy layer’s free-
list. If the lazy layer reaches a certain threshold of blocks, the block is inserted into the
buddy layer as normal. During allocation, the lazy layer is checked first for a suitable
block, followed by splitting blocks from the buddy layer if necessary. If both steps fail,
the lazy layer can be emptied to allow smaller blocks to be merged to meet the required
allocation size.
It is logical for different block sizes to have different thresholds, as the frequency of
sizes differs. Since most allocations are small, it is also logical that smaller blocks
should have a higher threshold. Storing large blocks in the lazy layer would also reserve
these large blocks for only that size, which could lead to smaller allocations not being
able to be fulfilled. When implementing this, the smallest block size has the highest
threshold, with each level above that halving the threshold of the previous level. The
default threshold is set to 1000. A high value is desired, as ZGC can often deallocate
many objects to then allocate many objects of the same size. The lazy layer needs the
capacity to support this, which is why the threshold is set relatively high.
Different block sizes have different thresholds, with the smallest block size having the
highest threshold. Large blocks in the lazy layer could prevent smaller allocations from
being fulfilled, so thresholds decrease for larger block sizes. The default threshold is set
to 1000 for the smallest block size, with the threshold halving for each level above that.
A high threshold is desired, as ZGC could deallocate many objects to afterward allocate
many objects of the same size. The lazy layer needs the capacity to support this, which
is why the threshold is set relatively high.
34
7 Adapting the Buddy Allocator for use With ZGC
In the original binary buddy allocator, the largest block equals the entire provided mem-
ory, which is then split into smaller blocks during allocation. This approach can degrade
performance in multithreaded programs with concurrent allocations, as only one alloca-
tion can split a large block at a time.
To avoid this, the maximum block size is set lower than the entire memory size, dividing
the memory into smaller regions, each with a normal buddy block structure. Park et
al. [16] suggest this for the iBuddy allocator. For example, splitting memory into two
regions, each with a maximum block size of half the total memory, allows for two
concurrent allocations. If one region is prioritized, fragmentation would be reduced as
one region would fulfill smaller allocations, leaving space for larger allocations in the
second region.
In ZGC, the maximum allocation size in a small page is 256 KiB, a fraction of the page
size of 2 MiB. This allows for eight allocations of the maximum size within a page. The
allocator can be divided into eight regions, each covering one-eighth of a page. Fewer
regions would limit the number of concurrent allocations, while more regions would
restrict the maximum allocation size.
One allocator controlls all regions, keeping shared metadata such as the lazy layer, while
each region has its own buddy structure. This design allows for concurrent allocations in
different regions, as long as the allocations are evenly distributed. Scalability extends up
to the number of regions, with additional allocations needing to wait for prior allocations
to complete.
Situations could arise where some regions become more heavily utilized than others,
decreasing throughput. When a thread initiates an allocation, it is assigned a specific
region. If another thread is allocating in that region, it checks the subsequent regions. If
all regions are in use, the thread waits. Upon entering a region, the necessary allocation
logic is performed. If an allocation fails due to insufficient space, the thread exits the
region and retries in the next one. Allocation requests fail when all regions cannot
accommodate the request.
35
7 Adapting the Buddy Allocator for use With ZGC
All the mentioned adaptations build on top of any of the three different buddy allocators
discussed in Sections 4 and 6. Each adaptation can be enabled or disabled indepen-
dently, allowing for a wide range of configurations. The user can select the most suitable
configuration for their specific use case, optimizing the allocator for their needs. While
the adapatations are designed to enhance the allocator’s performance within ZGC, the
configurability allows the user to optimize the allocator for any use case.
To configure the allocator, the user first selects which base allocator to use. The bi-
nary buddy allocator is the most memory-efficient, the binary tree allocator is a faster
alternative at the cost of memory, and the iBuddy allocator is designed for fast small
block allocations. The user then configures the adaptations: selecting the minimum and
maximum block sizes, the number of regions, the metadata storage method, and the lazy
layer thresholds. The allocator is then ready for use, optimized for the user’s specific
needs.
36
8 Evaluation Methodology
8 Evaluation Methodology
Performance, memory usage, and fragmentation of the allocators are critical metrics
that were evaluated in this study. These are important factors for a memory allocator,
and improvements in one often affect others. This section outlines the evaluation pro-
cess used to assess the performance and memory usage of the allocators, as well as
the fragmentation they introduce. Performance was measured by the time required for
allocation and deallocation requests. Memory usage refers to the additional memory
utilized by the allocator for its operations, and fragmentation refers to the inefficiencies
in memory utilization resulting from the allocation strategy.
As previously stated, a major limitation of this project was that the allocator was not
integrated into ZGC, even though this was its intended purpose. Such an integration
would have provided the simplest and most intuitive method of testing the allocator, as
this is the environment in which the allocator would be used. The challenge of not doing
this was to identify other benchmarks that could indicate the allocator’s performance
within ZGC.
The three versions of the buddy allocator—the binary buddy allocator, the binary tree
allocator, and the iBuddy allocator—were tested and compared against each other. They
all share the same base configuration of block sizes and regions, with a minimum block
size of 16 bytes, a maximum block size of 256 KiB, and a total of 8 regions.
8.2 Performance
Performance is influenced by both the type of allocator used and the context of its use.
Different allocators perform differently depending on the task; therefore, their perfor-
mance varies across different programs. Instead of relying on a single program to com-
pare all allocators, the time required for the allocation and deallocation operations was
measured for each allocator. These measurements were used to compare the allocators’
performance across different block sizes and allocation patterns.
37
8 Evaluation Methodology
The time required for a single allocation or deallocation for each block size was ob-
served to provide insight into the performance of each allocator. Additionally, timing
the allocation of a constant memory size using various block sizes demonstrated the
performance of repeated allocations and deallocations.
For the single allocation and deallocation benchmarks, time was measured for every
possible block size ranging from 24 to 218 . These figures match the potential block
sizes in a ZGC small page. To minimize noise, the average of 100000 allocations
was calculated for each size and each allocator. The duration was recorded before
and after the allocation call using the POSIX function clock_gettime() with the
CLOCK_MONOTONIC_RAW clock.
For the page-fill benchmark, a contiguous 2 MiB memory block was filled with ev-
ery possible block size ranging from 24 to 218 , matching ZGC’s allocation sizes and
page size. For each allocator and block size, the memory was filled 10000 times.
The duration was recorded using the POSIX function clock_gettime() with the
CLOCK_MONOTONIC_RAW clock, starting before the initial allocation and ending after
the final one.
These tests showed different performance aspects of the different allocators for various
allocation sizes, which were used to draw several conclusions. To estimate the per-
formance of a particular program, one would examine its allocation distribution and
compare it with the test results to determine the most suitable allocator for that specific
application.
Performance evaluations were performed using two different machines, detailed in Ta-
ble 2. Each memory allocator was compiled using GCC 11.4.0, adhering to the C++14
standard (-std=c++14) and using optimization level two (-O2).
38
8 Evaluation Methodology
8.3 Fragmentation
Internal fragmentation occurs when allocation sizes are rounded to the nearest power
of two to align with block sizes, and it is identical in all the allocators. This results
in unused memory for every allocation that is not a perfect power of two. To quantify
this, two counters were used: one for the requested allocation size and another for the
total allocated size. Each allocation operation incremented these counters, while deal-
location operations decremented them. At any point during program execution, these
two counters could be compared to measure the amount of internal fragmentation. The
percentage of internal fragmentation, or wasted memory, was calculated as:
Used memory - Requested memory
Internal fragmentation = Used memory
External fragmentation occurs when allocated blocks are placed in memory in a way that
inhibits larger allocations. Even if the total space available is greater than the requested
size, there may be no single contiguous block large enough to satisfy that request. The
different allocators have different policies about where blocks are placed, so fragmen-
tation differs between them.
39
8 Evaluation Methodology
40
9 Results
9 Results
This section presents the results of the evaluation described in Section 8. The three
buddy allocators presented in Sections 4 and 6, as well as the adaptations made in Sec-
tion 7, were evaluated based on their performance, memory overhead, and fragmenta-
tion.
The binary buddy allocator is the most efficient in terms of memory usage, requiring
the least data. As discussed in Section 7.2.1, the allocator only needs to store the state
of each block, which can be represented by a single bit. Additonally, the binary buddy
allocator can merge the states of two buddy blocks into a single bit. In the ZGC config-
uration, this resulted in a total overhead of 16 KiB per 2 MiB page, or 0.78%.
The iBuddy allocator also requires storing each block’s state, but the same optimizations
are not possible due. Each block needs to set and read its state individually, as operations
can be executed on multiple blocks at once. In the ZGC configuration, this led to an
overall overhead of 32 KiB per 2 MiB page, or 1.56%.
The binary tree allocator requires additional information to be stored. Storing only the
state of each block is insufficient, as each block needs to maintain a record of the highest
level accessible below it. However, significant optimizations were applied to the lower
layers, as described in Section 7.2.1, reducing the total overhead to 57 KiB per 2 MiB
page, or 2.78%.
9.2 Performance
The results of the single allocation benchmark defined in Section 8.2.1 are presented in
Figure 19. Figures 19a and 19b show the results for machines A and B, respectively. The
performance of the allocators was similar across both machines, differing by a constant
speed factor.
Both the binary buddy and binary tree allocators demonstrated faster allocation of larger
block sizes, with the latter showing superior performance for smaller block sizes. The
iBuddy allocator was faster at allocating smaller blocks but had reduced speed for larger
blocks. Utilizing the lazy layer significantly improved performance across all block
sizes, with the greatest impact at smaller sizes.
41
9 Results
42
9 Results
The results of the single deallocation benchmark defined in Section 8.2.1 are presented
in Figure 20. Figures 20a and 20b show the results for machines A and B, respectively.
The performance of the allocators was similar across both machines, differing by a
constant speed factor.
Both the binary buddy and binary tree allocators demonstrated faster deallocation of
larger block sizes, with the latter showing superior performance for smaller block sizes,
although less than during allocation. The iBuddy allocator was faster at deallocating
smaller blocks but had reduced speed for larger blocks. Utilizing the lazy layer signifi-
cantly improved performance across all block sizes, with the greatest impact at smaller
sizes.
43
9 Results
The results of the page-fill benchmark defined in Section 8.2.1 are presented in Fig-
ure 21. Figures 21a and 21b show the results for machines A and B, respectively. The
performance of the allocators was similar across both machines, differing by a constant
speed factor.
The results illustrate the characteristics of each allocator when making multiple alloca-
tions. The binary buddy and binary tree allocators scale linearly with block size due to
their improved allocation speed for larger blocks, with the binary tree allocator being
the faster of the two. The iBuddy allocator was fastest for the smallest block sizes, but
its inverse scaling makes it only marginally faster when allocating larger blocks, quickly
flattening out as block size increases.
(a) Allocation results from machine A (b) Allocation results from machine B
Figure 21 Time taken to allocate 2 MiB across various block sizes and allocator ver-
sions. The benchmark is run on both machines, and each point displays the mean results
from 1 000 allocations.
9.3 Fragmentation
Internal fragmentation remains consistent across all allocator versions, as they round
block sizes equally and use the same allocation pattern. Table 3 presents measures that
quantify the internal fragmentation created by the allocators. Although fragmentation
was considerable, the standard deviation is low, indicating consistent fragmentation with
a low spread. The chosen allocation pattern greatly influenced fragmentation levels,
causing nearly all allocations to require rounding up.
44
9 Results
External fragmentation manifested itself differently in each allocator version due to their
different strategies for placing allocated blocks. Figure 22 illustrate the distribution of
free blocks in terms of number and total size for the binary buddy, binary tree, and
iBuddy allocators, respectively. The total free space is less important than the distribu-
tion of free blocks across different block sizes.
The binary buddy allocator showed the largest number of free blocks around 27 , but
these constituted a minor portion of the overall free space. Most of the space was oc-
cupied by blocks around 214 , suggesting that allocations of this size were possible, and
most unsuccessful allocations exceeded this size.
The binary tree allocator behaved similar to the binary buddy allocator, with the largest
number of blocks around 27 . Although it had more smaller-sized blocks, these made
up only a small fraction of the total free space and could be attributed to a different
allocation order.
Most of the space was occupied by blocks around 213 , which is less than the binary
buddy allocator. This suggests that the binary tree allocator was better at utilizing mem-
ory, as it had fewer large blocks. The overall free space was also less than that of the
binary buddy allocator, indicating that the binary tree allocator was more efficient in
using memory than the binary buddy allocator.
The iBuddy allocator stands out due to its aggressive splitting policy. The most com-
mon size of free blocks was 26 , which also constituted a substantial portion of the total
free space. The largest total space was found around 213 , although it was more evenly
distributed between sizes. This concentration at 26 resulted from most allocations ex-
ceeding this size, which led to blocks being split until this size. The overall free space
was greater than that of the other two allocators, suggesting less efficiency in utilizing
memory.
45
9 Results
46
10 Discussion
10 Discussion
47
10 Discussion
The unique behavior of the iBuddy allocator is evident in the benchmarks. Notably, its
performance for large block allocations is significantly worse compared with small allo-
cations in the other two allocators. Its only advantage is a slight increase in speed at the
smallest allocation sizes. However, its strategy of aggressively splitting blocks quickly
inhibits the allocation of larger blocks. Therefore, the iBuddy allocator is only feasible
when nearly all allocations are among the smallest sizes. If a considerable number of
allocations exceed this size, there will be a notable decrease in both performance and
usable memory.
Considering all factors, the binary tree allocator emerges as the most promising option
for implementation in ZGC among the three discussed. It consistently outperforms the
binary buddy allocator in all benchmarks, while the iBuddy allocator is impractical due
to the reasons discussed. Since memory consumption is not a critical concern, the addi-
tional memory expense of the binary tree allocator is justified by its speed improvement.
Although the binary tree allocator is slower at smaller allocation sizes, this drawback
can be mitigated by using the lazy layer, keeping good performance across all allocation
sizes. This configuration is deemed the most promising for integrating a buddy allocator
within ZGC.
However, the intrinsic weaknesses of buddy allocators remain. Internal fragmentation
will be significant, even if it is possible to reduce it. Blocks still need to be aligned to
their size, which could become problematic if previous memory is awkwardly placed.
In addition, the cost of frequent allocator initialization could become non-negligible.
These issues should be taken into account when integrating with ZGC, as they could
affect both performance and memory efficiency. Future work should focus on strategies
to mitigate these weaknesses and explore alternative allocation algorithms that may offer
improved performance and reduced fragmentation in the context of ZGC.
48
11 Future Work
11 Future Work
The most significant future step is to integrate the allocators into ZGC. As highlighted
in Section 1.3, Gärds [4] has concurrently to this thesis explored the integration of an
allocator into ZGC, focusing on a free-list-based allocator to enhance the efficiency of
relocating objects within ZGC. Future efforts could build on Gärds’ work by incor-
porating a buddy allocator into ZGC or exploring alternative integration scenarios not
considered by Gärds.
When integrating a buddy allocator specifically, the garbage collector can implement
strategies to mitigate some of its drawbacks. The primary issue is the considerable
internal fragmentation during allocation. To minimize this, ZGC could allocate multiple
objects simultaneously to align more closely with size requirements or use fragmented
memory from an allocation for future use. This approach requires that objects allocated
together be deallocated collectively. However, as detailed in Section 7.3, this is not an
issue when deallocating ranges.
Currently, the minimum allocation size in ZGC is 16 bytes, constrained by the header
size of Java objects. Efforts are underway to decrease the size of these headers, po-
tentially lowering the smallest object size to 8 bytes [13]. Since this development is
ongoing, it was not considered in this thesis.
Reducing the minimum allocation size would introduce a new bottom layer for the al-
locators. Practically, this results in memory overhead per page doubling as it becomes
necessary to manage and track smaller allocations. Additionally, for the binary buddy
and iBuddy allocators, storing the two pointers needed for the free-list would no longer
be feasible. These pointers would need to be transformed into offsets from the begin-
ning of the page. Although the size of the pages makes this adjustment manageable, it
would still introduce some performance overhead.
49
11 Future Work
Improving the binary tree storage format in the binary tree allocator might be feasible.
Currently, the levels of the tree are stored sequentially, requiring progressively larger
jumps when iterating over the levels. To increase spatial locality, the larger tree could
be broken down into several subtrees stored consecutively. This would place the seg-
ments closer together, reducing the likelihood of cache misses and potentially increasing
performance.
Currently, all the adapted allocators use the same concurrency approach, which involves
dividing memory into regions and only permitting concurrent allocations in different re-
gions. However, since their designs differ, it would be worth investigating different
designs for each allocator. For example, a lock-free mechanism could achieve concur-
rency for the binary tree allocator. As the binary tree allocator navigates its tree struc-
ture, compare-and-swap (CAS) atomic operations could replace the current subtraction
operations. Conflicts could arise with other threads, but threads can back off to resolve
these conflicts. Consequently, it is advisable to keep distributing allocations evenly us-
ing regions, but with the locking code removed, thereby only allocating concurrently
within the same region when necessary.
Another area of future research could involve exploring hybrid approaches that combine
the strengths of various buddy allocator designs. This concept parallels the strategies
used in other allocators like the linux kernel [3] and jemalloc [2], which combine
multiple allocation techniques to optimize performance and memory utilization.
For instance, a hybrid buddy allocator could employ a binary tree allocator for large
block sizes and an iBuddy allocator for small block sizes, thereby leveraging the advan-
tages of both approaches. These allocators could be distributed across different memory
regions, optimizing their performance based on specific memory requirements. Alterna-
tively, one could investigate methods to enable these allocators to work in tandem on the
same memory region, dynamically switching between allocation strategies as needed to
maximize efficiency and minimize fragmentation.
50
12 Conclusions
12 Conclusions
This thesis has explored the adaptation and evaluation of different buddy allocators for
integration within the Z Garbage Collector (ZGC). Through analysis and benchmark-
ing, the binary tree allocator emerged as the most promising candidate, consistently
outperforming the traditional binary buddy allocator in all key benchmarks. Although
the iBuddy allocator demonstrated faster allocation speeds for smaller blocks, its poor
performance with larger block allocations and significant internal fragmentation makes
it impractical for uses where larger allocations occur regularly.
By tailoring the allocator to the specific needs of ZGC, several changes have been imple-
mented to enhance efficiency and introduce additional features. Using already-available
data allowed for further optimizations. Notably, a novel approach to initializing and
employing the allocator was implemented: it begins fully occupied, with the user iden-
tifying which sections are empty and available for allocation. These modifications are
designed to facilitate the seamless integration of the allocator into ZGC in the future.
To fully understand how the allocator would perform in ZGC, additional research is
needed. So far, only proxy benchmarks have been utilized, showing promise for poten-
tial real-world applications. Therefore, future efforts should focus on developing strate-
gies to mitigate the fundamental shortcomings of buddy allocators in terms of internal
fragmentation and alignment. Additionally, integrating the allocator within ZGC would
enable extensive testing and would be crucial in refining the allocator’s performance.
51
References
References
[1] R. Barkley and T. Lee, “A lazy buddy system bounded by two coalescing delays,”
ACM SIGOPS Operating Systems Review, vol. 23, no. 5, pp. 167–176, 1989.
[2] J. Evans, “A scalable concurrent malloc (3) implementation for FreeBSD,” in Proc.
of the bsdcan conference, ottawa, canada, 2006.
[3] M. Gorman, Understanding the Linux virtual memory manager. Prentice Hall
Upper Saddle River, 2004, vol. 352.
[4] N. Gärds, “Partial Compaction in ZGC Using Free-Lists,” Master’s thesis, Uppsala
University, Department of Information Technology, 2024.
[5] R. Jones, A. Hosking, and E. Moss, The Garbage Colleciton Handbook, 2nd ed.
Chapman & Hall/CRC, 2023.
[8] M. Masmano, I. Ripoll, A. Crespo, and J. Real, “TLSF: a New Dynamic Memory
Allocator for Real-Time Systems,” in Proceedings. 16th Euromicro Conference
on Real-Time Systems, 2004. ECRTS 2004., 2004, pp. 79–88. [Online]. Available:
[Link]
[9] Oracle. (2014) Concurrent Mark Sweep (CMS) Collector. Oracle. Accessed
2024-05-10. [Online]. Available: [Link]
guides/vm/gctuning/[Link]
[10] Oracle. (2023) Java Virtual Machine Guide. Accessed 2024-03-08. [Online].
Available: [Link]
[Link]
52
References
[13] Oracle. (2024) Project Lilliput. Oracle. Accessed 2024-05-03. [Online]. Available:
[Link]
[14] Oracle. (2024) The Java HotSpot Performance Engine Architecture. Accessed
2024-03-08. [Online]. Available: [Link]
[Link]
[16] H. Park, J. Choi, D. Lee, and S. H. Noh, “iBuddy: Inverse buddy for enhancing
memory allocation/deallocation performance on multi-core systems,” IEEE Trans-
actions on Computers, vol. 64, no. 3, pp. 720–732, 2014.
[17] J. L. Peterson and T. A. Norman, “Buddy systems,” Commun. ACM, vol. 20, no. 6,
p. 421–431, jun 1977. [Online]. Available: [Link]
[21] A. M. Yang and T. Wrigstad, “Deep Dive into ZGC: A Modern Garbage Collector
in OpenJDK,” ACM Trans. Program. Lang. Syst., vol. 44, no. 4, sep 2022.
[Online]. Available: [Link]
53