0% found this document useful (0 votes)
7 views61 pages

Fulltext01 2

Uploaded by

changyusong123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views61 pages

Fulltext01 2

Uploaded by

changyusong123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Upps al a U niversity log oty pe

UPTEC IT 24022

Degree project 30 credits


July 2024

Adapting and Evaluating Buddy


Allocators for Use Within ZGC
ZGC's New Best Friend

Casper Norrbin Fel! Hitt ar inte ref er enskälla.

Master's Programme in Computer and Information Engineering


Upps al a U niversity log oty pe

Adapting and Evaluating Buddy Allocators for Use Within ZGC


Casper Norrbin

Abstract
In the current software development environment, Java remains one of the major languages,
powering numerous applications. Central to Java's effectiveness is the Java Virtual Machine
(JVM), with HotSpot being a key implementation. Within HotSpot, garbage collection (GC) is
critical for efficient memory management, where one collector is Z (ZGC), designed for minimal
latency and high throughput.
ZGC primarily uses bump-pointer allocation, which, while fast, can lead to fragmentation issues.
An alternative allocation strategy involves using free-lists to dynamically manage memory
blocks of various sizes, such as the buddy allocator. This thesis explores the adaptation and
evaluation of buddy allocators for potential integration within ZGC, aiming to enhance memory
allocation efficiency and minimize fragmentation.
The thesis investigates the binary buddy allocator, the binary tree buddy allocator, and the
inverse buddy allocator, assessing their performance and suitability for ZGC. Although not
integrated into ZGC, these exploratory modifications and evaluations provide insight into their
behavior and performance in a GC context. The study reveals that while buddy allocators offer
promising solutions to fragmentation, they require careful adaptation to handle the unique
demands of ZGC.
The conclusions drawn from this research highlight the potential of free-list-based allocators to
improve memory management in Java applications. These advances could reduce GC-induced
latency and enhance the scalability of Java-based systems, addressing the growing demands of
modern software applications. Fac ulty of Sci enc e and Technol ogy, U ppsal a U niv ersity. U pps ala. Supervis or: Erik Ös terl und, Subj ect reader: T obias Wrigstad, Ex aminer: R oland Bol

Faculty of Science and Technology


Uppsala University, Uppsala

Supervisor: Erik Österlund Subject reader: Tobias Wrigstad


Examiner: Roland Bol
ii
Sammanfattning

I dagens mjukvaruutvecklingsmiljö är Java fortfarande ett av de främsta språken och dri-


ver många applikationer. Centralt för Javas effektivitet är Java Virtual Machine (JVM),
där HotSpot är en nyckelimplementation. Inom HotSpot är skräpsamling (GC) kritiskt
för effektiv minneshantering, där en av samlarna är Z (ZGC), designad för minimal
latens och hög genomströmning.
ZGC använder främst bump-pointer-allokering, vilket är snabbt men kan leda till frag-
menteringsproblem. En alternativ allokeringsstrategi innebär användning av free-listor
för att dynamiskt hantera minnesblock av olika storlekar, såsom buddyallokatorn. Den-
na avhandling utforskar anpassning och utvärdering av buddyallokatorer för potentiell
integration inom ZGC, med målet att förbättra minnesallokeringseffektivitet och mini-
mera fragmentering.
Avhandlingen undersöker den binära buddyallokatorn, binärträdsbuddyallokatorn och
den inversa buddyallokatorn, och bedömer deras prestanda och lämplighet för ZGC.
Även om de inte integreras i ZGC, ger dessa utforskande modifieringar och utvärde-
ringar insikt i deras beteende och prestanda i ett GC-sammanhang. Studien visar att
även om buddyallokatorer erbjuder lovande lösningar på fragmentering, kräver de nog-
grann anpassning för att hantera de unika kraven i ZGC.
De slutsatser som dras från denna forskning belyser potentialen hos allokatorer base-
rade på free-listor att förbättra minneshanteringen i Java-applikationer. Dessa framsteg
kan minska latens orsakad av GC och förbättra skalbarheten hos Java-baserade system,
vilket möter de växande kraven från moderna mjukvaruapplikationer.

iii
Contents

1 Introduction 1
1.1 Purpose and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Individual Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4
2.1 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Sequential Allocation . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Free-List Allocation . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Segregated-Fits Allocation . . . . . . . . . . . . . . . . . . . . 7
2.4 Garbage Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 OpenJDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 The Z Garbage Collector . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.1 Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.2 The Garbage Collection Cycle . . . . . . . . . . . . . . . . . . 12

3 Related Work 15
3.1 Memory Allocators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Free-Lists in Garbage Collection . . . . . . . . . . . . . . . . . . . . . 16

4 Binary Buddy Allocator 17


4.1 Allocation and Deallocation . . . . . . . . . . . . . . . . . . . . . . . 17

iv
4.2 Determining Buddy State . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Method 21

6 Improved Buddy Allocators 23


6.1 Binary Tree Buddy Allocator . . . . . . . . . . . . . . . . . . . . . . . 23
6.1.1 Allocation in a Binary Tree Buddy Allocator . . . . . . . . . . 24
6.1.2 Deallocation in a Binary Tree Buddy Allocator . . . . . . . . . 25
6.2 Inverse Buddy Allocator . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2.1 Allocation in an Inverse Buddy Allocator . . . . . . . . . . . . 26
6.2.2 Deallocation in an Inverse Buddy Allocator . . . . . . . . . . . 28

7 Adapting the Buddy Allocator for use With ZGC 30


7.1 Allocator Configurability . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.2 Allocator Metadata Implementations . . . . . . . . . . . . . . . . . . . 31
7.2.1 Page Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2.2 Finding Block Buddies . . . . . . . . . . . . . . . . . . . . . . 32
7.3 Using the Allocators on Already Allocated Memory . . . . . . . . . . . 33
7.4 Lazy Splitting and Merging of Blocks . . . . . . . . . . . . . . . . . . 34
7.5 Allocator Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.6 Combining and Configuring the Adaptations . . . . . . . . . . . . . . . 36

8 Evaluation Methodology 37
8.1 Allocator Configurations . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.2.1 Measuring Performance . . . . . . . . . . . . . . . . . . . . . 38
8.2.2 System Specifications . . . . . . . . . . . . . . . . . . . . . . 38

v
8.3 Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.3.1 Internal Fragmentation . . . . . . . . . . . . . . . . . . . . . . 39
8.3.2 External Fragmentation . . . . . . . . . . . . . . . . . . . . . . 39
8.3.3 Measuring Fragmentation . . . . . . . . . . . . . . . . . . . . 40

9 Results 41
9.1 Memory Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.3 Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.3.1 Internal Fragmentation . . . . . . . . . . . . . . . . . . . . . . 44
9.3.2 External Fragmentation . . . . . . . . . . . . . . . . . . . . . . 45

10 Discussion 47

11 Future Work 49
11.1 Integration Into ZGC . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
11.2 Smaller Minimum Allocation Size . . . . . . . . . . . . . . . . . . . . 49
11.3 Further Allocation Optimizations . . . . . . . . . . . . . . . . . . . . . 50
11.4 Combining Allocation Strategies . . . . . . . . . . . . . . . . . . . . . 50

12 Conclusions 51

vi
1 Introduction

1 Introduction

Effective memory management is a crucial for any software system. It can be catego-
rized into either manual memory management, where the developer is responsible for
both allocating and deallocating memory, or automatic memory management, where the
system handles memory on behalf of the developer. Garbage collection, a type of au-
tomatic memory management, identifies and reclaims memory that is no longer in use.
There are various different implementations of garbage collection that achieve this goal.
Java applications run within a Java Virtual Machine (JVM), and one such example is
the Open Java Development Kit (OpenJDK). OpenJDK includes various garbage col-
lectors, including the Z garbage collector (ZGC). ZGC organizes memory into many
pages, which are operated on concurrently during garbage collection. Objects are allo-
cated sequentially on these pages through a method known as bump-pointer allocation,
where a pointer tracks the position of the most recently allocated object, increasing it,
or “bumping” it, with each new allocation.
Because bump-pointer allocation always starts from the end address of the last alloca-
tion, it is unable to reuse previously allocated memory on the same page, that has since
become free. To resolve this, ZGC either relocates all active objects to a new page,
allowing the original page to be reset, or moves all objects one by one to the top of
the page. An alternative to this method is employing a free-list-based allocator, which
maintains a list of all unoccupied memory, thereby allowing allocations in any available
space, independent of prior allocations.
Using free-lists in garbage collectors is nothing new. Concurrent Mark Sweep (CMS) is
a garbage collector that relies on free-lists. It was part of OpenJDK until its deprecation
and subsequent removal. CMS could perform allocations without the constraints im-
posed by bump-pointer allocation methods, thanks to its use of free-lists. It is clear that
free-lists offer certain advantages, but they may not always be the optimal choice for
every situation. Using a free-list-based allocator within an existing garbage collector
can offer significant advantages, as the garbage collector has more information avail-
able about the objects it allocates that the allocator can use. This thesis investigates the
potential integration of a free-list-based allocator within ZGC, with focus on examin-
ing possible adaptations that can boost allocator efficiency within a garbage collection
context, without actual implementation in ZGC.

1
1 Introduction

1.1 Purpose and Goals

The purpose of this thesis is to investigate how it is possible to adapt an existing free-list-
based allocator for use within ZGC, and to evaluate the effects of these adaptations on
allocator performance. The goal is to leverage insights of this work to direct subsequent
efforts in the domain of free-list-based allocators in garbage collection.
The research questions this thesis seeks to answer are the following:

1. What changes are neccesary when adapting an allocator for use within ZGC?

2. Is it possible to leverage knowledge of integration into ZGC to improve allocator


performance?

1.2 Delimitations

The primary delimitation of this thesis is the exclusion of allocator integration within
ZGC. The focus is on examining the allocators and their adaptations independently.
Pages in ZGC can be of three types: small, medium, or large. The modifications made
focus solely on small pages, as these are the most common. However, insights gained
from small pages may guide potential adaptations for medium and possibly large pages.

1.3 Individual Contributions

The primary author and principal contributor to this thesis is Casper Norrbin. Casper
independently developed all sections, except Section 2, which was collaboratively au-
thored with Niclas Gärds and Joel Sikström. Both Niclas and Joel pursued their theses
in parallel with this one, with Joel focusing on adapting the two-level segregated fit al-
locator [19] and Niclas on integrating an allocator into ZGC [4]. Joel’s work resulted
in distinct adaptations and outcomes, while Niclas explored the challenges and oppor-
tunities of integrating an allocator into ZGC, a natural extension of both this work and
Joel’s research.
For the section developed in collaboration with Niclas and Casper, the contributions are:
Sections 2.1, 2.2 and 2.3 are authored solely by Joel, Sections 2.4 and 2.5 are authored
solely by Casper, and Section 2.6 is authored solely by Niclas.

2
1 Introduction

1.4 Outline

The rest of this thesis is organized as follows: Section 1 outlines the purpose, goals,
delimitations, and individual contributions. Section 2 provides context on memory
management, fragmentation, memory allocation strategies, and ZGC. Section 3 reviews
existing memory allocators and the use of free-lists in garbage collection. Section 4
details the binary buddy allocator and its allocation, deallocation, and state determina-
tion methods. Section 5 describes the implementation process, from reference design
to evaluation. Section 6 introduces enhancements such as the binary tree and inverse
buddy allocators. Section 7 discusses specific adaptations of the allocators for ZGC.
Section 8 outlines the performance and fragmentation testing approaches. Section 9
presents the findings on page overhead, performance, and fragmentation. Section 10
interprets these results, and Section 11 suggests further research directions. Finally,
Section 12 summarizes key insights and their implications.

3
2 Background

2 Background

2.1 Memory Management

Memory management [5] is typically categorized as either manual or automatic. Manual


memory management involves explicitly managing memory by the programmer, which
is commonly used in low-level languages like C and C++. Automatic memory manage-
ment is handled automatically by the system, without the need for the programmer to
intervene. The most common technique for automatic memory management is garbage
collection which automatically tracks liveness of memory and reclaimed unused mem-
ory. Languages like Python and Java are examples of programming languages that do
this.
Memory is most commonly allocated during runtime of the program as opposed to
statically allocating everything in advance, during compilation for example. The process
of allocating memory during runtime is referred to as dynamic memory management,
which presents several challenges for reliably being able to satisfy allocation requests
and maintain operational stability over extended periods.
The main challenge with dynamic memory allocation is that an allocation might fail
due to memory exhaustion. Exhaustion may arise either due to the program request-
ing more memory than is available in the system or from the circumstance where free
memory is available but cannot be reused. The latter case is sometimes referred to as
just wasted memory, but to be precise we will classify it as either internal or external
fragmentation [5], which will be discussed further below.

2.2 Fragmentation

We classify fragmentation as being either internal or external. Internal fragmentation


is considered wasted space due to alignment, which allocates extra memory to meet
requirements by the allocator or hardware for example. Figure 1 shows an example
of when a user has requested 100 bytes, where the allocator has instead allocated 128
bytes to meet the requirements of the allocator. The last 28 bytes are considered wasted
space, or internal fragmentation, as the user will not know of its existence and will end
up being unused.
External fragmentation occurs when there is enough memory in total available to satisfy
a request, but the request nevertheless cannot be satisfied because the available memory
is dispersed in several chunks, none of which is large enough to satisfy the request. This

4
2 Background

Figure 1 A memory region containing one allocated piece of memory that is 128 bytes
large. However, the user only requires 100 bytes and thus, 28 bytes are wasted.

is illustrated in Figure 2, where a total of 80 bytes is available but dispersed across the
memory region. Consequently, the largest allocation that the system can accommodate
is 32 bytes, as the single largest contiguous chunk of memory is of this size.

Figure 2 A memory region containing several allocated blocks with unused space be-
tween them. Although the total sum of the unused portions is 80 bytes, a single request
of more than 32 bytes cannot be fulfilled.

Effectively managing and reducing fragmentation is crucial for optimizing memory us-
age in long-running programs [5]. For example, if memory becomes too fragmented,
the system may not be able to satisfy allocation requests and be forced to either collect
garbage sooner than necessary or terminate if not using a garbage collector [5].

2.3 Memory Allocation

In this section, we describe two fundamental memory allocation strategies, called se-
quential allocation and free-list allocation [5]. Additionally, we discuss the more com-
plex case of using multiple free-lists, called segregated-fits.

5
2 Background

2.3.1 Sequential Allocation

Sequential allocation is a method used for allocating memory within a contiguous chunk
of memory. In this approach, a pointer is used to track the current position within the
memory chunk. As new objects are allocated, the pointer is advanced by the size of
the object, along with extra memory that might be needed for alignment purposes. For
this reason, sequential allocation is also known as bump-pointer allocation due to the
incremental “bumping” of the pointer with each new allocation.
This approach is simple and efficient and is highly effective in situations where memory
fragmentation is not a significant concern and where a predictable, sequential layout
is desirable. However, it may not be the most suitable choice for all scenarios, espe-
cially where memory is not available in a large contiguous chunk. In that case, a more
sophisticated memory management strategy might be required.

2.3.2 Free-List Allocation

An alternative to sequential allocation is free-list allocation, which involves maintaining


a record of the location and size of available blocks in a linked list, for example. In its
simplest form, a single list is used to track free blocks. The allocator then sequentially
considers each block and selects one according to a specified policy. Below, we will
provide an overview of the most common policies used in free-list allocation.

First-fit The first block in the free-list that is large enough to fulfill the memory alloca-
tion request is selected. This method minimizes search time but does not consider
the possibility of a more suitable block elsewhere in the list. This often leads to
more fragmentation than is necessary, because blocks are split more often. Each
request searches from the beginning of the list.

Next-fit The search for a suitable block begins in the free-list, following a similar pro-
cess to that described for first-fit. However, in subsequent searches, it resumes
from where the previous search ended, improving efficiency when locating a new
block. This strategy is based on the observation that smaller blocks tend to ac-
cumulate at the beginning of the free-list [5], optimizing the search process by
starting further into the list in each iteration.

Best-fit The entire free-list is searched until the smallest available block that can fulfill
a request is located. This method minimizes fragmentation by selecting the block
that best matches the size of the requested memory, but it comes with the trade-off
of increased search time.

6
2 Background

The three policies described above have in common that when a block that is larger than
requested is found, it is split up.
Ideally, blocks are split as rarely as possible to maintain the ability to serve large allo-
cations. If blocks are split too often, many small blocks might accumulate, which might
increase external fragmentation to a level where new requests cannot be fulfilled. Addi-
tionally, splitting blocks less frequently will also mean less merging, or coalescing, of
blocks to larger sizes, which could improve performance.

2.3.3 Segregated-Fits Allocation

Instead of using a single free-list where blocks of many sizes are stored, multiple
free-lists that store blocks of similar sizes or size ranges can instead be used, called
segregated-fits. The goal of using multiple free-lists is to narrow down the search space
to fewer blocks, minimizing the time to find blocks large enough to satisfy a request.
It is crucial to note that blocks are logically segregated into their respective free-lists
based on size but are not required to be physically adjacent in memory within the same
free-list.
Segregated-fits is often employed in real-time systems where predictable and efficient
memory allocation is crucial [5, 8]. The reduced search space and minimized search
time to find suitable blocks increase the probability of timing constraints being met.

2.4 Garbage Collection

This section draws heavily from The Garbage Collection Handbook: The Art of Au-
tomatic Memory Management [5], by Jones et al., which provides a comprehensive
overview of garbage collection methods, both historical and current.
As mentioned in Section 2.1, garbage collection is a form of automatic memory man-
agement where the system identifies and cleans up unused objects, which are considered
garbage. This removes the requirement of managing memory manually, which reduces
the possibility of user-related memory issues occurring. However, this also removes
some control from the user and could lead to an increased memory footprint.
From the perspective of the GC, the user program can be perceived solely as a memory
mutator. This is because the GC is only concerned with the memory management of the
program, not its functionality. Therefore, the threads of the user program are commonly
referred to as mutator threads, or just mutators.

7
2 Background

During GC execution, it may be necessary to temporarily halt the mutator threads to


give the GC exclusive memory access. This is known as a stop-the-world approach.
From the mutator’s perspective, the GC appears to be atomic. This permits a simple
implementation and avoids synchronization between GC and mutators while the GC is
running, but also introduces pauses in the application which may be undesirable if la-
tency is a quality metric. The GC could also operate concurrently and in parallel with the
mutator threads. This leads to increased complexity as the memory can mutate during
GC execution. Most GCs fall somewhere in between these ends, executing concurrently
with the mutator threads while occasionally requiring pauses in their execution.
There exist numerous garbage collection strategies. One is the mark-sweep algorithm,
which begins by identifying a set of accessible “root” objects. Roots are objects that
serve as entry points into the object graph, such as global variables, local variables in
stack frames, and registers. Roots are then marked as live and scanned to identify other
objects in the object graph. This process continues recursively until all live objects have
been identified. Afterwards, the heap is swept, meaning that the space not occupied by
live objects is made available for new allocations.
The mark-sweep algorithm does not move objects, which can result in higher levels of
external fragmentation. The mark-compact algorithm addresses this issue. Like mark-
sweep, it traverses the heap and marks all live objects. However, instead of sweeping,
live objects are moved and compacted. This approach reduces fragmentation and can
be managed using a simple bump-pointer allocator.
A copying garbage collector divides the heap into many subheaps. A subheap is first
chosen and marked as non-empty. Allocations are then made sequentially onto this sub-
heap using bump-pointer allocation. When garbage collection is initiated, live objects
are copied from their current subheap to another subheap. Once all live objects have
been copied, the original subheap is marked as empty and its bump-pointer reset. Com-
pared to mark-compact, heap size is significantly reduced when using few subheaps.
Using many subheaps will make each subheap small, which could result in more, and
likely shorter, garbage collection cycles.

2.5 OpenJDK

OpenJDK [11] is a set of tools for creating and running Java programs, maintained by
Oracle. HotSpot [14] is one of these tools and the reference implementation of the Java
Virtual Machine (JVM) [10]. Specifically, a virtual machine (VM) emulates a distinct
computer to run various programs, which could be anything from full operating sys-
tems – to more specialized machines. The JVM is designed specifically to execute Java

8
2 Background

programs. It translates programs to instructions for the underlying machine, creating an


abstraction of the hardware of the physical machine and allows Java programs to be run
on any platform that the JVM runs on.
HotSpot is comprised of several components, such as an interpreter, a Just-In-Time
(JIT) compiler, and a garbage collector (GC). In combination, these components provide
the means for running different types of Java programs on the platforms supported by
HotSpot.
HotSpot provides several garbage collectors, each with different characteristics and per-
formance profiles. Different garbage collectors are optimized for different use cases,
such as throughput and latency, and can be tuned to varying degrees. One of the garbage
collectors in HotSpot is the Z Garbage Collector (ZGC) [15]. It was introduced as an
experimental feature in OpenJDK 11, and declared production ready in OpenJDK 15.
ZGC is a modern, generational, region-based, concurrent garbage collector that aims to
keep pause times constant at any given heap size.

2.6 The Z Garbage Collector

The following sections will cover the basics of how ZGC allocates data and how it can
dynamically collect garbage. The information is based partly on an overview by Yang
and Wrigstad [21] and partly on the source code of ZGC itself, as available in OpenJDK
version 22.32 [12].

2.6.1 Pages

ZGC is a region-based garbage collector which allocates objects inside differently-sized


regions of memory that are referred to as pages. When new objects are created, they are
allocated inside pages with respect to their size. Pages are classified as one of three
different types: Small, Medium or Large, which support allocations of different size
ranges, as specified in Table 1. All allocations are aligned to 8 bytes and the smallest
supported allocation size is, at the time of writing, 16 bytes. As a result of all allocations
having the same alignment, their offset from the start of the page will also be a multiple
of 8 bytes.

9
2 Background

Table 1 Page sizes in ZGC (Taken from Yang and Wrigstad [20]).
Page Type Page Size Object Size
Small 2 MB [16B, 256KB]
Medium 32 MB (256KB, 4MB)
Large ≥ 4 MB ≥ 4MB

Pages in ZGC include metadata that allows them to perform allocations efficiently,
as well as ensure safe execution while running concurrently executing threads. An
overview of this metadata is shown in Figure 3, where the most important attributes
are the: Bump Pointer, Live Map, Sequence Number, and Age, which are explained in
detail below.

Bump Pointer The bump pointer is a pointer within the page’s memory range that
stores the location of where new allocations are made inside the page. Its func-
tionality and use-case are explained in detail in Section 2.3.1. All allocations
inside pages in ZGC are currently done using bump pointers. In Figure 3, the
bump pointer is displayed below four allocated objects, indicating that the next
position to allocate objects at would be below the fourth object.

Live Map The live map is used to keep track of which objects are currently live and
have been marked as reachable by the program. Objects not marked in the live
map are considered dead. The live map is shown on the right-hand side of Fig-
ure 3. It is constructed during the marking phase of the GC cycle as live objects
are encountered during memory traversal and marked in the live map. The live
map is represented by a bitmap, where each bit is mapped to an 8-byte chunk of
the page. If a bit is set, there is a live object at the corresponding address on the
page. The live map is used during the compacting phase of the GC in order to
know how much memory is alive inside the page.

10
2 Background

Sequence Number Every page has a sequence number denoting during which GC cy-
cle it was created. If the GC is on its N th cycle, pages created during that GC
cycle will have a sequence number N . Pages with sequence number N are called
allocating pages, and are the only pages which allocate new objects for mutator
threads. Pages with a sequence number below the current GC cycle, 0–N − 1 are
called relocatable pages since there will be no additional allocations done inside
them during the GC cycle, and garbage can be reclaimed with the use of reloca-
tion. For example, if the current Java program has performed 5 GC cycles, any
allocations after that will be exclusively done on pages with sequence number 5.
If the program decides to run a 6th cycle, only garbage from pages with sequence
numbers 0 through 5 will be collected, and new allocations will be done on pages
with sequence number 6.

Age ZGC is a generational garbage collector, meaning it treats objects of varying ages
differently to improve performance. This is based on the observation that most
objects survive for only a short period and those that survive for longer tend to
live for a long time. This is known as the weak generational hypothesis. ZGC
applies this approach on a page-by-page basis, where all allocations on a page
share the same age, which is conveniently stored on each page instead of for every
object. Objects are grouped into three different age categories: Eden, Survivor
and Old. The Eden age is for first-time allocation, Survivor signifies objects that
have survived one or more GC cycles and Old those that have survived a threshold
of cycles in the Survivor age. Classifying pages this way allows the GC to treat
pages of a certain age differently, which is especially useful for Eden pages, which
tend to accumulate garbage quickly.

11
2 Background

Live Map
Age
Sequence Number 5 eden
1
20B 0 24B
0
1
16B 16B
0
0
16B 16B
0
1
16B 16B
0
Bump Pointer
0
0

Figure 3 Illustration of a page in ZGC showing what kind of metadata is kept track
of to facilitate allocation and object bookkeeping. The page contains three live objects
marked in green and with the corresponding bit set in the live map to the right. Red
objects are previously allocated objects which have since become unreachable garbage
memory.

2.6.2 The Garbage Collection Cycle

Figure 4 shows a timeline of a garbage collection cycle in ZGC, made up of an example


heap and pages. The timeline shows the different phases of a cycle and how the garbage
collector prepares for relocating memory to free unused memory. The three different
steps in the timeline show:

a) The initial state of the heap before the GC cycle starts. The left page has about
30% free memory and the right page has about 50%. The current GC cycle is 1,
which tells us that pages with sequence number 1 are allocating objects.

b) The GC cycle has started, and the old pages have been locked. New allocations are
now done on new pages instead of the old ones. The old pages are now guaranteed
to not allocate any new objects, which makes it possible for the marking phase to
begin to traverse the live objects.

c) The GC has finished the marking phase and now has liveness data that indicate
where live objects are allocated inside the relocatable pages.

12
2 Background

(a) Initial state of (b) The state of the heap after (c) The state of the heap af-
the heap before the marking has started. The blue ter all live objects are marked.
first GC cycle. The color indicates that the page is The red and green colors in-
gray color indicates no longer being allocated on. dicate that the marking phase
the page is allocat- has finished, and the pages
ing objects. have liveness data that explain
where live objects(green) are
located
.
Figure 4 Timeline of a garbage collection cycle, showing the state of the heap in the
different phases.

ZGC frees memory marked as garbage by copying live objects from one page to another
and re-purposing the source page for new allocations. This leads to the fragmented lay-
out of allocated memory to be compacted which makes it possible to fit all live objects
on multiple pages onto a single one. The process of copying live objects is also referred
to as relocation. When performing relocation, two different scenarios can occur. Either
there is enough free memory available in some page where we can allocate, or the heap
is full. If the heap is full, objects need to be compacted onto the same page that they are
already located on, which is called in-place compaction. The first scenario is illustrated
in Figure 5. This requires the heap to have enough free space available to create a new
page during the time of relocation. The GC can reclaim garbage memory by relocating
live objects from sparsely populated pages into a more dense configuration inside a new
page. The fragmentation and also total heap usage is therefore reduced.

13
2 Background

(a) A page has (b) A new page is (c) The objects (d) All objects
been selected for created with a new from the first page have been copied
relocation due to sequence number are relocated to the to the new page
high fragmenta- that is used as a new page. and the old page is
tion. relocation target. re-purposed.
Figure 5 An example of how a successful relocation is done when the heap has enough
space to allocate a new page.

Since a garbage collection is started as a reaction to high memory pressure, the available
memory might be limited. If a new page cannot be allocated, the first page is “created”
by compacting its objects in-place, and then continuing by relocating the live objects of
other pages onto that page. An in-place compaction is potentially an expensive opera-
tion, which requires the thread performing the re-arrangement to write to the memory of
the page while other threads may read its contents. To do this safely, any threads trying
to read from the page during this process must be paused, which removes concurrent
execution of the page. The process of in-place compaction of a page is illustrated in
Figure 6.

(a) The first live (b) The second live (c) All objects (d) The bump
object on the page object is moved as have been moved, pointer is moved
is moved to the close to the start as and the space after to the beginning of
start of the page. possible. is made available. available space.
Figure 6 An example of how in-place compaction is done to remove fragmentation.

14
3 Related Work

3 Related Work

Given the large number of existing allocator designs, one might conclude that most
problems with dynamic memory allocation have already been solved. In most cases, a
general-purpose allocator designed for use in any system or environment performs well
on average in terms of response time and fragmentation. However, there may be areas
for improvement depending on how much the use case can be narrowed down. For
example, a garbage collector has access to more information about the objects being
allocated than most other users of an allocator.

3.1 Memory Allocators

One of the first memory allocation techniques is the simple buddy allocator, conceived
by Knowlton [6]. The buddy allocator partitions memory into blocks, each sized as a
power of two. Initially, the entire memory is set as a single block. In the first allocation,
this block is divided recursively into halves until the block size closest to the allocation
size is obtained. These new differently-sized blocks can then be utilized for subsequent
allocations. The two resulting blocks from the splitting of a block are called “buddies”.
When a block is deallocated and its buddy is also unallocated, the two blocks are com-
bined into a larger block. This merging process continues recursively as long as the
buddy is also unallocated.
The simple design of the buddy allocator provides ample room for enhancements. Pe-
terson and Norman presented a more generalized version of the buddy allocator [17],
allowing block sizes that are not powers of two and allowing buddies of unequal sizes.
Recent advances have also included making the allocator lock-free [7] and enabling the
same block to exist in multiple sizes simultaneously to improve the efficiency of smaller
allocations [16].
The buddy allocator has seen wide use in real environments and applications. A notable
example is its use within the Linux kernel for allocating physical memory pages [3]. In
addition, the buddy allocator can be combined with various other allocation techniques
to improve efficiency. For example, jemalloc [2], created for the FreeBSD operating
system, is a general-purpose allocator that incorporates multiple strategies, including
the buddy allocator, to achieve overall efficiency.

15
3 Related Work

3.2 Free-Lists in Garbage Collection

Using free-lists in garbage collection systems is not a new concept. For example, the ba-
sic mark-sweep algorithm, as outlined in Section 2.4, necessitates the use of a free-list.
The static position of live objects can lead to significant fragmentation. Consequently,
simpler allocation methods, such as bump-pointer allocation, become impractical. In-
stead, employing a strategy based on free-lists for allocation proves to be a more effec-
tive approach to managing and allocating memory surrounding live objects.
Among the free-list strategies outlined in Section 2.3.2, the segregated-fit method is
particularly noted for its effectiveness [5]. Programs that assume garbage collection
generally allocate more than those that manually handle memory. Therefore, the sig-
nificance of segregated-fit free-lists becomes particularly crucial as program allocations
increase.
The Concurrent Mark Sweep (CMS) garbage collector [9], introduced in JDK 1.4, is a
concrete example of free-lists being used within garbage collectors. CMS implements a
concurrent variant of the mark-sweep algorithm, utilizing free-lists for managing mem-
ory. Although it was deprecated in JDK 9, CMS demonstrates the practicality of using
free-lists in garbage collectors. This highlights their versatility and adaptability across
different garbage collection algorithms and implementations.
There is potential for free-list-based allocators to exist within already existing garbage
collection systems. Using multiple allocation strategies in tandem can enhance alloca-
tion efficiency, as can be seen in various allocators [2]. With a free-list-based allocator
integrated into a garbage collection system, the system can leverage the benefits of both
strategies to improve performance. For example, integrating a buddy allocator into ZGC
would allow for fast allocation using bump-pointer allocation, while also providing the
benefits of free-lists for managing memory surrounding live objects in situations where
bump-pointer allocation is not as efficient.

16
4 Binary Buddy Allocator

4 Binary Buddy Allocator

The concept of a buddy memory allocator was first introduced by Knowlton [6]. This
section explains how the original buddy memory allocator operates, as designed by
Knowlton. The core idea of the buddy allocator involves dividing memory into blocks
and pairing them as “buddies”. Buddy pairs are split when allocating memory and
merged when deallocating memory to allow allocations of different sizes. The simple
design of the buddy allocator makes it attractive for various applications. For example,
the Linux kernel implements a modified binary buddy allocator to manage physical
memory [3], and jemalloc adopts a buddy allocator for one of its memory allocation
methods [2].

4.1 Allocation and Deallocation

The binary buddy allocator is the original and most basic type of buddy allocator. It
partitions memory into blocks with sizes that are powers of 2. A free list of available
blocks of different sizes is maintained to keep track of the free blocks. Initially, all
memory is one large block. Figure 7 shows the initial state of a binary buddy allocator
with a total memory of 128 bytes. It can be seen that the entire memory is contained
within a single block, which is also included in the free list.

Figure 7 The initial state of the free-list and memory in a binary buddy allocator. Shown
are the memory and the free-list of available blocks.

When requesting memory, a block of the smallest power-of-two size that can accommo-
date the requested memory will be returned. If there are no blocks of that size available,
a larger block will be recursively split into two until a block of the correct size is avail-
able. The split blocks are added to the free list for future allocations. An algorithmic
explanation of this process can be found in Algorithm 1.

17
4 Binary Buddy Allocator

Procedure Allocate(size)
1: level ← smallest power of 2 ≥ size
Find a suitable block
2: block ← no_block
3: for i ← level to max_level do
4: if free list for level i is not empty then
5: block ← remove first block from free list of level i
6: break
7: end if
8: end for
9: if block = no_block then
10: return allocation_failed
11: end if
Split the block if necessary
12: while level of block > level do
13: block, buddy ← split block into two buddy blocks
14: add buddy to its corresponding free list
15: end while
16: return block
Algorithm 1 Binary buddy allocation algorithm

As an example, consider Figure 8, which shows the resulting state of the previous figure
after requesting 16 bytes of memory. The initial 128-byte block has been split multiple
times to fulfill the request, with the leftmost block being returned and the addition of
various block sizes to the free list.

18
4 Binary Buddy Allocator

Figure 8 The state of the free-list and memory in a binary buddy allocator after one
16-byte allocation. Shown are the memory and the free-list of available blocks. Each
new row represents one step in the allocation process.

Deallocation is the reverse process of allocation. When a block is deallocated, it is


merged with its buddy if the buddy is also deallocated. Merging creates a new, larger
block. After merging, the original block and its buddy are removed from the free list,
and the new, larger block is inserted. The larger block checks the status of its buddy,
and the merging process continues recursively until merging is no longer possible. An
algorithmic explanation of this process can be found in Algorithm 2. To demonstrate
this process, consider the situation where the 16-byte block in Figure 8 is deallocated. In
this scenario, the block would go through a recursive merging process with its buddies
until it reaches the initial state shown in Figure 7.

Procedure Deallocate(block)
1: level ← level of block
2: while true do
3: buddy ← find buddy of block
4: if buddy is not free then
5: add block to free list of level level
6: break
7: else
8: remove buddy from free list of level level
9: block ← merge block and buddy
10: level ← level + 1
11: end if
12: end while
Algorithm 2 Binary buddy deallocation algorithm

19
4 Binary Buddy Allocator

4.2 Determining Buddy State

When deallocating a block, the allocator must determine the state of its corresponding
buddy. This is done using a bitmap, with each entry indicating if a block has been allo-
cated or not. The allocator assigns a unique number to each block that can potentially
be allocated. Figure 9 illustrates all possible blocks and their respective numbering. The
bitmap requires a total of 2n − 1 entries, where n is the number of levels of potential
block sizes.

Figure 9 Unique numbering of each possible block in a binary buddy allocator.

The bitmap is correlated with the available blocks in the free-list and is modified both
when splitting and merging blocks. For example, Figure 10 shows the state of the bitmap
of a binary buddy allocator after one 16-byte allocation. It can be seen that the largest
block has been split down to the smallest size and that the buddies of the used blocks
are marked in the bitmap as available.

Figure 10 The state of the bitmap and free-list of a binary buddy allocator after one
16-byte allocation.

20
5 Method

5 Method

The methodology was divided into five distinct phases:

1. Implement the Reference Design: Implement the original binary buddy alloca-
tor to establish a foundation for further development.
2. Investigate Key Aspects: Identify important aspects of using a memory alloca-
tor within a garbage collection context, particularly ZGC, through research and
experimentation with source code.
3. Make Modifications Based on Existing Research: Modify the reference alloca-
tor based on prior research to enhance its performance.
4. Adapt for Garbage Collection: Further modify the allocator for specific use
cases within garbage collection.
5. Evaluate Performance: Measure memory fragmentation and allocation speed
for both the reference allocator and the modified versions.

The initial phase involved implementing the original binary buddy allocator designed
by Knowlton [6]. This implementation provided a deeper understanding of its function-
ality and established a solid foundation for subsequent development. The base allocator
was tested using both unit tests and real-world applications. By exporting the alloca-
tor as a shared library within a wrapper, it was possible to test real-world programs by
overriding the malloc/free operations using the LD_PRELOAD environment variable.
Once the base allocator was implemented, the goal of phase two was to understand the
nuances of memory allocation within the context of garbage collection and ZGC. This
involved conducting a literature review, examining specific parts of the ZGC source
code, and performing experiments. This phase was crucial for grasping the complexities
of ZGC and its memory allocation mechanisms, identifying potential areas for modifi-
cation, and determining which changes would have the greatest impact. The changes
made and the reasons behind them can be found in Sections 6 and 7.
With a thorough understanding of the binary buddy allocator and ZGC, the third phase
involved integrating modifications inspired by prior research. The most significant
changes were applied, sometimes requiring the implementation of conflicting concepts,
which were then compared and evaluated. The focus was on leveraging existing work
to enhance overall performance, rather than to start from scratch. These adjustments
were not exclusively tailored for garbage collection but aimed to improve general per-
formance or produce positive results when used in garbage collection processes.

21
5 Method

In the fourth phase, improvements were implemented that were specifically relevant
to the allocator’s use in a garbage collection environment. These modifications were
derived from insights and knowledge acquired in the preceding phases, rather than based
on existing research. The goal was to introduce features tailored for specific scenarios
within ZGC without compromising performance in other aspects of ZGC or elsewhere.
The implementation process was incremental, with each minor modification being tested
and validated before proceeding to the next step. As the allocator evolved, the number
of unit tests increased, which ensured comprehensive test coverage of its capabilities.
Additionally, the allocator was tested with preexisting programs after each modification
to ensure stability and performance.
The final phase involved evaluating the reference allocator and the modifications. Sev-
eral key metrics were considered during the evaluation, including fragmentation, mem-
ory usage, and allocation speed. These metrics were essential for assessing the alloca-
tor’s performance and comparing different versions. Different metrics may hold greater
significance in different scenarios, while others may have different requirements. A
detailed explanation of the evaluation process can be found in Section 8.

22
6 Improved Buddy Allocators

6 Improved Buddy Allocators

As of the time of writing, the original binary buddy allocator is nearly 60 years old. Over
the years, numerous improvements and advances have been made, including refining
the initial design [17] and adapting it for new applications and scenarios [7, 16]. This
section covers two notable adaptations of the binary buddy allocator: the binary tree
buddy allocator and the inverse buddy allocator. These adaptations aim to improve
the performance of the original buddy allocator in specific scenarios, such as when
allocating small blocks or when minimizing block splitting and merging.

6.1 Binary Tree Buddy Allocator

The original binary buddy allocator keeps track of available blocks by organizing them
into different free-lists. Consequently, every time an allocation or deallocation occurs,
entries must be added to and removed from these lists. If a block needs to be split or can
be merged, more operations are needed. To avoid these costly free-list manipulations,
an implicit free-list can be implemented using a binary tree structure. This tree includes
all the information in the original free-list.
An example of such an implementation is that of Restioson [18]. In this implementation,
the tree stores the maximum allocation size for each block or its children. Since blocks
are evenly divided into two, only the block’s “level” needs to be stored to determine its
size. The initial state of a binary tree buddy allocator is illustrated in Figure 11. Each
level of the tree represents a specific size, and the root level corresponds to the entire
memory space.

Figure 11 The initial state of the binary tree and its free-list equivalent in a binary tree
buddy allocator.

23
6 Improved Buddy Allocators

6.1.1 Allocation in a Binary Tree Buddy Allocator

When allocating with a binary tree, the level required for the allocation size is first
determined. Then, a binary search is started from the root node of the tree. The node
is checked to see if the available level is high enough; if it is, it checks the left child.
If the left child does not have a large enough block available, the right child is chosen
instead. This search process continues recursively on the child node until the correct
level within the tree is reached. Upon finding a suitable node, its level is set to 0, and
the parent nodes are updated to reflect the new state of their children. An algorithmic
explanation of this process can be found in Algorithm 3.

Procedure Allocate(size)
1: level ← smallest power of 2 ≥ size
2: node ← root of tree
Find an available node
3: while level of node > level do
4: lef t, right ← get children of node
5: if level avaliable in lef t ≥ level then
6: node ← lef t
7: else
8: node ← right
9: end if
10: end while
11: if level avaliable in node < level then
12: return allocation_failed
13: end if
14: mark node as unavaliable
Update the tree
15: for all parent nodes starting from node do
16: level available in parent ← max(lef t, right)
17: end for
18: return node
Algorithm 3 Binary tree allocation algorithm

To illustrate this, Figure 12 shows the state of a binary tree buddy allocator following
the allocation of a 32-byte and a 16-byte block. It can be seen that the initial allocation
is placed in the leftmost node at the second-lowest level, while the subsequent allocation
is placed to the right of it at the lowest level. Consequently, the entire left branch of the
binary tree now only contains one 16-byte block, as indicated by one being stored in the
respective nodes.

24
6 Improved Buddy Allocators

Figure 12 The state of the binary tree and its free-list equivalent in a binary tree buddy
allocator after one 32-byte and one 16-byte allocation.

6.1.2 Deallocation in a Binary Tree Buddy Allocator

Deallocation in a binary tree buddy allocator is also a simple process. Initially, the
node that has been deallocated is set to its corresponding level, followed by updating its
parent nodes to reflect this change. When the buddy of a block is also free, signified by
it having the same level as the recently deallocated node, they are merged. The merging
operation in a binary tree involves increasing the level of the parent by one, indicating
that the entire memory space it represents is now available. An algorithmic explanation
of this process can be found in Algorithm 4.

Procedure Deallocate(block)
1: level ← level of block
Set the level of the node
2: node ← level
Update the tree
3: for all parent nodes starting from node do
4: if lef t = right and lef t is free completely then
5: level available in parent ← lef t + 1
6: else
7: level available in parent ← max(lef t, right)
8: end if
9: end for
Algorithm 4 Binary tree deallocation algorithm

25
6 Improved Buddy Allocators

6.2 Inverse Buddy Allocator

Allocating the smallest block size in a binary buddy allocator requires the most total
block splits. To improve the speed of allocating these blocks, adjustments are necessary.
Park et al. [16] proposed an alternative method for organizing the metadata of a binary
buddy allocator, which they call an inverse buddy allocator (iBuddy allocator). An
iBuddy allocator can allocate individual blocks in constant time, although at the expense
of slower allocations for larger blocks. Rather than dividing larger blocks into smaller
ones to fulfill allocations, all blocks are initially split, and merging is only done during
larger allocations.

6.2.1 Allocation in an Inverse Buddy Allocator

In an iBuddy allocator, blocks of different sizes that overlap the same memory are all
present in the free-list and bitmap at the same time. In the iBuddy allocator, the bitmap
takes on additional meaning. When a block is marked as free, it implies that all smaller
blocks at that memory location are also free and available for allocation. This enables
the possibility of allocating a smaller size in a larger block, and keeping the rest of the
space in that block available for future allocations. Consider Figure 13, which shows a
valid initial state of an iBuddy allocator. It can be seen that all potential addresses for
the smallest-size blocks are present in the free-list but at different levels. For example,
the first block is available at the uppermost level, while the second block is available at
the lowest level.

Figure 13 The initial state of the free-list and bitmap in an inverse binary buddy alloca-
tor.

26
6 Improved Buddy Allocators

Due to the new structure of the bitmap, it is possible to allocate single-sized blocks
in place of larger blocks without affecting other blocks in the bitmap. When using an
iBuddy allocator, the largest available block in the free-list is always utilized for alloca-
tion, irrespective of the allocation size. Consider Figure 14, which shows an allocation
of one block of the smallest size. The largest block from the preceding figure has been
taken out of the free-list and bitmap, whereas the remainder of the free-list and bitmap
remains unaltered. This allocation did not influence other blocks, and all other smallest-
size blocks are still accessible at different levels. For the next smallest-size allocation,
the 64-byte block would be used. No explicit block splitting is needed, and by storing
the highest level of the free-list with available blocks, the allocation of smallest-sized
blocks can be done in constant time.

Figure 14 The state of the free-list and bitmap in an inverse binary buddy allocator after
one 16-byte allocation.

Fast allocation speed for small blocks comes at the cost of slow allocation speed for
larger blocks. In contrast to a binary buddy allocator, blocks cannot simply be removed,
as there are now overlapping smaller blocks present in the free-list and bitmap. These
must be removed to avoid allocating the same memory location twice. An algorithmic
explanation of this process can be found in Algorithm 5.
Consider Figure 15, which shows the same allocator as in the previous figure, now with
an additional 64-byte allocation. It can once again be seen that the highest-level block
has been removed, but now, all the blocks under it have also been cleared from both the
free-list and the bitmap. For larger blocks, the allocation speed is proportional to the
number of smallest-sized blocks needed to fill the requested size.

27
6 Improved Buddy Allocators

Procedure Allocate(size)
1: level ← smallest power of 2 ≥ size
2: block ← remove the first block from the highest-level non-empty free list
3: if level = 0 then
4: return block
5: else if level of block < level then
6: return allocation_failed
7: end if
Remove all lower blocks
8: for i ← level to 0 do
9: for all possible blocks in level i do
10: remove block from free list at level i
11: end for
12: end for
13: return block
Algorithm 5 iBuddy allocation algorithm

Figure 15 The state of the free-list and bitmap in an inverse binary buddy allocator after
one 16-byte allocation and one 64-byte allocation.

6.2.2 Deallocation in an Inverse Buddy Allocator

In an inverse buddy allocator, no blocks are explicitly merged during deallocation; in-
stead, the level at which they are placed back into the free-list indicates the size of the
available block. During deallocation, the status of the block’s buddy is checked; if it is
free, instead of merging, the deallocated block rises one level. This process repeats until
a buddy is no longer free, after which the block is inserted at that layer. An algorithmic
explanation of this process can be found in Algorithm 6.

28
6 Improved Buddy Allocators

Procedure Deallocate(block)
1:for all lowest-level blocks current_block that fit in block do
2: level ← 0
3: buddy ← find buddy of current_block at level
4: while buddy is free do
5: level ← level + 1
6: buddy ← find buddy of current_block at level
7: end while
8: add current_block to free list at level level
9:end for
Algorithm 6 iBuddy deallocation algorithm

As an example of the deallocation process, consider Figure 16, which shows the previ-
ous figure, now after deallocating the first 16-byte allocation. The buddy is first checked
at the bottom level, then again at the level above that, to finally end up at the third level,
where it is inserted and marked in the bitmap as free. The complexity of deallocating
single blocks is the same as that of the binary buddy allocator, but less work needs to be
done, as no explicit merging is done, only checks for the blocks’ buddies.

Figure 16 The state of the free-list and bitmap in an inverse binary buddy allocator after
one 16-byte allocation, one 64-byte allocation, and one 16-byte deallocation.

Similar to allocation, the deallocation speedup for smaller blocks comes at the cost
of larger blocks. When deallocating a large block, all the smaller blocks within that
block also need to be inserted back into the free-list and bitmap. Thus, the operation of
deallocating a large block is equivalent in cost to individually deallocating each of the
smallest-size blocks that make up the larger block. This is more costly than a binary
buddy, where larger blocks require the least work as they merge the fewest number of
times.

29
7 Adapting the Buddy Allocator for use With ZGC

7 Adapting the Buddy Allocator for use With ZGC

As discussed in Section 6, there is ample room to improve the original binary buddy
allocator. The two designs discussed have different strengths, so both are implemented
to compare and contrast with each other and the original design. The binary tree allo-
cator closely resembles the binary buddy allocator in its allocation method, whereas the
iBuddy allocator differs significantly. Both allocators are investigated to determine the
most efficient allocation and deallocation processes for all possible allocatable sizes.
Further enhancements can be developed that are orthogonal to the specific allocator
implementations. These enhancements can be general optimizations or specifically tai-
lored to the context of operating within ZGC. Constraining the use case enables the
allocators to assume access to additional information or to narrow their functionality.
This section discusses the adaptations made to the allocators to improve their perfor-
mance within ZGC.

7.1 Allocator Configurability

There are numerous scenarios in which a garbage collector could benefit from an ad-
vanced memory allocator, many of which are beyond the scope of this thesis. Each
scenario has distinct characteristics and unique requirements. Improvements should
enhance specific scenarios without causing adverse effects in other situations. If a mod-
ification improves one use case but negatively impacts another, it should be possible
to disable that change. This flexibility allows each scenario to utilize the most suit-
able configuration for that specific case by enabling or disabling the desired features.
The adaptations made are indepentant of one another, allowing for easy configuration
changes of specific features.
Making the allocator configurable improves its applicability to a wide range of scenar-
ios. Although this versatility is advantageous, it may lead to limitations in certain in-
stances. Prioritizing configurability can hinder the allocator from being fully optimized
for one particular use case. Significant modifications to enhance one specific scenario
could adversely affect others. The choice was made to prioritize configurability instead
of restricting the allocator to a single use case. This decision allows the allocator to be
used in a variety of scenarios, with the user selecting the most suitable configuration for
their specific needs.

30
7 Adapting the Buddy Allocator for use With ZGC

7.2 Allocator Metadata Implementations

For some scenarios, a specific metadata implementation may be more beneficial than
others. The metadata structures can be altered to either prioritize memory efficiency or
speed, depending on needs. This section covers the choices made regarding the metadata
storage of the allocators.

7.2.1 Page Metadata

The design of the buddy allocator requires storing additional metadata separate from
the data on the page. This metadata, which stores the status of possible blocks, incurs
a fixed overhead for each page allocated, rather than increasing with each allocation.
The size of the metadata is influenced by the total number of possible blocks rather than
their size. Consequently, having larger block sizes lead less of a proportional overhead,
whereas smaller block sizes result in significantly higher proportional overhead.
The basic binary buddy allocator is the most efficient in terms of memory usage, re-
quiring the least data. The state of each potential block must be stored, which can be
represented by a single bit. This data is only used during deallocation, so the states of
two buddy blocks can be merged into one bit using XOR operations, effectively halving
the memory requirement. This optimization is not possible for the iBuddy allocator, as
blocks can be deallocated without knowing the state of their buddy.
The binary tree allocator retains the design of the original binary buddy allocator but
stores its metadata in a different format. This approach allows for faster allocation and
deallocation, as there is no explicit free-list that needs to be managed.
The binary tree is stored as a flattened byte array, with each level stored contiguously
following the preceding one. Most levels have each byte representing the value of a
single block, but the lowest levels deviate from this pattern. Each node in the tree holds
the maximum possible level that can be allocated within it or its children. The smallest
blocks can only store 1 or 0, enabling memory optimization by compacting and storing
8 blocks within each byte in the array. Optimizing this level, which contains half of
the total blocks, results in memory savings of 43.75%. Similarly, the data for the next
two levels can be compacted to 2 bits per block, further reducing memory by 28.13%.
Higher levels require at least 4 bits of information and constitute a smaller fraction of
total blocks, so their memory is not optimized to avoid performance impacts from bit
operations.

31
7 Adapting the Buddy Allocator for use With ZGC

7.2.2 Finding Block Buddies

When deallocating, the allocator must determine the correct block to deallocate based
on the memory address. Since an address can belong to any block level, additional
information is required. The simplest solution is to require the user to provide the
allocated size, allowing the allocator to calculate the deallocation level. This method
is fast and incurs minimal overhead as the allocator does not need to store additional
data. However, it requires the user to keep track of the allocated size, which may not
always be feasible.
Alternatively, the level of each allocation can be stored inside the block or in a sepa-
rate data structure. Although this is simple and allows for fast lookups, it is not very
memory-efficient. If the data is stored within a block, the usable space decreases. Stor-
ing data separately for each possible block location results in a slightly larger but con-
stant overhead.
A third approach uses a bitmap to track which blocks have been split. This method is
very memory-efficient but requires more operations to find the correct blocks. Figure 17
illustrates this bitmap after a 16-byte block allocation, showing all blocks above the
allocated block marked as split. The smallest blocks do not require an entry in the
bitmap, as they cannot be split. During deallocation, the allocator starts at the top of the
bitmap and traverses down until the block is no longer marked as split.

Figure 17 The state of the split blocks-bitmap and free-list of a binary buddy allocator
after one 16-byte allocation.

All the three options are implemented and can be configured in the allocators, with the
most suitable option depending on the specific usage scenario. For general use, the
bitmap approach is the most memory-efficient across various allocator configurations.
For use within ZGC, data structures responsible for tracking size can be omitted, dele-
gating this function to the garbage collector. The collector has sufficient data from its
live analysis and object headers to compute the size of each object and allocation. This
approach eliminates additional overhead, making optimal use of existing resources.

32
7 Adapting the Buddy Allocator for use With ZGC

7.3 Using the Allocators on Already Allocated Memory

In certain situations, using an advanced allocator to manage memory may not be the
most efficient option. For instance, bump-pointer allocation could be used initially,
switching to a more advanced allocator when significant external fragmentation arises.
This approach maximizes the speed advantage of bump-pointer allocations while lever-
aging advanced allocators when necessary.
In the context of ZGC, the allocator must be able to allocate around memory already
in use. This imposes limitations on the allocator, as it no longer “owns” the memory it
uses for allocation. The allocator instead has to allocate in holes of available memory.
This affects where the storage of allocator metadata, as it cannot be stored within its
own memory region.
When working with memory not fully controlled by the allocator, the allocator starts
with no available memory for allocation. The user must indicate the locations of oc-
cupied memory or the intervals between them, defining free regions that the allocator
can use. In ZGC, live analysis of a page provides information on live objects, enabling
deallocation of the intervals between them.
To implement functionality for this scenario, the allocator’s initial state is fully occupied
with the smallest-sized blocks, avoiding accidental merging of blocks overlapping with
occupied regions. During the deallocation of a specific range, the largest fitting blocks
are added to the free-list, and each block and its split children are marked as no longer
split. Figure 18 illustrates this process, showing blocks added to the free-list covering
the entire free range.

Figure 18 The blocks added to the free-list after deallocating the range up until the last
block.

33
7 Adapting the Buddy Allocator for use With ZGC

7.4 Lazy Splitting and Merging of Blocks

Splitting and merging blocks is costly and should be avoided if possible. The binary
buddy allocator merges blocks whenever possible, often requiring immediate splitting
for subsequent allocations. Lee and Barkley [1] designed a modified buddy allocator
that delays block merging. As long as the allocation size distribution remains consistent,
costs related to splitting and merging can be avoided. A simpler version of Lee and
Barkley’s design is implemented, with a second free-list (lazy layer) on top of the buddy
allocator (buddy layer). This separate free-list does not perform any splitting or merging
and solely stores blocks of different sizes.
When deallocating using a lazy layer, the block is inserted into the lazy layer’s free-
list. If the lazy layer reaches a certain threshold of blocks, the block is inserted into the
buddy layer as normal. During allocation, the lazy layer is checked first for a suitable
block, followed by splitting blocks from the buddy layer if necessary. If both steps fail,
the lazy layer can be emptied to allow smaller blocks to be merged to meet the required
allocation size.
It is logical for different block sizes to have different thresholds, as the frequency of
sizes differs. Since most allocations are small, it is also logical that smaller blocks
should have a higher threshold. Storing large blocks in the lazy layer would also reserve
these large blocks for only that size, which could lead to smaller allocations not being
able to be fulfilled. When implementing this, the smallest block size has the highest
threshold, with each level above that halving the threshold of the previous level. The
default threshold is set to 1000. A high value is desired, as ZGC can often deallocate
many objects to then allocate many objects of the same size. The lazy layer needs the
capacity to support this, which is why the threshold is set relatively high.
Different block sizes have different thresholds, with the smallest block size having the
highest threshold. Large blocks in the lazy layer could prevent smaller allocations from
being fulfilled, so thresholds decrease for larger block sizes. The default threshold is set
to 1000 for the smallest block size, with the threshold halving for each level above that.
A high threshold is desired, as ZGC could deallocate many objects to afterward allocate
many objects of the same size. The lazy layer needs the capacity to support this, which
is why the threshold is set relatively high.

34
7 Adapting the Buddy Allocator for use With ZGC

7.5 Allocator Regions

In the original binary buddy allocator, the largest block equals the entire provided mem-
ory, which is then split into smaller blocks during allocation. This approach can degrade
performance in multithreaded programs with concurrent allocations, as only one alloca-
tion can split a large block at a time.
To avoid this, the maximum block size is set lower than the entire memory size, dividing
the memory into smaller regions, each with a normal buddy block structure. Park et
al. [16] suggest this for the iBuddy allocator. For example, splitting memory into two
regions, each with a maximum block size of half the total memory, allows for two
concurrent allocations. If one region is prioritized, fragmentation would be reduced as
one region would fulfill smaller allocations, leaving space for larger allocations in the
second region.
In ZGC, the maximum allocation size in a small page is 256 KiB, a fraction of the page
size of 2 MiB. This allows for eight allocations of the maximum size within a page. The
allocator can be divided into eight regions, each covering one-eighth of a page. Fewer
regions would limit the number of concurrent allocations, while more regions would
restrict the maximum allocation size.
One allocator controlls all regions, keeping shared metadata such as the lazy layer, while
each region has its own buddy structure. This design allows for concurrent allocations in
different regions, as long as the allocations are evenly distributed. Scalability extends up
to the number of regions, with additional allocations needing to wait for prior allocations
to complete.
Situations could arise where some regions become more heavily utilized than others,
decreasing throughput. When a thread initiates an allocation, it is assigned a specific
region. If another thread is allocating in that region, it checks the subsequent regions. If
all regions are in use, the thread waits. Upon entering a region, the necessary allocation
logic is performed. If an allocation fails due to insufficient space, the thread exits the
region and retries in the next one. Allocation requests fail when all regions cannot
accommodate the request.

35
7 Adapting the Buddy Allocator for use With ZGC

7.6 Combining and Configuring the Adaptations

All the mentioned adaptations build on top of any of the three different buddy allocators
discussed in Sections 4 and 6. Each adaptation can be enabled or disabled indepen-
dently, allowing for a wide range of configurations. The user can select the most suitable
configuration for their specific use case, optimizing the allocator for their needs. While
the adapatations are designed to enhance the allocator’s performance within ZGC, the
configurability allows the user to optimize the allocator for any use case.
To configure the allocator, the user first selects which base allocator to use. The bi-
nary buddy allocator is the most memory-efficient, the binary tree allocator is a faster
alternative at the cost of memory, and the iBuddy allocator is designed for fast small
block allocations. The user then configures the adaptations: selecting the minimum and
maximum block sizes, the number of regions, the metadata storage method, and the lazy
layer thresholds. The allocator is then ready for use, optimized for the user’s specific
needs.

36
8 Evaluation Methodology

8 Evaluation Methodology

Performance, memory usage, and fragmentation of the allocators are critical metrics
that were evaluated in this study. These are important factors for a memory allocator,
and improvements in one often affect others. This section outlines the evaluation pro-
cess used to assess the performance and memory usage of the allocators, as well as
the fragmentation they introduce. Performance was measured by the time required for
allocation and deallocation requests. Memory usage refers to the additional memory
utilized by the allocator for its operations, and fragmentation refers to the inefficiencies
in memory utilization resulting from the allocation strategy.
As previously stated, a major limitation of this project was that the allocator was not
integrated into ZGC, even though this was its intended purpose. Such an integration
would have provided the simplest and most intuitive method of testing the allocator, as
this is the environment in which the allocator would be used. The challenge of not doing
this was to identify other benchmarks that could indicate the allocator’s performance
within ZGC.

8.1 Allocator Configurations

The three versions of the buddy allocator—the binary buddy allocator, the binary tree
allocator, and the iBuddy allocator—were tested and compared against each other. They
all share the same base configuration of block sizes and regions, with a minimum block
size of 16 bytes, a maximum block size of 256 KiB, and a total of 8 regions.

8.2 Performance

Performance is influenced by both the type of allocator used and the context of its use.
Different allocators perform differently depending on the task; therefore, their perfor-
mance varies across different programs. Instead of relying on a single program to com-
pare all allocators, the time required for the allocation and deallocation operations was
measured for each allocator. These measurements were used to compare the allocators’
performance across different block sizes and allocation patterns.

37
8 Evaluation Methodology

8.2.1 Measuring Performance

The time required for a single allocation or deallocation for each block size was ob-
served to provide insight into the performance of each allocator. Additionally, timing
the allocation of a constant memory size using various block sizes demonstrated the
performance of repeated allocations and deallocations.
For the single allocation and deallocation benchmarks, time was measured for every
possible block size ranging from 24 to 218 . These figures match the potential block
sizes in a ZGC small page. To minimize noise, the average of 100000 allocations
was calculated for each size and each allocator. The duration was recorded before
and after the allocation call using the POSIX function clock_gettime() with the
CLOCK_MONOTONIC_RAW clock.
For the page-fill benchmark, a contiguous 2 MiB memory block was filled with ev-
ery possible block size ranging from 24 to 218 , matching ZGC’s allocation sizes and
page size. For each allocator and block size, the memory was filled 10000 times.
The duration was recorded using the POSIX function clock_gettime() with the
CLOCK_MONOTONIC_RAW clock, starting before the initial allocation and ending after
the final one.
These tests showed different performance aspects of the different allocators for various
allocation sizes, which were used to draw several conclusions. To estimate the per-
formance of a particular program, one would examine its allocation distribution and
compare it with the test results to determine the most suitable allocator for that specific
application.

8.2.2 System Specifications

Performance evaluations were performed using two different machines, detailed in Ta-
ble 2. Each memory allocator was compiled using GCC 11.4.0, adhering to the C++14
standard (-std=c++14) and using optimization level two (-O2).

38
8 Evaluation Methodology

Table 2 Machines used for performance benchmarks.


Configuration Machine A Machine B
CPU Intel® Core™ i7-1270P AMD Opteron™ 6282 SE
Sockets / Cores / Threads 1 / 4P 8E / 16 2 / 16 / 32
Frequency (Base / Turbo) 2.2 GHz / 4.8 GHz 2.6 GHz / 3.3 GHz
L1 Cache 448 KiB 512 KiB
L2 Cache 9 MiB 32 MiB
L3 Cache 18 MiB 24 MiB
Memory 16 GiB 126 GiB
OS Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Kernel 6.5.0-17-generic 5.15.0-101-generic

8.3 Fragmentation

8.3.1 Internal Fragmentation

Internal fragmentation occurs when allocation sizes are rounded to the nearest power
of two to align with block sizes, and it is identical in all the allocators. This results
in unused memory for every allocation that is not a perfect power of two. To quantify
this, two counters were used: one for the requested allocation size and another for the
total allocated size. Each allocation operation incremented these counters, while deal-
location operations decremented them. At any point during program execution, these
two counters could be compared to measure the amount of internal fragmentation. The
percentage of internal fragmentation, or wasted memory, was calculated as:
Used memory - Requested memory
Internal fragmentation = Used memory

8.3.2 External Fragmentation

External fragmentation occurs when allocated blocks are placed in memory in a way that
inhibits larger allocations. Even if the total space available is greater than the requested
size, there may be no single contiguous block large enough to satisfy that request. The
different allocators have different policies about where blocks are placed, so fragmen-
tation differs between them.

39
8 Evaluation Methodology

8.3.3 Measuring Fragmentation

Fragmentation is highly dependent on a specific pattern of allocation and deallocation.


A program that consecutively only allocates powers of two would experience no internal
or external fragmentation. However, this scenario is unrealistic, as programs typically
mix allocations and deallocations across a broad range of sizes. To illustrate potential
fragmentation levels in a program, a simulation was performed that used the modified
allocators in a way that created high fragmentation. This simulation provided a rough
estimate of the maximum fragmentation that could arise from typical use of the alloca-
tor.
This test used two modules: one for creating a distribution of allocation sizes and an-
other for producing a sequence of allocations and deallocations based on that distribu-
tion. Using these, it was possible to measure both internal and external fragmentation at
any point within the sequence of allocations.
The chosen allocation distribution was based on the observation that most allocations
in Java programs are small, with a few large allocations and occasionally very large
ones. Samples were drawn from a Poisson distribution with a mean of λ = 6. The
samples were converted to allocation sizes by using them as exponents with base 2, and
a variance of ±50% was applied to the converted samples to introduce spread.
To make the evaluation targeted towards ZGC, the values were aligned to 8, and any
values smaller than 24 or larger than 218 were excluded to adhere to the limited allocation
size range for small pages. This resulted in a distribution centered on 26 , with most
values clustering near this point and a gradual but steep decrease in the frequency of
higher values.
The sequence of allocations and deallocations was created to introduce significant frag-
mentation without using a deliberate pattern. First, a sequence based on the allocation
size distribution was generated. Following this sequence, allocations were made until
a failure occured. Subsequently, half of the allocations by size were randomly deal-
located, resulting in fragmentation throughout the entire memory. This process was
iterated numerous times to ensure extensive fragmentation.
External fragmentation was measured each time an allocation attempt was unsuccess-
ful, providing insight into the allocator’s ability to manage fragmented memory. Inter-
nal fragmentation was measured simultaneously, capturing the impact of random size
variations on memory utilization. This approach ensured a thorough evaluation of the
allocator’s performance under conditions of high fragmentation.

40
9 Results

9 Results

This section presents the results of the evaluation described in Section 8. The three
buddy allocators presented in Sections 4 and 6, as well as the adaptations made in Sec-
tion 7, were evaluated based on their performance, memory overhead, and fragmenta-
tion.

9.1 Memory Overhead

The binary buddy allocator is the most efficient in terms of memory usage, requiring
the least data. As discussed in Section 7.2.1, the allocator only needs to store the state
of each block, which can be represented by a single bit. Additonally, the binary buddy
allocator can merge the states of two buddy blocks into a single bit. In the ZGC config-
uration, this resulted in a total overhead of 16 KiB per 2 MiB page, or 0.78%.
The iBuddy allocator also requires storing each block’s state, but the same optimizations
are not possible due. Each block needs to set and read its state individually, as operations
can be executed on multiple blocks at once. In the ZGC configuration, this led to an
overall overhead of 32 KiB per 2 MiB page, or 1.56%.
The binary tree allocator requires additional information to be stored. Storing only the
state of each block is insufficient, as each block needs to maintain a record of the highest
level accessible below it. However, significant optimizations were applied to the lower
layers, as described in Section 7.2.1, reducing the total overhead to 57 KiB per 2 MiB
page, or 2.78%.

9.2 Performance

The results of the single allocation benchmark defined in Section 8.2.1 are presented in
Figure 19. Figures 19a and 19b show the results for machines A and B, respectively. The
performance of the allocators was similar across both machines, differing by a constant
speed factor.
Both the binary buddy and binary tree allocators demonstrated faster allocation of larger
block sizes, with the latter showing superior performance for smaller block sizes. The
iBuddy allocator was faster at allocating smaller blocks but had reduced speed for larger
blocks. Utilizing the lazy layer significantly improved performance across all block
sizes, with the greatest impact at smaller sizes.

41
9 Results

(a) Allocation benchmark results from machine A.

(b) Allocation benchmark results from machine B.


Figure 19 Performance of individual allocations across various block sizes and alloca-
tors. The benchmark is run on both machines, and each bar displays the mean results
from 100 000 allocations.

42
9 Results

The results of the single deallocation benchmark defined in Section 8.2.1 are presented
in Figure 20. Figures 20a and 20b show the results for machines A and B, respectively.
The performance of the allocators was similar across both machines, differing by a
constant speed factor.
Both the binary buddy and binary tree allocators demonstrated faster deallocation of
larger block sizes, with the latter showing superior performance for smaller block sizes,
although less than during allocation. The iBuddy allocator was faster at deallocating
smaller blocks but had reduced speed for larger blocks. Utilizing the lazy layer signifi-
cantly improved performance across all block sizes, with the greatest impact at smaller
sizes.

(a) Deallocation benchmark results from machine A

(b) Deallocation benchmark results from machine B


Figure 20 Performance of individual allocations across various block sizes and alloca-
tors. The benchmark is run on both machines, and each bar displays the mean results
from 100 000 deallocations.

43
9 Results

The results of the page-fill benchmark defined in Section 8.2.1 are presented in Fig-
ure 21. Figures 21a and 21b show the results for machines A and B, respectively. The
performance of the allocators was similar across both machines, differing by a constant
speed factor.
The results illustrate the characteristics of each allocator when making multiple alloca-
tions. The binary buddy and binary tree allocators scale linearly with block size due to
their improved allocation speed for larger blocks, with the binary tree allocator being
the faster of the two. The iBuddy allocator was fastest for the smallest block sizes, but
its inverse scaling makes it only marginally faster when allocating larger blocks, quickly
flattening out as block size increases.

(a) Allocation results from machine A (b) Allocation results from machine B
Figure 21 Time taken to allocate 2 MiB across various block sizes and allocator ver-
sions. The benchmark is run on both machines, and each point displays the mean results
from 1 000 allocations.

9.3 Fragmentation

9.3.1 Internal Fragmentation

Internal fragmentation remains consistent across all allocator versions, as they round
block sizes equally and use the same allocation pattern. Table 3 presents measures that
quantify the internal fragmentation created by the allocators. Although fragmentation
was considerable, the standard deviation is low, indicating consistent fragmentation with
a low spread. The chosen allocation pattern greatly influenced fragmentation levels,
causing nearly all allocations to require rounding up.

44
9 Results

Table 3 Measurements of internal fragmentation based on 1,000 observations. All allo-


cator versions experience the same internal fragmentation.
Measure Value
Minimum: 24.4%
Maximum: 40.9%
Mean: 32.0%
Median: 32.3%
Standard Deviation: 2.1%

9.3.2 External Fragmentation

External fragmentation manifested itself differently in each allocator version due to their
different strategies for placing allocated blocks. Figure 22 illustrate the distribution of
free blocks in terms of number and total size for the binary buddy, binary tree, and
iBuddy allocators, respectively. The total free space is less important than the distribu-
tion of free blocks across different block sizes.
The binary buddy allocator showed the largest number of free blocks around 27 , but
these constituted a minor portion of the overall free space. Most of the space was oc-
cupied by blocks around 214 , suggesting that allocations of this size were possible, and
most unsuccessful allocations exceeded this size.
The binary tree allocator behaved similar to the binary buddy allocator, with the largest
number of blocks around 27 . Although it had more smaller-sized blocks, these made
up only a small fraction of the total free space and could be attributed to a different
allocation order.
Most of the space was occupied by blocks around 213 , which is less than the binary
buddy allocator. This suggests that the binary tree allocator was better at utilizing mem-
ory, as it had fewer large blocks. The overall free space was also less than that of the
binary buddy allocator, indicating that the binary tree allocator was more efficient in
using memory than the binary buddy allocator.
The iBuddy allocator stands out due to its aggressive splitting policy. The most com-
mon size of free blocks was 26 , which also constituted a substantial portion of the total
free space. The largest total space was found around 213 , although it was more evenly
distributed between sizes. This concentration at 26 resulted from most allocations ex-
ceeding this size, which led to blocks being split until this size. The overall free space
was greater than that of the other two allocators, suggesting less efficiency in utilizing
memory.

45
9 Results

(a) Mean external fragmentation of the binary buddy allocator.

(b) Mean external fragmentation of the binary tree allocator.

(c) Mean external fragmentation of the iBuddy allocator.


Figure 22 Mean external fragmentation over 1 000 observations of the three allocators
when allocating and deallocating randomly.

46
10 Discussion

10 Discussion

The evaluation of the adapted allocators presents considerable challenges. Because


these allocators are not integrated into ZGC, definitive conclusions regarding their per-
formance are difficult to reach. Only proxy benchmarks are possible, serving as indica-
tors of potential performance. Isolating the allocators for testing removes the influence
of ZGC or other user-defined logic. This isolation can result in, for example, cache-
locality effects that do not accurately reflect real-world scenarios. However, these as-
sessments enable a comparison of the allocators on a uniformly defined basis and allow
for a discussion of their comparative effectiveness.
Performance and external fragmentation benchmarks demonstrate that the binary tree
allocator consistently outperforms the binary buddy allocator, but at the cost of increased
memory overhead. Furthermore, the iBuddy allocator exhibits significant differences in
allocation performance and external fragmentation distribution compared to the other
two allocators. The choice of an allocator for a particular scenario should depend on the
most critical performance metric for that use case.
Internal fragmentation is the most significant issue faced by all the allocators. The re-
quirement to round up block sizes leads to considerable waste, especially with larger
allocations. Since most allocations do not align with a perfect power of two, fragmen-
tation occurs with each allocation, diminishing the efficiency of any buddy allocator
in utilizing memory space. In the context of ZGC, this problem can be mitigated by
strategically grouping object allocations to minimize internal fragmentation. Further
discussion of this strategy is found in Section 11.1 as future work.
External fragmentation can only be measured empirically, which is why the alloca-
tion/deallocation pattern is chosen to induce significant fragmentation without maxi-
mizing it. Under this pattern, all allocator versions experience high external fragmen-
tation, particularly at larger block sizes, where most of the total space is located. This
suggests that failures predominantly occur with very large allocation requests and that
a wide range of allocation sizes presents a major weakness, increasing the likelihood of
failures for large requests. Limiting the range of sizes would likely decrease external
fragmentation. Consequently, within ZGC, it should be presumed that additional allo-
catable space exists even if an allocation attempt fails. Assuming that memory is fully
utilized could lead to considerable memory wastage.

47
10 Discussion

The unique behavior of the iBuddy allocator is evident in the benchmarks. Notably, its
performance for large block allocations is significantly worse compared with small allo-
cations in the other two allocators. Its only advantage is a slight increase in speed at the
smallest allocation sizes. However, its strategy of aggressively splitting blocks quickly
inhibits the allocation of larger blocks. Therefore, the iBuddy allocator is only feasible
when nearly all allocations are among the smallest sizes. If a considerable number of
allocations exceed this size, there will be a notable decrease in both performance and
usable memory.
Considering all factors, the binary tree allocator emerges as the most promising option
for implementation in ZGC among the three discussed. It consistently outperforms the
binary buddy allocator in all benchmarks, while the iBuddy allocator is impractical due
to the reasons discussed. Since memory consumption is not a critical concern, the addi-
tional memory expense of the binary tree allocator is justified by its speed improvement.
Although the binary tree allocator is slower at smaller allocation sizes, this drawback
can be mitigated by using the lazy layer, keeping good performance across all allocation
sizes. This configuration is deemed the most promising for integrating a buddy allocator
within ZGC.
However, the intrinsic weaknesses of buddy allocators remain. Internal fragmentation
will be significant, even if it is possible to reduce it. Blocks still need to be aligned to
their size, which could become problematic if previous memory is awkwardly placed.
In addition, the cost of frequent allocator initialization could become non-negligible.
These issues should be taken into account when integrating with ZGC, as they could
affect both performance and memory efficiency. Future work should focus on strategies
to mitigate these weaknesses and explore alternative allocation algorithms that may offer
improved performance and reduced fragmentation in the context of ZGC.

48
11 Future Work

11 Future Work

11.1 Integration Into ZGC

The most significant future step is to integrate the allocators into ZGC. As highlighted
in Section 1.3, Gärds [4] has concurrently to this thesis explored the integration of an
allocator into ZGC, focusing on a free-list-based allocator to enhance the efficiency of
relocating objects within ZGC. Future efforts could build on Gärds’ work by incor-
porating a buddy allocator into ZGC or exploring alternative integration scenarios not
considered by Gärds.
When integrating a buddy allocator specifically, the garbage collector can implement
strategies to mitigate some of its drawbacks. The primary issue is the considerable
internal fragmentation during allocation. To minimize this, ZGC could allocate multiple
objects simultaneously to align more closely with size requirements or use fragmented
memory from an allocation for future use. This approach requires that objects allocated
together be deallocated collectively. However, as detailed in Section 7.3, this is not an
issue when deallocating ranges.

11.2 Smaller Minimum Allocation Size

Currently, the minimum allocation size in ZGC is 16 bytes, constrained by the header
size of Java objects. Efforts are underway to decrease the size of these headers, po-
tentially lowering the smallest object size to 8 bytes [13]. Since this development is
ongoing, it was not considered in this thesis.
Reducing the minimum allocation size would introduce a new bottom layer for the al-
locators. Practically, this results in memory overhead per page doubling as it becomes
necessary to manage and track smaller allocations. Additionally, for the binary buddy
and iBuddy allocators, storing the two pointers needed for the free-list would no longer
be feasible. These pointers would need to be transformed into offsets from the begin-
ning of the page. Although the size of the pages makes this adjustment manageable, it
would still introduce some performance overhead.

49
11 Future Work

11.3 Further Allocation Optimizations

Improving the binary tree storage format in the binary tree allocator might be feasible.
Currently, the levels of the tree are stored sequentially, requiring progressively larger
jumps when iterating over the levels. To increase spatial locality, the larger tree could
be broken down into several subtrees stored consecutively. This would place the seg-
ments closer together, reducing the likelihood of cache misses and potentially increasing
performance.
Currently, all the adapted allocators use the same concurrency approach, which involves
dividing memory into regions and only permitting concurrent allocations in different re-
gions. However, since their designs differ, it would be worth investigating different
designs for each allocator. For example, a lock-free mechanism could achieve concur-
rency for the binary tree allocator. As the binary tree allocator navigates its tree struc-
ture, compare-and-swap (CAS) atomic operations could replace the current subtraction
operations. Conflicts could arise with other threads, but threads can back off to resolve
these conflicts. Consequently, it is advisable to keep distributing allocations evenly us-
ing regions, but with the locking code removed, thereby only allocating concurrently
within the same region when necessary.

11.4 Combining Allocation Strategies

Another area of future research could involve exploring hybrid approaches that combine
the strengths of various buddy allocator designs. This concept parallels the strategies
used in other allocators like the linux kernel [3] and jemalloc [2], which combine
multiple allocation techniques to optimize performance and memory utilization.
For instance, a hybrid buddy allocator could employ a binary tree allocator for large
block sizes and an iBuddy allocator for small block sizes, thereby leveraging the advan-
tages of both approaches. These allocators could be distributed across different memory
regions, optimizing their performance based on specific memory requirements. Alterna-
tively, one could investigate methods to enable these allocators to work in tandem on the
same memory region, dynamically switching between allocation strategies as needed to
maximize efficiency and minimize fragmentation.

50
12 Conclusions

12 Conclusions

This thesis has explored the adaptation and evaluation of different buddy allocators for
integration within the Z Garbage Collector (ZGC). Through analysis and benchmark-
ing, the binary tree allocator emerged as the most promising candidate, consistently
outperforming the traditional binary buddy allocator in all key benchmarks. Although
the iBuddy allocator demonstrated faster allocation speeds for smaller blocks, its poor
performance with larger block allocations and significant internal fragmentation makes
it impractical for uses where larger allocations occur regularly.
By tailoring the allocator to the specific needs of ZGC, several changes have been imple-
mented to enhance efficiency and introduce additional features. Using already-available
data allowed for further optimizations. Notably, a novel approach to initializing and
employing the allocator was implemented: it begins fully occupied, with the user iden-
tifying which sections are empty and available for allocation. These modifications are
designed to facilitate the seamless integration of the allocator into ZGC in the future.
To fully understand how the allocator would perform in ZGC, additional research is
needed. So far, only proxy benchmarks have been utilized, showing promise for poten-
tial real-world applications. Therefore, future efforts should focus on developing strate-
gies to mitigate the fundamental shortcomings of buddy allocators in terms of internal
fragmentation and alignment. Additionally, integrating the allocator within ZGC would
enable extensive testing and would be crucial in refining the allocator’s performance.

51
References

References

[1] R. Barkley and T. Lee, “A lazy buddy system bounded by two coalescing delays,”
ACM SIGOPS Operating Systems Review, vol. 23, no. 5, pp. 167–176, 1989.

[2] J. Evans, “A scalable concurrent malloc (3) implementation for FreeBSD,” in Proc.
of the bsdcan conference, ottawa, canada, 2006.

[3] M. Gorman, Understanding the Linux virtual memory manager. Prentice Hall
Upper Saddle River, 2004, vol. 352.

[4] N. Gärds, “Partial Compaction in ZGC Using Free-Lists,” Master’s thesis, Uppsala
University, Department of Information Technology, 2024.

[5] R. Jones, A. Hosking, and E. Moss, The Garbage Colleciton Handbook, 2nd ed.
Chapman & Hall/CRC, 2023.

[6] K. C. Knowlton, “A fast storage allocator,” Communications of the ACM, vol. 8,


no. 10, pp. 623–624, 1965.

[7] R. Marotta, M. Ianni, A. Scarselli, A. Pellegrini, and F. Quaglia, “A non-blocking


buddy system for scalable memory allocation on multi-core machines,” in 2018
IEEE International Conference on Cluster Computing (CLUSTER), 2018, pp.
164–165.

[8] M. Masmano, I. Ripoll, A. Crespo, and J. Real, “TLSF: a New Dynamic Memory
Allocator for Real-Time Systems,” in Proceedings. 16th Euromicro Conference
on Real-Time Systems, 2004. ECRTS 2004., 2004, pp. 79–88. [Online]. Available:
[Link]

[9] Oracle. (2014) Concurrent Mark Sweep (CMS) Collector. Oracle. Accessed
2024-05-10. [Online]. Available: [Link]
guides/vm/gctuning/[Link]

[10] Oracle. (2023) Java Virtual Machine Guide. Accessed 2024-03-08. [Online].
Available: [Link]
[Link]

[11] Oracle. (2024) OpenJDK. Accessed 2024-03-08. [Online]. Available: https:


//[Link]/

[12] Oracle. (2024) openjdk/jdk (abbreviated commit: b5ed8cc). Accessed 2024-05-


07. [Online]. Available: [Link]

52
References

[13] Oracle. (2024) Project Lilliput. Oracle. Accessed 2024-05-03. [Online]. Available:
[Link]

[14] Oracle. (2024) The Java HotSpot Performance Engine Architecture. Accessed
2024-03-08. [Online]. Available: [Link]
[Link]

[15] Oracle. (2024) The Z Garbage Collector. Accessed 2024-05-15. [Online].


Available: [Link]

[16] H. Park, J. Choi, D. Lee, and S. H. Noh, “iBuddy: Inverse buddy for enhancing
memory allocation/deallocation performance on multi-core systems,” IEEE Trans-
actions on Computers, vol. 64, no. 3, pp. 720–732, 2014.

[17] J. L. Peterson and T. A. Norman, “Buddy systems,” Commun. ACM, vol. 20, no. 6,
p. 421–431, jun 1977. [Online]. Available: [Link]

[18] Restioson. (2024) Buddy allocator workshop. Accessed 2024-04-21. [Online].


Available: [Link]

[19] J. Sikström, “Addressing Fragmentation in ZGC Through Custom Allocators,”


Master’s thesis, Uppsala University, Department of Information Technology, 2024.

[20] A. M. Yang, E. Österlund, and T. Wrigstad, “Improving program locality in


the GC using hotness,” in Proceedings of the 41st ACM SIGPLAN Conference
on Programming Language Design and Implementation, ser. PLDI 2020. New
York, NY, USA: Association for Computing Machinery, 2020, p. 301–313.
[Online]. Available: [Link]

[21] A. M. Yang and T. Wrigstad, “Deep Dive into ZGC: A Modern Garbage Collector
in OpenJDK,” ACM Trans. Program. Lang. Syst., vol. 44, no. 4, sep 2022.
[Online]. Available: [Link]

53

You might also like