Is Parallel Programming Hard
Is Parallel Programming Hard
Edited by:
Paul E. McKenney
Facebook
[email protected]
v2023.06.11a
ii
Legal Statement
This work represents the views of the editor and the authors and does not necessarily
represent the view of their respective employers.
Trademarks:
• IBM, z Systems, and PowerPC are trademarks or registered trademarks of Inter-
national Business Machines Corporation in the United States, other countries, or
both.
• Linux is a registered trademark of Linus Torvalds.
• Intel, Itanium, Intel Core, and Intel Xeon are trademarks of Intel Corporation or
its subsidiaries in the United States, other countries, or both.
• Arm is a registered trademark of Arm Limited (or its subsidiaries) in the US and/or
elsewhere.
• SPARC is a registered trademark of SPARC International, Inc. Products bearing
SPARC trademarks are based on an architecture developed by Sun Microsystems,
Inc.
• Other company, product, and service names may be trademarks or service marks
of such companies.
The non-source-code text and images in this document are provided under the terms
of the Creative Commons Attribution-Share Alike 3.0 United States license.1 In brief,
you may use the contents of this document for any purpose, personal, commercial, or
otherwise, so long as attribution to the authors is maintained. Likewise, the document
may be modified, and derivative works and translations made available, so long as
such modifications and derivations are offered to the public on equal terms as the
non-source-code text and images in the original document.
Source code is covered by various versions of the GPL.2 Some of this code is
GPLv2-only, as it derives from the Linux kernel, while other code is GPLv2-or-later. See
the comment headers of the individual source files within the CodeSamples directory in
the git archive3 for the exact licenses. If you are unsure of the license for a given code
fragment, you should assume GPLv2-only.
Combined work © 2005–2023 by Paul E. McKenney. Each individual contribution
is copyright by its contributor at the time of contribution, as recorded in the git archive.
1 https://creativecommons.org/licenses/by-sa/3.0/us/
2 https://www.gnu.org/licenses/gpl-2.0.html
3 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
v2023.06.11a
Contents
2 Introduction 7
2.1 Historic Parallel Programming Difficulties . . . . . . . . . . . . . . . 7
2.2 Parallel Programming Goals . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Alternatives to Parallel Programming . . . . . . . . . . . . . . . . . . 12
2.3.1 Multiple Instances of a Sequential Application . . . . . . . . 12
2.3.2 Use Existing Parallel Software . . . . . . . . . . . . . . . . . 12
2.3.3 Performance Optimization . . . . . . . . . . . . . . . . . . . 13
2.4 What Makes Parallel Programming Hard? . . . . . . . . . . . . . . . 13
2.4.1 Work Partitioning . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Parallel Access Control . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Resource Partitioning and Replication . . . . . . . . . . . . . 15
2.4.4 Interacting With Hardware . . . . . . . . . . . . . . . . . . . 15
2.4.5 Composite Capabilities . . . . . . . . . . . . . . . . . . . . 15
2.4.6 How Do Languages and Environments Assist With These Tasks? 16
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
iii
v2023.06.11a
iv CONTENTS
5 Counting 49
5.1 Why Isn’t Concurrent Counting Trivial? . . . . . . . . . . . . . . . . 49
5.2 Statistical Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Array-Based Implementation . . . . . . . . . . . . . . . . . 51
5.2.3 Per-Thread-Variable-Based Implementation . . . . . . . . . . 52
5.2.4 Eventually Consistent Implementation . . . . . . . . . . . . 54
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Approximate Limit Counters . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Simple Limit Counter Implementation . . . . . . . . . . . . 56
5.3.3 Simple Limit Counter Discussion . . . . . . . . . . . . . . . 59
5.3.4 Approximate Limit Counter Implementation . . . . . . . . . 60
5.3.5 Approximate Limit Counter Discussion . . . . . . . . . . . . 61
5.4 Exact Limit Counters . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 Atomic Limit Counter Implementation . . . . . . . . . . . . 61
5.4.2 Atomic Limit Counter Discussion . . . . . . . . . . . . . . . 64
5.4.3 Signal-Theft Limit Counter Design . . . . . . . . . . . . . . 64
5.4.4 Signal-Theft Limit Counter Implementation . . . . . . . . . . 65
5.4.5 Signal-Theft Limit Counter Discussion . . . . . . . . . . . . 67
5.4.6 Applying Exact Limit Counters . . . . . . . . . . . . . . . . 68
v2023.06.11a
CONTENTS v
7 Locking 101
7.1 Staying Alive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.1 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.2 Livelock and Starvation . . . . . . . . . . . . . . . . . . . . 109
7.1.3 Unfairness . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.1.4 Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2 Types of Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.1 Exclusive Locks . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.2 Reader-Writer Locks . . . . . . . . . . . . . . . . . . . . . . 111
7.2.3 Beyond Reader-Writer Locks . . . . . . . . . . . . . . . . . 112
7.2.4 Scoped Locking . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3 Locking Implementation Issues . . . . . . . . . . . . . . . . . . . . . 115
7.3.1 Sample Exclusive-Locking Implementation Based on Atomic
Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3.2 Other Exclusive-Locking Implementations . . . . . . . . . . 115
7.4 Lock-Based Existence Guarantees . . . . . . . . . . . . . . . . . . . 117
7.5 Locking: Hero or Villain? . . . . . . . . . . . . . . . . . . . . . . . 118
7.5.1 Locking For Applications: Hero! . . . . . . . . . . . . . . . 119
v2023.06.11a
vi CONTENTS
v2023.06.11a
CONTENTS vii
11 Validation 209
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.1.1 Where Do Bugs Come From? . . . . . . . . . . . . . . . . . 209
11.1.2 Required Mindset . . . . . . . . . . . . . . . . . . . . . . . 210
11.1.3 When Should Validation Start? . . . . . . . . . . . . . . . . 212
11.1.4 The Open Source Way . . . . . . . . . . . . . . . . . . . . . 213
11.2 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.4 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
11.5 Code Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
11.5.1 Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
11.5.2 Walkthroughs . . . . . . . . . . . . . . . . . . . . . . . . . . 216
11.5.3 Self-Inspection . . . . . . . . . . . . . . . . . . . . . . . . . 216
11.6 Probability and Heisenbugs . . . . . . . . . . . . . . . . . . . . . . . 217
11.6.1 Statistics for Discrete Testing . . . . . . . . . . . . . . . . . 218
11.6.2 Statistics Abuse for Discrete Testing . . . . . . . . . . . . . . 219
11.6.3 Statistics for Continuous Testing . . . . . . . . . . . . . . . . 219
11.6.4 Hunting Heisenbugs . . . . . . . . . . . . . . . . . . . . . . 220
11.7 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 224
11.7.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . 224
11.7.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
11.7.3 Differential Profiling . . . . . . . . . . . . . . . . . . . . . . 225
11.7.4 Microbenchmarking . . . . . . . . . . . . . . . . . . . . . . 225
11.7.5 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
11.7.6 Detecting Interference . . . . . . . . . . . . . . . . . . . . . 227
11.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
v2023.06.11a
viii CONTENTS
v2023.06.11a
CONTENTS ix
v2023.06.11a
x CONTENTS
v2023.06.11a
CONTENTS xi
v2023.06.11a
xii CONTENTS
Glossary 585
Bibliography 595
Credits 639
LATEX Advisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Reviewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Machine Owners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Original Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Figure Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
Other Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
Acronyms 643
Index 645
v2023.06.11a
If you would only recognize that life is hard, things
would be so much easier for you.
The purpose of this book is to help you program shared- 1.1 Roadmap
memory parallel systems without risking your sanity.1
Nevertheless, you should think of the information in this
book as a foundation on which to build, rather than as Cat: Where are you going?
a completed cathedral. Your mission, if you choose to Alice: Which way should I go?
accept, is to help make further progress in the exciting Cat: That depends on where you are going.
Alice: I don’t know.
field of parallel programming—progress that will in time
Cat: Then it doesn’t matter which way you go.
render this book obsolete.
Parallel programming in the 21st century is no longer Lewis Carroll, Alice in Wonderland
focused solely on science, research, and grand-challenge
projects. And this is all to the good, because it means This book is a handbook of widely applicable and heav-
that parallel programming is becoming an engineering ily used design techniques, rather than a collection of
discipline. Therefore, as befits an engineering discipline, optimal algorithms with tiny areas of applicability. You
this book examines specific parallel-programming tasks are currently reading Chapter 1, but you knew that al-
and describes how to approach them. In some surprisingly ready. Chapter 2 gives a high-level overview of parallel
common cases, these tasks can be automated. programming.
This book is written in the hope that presenting the
Chapter 3 introduces shared-memory parallel hardware.
engineering discipline underlying successful parallel-
After all, it is difficult to write good parallel code un-
programming projects will free a new generation of par-
less you understand the underlying hardware. Because
allel hackers from the need to slowly and painstakingly
hardware constantly evolves, this chapter will always be
reinvent old wheels, enabling them to instead focus their
out of date. We will nevertheless do our best to keep up.
energy and creativity on new frontiers. However, what
Chapter 4 then provides a very brief overview of common
you get from this book will be determined by what you
shared-memory parallel-programming primitives.
put into it. It is hoped that simply reading this book will
be helpful, and that working the Quick Quizzes will be Chapter 5 takes an in-depth look at parallelizing one
even more helpful. However, the best results come from of the simplest problems imaginable, namely counting.
applying the techniques taught in this book to real-life Because almost everyone has an excellent grasp of count-
problems. As always, practice makes perfect. ing, this chapter is able to delve into many important
But no matter how you approach it, we sincerely hope parallel-programming issues without the distractions of
that parallel programming brings you at least as much fun, more-typical computer-science problems. My impression
excitement, and challenge that it has brought to us! is that this chapter has seen the greatest use in parallel-
programming coursework.
Chapter 6 introduces a number of design-level methods
of addressing the issues identified in Chapter 5. It turns out
1 Or, perhaps more accurately, without much greater risk to your that it is important to address parallelism at the design level
sanity than that incurred by non-parallel programming. Which, come to when feasible: To paraphrase Dijkstra [Dij68], “retrofitted
think of it, might not be saying all that much. parallelism considered grossly suboptimal” [McK12c].
v2023.06.11a
2 CHAPTER 1. HOW TO USE THIS BOOK
The next three chapters examine three important ap- Some of them are based on material in which that quick
proaches to synchronization. Chapter 7 covers locking, quiz appears, but others require you to think beyond that
which is still not only the workhorse of production-quality section, and, in some cases, beyond the realm of current
parallel programming, but is also widely considered to knowledge. As with most endeavors, what you get out of
be parallel programming’s worst villain. Chapter 8 gives this book is largely determined by what you are willing to
a brief overview of data ownership, an often overlooked put into it. Therefore, readers who make a genuine effort
but remarkably pervasive and powerful approach. Finally, to solve a quiz before looking at the answer find their
Chapter 9 introduces a number of deferred-processing effort repaid handsomely with increased understanding of
mechanisms, including reference counting, hazard point- parallel programming.
ers, sequence locking, and RCU. Quick Quiz 1.1: Where are the answers to the Quick Quizzes
Chapter 10 applies the lessons of previous chapters to found?
hash tables, which are heavily used due to their excel-
lent partitionability, which (usually) leads to excellent Quick Quiz 1.2: Some of the Quick Quiz questions seem to
performance and scalability. be from the viewpoint of the reader rather than the author. Is
As many have learned to their sorrow, parallel program- that really the intent?
ming without validation is a sure path to abject failure.
Chapter 11 covers various forms of testing. It is of course Quick Quiz 1.3: These Quick Quizzes are just not my cup of
impossible to test reliability into your program after the tea. What can I do about it?
fact, so Chapter 12 follows up with a brief overview of a
In short, if you need a deep understanding of the mate-
couple of practical approaches to formal verification.
rial, then you should invest some time into answering the
Chapter 13 contains a series of moderate-sized parallel Quick Quizzes. Don’t get me wrong, passively reading the
programming problems. The difficulty of these problems material can be quite valuable, but gaining full problem-
vary, but should be appropriate for someone who has solving capability really does require that you practice
mastered the material in the previous chapters. solving problems. Similarly, gaining full code-production
Chapter 14 looks at advanced synchronization meth- capability really does require that you practice producing
ods, including non-blocking synchronization and parallel code.
real-time computing, while Chapter 15 covers the ad-
Quick Quiz 1.4: If passively reading this book doesn’t get me
vanced topic of memory ordering. Chapter 16 follows up
full problem-solving and code-production capabilities, what
with some ease-of-use advice. Chapter 17 looks at a few on earth is the point???
possible future directions, including shared-memory par-
allel system design, software and hardware transactional I learned this the hard way during coursework for my
memory, and functional programming for parallelism. Fi- late-in-life Ph.D. I was studying a familiar topic, and
nally, Chapter 18 reviews the material in this book and its was surprised at how few of the chapter’s exercises I
origins. could answer off the top of my head.2 Forcing myself to
This chapter is followed by a number of appendices. The answer the questions greatly increased my retention of the
most popular of these appears to be Appendix C, which material. So with these Quick Quizzes I am not asking
delves even further into memory ordering. Appendix E you to do anything that I have not been doing myself.
contains the answers to the infamous Quick Quizzes, Finally, the most common learning disability is thinking
which are discussed in the next section. that you already understand the material at hand. The
quick quizzes can be an extremely effective cure.
“Quick quizzes” appear throughout this book, and the 2 So I suppose that it was just as well that my professors refused to
answers may be found in Appendix E starting on page 473. let me waive that class!
v2023.06.11a
1.3. ALTERNATIVES TO THIS BOOK 3
1.3 Alternatives to This Book any data structure.” These are clearly not the words
of someone who is hostile towards RCU.
Between two evils I always pick the one I never tried 2. If you would like an academic treatment of parallel
before. programming from a programming-language-prag-
Mae West matics viewpoint, you might be interested in the
concurrency chapter from Scott’s textbook [Sco06,
As Knuth learned the hard way, if you want your book Sco15] on programming-language pragmatics.
to be finite, it must be focused. This book focuses on
3. If you are interested in an object-oriented patternist
shared-memory parallel programming, with an emphasis
treatment of parallel programming focussing on C++,
on software that lives near the bottom of the software
you might try Volumes 2 and 4 of Schmidt’s POSA
stack, such as operating-system kernels, parallel data-
series [SSRB00, BHS07]. Volume 4 in particular
management systems, low-level libraries, and the like.
has some interesting chapters applying this work to a
The programming language used by this book is C.
warehouse application. The realism of this example
If you are interested in other aspects of parallelism,
is attested to by the section entitled “Partitioning the
you might well be better served by some other book.
Big Ball of Mud”, in which the problems inherent
Fortunately, there are many alternatives available to you:
in parallelism often take a back seat to getting one’s
1. If you prefer a more academic and rigorous treatment head around a real-world application.
of parallel programming, you might like Herlihy’s 4. If you want to work with Linux-kernel device driv-
and Shavit’s textbook [HS08, HSLS20]. This book ers, then Corbet’s, Rubini’s, and Kroah-Hartman’s
starts with an interesting combination of low-level “Linux Device Drivers” [CRKH05] is indispensable,
primitives at high levels of abstraction from the hard- as is the Linux Weekly News web site (https:
ware, and works its way through locking and simple //lwn.net/). There is a large number of books and
data structures including lists, queues, hash tables, resources on the more general topic of Linux kernel
and counters, culminating with transactional mem- internals.
ory, all in Java. Michael Scott’s textbook [Sco13]
approaches similar material with more of a software- 5. If your primary focus is scientific and technical com-
engineering focus, and, as far as I know, is the first puting, and you prefer a patternist approach, you
formally published academic textbook with section might try Mattson et al.’s textbook [MSM05]. It
devoted to RCU. covers Java, C/C++, OpenMP, and MPI. Its pat-
Herlihy, Shavit, Luchangco, and Spear did catch up terns are admirably focused first on design, then on
in their second edition [HSLS20] by adding short implementation.
sections on hazard pointers and on RCU, with the
6. If your primary focus is scientific and technical com-
latter in the guise of EBR.3 They also include
puting, and you are interested in GPUs, CUDA, and
a brief history of both, albeit with an abbreviated
MPI, you might check out Norm Matloff’s “Program-
history of RCU that picks up almost a year after it was
ming on Parallel Machines” [Mat17]. Of course, the
accepted into the Linux kernel and more than 20 years
GPU vendors have quite a bit of additional informa-
after Kung’s and Lehman’s landmark paper [KL80].
tion [AMD20, Zel11, NVi17a, NVi17b].
Those wishing a deeper view of the history may find
it in this book’s Section 9.5.5. 7. If you are interested in POSIX Threads, you might
However, readers who might otherwise suspect a take a look at David R. Butenhof’s book [But97]. In
hostile attitude towards RCU on the part of this text- addition, W. Richard Stevens’s book [Ste92, Ste13]
book’s first author should refer to the last full sentence covers UNIX and POSIX, and Stewart Weiss’s lecture
on the first page of one of his papers [BGHZ16]. This notes [Wei13] provide an thorough and accessible
sentence reads “QSBR [a particular class of RCU im- introduction with a good set of examples.
plementations] is fast and can be applied to virtually
8. If you are interested in C++11, you might like An-
3 Albeitan implementation that contains a reader-preemption bug thony Williams’s “C++ Concurrency in Action: Prac-
noted by Richard Bornat. tical Multithreading” [Wil12, Wil19].
v2023.06.11a
4 CHAPTER 1. HOW TO USE THIS BOOK
9. If you are interested in C++, but in a Windows Listing 1.1: Creating an Up-To-Date PDF
environment, you might try Herb Sutter’s “Effective git clone git://git.kernel.org/pub/scm/linux/kernel/git/↵
↩→ paulmck/perfbook.git
Concurrency” series in Dr. Dobbs Journal [Sut08]. cd perfbook
This series does a reasonable job of presenting a # You may need to install a font. See item 1 in FAQ.txt.
make # -jN for parallel build
commonsense approach to parallelism. evince perfbook.pdf & # Two-column version
make perfbook-1c.pdf
10. If you want to try out Intel Threading Building Blocks, evince perfbook-1c.pdf & # One-column version for e-readers
make help # Display other build options
then perhaps James Reinders’s book [Rei07] is what
you are looking for.
11. Those interested in learning how various types of
multi-processor hardware cache organizations affect find CodeSamples -name rcu_rcpls.c -print
the implementation of kernel internals should take
a look at Curt Schimmel’s classic treatment of this This command will locate the file rcu_rcpls.c, which
subject [Sch94]. is called out in Appendix B. Non-UNIX systems have
12. If you are looking for a hardware view, Hennessy’s their own well-known ways of locating files by filename.
and Patterson’s classic textbook [HP17, HP11] is
well worth a read. A “Readers Digest” version of
this tome geared for scientific and technical work-
1.5 Whose Book Is This?
loads (bashing big arrays) may be found in Andrew
Chien’s textbook [Chi22]. If you are looking for an If you become a teacher, by your pupils you’ll be
academic textbook on memory ordering from a more taught.
hardware-centric viewpoint, that of Daniel Sorin Oscar Hammerstein II
et al. [SHW11, NSHW20] is highly recommended.
For a memory-ordering tutorial from a Linux-kernel As the cover says, the editor is one Paul E. McKenney.
viewpoint, Paolo Bonzini’s LWN series is a good However, the editor does accept contributions via the
place to start [Bon21a, Bon21e, Bon21c, Bon21b, [email protected] email list. These contri-
Bon21d, Bon21f]. butions can be in pretty much any form, with popular
approaches including text emails, patches against the
13. Those wishing to learn about the Rust language’s
book’s LATEX source, and even git pull requests. Use
support for low-level concurrency should refer to
whatever form works best for you.
Mara Bos’s book [Bos23].
To create patches or git pull requests, you
14. Finally, those using Java might be well-served by will need the LATEX source to the book, which
Doug Lea’s textbooks [Lea97, GPB+ 07]. is at git://git.kernel.org/pub/scm/linux/
kernel/git/paulmck/perfbook.git, or, alterna-
However, if you are interested in principles of parallel
tively, https://git.kernel.org/pub/scm/linux/
design for low-level software, especially software written
kernel/git/paulmck/perfbook.git. You will of
in C, read on!
course also need git and LATEX, which are available
as part of most mainstream Linux distributions. Other
1.4 Sample Source Code packages may be required, depending on the distribution
you use. The required list of packages for a few popular
distributions is listed in the file FAQ-BUILD.txt in the
Use the source, Luke! LATEX source to the book.
Unknown Star Wars fan To create and display a current LATEX source tree of this
book, use the list of Linux commands shown in Listing 1.1.
This book discusses its fair share of source code, and In some environments, the evince command that displays
in many cases this source code may be found in the perfbook.pdf may need to be replaced, for example,
CodeSamples directory of this book’s git tree. For ex- with acroread. The git clone command need only be
ample, on UNIX systems, you should be able to type the used the first time you create a PDF, subsequently, you
following: can run the commands shown in Listing 1.2 to pull in any
v2023.06.11a
1.5. WHOSE BOOK IS THIS? 5
Listing 1.2: Generating an Updated PDF may be redistributed consistent with this project or
git remote update the open source license(s) involved.
git checkout origin/master
make # -jN for parallel build
evince perfbook.pdf & # Two-column version This is quite similar to the Developer’s Certificate
make perfbook-1c.pdf
evince perfbook-1c.pdf & # One-column version for e-readers
of Origin (DCO) 1.1 used by the Linux kernel. You
must use your real name: I unfortunately cannot accept
pseudonymous or anonymous contributions.
updates and generate an updated PDF. The commands The language of this book is American English, however,
in Listing 1.2 must be run within the perfbook directory the open-source nature of this book permits translations,
created by the commands shown in Listing 1.1. and I personally encourage them. The open-source li-
PDFs of this book are sporadically posted at censes covering this book additionally allow you to sell
https://kernel.org/pub/linux/kernel/people/ your translation, if you wish. I do request that you send
paulmck/perfbook/perfbook.html and at http: me a copy of the translation (hardcopy if available), but
//www.rdrop.com/users/paulmck/perfbook/. this is a request made as a professional courtesy, and
The actual process of contributing patches and is not in any way a prerequisite to the permission that
sending git pull requests is similar to that of you already have under the Creative Commons and GPL
the Linux kernel, which is documented here: licenses. Please see the FAQ.txt file in the source tree
https://www.kernel.org/doc/html/latest/ for a list of translations currently in progress. I consider
process/submitting-patches.html. One important a translation effort to be “in progress” once at least one
requirement is that each patch (or commit, in the chapter has been fully translated.
case of a git pull request) must contain a valid There are many styles under the “American English”
Signed-off-by: line, which has the following format: rubric. The style for this particular book is documented
in Appendix D.
Signed-off-by: My Name <[email protected]> As noted at the beginning of this section, I am this
book’s editor. However, if you choose to contribute, it will
Please see https://lkml.org/lkml/2007/1/15/ be your book as well. In that spirit, I offer you Chapter 2,
219 for an example patch with a Signed-off-by: line. our introduction.
Note well that the Signed-off-by: line has a very spe-
cific meaning, namely that you are certifying that:
v2023.06.11a
6 CHAPTER 1. HOW TO USE THIS BOOK
v2023.06.11a
If parallel programming is so hard, why are there so
many parallel programs?
Chapter 2 Unknown
Introduction
Parallel programming has earned a reputation as one of 2.1 Historic Parallel Programming
the most difficult areas a hacker can tackle. Papers and
textbooks warn of the perils of deadlock, livelock, race Difficulties
conditions, non-determinism, Amdahl’s-Law limits to
scaling, and excessive realtime latencies. And these perils Not the power to remember, but its very opposite,
are quite real; we authors have accumulated uncounted the power to forget, is a necessary condition for our
years of experience along with the resulting emotional existence.
scars, grey hairs, and hair loss. Sholem Asch
However, new technologies that are difficult to use at
introduction invariably become easier over time. For As indicated by its title, this book takes a different ap-
example, the once-rare ability to drive a car is now com- proach. Rather than complain about the difficulty of
monplace in many countries. This dramatic change came parallel programming, it instead examines the reasons
about for two basic reasons: (1) Cars became cheaper why parallel programming is difficult, and then works to
and more readily available, so that more people had the help the reader to overcome these difficulties. As will be
opportunity to learn to drive, and (2) Cars became easier to seen, these difficulties have historically fallen into several
operate due to automatic transmissions, automatic chokes, categories, including:
automatic starters, greatly improved reliability, and a host
1. The historic high cost and relative rarity of parallel
of other technological improvements.
systems.
The same is true for many other technologies, includ-
ing computers. It is no longer necessary to operate a 2. The typical researcher’s and practitioner’s lack of
keypunch in order to program. Spreadsheets allow most experience with parallel systems.
non-programmers to get results from their computers that 3. The paucity of publicly accessible parallel code.
would have required a team of specialists a few decades
ago. Perhaps the most compelling example is web-surfing 4. The lack of a widely understood engineering disci-
and content creation, which since the early 2000s has been pline of parallel programming.
easily done by untrained, uneducated people using various
now-commonplace social-networking tools. As recently 5. The high overhead of communication relative to that
as 1968, such content creation was a far-out research of processing, even in tightly coupled shared-memory
project [Eng68], described at the time as “like a UFO computers.
landing on the White House lawn” [Gri00].
Many of these historic difficulties are well on the way
Therefore, if you wish to argue that parallel program- to being overcome. First, over the past few decades,
ming will remain as difficult as it is currently perceived the cost of parallel systems has decreased from many
by many to be, it is you who bears the burden of proof, multiples of that of a house to that of a modest meal,
keeping in mind the many centuries of counter-examples courtesy of Moore’s Law [Moo65]. Papers calling out the
in many fields of endeavor. advantages of multicore CPUs were published as early
v2023.06.11a
8 CHAPTER 2. INTRODUCTION
as 1996 [ONH+ 96]. IBM introduced simultaneous multi- hardware will be more friendly to parallel software, as
threading into its high-end POWER family in 2000, and discussed in Section 3.3.
multicore in 2001. Intel introduced hyperthreading into Quick Quiz 2.1: Come on now!!! Parallel programming has
its commodity Pentium line in November 2000, and both been known to be exceedingly hard for many decades. You
AMD and Intel introduced dual-core CPUs in 2005. Sun seem to be hinting that it is not so hard. What sort of game
followed with the multicore/multi-threaded Niagara in are you playing?
late 2005. In fact, by 2008, it was becoming difficult to
find a single-CPU desktop system, with single-core CPUs However, even though parallel programming might not
being relegated to netbooks and embedded devices. By be as hard as is commonly advertised, it is often more
2012, even smartphones were starting to sport multiple work than is sequential programming.
CPUs. By 2020, safety-critical software standards started Quick Quiz 2.2: How could parallel programming ever be
addressing concurrency. as easy as sequential programming?
Second, the advent of low-cost and readily available
multicore systems means that the once-rare experience It therefore makes sense to consider alternatives to
of parallel programming is now available to almost all parallel programming. However, it is not possible to
researchers and practitioners. In fact, parallel systems reasonably consider parallel-programming alternatives
have long been within the budget of students and hobbyists. without understanding parallel-programming goals. This
We can therefore expect greatly increased levels of inven- topic is addressed in the next section.
tion and innovation surrounding parallel systems, and that
increased familiarity will over time make the once pro- 2.2 Parallel Programming Goals
hibitively expensive field of parallel programming much
more friendly and commonplace.
If you don’t know where you are going, you will end
Third, in the 20th century, large systems of highly par-
up somewhere else.
allel software were almost always closely guarded propri-
etary secrets. In happy contrast, the 21st century has seen Yogi Berra
numerous open-source (and thus publicly available) paral-
lel software projects, including the Linux kernel [Tor03], The three major goals of parallel programming (over and
database systems [Pos08, MS08], and message-passing above those of sequential programming) are as follows:
systems [The08, Uni08a]. This book will draw primarily
from the Linux kernel, but will provide much material 1. Performance.
suitable for user-level applications. 2. Productivity.
Fourth, even though the large-scale parallel-program-
ming projects of the 1980s and 1990s were almost all 3. Generality.
proprietary projects, these projects have seeded other
communities with cadres of developers who understand Unfortunately, given the current state of the art, it is
the engineering discipline required to develop production- possible to achieve at best two of these three goals for any
quality parallel code. A major purpose of this book is to given parallel program. These three goals therefore form
present this engineering discipline. the iron triangle of parallel programming, a triangle upon
which overly optimistic hopes all too often come to grief.1
Unfortunately, the fifth difficulty, the high cost of com-
munication relative to that of processing, remains largely Quick Quiz 2.3: Oh, really??? What about correctness,
in force. This difficulty has been receiving increasing maintainability, robustness, and so on?
attention during the new millennium. However, accord-
ing to Stephen Hawking, the finite speed of light and Quick Quiz 2.4: And if correctness, maintainability, and
robustness don’t make the list, why do productivity and gener-
the atomic nature of matter will limit progress in this
ality?
area [Gar07, Moo03]. Fortunately, this difficulty has been
in force since the late 1980s, so that the aforementioned
engineering discipline has evolved practical and effective
strategies for handling it. In addition, hardware designers
are increasingly aware of these issues, so perhaps future 1 Kudos to Michael Wong for naming the iron triangle.
v2023.06.11a
2.2. PARALLEL PROGRAMMING GOALS 9
v2023.06.11a
10 CHAPTER 2. INTRODUCTION
Quick Quiz 2.9: Why all this prattling on about non-technical 100000
issues??? And not just any non-technical issue, but productivity 10000
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
One such machine was the CSIRAC, the oldest still-
intact stored-program computer, which was put into op-
eration in 1949 [Mus04, Dep06]. Because this machine Year
was built before the transistor era, it was constructed of Figure 2.2: MIPS per Die for Intel CPUs
2,000 vacuum tubes, ran with a clock frequency of 1 kHz,
consumed 30 kW of power, and weighed more than three
metric tons. Given that this machine had but 768 words cently has high productivity become critically important
of RAM, it is safe to say that it did not suffer from the when creating parallel software.
productivity issues that often plague today’s large-scale
Quick Quiz 2.10: Given how cheap parallel systems have
software projects.
become, how can anyone afford to pay people to program
Today, it would be quite difficult to purchase a machine them?
with so little computing power. Perhaps the closest equiv-
alents are 8-bit embedded microprocessors exemplified Perhaps at one time, the sole purpose of parallel software
by the venerable Z80 [Wik08], but even the old Z80 had was performance. Now, however, productivity is gaining
a CPU clock frequency more than 1,000 times faster than the spotlight.
the CSIRAC. The Z80 CPU had 8,500 transistors, and
could be purchased in 2008 for less than $2 US per unit 2.2.3 Generality
in 1,000-unit quantities. In stark contrast to the CSIRAC,
software-development costs are anything but insignificant One way to justify the high cost of developing parallel
for the Z80. software is to strive for maximal generality. All else being
The CSIRAC and the Z80 are two points in a long- equal, the cost of a more-general software artifact can be
term trend, as can be seen in Figure 2.2. This figure spread over more users than that of a less-general one. In
plots an approximation to computational power per die fact, this economic force explains much of the maniacal
over the past four decades, showing an impressive six- focus on portability, which can be seen as an important
order-of-magnitude increase over a period of forty years. special case of generality.4
Note that the advent of multicore CPUs has permitted this Unfortunately, generality often comes at the cost of per-
increase to continue apace despite the clock-frequency wall formance, productivity, or both. For example, portability
encountered in 2003, albeit courtesy of dies supporting is often achieved via adaptation layers, which inevitably
more than 50 hardware threads each. exact a performance penalty. To see this more gener-
One of the inescapable consequences of the rapid de- ally, consider the following popular parallel programming
crease in the cost of hardware is that software productivity environments:
becomes increasingly important. It is no longer sufficient
C/C++ “Locking Plus Threads”: This category, which
merely to make efficient use of the hardware: It is now
includes POSIX Threads (pthreads) [Ope97], Win-
necessary to make extremely efficient use of software
dows Threads, and numerous operating-system ker-
developers as well. This has long been the case for se-
nel environments, offers excellent performance (at
quential hardware, but parallel hardware has become a
low-cost commodity only recently. Therefore, only re- 4 Kudos to Michael Wong for pointing this out.
v2023.06.11a
2.2. PARALLEL PROGRAMMING GOALS 11
Performance
much higher productivity than C or C++, courtesy
Generality
System Libraries
of the automatic garbage collector and the rich set
of class libraries. However, its performance, though Container
greatly improved in the early 2000s, lags that of C Operating System Kernel
and C++.
Hypervisor
MPI: This Message Passing Interface [MPI08] powers
Firmware
the largest scientific and technical computing clusters
in the world and offers unparalleled performance
Hardware
and scalability. In theory, it is general purpose,
but it is mainly used for scientific and technical
computing. Its productivity is believed by many Figure 2.3: Software Layers and Performance, Productiv-
to be even lower than that of C/C++ “locking plus ity, and Generality
threads” environments.
OpenMP: This set of compiler directives can be used to Special−Purpose
User 1 Env Productive User 2
parallelize loops. It is thus quite specific to this task, for User 1
and this specificity often limits its performance. It
is, however, much easier to use than MPI or C/C++
HW / Special−Purpose
“locking plus threads.” Abs Environment
Productive for User 2
SQL: Structured Query Language [Int92] is specific to
relational database queries. However, its perfor-
mance is quite good as measured by the Transaction User 3
General−Purpose User 4
Processing Performance Council (TPC) benchmark Environment
results [Tra01]. Productivity is excellent; in fact, this
parallel programming environment enables people to Special−Purpose Environment
Special−Purpose
make good use of a large parallel system despite hav- Productive for User 3
Environment
ing little or no knowledge of parallel programming Productive for User 4
concepts. Figure 2.4: Tradeoff Between Productivity and Generality
The nirvana of parallel programming environments,
one that offers world-class performance, productivity, and
lost in lower layers cannot easily be recovered further up
generality, simply does not yet exist. Until such a nirvana
the stack. In the upper layers of the stack, there might be
appears, it will be necessary to make engineering tradeoffs
very few users for a given specific application, in which
among performance, productivity, and generality. One
case productivity concerns are paramount. This explains
such tradeoff is depicted by the green “iron triangle”5
the tendency towards “bloatware” further up the stack:
shown in Figure 2.3, which shows how productivity be-
Extra hardware is often cheaper than extra developers.
comes increasingly important at the upper layers of the
This book is intended for developers working near the
system stack, while performance and generality become
bottom of the stack, where performance and generality
increasingly important at the lower layers of the system
are of greatest concern.
stack. The huge development costs incurred at the lower
It is important to note that a tradeoff between produc-
layers must be spread over equally huge numbers of users
tivity and generality has existed for centuries in many
(hence the importance of generality), and performance
fields. For but one example, a nailgun is more productive
5 Kudos to Michael Wong for coining “iron triangle.” than a hammer for driving nails, but in contrast to the
v2023.06.11a
12 CHAPTER 2. INTRODUCTION
nailgun, a hammer can be used for many things besides performance, productivity, and generality. Because this
driving nails. It should therefore be no surprise to see book is intended for developers working on performance-
similar tradeoffs appear in the field of parallel computing. critical code near the bottom of the software stack, the
This tradeoff is shown schematically in Figure 2.4. Here, remainder of this section focuses primarily on performance
users 1, 2, 3, and 4 have specific jobs that they need the improvement.
computer to help them with. The most productive possible It is important to keep in mind that parallelism is but
language or environment for a given user is one that simply one way to improve performance. Other well-known
does that user’s job, without requiring any programming, approaches include the following, in roughly increasing
configuration, or other setup. order of difficulty:
Quick Quiz 2.11: This is a ridiculously unachievable ideal!
1. Run multiple instances of a sequential application.
Why not focus on something that is achievable in practice?
2. Make the application use existing parallel software.
Unfortunately, a system that does the job required by
user 1 is unlikely to do user 2’s job. In other words, the 3. Optimize the serial application.
most productive languages and environments are domain-
specific, and thus by definition lacking generality. These approaches are covered in the following sections.
Another option is to tailor a given programming lan-
guage or environment to the hardware system (for example, 2.3.1 Multiple Instances of a Sequential Ap-
low-level languages such as assembly, C, C++, or Java)
or to some abstraction (for example, Haskell, Prolog, or
plication
Snobol), as is shown by the circular region near the center Running multiple instances of a sequential application can
of Figure 2.4. These languages can be considered to be allow you to do parallel programming without actually
general in the sense that they are equally ill-suited to the doing parallel programming. There are a large number of
jobs required by users 1, 2, 3, and 4. In other words, ways to approach this, depending on the structure of the
their generality comes at the expense of decreased produc- application.
tivity when compared to domain-specific languages and If your program is analyzing a large number of different
environments. Worse yet, a language that is tailored to a scenarios, or is analyzing a large number of independent
given abstraction is likely to suffer from performance and data sets, one easy and effective approach is to create a
scalability problems unless and until it can be efficiently single sequential program that carries out a single analysis,
mapped to real hardware. then use any of a number of scripting environments (for
Is there no escape from iron triangle’s three conflicting example the bash shell) to run a number of instances of
goals of performance, productivity, and generality? that sequential program in parallel. In some cases, this
It turns out that there often is an escape, for example, approach can be easily extended to a cluster of machines.
using the alternatives to parallel programming discussed This approach may seem like cheating, and in fact some
in the next section. After all, parallel programming can denigrate such programs as “embarrassingly parallel”.
be a great deal of fun, but it is not always the best tool for And in fact, this approach does have some potential dis-
the job. advantages, including increased memory consumption,
waste of CPU cycles recomputing common intermediate
results, and increased copying of data. However, it is of-
2.3 Alternatives to Parallel Pro- ten extremely productive, garnering extreme performance
gramming gains with little or no added effort.
Experiment is folly when experience shows the way. 2.3.2 Use Existing Parallel Software
Roger M. Babson There is no longer any shortage of parallel software envi-
ronments that can present a single-threaded programming
In order to properly consider alternatives to parallel pro- environment, including relational databases [Dat82], web-
gramming, you must first decide on what exactly you application servers, and map-reduce environments. For
expect the parallelism to do for you. As seen in Sec- example, a common design provides a separate process for
tion 2.2, the primary goals of parallel programming are each user, each of which generates SQL from user queries.
v2023.06.11a
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 13
This per-user SQL is run against a common relational Furthermore, different programs might have different
database, which automatically runs the users’ queries performance bottlenecks. For example, if your program
concurrently. The per-user programs are responsible only spends most of its time waiting on data from your disk
for the user interface, with the relational database tak- drive, using multiple CPUs will probably just increase the
ing full responsibility for the difficult issues surrounding time wasted waiting for the disks. In fact, if the program
parallelism and persistence. was reading from a single large file laid out sequentially
In addition, there are a growing number of parallel on a rotating disk, parallelizing your program might well
library functions, particularly for numeric computation. make it a lot slower due to the added seek overhead. You
Even better, some libraries take advantage of special- should instead optimize the data layout so that the file can
purpose hardware such as vector units and general-purpose be smaller (thus faster to read), split the file into chunks
graphical processing units (GPGPUs). which can be accessed in parallel from different drives,
Taking this approach often sacrifices some performance, cache frequently accessed data in main memory, or, if
at least when compared to carefully hand-coding a fully possible, reduce the amount of data that must be read.
parallel application. However, such sacrifice is often well Quick Quiz 2.13: What other bottlenecks might prevent
repaid by a huge reduction in development effort. additional CPUs from providing additional performance?
Quick Quiz 2.12: Wait a minute! Doesn’t this approach
simply shift the development effort from you to whoever wrote Parallelism can be a powerful optimization technique,
the existing parallel software you are using? but it is not the only such technique, nor is it appropriate
for all situations. Of course, the easier it is to parallelize
your program, the more attractive parallelization becomes
as an optimization. Parallelization has a reputation of
2.3.3 Performance Optimization being quite difficult, which leads to the question “exactly
Up through the early 2000s, CPU clock frequencies dou- what makes parallel programming so difficult?”
bled every 18 months. It was therefore usually more impor-
tant to create new functionality than to carefully optimize
performance. Now that Moore’s Law is “only” increasing 2.4 What Makes Parallel Program-
transistor density instead of increasing both transistor ming Hard?
density and per-transistor performance, it might be a good
time to rethink the importance of performance optimiza-
tion. After all, new hardware generations no longer bring Real difficulties can be overcome; it is only the
imaginary ones that are unconquerable.
significant single-threaded performance improvements.
Furthermore, many performance optimizations can also Theodore N. Vail
conserve energy.
From this viewpoint, parallel programming is but an- It is important to note that the difficulty of parallel pro-
other performance optimization, albeit one that is be- gramming is as much a human-factors issue as it is a set of
coming much more attractive as parallel systems become technical properties of the parallel programming problem.
cheaper and more readily available. However, it is wise We do need human beings to be able to tell parallel sys-
to keep in mind that the speedup available from paral- tems what to do, otherwise known as programming. But
lelism is limited to roughly the number of CPUs (but parallel programming involves two-way communication,
see Section 6.5 for an interesting exception). In contrast, with a program’s performance and scalability being the
the speedup available from traditional single-threaded communication from the machine to the human. In short,
software optimizations can be much larger. For example, the human writes a program telling the computer what
replacing a long linked list with a hash table or a search to do, and the computer critiques this program via the
tree can improve performance by many orders of mag- resulting performance and scalability. Therefore, appeals
nitude. This highly optimized single-threaded program to abstractions or to mathematical analyses will often be
might run much faster than its unoptimized parallel coun- of severely limited utility.
terpart, making parallelization unnecessary. Of course, a In the Industrial Revolution, the interface between hu-
highly optimized parallel program would be even better, man and machine was evaluated by human-factor studies,
aside from the added development effort required. then called time-and-motion studies. Although there have
v2023.06.11a
14 CHAPTER 2. INTRODUCTION
Performance Productivity errors and events: A parallel program may need to carry
out non-trivial synchronization in order to safely process
Work
Partitioning
such global events. More generally, each partition requires
some sort of communication: After all, if a given thread
Resource did not communicate at all, it would have no effect and
Parallel
Partitioning and would thus not need to be executed. However, because
Access Control Replication
communication incurs overhead, careless partitioning
choices can result in severe performance degradation.
Interacting
With Hardware
Furthermore, the number of concurrent threads must
often be controlled, as each such thread occupies common
Generality resources, for example, space in CPU caches. If too
many threads are permitted to execute concurrently, the
Figure 2.5: Categories of Tasks Required of Parallel CPU caches will overflow, resulting in high cache miss
Programmers rate, which in turn degrades performance. Conversely,
large numbers of threads are often required to overlap
computation and I/O so as to fully utilize I/O devices.
been a few human-factor studies examining parallel pro-
gramming [ENS05, ES05, HCS+ 05, SS94], these studies Quick Quiz 2.14: Other than CPU cache capacity, what
might require limiting the number of concurrent threads?
have been extremely narrowly focused, and hence unable
to demonstrate any general results. Furthermore, given Finally, permitting threads to execute concurrently
that the normal range of programmer productivity spans greatly increases the program’s state space, which can
more than an order of magnitude, it is unrealistic to expect make the program difficult to understand and debug, de-
an affordable study to be capable of detecting (say) a 10 % grading productivity. All else being equal, smaller state
difference in productivity. Although the multiple-order-of- spaces having more regular structure are more easily un-
magnitude differences that such studies can reliably detect derstood, but this is a human-factors statement as much as
are extremely valuable, the most impressive improvements it is a technical or mathematical statement. Good parallel
tend to be based on a long series of 10 % improvements. designs might have extremely large state spaces, but never-
We must therefore take a different approach. theless be easy to understand due to their regular structure,
One such approach is to carefully consider the tasks that while poor designs can be impenetrable despite having a
parallel programmers must undertake that are not required comparatively small state space. The best designs exploit
of sequential programmers. We can then evaluate how embarrassing parallelism, or transform the problem to
well a given programming language or environment assists one having an embarrassingly parallel solution. In either
the developer with these tasks. These tasks fall into the case, “embarrassingly parallel” is in fact an embarrass-
four categories shown in Figure 2.5, each of which is ment of riches. The current state of the art enumerates
covered in the following sections. good designs; more work is required to make more general
judgments on state-space size and structure.
2.4.1 Work Partitioning
2.4.2 Parallel Access Control
Work partitioning is absolutely required for parallel ex-
ecution: If there is but one “glob” of work, then it can Given a single-threaded sequential program, that single
be executed by at most one CPU at a time, which is by thread has full access to all of the program’s resources.
definition sequential execution. However, partitioning the These resources are most often in-memory data structures,
work requires great care. For example, uneven partitioning but can be CPUs, memory (including caches), I/O devices,
can result in sequential execution once the small partitions computational accelerators, files, and much else besides.
have completed [Amd67]. In less extreme cases, load The first parallel-access-control issue is whether the
balancing can be used to fully utilize available hardware form of access to a given resource depends on that re-
and restore performance and scalability. source’s location. For example, in many message-passing
Although partitioning can greatly improve performance environments, local-variable access is via expressions
and scalability, it can also increase complexity. For and assignments, while remote-variable access uses an
example, partitioning can complicate handling of global entirely different syntax, usually involving messaging.
v2023.06.11a
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 15
v2023.06.11a
16 CHAPTER 2. INTRODUCTION
2.4.6 How Do Languages and Environments to the parallel-programming challenge here in the 21st
Assist With These Tasks? century!
We are now ready to proceed to the next chapter, which
Although many environments require the developer to dives into the relevant properties of the parallel hardware
deal manually with these tasks, there are long-standing underlying our parallel software.
environments that bring significant automation to bear.
The poster child for these environments is SQL, many
implementations of which automatically parallelize single
large queries and also automate concurrent execution of
independent queries and updates.
These four categories of tasks must be carried out in all
parallel programs, but that of course does not necessarily
mean that the developer must manually carry out these
tasks. We can expect to see ever-increasing automation of
these four tasks as parallel systems continue to become
cheaper and more readily available.
Quick Quiz 2.16: Are there any other obstacles to parallel
programming?
2.5 Discussion
Until you try, you don’t know what you can’t do.
Henry James
v2023.06.11a
Premature abstraction is the root of all evil.
A cast of thousands
Chapter 3
17
v2023.06.11a
18 CHAPTER 3. HARDWARE AND ITS HABITS
Thread 0 Thread 1
4.0 GHz clock, 20 MB L3 Instructions
Decode and
Instructions
cache, 20 stage pipeline... Translate
Micro-Op
Scheduler
v2023.06.11a
3.1. OVERVIEW 19
v2023.06.11a
20 CHAPTER 3. HARDWARE AND ITS HABITS
Memory
Barrier
If the CPU were not constrained to execute these state- 3.1.5 Thermal Throttling
ments in the order shown, the effect would be that the
variable “a” would be incremented without the protection One increasingly common frustrating experience is to
of “mylock”, which would certainly defeat the purpose of carefully micro-optimize a critical code path, greatly
acquiring it. To prevent such destructive reordering, lock- reducing the number of clock cycles consumed by that
ing primitives contain either explicit or implicit memory code path, only to find that the wall-clock time consumed
barriers. Because the whole purpose of these memory by that code has actually increased.
barriers is to prevent reorderings that the CPU would Welcome to modern thermal throttling.
otherwise undertake in order to increase performance, If you reduced the number of clock cycles by making
memory barriers almost always reduce performance, as more effective use of the CPU’s functional units, you will
depicted in Figure 3.7. have increased the power consumed by that CPU. This
As with atomic operations, CPU designers have been will in turn increase the amount of heat dissipated by that
working hard to reduce memory-barrier overhead, and CPU. If this heat dissipation exceeds the cooling system’s
have made substantial progress. capacity, the system will thermally throttle that CPU, for
v2023.06.11a
3.1. OVERVIEW 21
BOOTH
Figure 3.9: CPU Meets a Cache Miss Figure 3.10: CPU Waits for I/O Completion
example, by reducing its clock frequency, as fancifully Quick Quiz 3.3: So have CPU designers also greatly reduced
depicted by the snow penguin in Figure 3.8. the overhead of cache misses?
If performance is of the essence, the proper fix is im-
proved cooling, an approach loved by serious gamers
and by overclockers.3 But if you cannot modify your 3.1.7 I/O Operations
computer’s cooling system, perhaps because you are rent-
A cache miss can be thought of as a CPU-to-CPU I/O
ing it from a cloud provider, then you will need to take
operation, and as such is one of the cheapest I/O operations
some other optimization approach. For example, you
available. I/O operations involving networking, mass
might need to apply algorithmic optimizations instead
storage, or (worse yet) human beings pose much greater
of hardware-centric micro-optimizations. Alternatively,
obstacles than the internal obstacles called out in the prior
perhaps you can parallelize your code, spreading the work
sections, as illustrated by Figure 3.10.
(and thus the heat) over multiple CPU cores.
This is one of the differences between shared-memory
and distributed-system parallelism: Shared-memory par-
3.1.6 Cache Misses allel programs must normally deal with no obstacle worse
An additional multi-threading obstacle to CPU perfor- than a cache miss, while a distributed parallel program
mance is the “cache miss”. As noted earlier, modern will typically incur the larger network communication
CPUs sport large caches in order to reduce the perfor- latencies. In both cases, the relevant latencies can be
mance penalty that would otherwise be incurred due to thought of as a cost of communication—a cost that would
high memory latencies. However, these caches are actu- be absent in a sequential program. Therefore, the ratio
ally counter-productive for variables that are frequently between the overhead of the communication to that of the
shared among CPUs. This is because when a given CPU actual work being performed is a key design parameter.
wishes to modify the variable, it is most likely the case A major goal of parallel hardware design is to reduce this
that some other CPU has modified it recently. In this case, ratio as needed to achieve the relevant performance and
the variable will be in that other CPU’s cache, but not in scalability goals. In turn, as will be seen in Chapter 6,
this CPU’s cache, which will therefore incur an expensive a major goal of parallel software design is to reduce the
cache miss (see Appendix C.1 for more detail). Such frequency of expensive operations like communications
cache misses form a major obstacle to CPU performance, cache misses.
as shown in Figure 3.9. Of course, it is one thing to say that a given operation is
an obstacle, and quite another to show that the operation
is a significant obstacle. This distinction is discussed in
3 Some of whom make good use of liquid nitrogen. the following sections.
v2023.06.11a
22 CHAPTER 3. HARDWARE AND ITS HABITS
Interconnect Interconnect
2. A request for this cacheline is forwarded to CPU 0’s
and 1’s interconnect, which checks CPU 1’s local
Cache Cache Cache Cache
cache, and does not find the cacheline.
CPU 4 CPU 5 CPU 6 CPU 7
3. This request is forwarded to the system interconnect,
which checks with the other three dies, learning that
Speed−of−Light Round−Trip Distance in Vacuum the cacheline is held by the die containing CPU 6
for 1.8 GHz Clock Period (8 cm) and 7.
Figure 3.11: System Hardware Architecture 4. This request is forwarded to CPU 6’s and 7’s inter-
connect, which checks both CPUs’ caches, finding
the value in CPU 7’s cache.
3.2 Overheads 5. CPU 7 forwards the cacheline to its interconnect, and
also flushes the cacheline from its cache.
Don’t design bridges in ignorance of materials, and
don’t design low-level software in ignorance of the 6. CPU 6’s and 7’s interconnect forwards the cacheline
underlying hardware. to the system interconnect.
Unknown 7. The system interconnect forwards the cacheline to
CPU 0’s and 1’s interconnect.
This section presents actual overheads of the obstacles to
performance listed out in the previous section. However, 8. CPU 0’s and 1’s interconnect forwards the cacheline
it is first necessary to get a rough view of hardware system to CPU 0’s cache.
architecture, which is the subject of the next section.
9. CPU 0 can now complete the write, updating the
relevant portions of the newly arrived cacheline from
3.2.1 Hardware System Architecture the value previously recorded in the store buffer.
Figure 3.11 shows a rough schematic of an eight-core
Quick Quiz 3.4: This is a simplified sequence of events?
computer system. Each die has a pair of CPU cores, each
How could it possibly be any more complex?
with its cache, as well as an interconnect allowing the pair
of CPUs to communicate with each other. The system
Quick Quiz 3.5: Why is it necessary to flush the cacheline
interconnect allows the four dies to communicate with from CPU 7’s cache?
each other and with main memory.
Data moves through this system in units of “cache This simplified sequence is just the beginning of a dis-
lines”, which are power-of-two fixed-size aligned blocks cipline called cache-coherency protocols [HP95, CSG99,
of memory, usually ranging from 32 to 256 bytes in size. MHS12, SHW11], which is discussed in more detail in
When a CPU loads a variable from memory to one of its Appendix C. As can be seen in the sequence of events
registers, it must first load the cacheline containing that triggered by a CAS operation, a single instruction can
variable into its cache. Similarly, when a CPU stores a cause considerable protocol traffic, which can significantly
value from one of its registers into memory, it must also degrade your parallel program’s performance.
load the cacheline containing that variable into its cache, Fortunately, if a given variable is being frequently read
but must also ensure that no other CPU has a copy of that during a time interval during which it is never updated,
cacheline. that variable can be replicated across all CPUs’ caches.
v2023.06.11a
3.2. OVERHEADS 23
Table 3.1: CPU 0 View of Synchronization Mechanisms (unexpected) value, and the CAS operation fails. The
on 8-Socket System With Intel Xeon Platinum 8176 operation is atomic in that the hardware guarantees that
CPUs @ 2.10 GHz the memory location will not be changed between the
Ratio
compare and the store. CAS functionality is provided by
Operation Cost (ns) (cost/clock) CPUs the lock;cmpxchg instruction on x86.
Clock period 0.5 1.0 The “same-CPU” prefix means that the CPU now per-
forming the CAS operation on a given variable was also
Same-CPU 0
CAS 7.0 14.6
the last CPU to access this variable, so that the corre-
lock 15.4 32.3 sponding cacheline is already held in that CPU’s cache.
Similarly, the same-CPU lock operation (a “round trip”
On-Core 224
pair consisting of a lock acquisition and release) consumes
Blind CAS 7.2 15.2
CAS 18.0 37.7 more than fifteen nanoseconds, or more than thirty clock
cycles. The lock operation is more expensive than CAS
Off-Core 1–27
because it requires two atomic operations on the lock data
Blind CAS 47.5 99.8 225–251
CAS 101.9 214.0
structure, one for acquisition and the other for release.
On-core operations involving interactions between the
Off-Socket 28–111
Blind CAS 148.8 312.5 252–335
hardware threads sharing a single core are about the same
CAS 442.9 930.1 cost as same-CPU operations. This should not be too
surprising, given that these two hardware threads also
Cross-Interconnect 112–223
Blind CAS 336.6 706.8 336–447
share the full cache hierarchy.
CAS 944.8 1,984.2 In the case of the blind CAS, the software specifies the
Off-System
old value without looking at the memory location. This
Comms Fabric 5,000 10,500 approach is appropriate when attempting to acquire a lock.
Global Comms 195,000,000 409,500,000 If the unlocked state is represented by zero and the locked
state is represented by the value one, then a CAS operation
on the lock that specifies zero for the old value and one
This replication permits all CPUs to enjoy extremely fast for the new value will acquire the lock if it is not already
access to this read-mostly variable. Chapter 9 presents held. The key point is that there is only one access to the
synchronization mechanisms that take full advantage of memory location, namely the CAS operation itself.
this important hardware read-mostly optimization. In contrast, a normal CAS operation’s old value is de-
rived from some earlier load. For example, to implement
an atomic increment, the current value of that location
3.2.2 Costs of Operations
is loaded and that value is incremented to produce the
The overheads of some common operations important to new value. Then in the CAS operation, the value actu-
parallel programs are displayed in Table 3.1. This system’s ally loaded would be specified as the old value and the
clock period rounds to 0.5 ns. Although it is not unusual incremented value as the new value. If the value had
for modern microprocessors to be able to retire multiple not been changed between the load and the CAS, this
instructions per clock period, the operations’ costs are would increment the memory location. However, if the
nevertheless normalized to a clock period in the third value had in fact changed, then the old value would not
column, labeled “Ratio”. The first thing to note about this match, causing a miscompare that would result in the CAS
table is the large values of many of the ratios. operation failing. The key point is that there are now two
The same-CPU compare-and-swap (CAS) operation accesses to the memory location, the load and the CAS.
consumes about seven nanoseconds, a duration more than Thus, it is not surprising that on-core blind CAS con-
ten times that of the clock period. CAS is an atomic sumes only about seven nanoseconds, while on-core CAS
operation in which the hardware compares the contents consumes about 18 nanoseconds. The non-blind case’s
of the specified memory location to a specified “old” extra load does not come for free. That said, the overhead
value, and if they compare equal, stores a specified “new” of these operations are similar to same-CPU CAS and
value, in which case the CAS operation succeeds. If lock, respectively.
they compare unequal, the memory location keeps its
v2023.06.11a
24 CHAPTER 3. HARDWARE AND ITS HABITS
Quick Quiz 3.6: Table 3.1 shows CPU 0 sharing a core with Table 3.2: Cache Geometry for 8-Socket System With
CPU 224. Shouldn’t that instead be CPU 1??? Intel Xeon Platinum 8176 CPUs @ 2.10 GHz
A blind CAS involving CPUs in different cores but Level Scope Line Size Sets Ways Size
on the same socket consumes almost fifty nanoseconds,
L0 Core 64 64 8 32K
or almost one hundred clock cycles. The code used for
L1 Core 64 64 8 32K
this cache-miss measurement passes the cache line back
L2 Core 64 1024 16 1024K
and forth between a pair of CPUs, so this cache miss
L3 Socket 64 57,344 11 39,424K
is satisfied not from memory, but rather from the other
CPU’s cache. A non-blind CAS operation, which as
noted earlier must look at the old value of the variable
are organized as a hardware hash table with a limited
as well as store a new value, consumes over one hundred
number of items per bucket. For example, the raw size of
nanoseconds, or more than two hundred clock cycles.
the L3 cache (“Size”) is almost 40 MB, but each bucket
Think about this a bit. In the time required to do one CAS
(“Line”) can only hold 11 blocks of memory (“Ways”),
operation, the CPU could have executed more than two
each of which can be at most 64 bytes (“Line Size”).
hundred normal instructions. This should demonstrate
This means that only 12 bytes of memory (admittedly at
the limitations not only of fine-grained locking, but of any
carefully chosen addresses) are required to overflow this
other synchronization mechanism relying on fine-grained
40 MB cache. On the other hand, equally careful choice
global agreement.
of addresses might make good use of the entire 40 MB.
If the pair of CPUs are on different sockets, the oper-
Spatial locality of reference is clearly extremely impor-
ations are considerably more expensive. A blind CAS
tant, as is spreading the data across memory.
operation consumes almost 150 nanoseconds, or more
I/O operations are even more expensive. As shown
than three hundred clock cycles. A normal CAS operation
in the “Comms Fabric” row, high performance (and ex-
consumes more than 400 nanoseconds, or almost one
pensive!) communications fabric, such as InfiniBand or
thousand clock cycles.
any number of proprietary interconnects, has a latency of
Worse yet, not all pairs of sockets are created equal.
roughly five microseconds for an end-to-end round trip,
This particular system appears to be constructed as a
during which time more than ten thousand instructions
pair of four-socket components, with additional latency
might have been executed. Standards-based communi-
penalties when the CPUs reside in different components.
cations networks often require some sort of protocol
In this case, a blind CAS operation consumes more than
processing, which further increases the latency. Of course,
three hundred nanoseconds, or more than seven hundred
geographic distance also increases latency, with the speed-
clock cycles. A CAS operation consumes almost a full
of-light through optical fiber latency around the world
microsecond, or almost two thousand clock cycles.
coming to roughly 195 milliseconds, or more than 400
Quick Quiz 3.7: Surely the hardware designers could be per- million clock cycles, as shown in the “Global Comms”
suaded to improve this situation! Why have they been content row.
with such abysmal performance for these single-instruction
operations? Quick Quiz 3.9: These numbers are insanely large! How
can I possibly get my head around them?
Quick Quiz 3.8: Table E.1 in the answer to Quick Quiz 3.7
on page 480 says that on-core CAS is faster than both of
same-CPU CAS and on-core blind CAS. What is happening 3.2.3 Hardware Optimizations
there?
It is only natural to ask how the hardware is helping, and
Unfortunately, the high speed of within-core and within- the answer is “Quite a bit!”
socket communication does not come for free. First, there One hardware optimization is large cachelines. This
are only two CPUs within a given core and only 56 within a can provide a big performance boost, especially when
given socket, compared to 448 across the system. Second, software is accessing memory sequentially. For example,
as shown in Table 3.2, the on-core caches are quite small given a 64-byte cacheline and software accessing 64-
compared to the on-socket caches, which are in turn quite bit variables, the first access will still be slow due to
small compared to the 1.4 TB of memory configured on speed-of-light delays (if nothing else), but the remaining
this system. Third, again referring to the figure, the caches seven can be quite fast. However, this optimization has
v2023.06.11a
3.3. HARDWARE FREE LUNCH? 25
v2023.06.11a
26 CHAPTER 3. HARDWARE AND ITS HABITS
70 um
path through the system by a factor of two, keeping in
mind that each layer is quite thin. In addition, given proper
attention to design and placement, long horizontal electri-
cal connections (which are both slow and power hungry)
can be replaced by short vertical electrical connections,
which are both faster and more power efficient.
However, delays due to levels of clocked logic will not be
3 cm 1.5 cm decreased by 3D integration, and significant manufactur-
ing, testing, power-supply, and heat-dissipation problems
Figure 3.13: Latency Benefit of 3D Integration must be solved for 3D integration to reach production
while still delivering on its promise. The heat-dissipation
problems might be solved using semiconductors based
in a vacuum, and common clocked logic constructs run
on diamond, which is a good conductor for heat, but an
still more slowly, for example, a memory reference may
electrical insulator. That said, it remains difficult to grow
need to wait for a local cache lookup to complete before
large single diamond crystals, to say nothing of slicing
the request may be passed on to the rest of the system.
them into wafers. In addition, it seems unlikely that any of
Furthermore, relatively low speed and high power drivers
these technologies will be able to deliver the exponential
are required to move electrical signals from one silicon
increases to which some people have become accustomed.
die to another, for example, to communicate between a
That said, they may be necessary steps on the path to the
CPU and main memory.
late Jim Gray’s “smoking hairy golf balls” [Gra02].
Quick Quiz 3.10: But individual electrons don’t move
anywhere near that fast, even in conductors!!! The electron
drift velocity in a conductor under semiconductor voltage
3.3.2 Novel Materials and Processes
levels is on the order of only one millimeter per second. What Stephen Hawking is said to have claimed that semicon-
gives??? ductor manufacturers have but two fundamental problems:
(1) The finite speed of light and (2) The atomic nature of
There are nevertheless some technologies (both hard-
matter [Gar07]. It is possible that semiconductor man-
ware and software) that might help improve matters:
ufacturers are approaching these limits, but there are
1. 3D integration, nevertheless a few avenues of research and development
focused on working around these fundamental limits.
2. Novel materials and processes, One workaround for the atomic nature of matter are
so-called “high-K dielectric” materials, which allow larger
3. Substituting light for electricity, devices to mimic the electrical properties of infeasibly
small devices. These materials pose some severe fab-
4. Special-purpose accelerators, and
rication challenges, but nevertheless may help push the
5. Existing parallel software. frontiers out a bit farther. Another more-exotic work-
around stores multiple bits in a single electron, relying
Each of these is described in one of the following on the fact that a given electron can exist at a number
sections. of energy levels. It remains to be seen if this particular
approach can be made to work reliably in production
semiconductor devices.
3.3.1 3D Integration
Another proposed workaround is the “quantum dot”
3-dimensional integration (3DI) is the practice of bonding approach that allows much smaller device sizes, but which
very thin silicon dies to each other in a vertical stack. is still in the research stage.
This practice provides potential benefits, but also poses One challenge is that many recent hardware-device-
significant fabrication challenges [Kni08]. level breakthroughs require very tight control of which
Perhaps the most important benefit of 3DI is decreased atoms are placed where [Kel17]. It therefore seems likely
path length through the system, as shown in Figure 3.13. that whoever finds a good way to hand-place atoms on
A 3-centimeter silicon die is replaced with a stack of four each of the billions of devices on a chip will have most
1.5-centimeter dies, in theory decreasing the maximum excellent bragging rights, if nothing else!
v2023.06.11a
3.3. HARDWARE FREE LUNCH? 27
3.3.3 Light, Not Electrons must be sufficiently generally useful that the high up-front
hardware-design costs can be spread over enough users to
Although the speed of light would be a hard limit, the fact make the specialized hardware affordable. In part due to
is that semiconductor devices are limited by the speed of these sorts of economic considerations, specialized hard-
electricity rather than that of light, given that electric waves ware has thus far appeared only for a few application areas,
in semiconductor materials move at between 3 % and 30 % including graphics processing (GPUs), vector processors
of the speed of light in a vacuum. The use of copper (MMX, SSE, and VMX instructions), and, to a lesser ex-
connections on silicon devices is one way to increase the tent, encryption. And even in these areas, it is not always
speed of electricity, and it is quite possible that additional easy to realize the expected performance gains, for exam-
advances will push closer still to the actual speed of ple, due to thermal throttling [Kra17, Lem18, Dow20].
light. In addition, there have been some experiments with
tiny optical fibers as interconnects within and between Unlike the server and PC arena, smartphones have long
chips, based on the fact that the speed of light in glass is used a wide variety of hardware accelerators. These hard-
more than 60 % of the speed of light in a vacuum. One ware accelerators are often used for media decoding, so
obstacle to such optical fibers is the inefficiency conversion much so that a high-end MP3 player might be able to play
between electricity and light and vice versa, resulting in audio for several minutes—with its CPU fully powered
both power-consumption and heat-dissipation problems. off the entire time. The purpose of these accelerators
That said, absent some fundamental advances in the is to improve energy efficiency and thus extend battery
field of physics, any exponential increases in the speed of life: Special purpose hardware can often compute more
data flow will be sharply limited by the actual speed of efficiently than can a general-purpose CPU. This is an-
light in a vacuum. other example of the principle called out in Section 2.2.3:
Generality is almost never free.
v2023.06.11a
28 CHAPTER 3. HARDWARE AND ITS HABITS
3.4 Software Design Implications gorithms and implementations, whether by careful choice
of data structures and algorithms, use of existing paral-
lel applications and environments, or transforming the
One ship drives east and another west
problem into an embarrassingly parallel form.
While the self-same breezes blow;
’Tis the set of the sail and not the gail Quick Quiz 3.12: OK, if we are going to have to apply
That bids them where to go. distributed-programming techniques to shared-memory par-
allel programs, why not just always use these distributed
Ella Wheeler Wilcox
techniques and dispense with shared memory?
The values of the ratios in Table 3.1 are critically important, So, to sum up:
as they limit the efficiency of a given parallel application.
To see this, suppose that the parallel application uses CAS 1. The good news is that multicore systems are inexpen-
operations to communicate among threads. These CAS sive and readily available.
operations will typically involve a cache miss, that is,
assuming that the threads are communicating primarily 2. More good news: The overhead of many synchro-
with each other rather than with themselves. Suppose nization operations is much lower than it was on
further that the unit of work corresponding to each CAS parallel systems from the early 2000s.
communication operation takes 300 ns, which is sufficient 3. The bad news is that the overhead of cache misses is
time to compute several floating-point transcendental still high, especially on large systems.
functions. Then about half of the execution time will be
consumed by the CAS communication operations! This The remainder of this book describes ways of handling
in turn means that a two-CPU system running such a this bad news.
parallel program would run no faster than a sequential In particular, Chapter 4 will cover some of the low-
implementation running on a single CPU. level tools used for parallel programming, Chapter 5 will
The situation is even worse in the distributed-system investigate problems and solutions to parallel counting,
case, where the latency of a single communications oper- and Chapter 6 will discuss design disciplines that promote
ation might take as long as thousands or even millions of performance and scalability.
floating-point operations. This illustrates how important
it is for communications operations to be extremely infre-
quent and to enable very large quantities of processing.
Quick Quiz 3.11: Given that distributed-systems communi-
cation is so horribly expensive, why does anyone bother with
such systems?
v2023.06.11a
You are only as good as your tools, and your tools
are only as good as you are.
Chapter 4 Unknown
helps to choose the tool that will get the job done.
Quick Quiz 4.1: You call these tools??? They look more cat compute_it.1.out
like low-level synchronization primitives to me!
Please note that this chapter provides but a brief intro- cat compute_it.2.out
duction. More detail is available from the references (and
from the Internet), and more information will be provided Figure 4.1: Execution Diagram for Parallel Shell Execu-
in later chapters. tion
4.1 Scripting Languages character directing the shell to run the two instances of
the program in the background. Line 3 waits for both
The supreme excellence is simplicity. instances to complete, and lines 4 and 5 display their
output. The resulting execution is as shown in Figure 4.1:
Henry Wadsworth Longfellow, simplified The two instances of compute_it execute in parallel,
wait completes after both of them do, and then the two
The Linux shell scripting languages provide simple but instances of cat execute sequentially.
effective ways of managing parallelism. For example,
suppose that you had a program compute_it that you Quick Quiz 4.2: But this silly shell script isn’t a real parallel
needed to run twice with two different sets of arguments. program! Why bother with such trivia???
This can be accomplished using UNIX shell scripting as
follows: Quick Quiz 4.3: Is there a simpler way to create a parallel
1 compute_it 1 > compute_it.1.out & shell script? If so, how? If not, why not?
2 compute_it 2 > compute_it.2.out &
3 wait
4 cat compute_it.1.out For another example, the make software-build scripting
5 cat compute_it.2.out language provides a -j option that specifies how much par-
allelism should be introduced into the build process. Thus,
Lines 1 and 2 launch two instances of this program, typing make -j4 when building a Linux kernel specifies
redirecting their output to two separate files, with the & that up to four build steps be executed concurrently.
29
v2023.06.11a
30 CHAPTER 4. TOOLS OF THE TRADE
It is hoped that these simple examples convince you Listing 4.1: Using the fork() Primitive
that parallel programming need not always be complex or 1 pid = fork();
2 if (pid == 0) {
difficult. 3 /* child */
4 } else if (pid < 0) {
Quick Quiz 4.4: But if script-based parallel programming is 5 /* parent, upon error */
so easy, why bother with anything else? 6 perror("fork");
7 exit(EXIT_FAILURE);
8 } else {
9 /* parent, pid == child ID */
10 }
4.2 POSIX Multiprocessing
Listing 4.2: Using the wait() Primitive
1 static __inline__ void waitall(void)
A camel is a horse designed by committee. 2 {
3 int pid;
Unknown 4 int status;
5
6 for (;;) {
This section scratches the surface of the POSIX environ- 7 pid = wait(&status);
ment, including pthreads [Ope97], as this environment is 8 if (pid == -1) {
9 if (errno == ECHILD)
readily available and widely implemented. Section 4.2.1 10 break;
provides a glimpse of the POSIX fork() and related 11 perror("wait");
12 exit(EXIT_FAILURE);
primitives, Section 4.2.2 touches on thread creation and 13 }
destruction, Section 4.2.3 gives a brief overview of POSIX 14 }
15 }
locking, and, finally, Section 4.2.4 describes a specific
lock which can be used for data that is read by many
threads and only occasionally updated. noted earlier, the child may terminate via the exit()
primitive. Otherwise, this is the parent, which checks for
4.2.1 POSIX Process Creation and Destruc- an error return from the fork() primitive on line 4, and
tion prints an error and exits on lines 5–7 if so. Otherwise,
the fork() has executed successfully, and the parent
Processes are created using the fork() primitive, they therefore executes line 9 with the variable pid containing
may be destroyed using the kill() primitive, they may the process ID of the child.
destroy themselves using the exit() primitive. A process The parent process may use the wait() primitive to
executing a fork() primitive is said to be the “parent” wait for its children to complete. However, use of this
of the newly created process. A parent may wait on its primitive is a bit more complicated than its shell-script
children using the wait() primitive. counterpart, as each invocation of wait() waits for but one
Please note that the examples in this section are quite child process. It is therefore customary to wrap wait()
simple. Real-world applications using these primitives into a function similar to the waitall() function shown
might need to manipulate signals, file descriptors, shared in Listing 4.2 (api-pthreads.h), with this waitall()
memory segments, and any number of other resources. In function having semantics similar to the shell-script wait
addition, some applications need to take specific actions command. Each pass through the loop spanning lines 6–14
if a given child terminates, and might also need to be waits on one child process. Line 7 invokes the wait()
concerned with the reason that the child terminated. These primitive, which blocks until a child process exits, and
issues can of course add substantial complexity to the code. returns that child’s process ID. If the process ID is instead
For more information, see any of a number of textbooks −1, this indicates that the wait() primitive was unable to
on the subject [Ste92, Wei13]. wait on a child. If so, line 9 checks for the ECHILD errno,
If fork() succeeds, it returns twice, once for the which indicates that there are no more child processes, so
parent and again for the child. The value returned from that line 10 exits the loop. Otherwise, lines 11 and 12
fork() allows the caller to tell the difference, as shown in print an error and exit.
Listing 4.1 (forkjoin.c). Line 1 executes the fork()
Quick Quiz 4.5: Why does this wait() primitive need to be
primitive, and saves its return value in local variable pid.
so complicated? Why not just make it work like the shell-script
Line 2 checks to see if pid is zero, in which case, this wait does?
is the child, which continues on to execute line 3. As
v2023.06.11a
4.2. POSIX MULTIPROCESSING 31
Listing 4.3: Processes Created Via fork() Do Not Share Listing 4.4: Threads Created Via pthread_create() Share
Memory Memory
1 int x = 0; 1 int x = 0;
2 2
3 int main(int argc, char *argv[]) 3 void *mythread(void *arg)
4 { 4 {
5 int pid; 5 x = 1;
6 6 printf("Child process set x=1\n");
7 pid = fork(); 7 return NULL;
8 if (pid == 0) { /* child */ 8 }
9 x = 1; 9
10 printf("Child process set x=1\n"); 10 int main(int argc, char *argv[])
11 exit(EXIT_SUCCESS); 11 {
12 } 12 int en;
13 if (pid < 0) { /* parent, upon error */ 13 pthread_t tid;
14 perror("fork"); 14 void *vp;
15 exit(EXIT_FAILURE); 15
16 } 16 if ((en = pthread_create(&tid, NULL,
17 17 mythread, NULL)) != 0) {
18 /* parent */ 18 fprintf(stderr, "pthread_create: %s\n", strerror(en));
19 19 exit(EXIT_FAILURE);
20 waitall(); 20 }
21 printf("Parent process sees x=%d\n", x); 21
22 22 /* parent */
23 return EXIT_SUCCESS; 23
24 } 24 if ((en = pthread_join(tid, &vp)) != 0) {
25 fprintf(stderr, "pthread_join: %s\n", strerror(en));
26 exit(EXIT_FAILURE);
27 }
It is critically important to note that the parent and child 28 printf("Parent process sees x=%d\n", x);
29
do not share memory. This is illustrated by the program 30 return EXIT_SUCCESS;
shown in Listing 4.3 (forkjoinvar.c), in which the 31 }
v2023.06.11a
32 CHAPTER 4. TOOLS OF THE TRADE
v2023.06.11a
4.2. POSIX MULTIPROCESSING 33
and initializes a POSIX lock named lock_a, while line 2 Listing 4.6: Demonstration of Same Exclusive Lock
similarly defines and initializes a lock named lock_b. 1 printf("Creating two threads using same lock:\n");
2 en = pthread_create(&tid1, NULL, lock_reader, &lock_a);
Line 4 defines and initializes a shared variable x. 3 if (en != 0) {
4 fprintf(stderr, "pthread_create: %s\n", strerror(en));
Lines 6–33 define a function lock_reader() which 5 exit(EXIT_FAILURE);
repeatedly reads the shared variable x while holding the 6 }
7 en = pthread_create(&tid2, NULL, lock_writer, &lock_a);
lock specified by arg. Line 12 casts arg to a pointer to a 8 if (en != 0) {
pthread_mutex_t, as required by the pthread_mutex_ 9 fprintf(stderr, "pthread_create: %s\n", strerror(en));
10 exit(EXIT_FAILURE);
lock() and pthread_mutex_unlock() primitives. 11 }
12 if ((en = pthread_join(tid1, &vp)) != 0) {
Quick Quiz 4.10: Why not simply make the argument to 13 fprintf(stderr, "pthread_join: %s\n", strerror(en));
lock_reader() on line 6 of Listing 4.5 be a pointer to a 14 exit(EXIT_FAILURE);
15 }
pthread_mutex_t? 16 if ((en = pthread_join(tid2, &vp)) != 0) {
17 fprintf(stderr, "pthread_join: %s\n", strerror(en));
18 exit(EXIT_FAILURE);
Quick Quiz 4.11: What is the READ_ONCE() on lines 20 19 }
and 47 and the WRITE_ONCE() on line 47 of Listing 4.5?
Listing 4.7: Demonstration of Different Exclusive Locks
Lines 14–18 acquire the specified pthread_mutex_t, 1 printf("Creating two threads w/different locks:\n");
2 x = 0;
checking for errors and exiting the program if any occur. 3 en = pthread_create(&tid1, NULL, lock_reader, &lock_a);
Lines 19–26 repeatedly check the value of x, printing 4 if (en != 0) {
5 fprintf(stderr, "pthread_create: %s\n", strerror(en));
the new value each time that it changes. Line 25 sleeps 6 exit(EXIT_FAILURE);
for one millisecond, which allows this demonstration 7 }
8 en = pthread_create(&tid2, NULL, lock_writer, &lock_b);
to run nicely on a uniprocessor machine. Lines 27–31 9 if (en != 0) {
release the pthread_mutex_t, again checking for errors 10 fprintf(stderr, "pthread_create: %s\n", strerror(en));
11 exit(EXIT_FAILURE);
and exiting the program if any occur. Finally, line 32 12 }
returns NULL, again to match the function type required 13 if ((en = pthread_join(tid1, &vp)) != 0) {
14 fprintf(stderr, "pthread_join: %s\n", strerror(en));
by pthread_create(). 15 exit(EXIT_FAILURE);
16 }
Quick Quiz 4.12: Writing four lines of code for each 17 if ((en = pthread_join(tid2, &vp)) != 0) {
18 fprintf(stderr, "pthread_join: %s\n", strerror(en));
acquisition and release of a pthread_mutex_t sure seems 19 exit(EXIT_FAILURE);
painful! Isn’t there a better way? 20 }
Creating two threads using same lock: Because the two threads are using different locks, they
lock_reader(): x = 0
do not exclude each other, and can run concurrently. The
v2023.06.11a
34 CHAPTER 4. TOOLS OF THE TRADE
v2023.06.11a
4.2. POSIX MULTIPROCESSING 35
10
thinktime argument controlling the time between the
release of the reader-writer lock and the next acquisition,
line 4 defines the readcounts array into which each ideal 10000us
1
reader thread places the number of times it acquired the
Quick Quiz 4.17: Instead of using READ_ONCE() everywhere, on the graph). The actual value plotted is:
why not just declare goflag as volatile on line 10 of
Listing 4.8? 𝐿𝑁
(4.1)
𝑁 𝐿1
Quick Quiz 4.18: READ_ONCE() only affects the compiler, where 𝑁 is the number of threads in the current run, 𝐿 𝑁 is
not the CPU. Don’t we also need memory barriers to make the total number of lock acquisitions by all 𝑁 threads in the
sure that the change in goflag’s value propagates to the CPU current run, and 𝐿 1 is the number of lock acquisitions in
in a timely fashion in Listing 4.8? a single-threaded run. Given ideal hardware and software
scalability, this value will always be 1.0.
Quick Quiz 4.19: Would it ever be necessary to use READ_ As can be seen in the figure, reader-writer locking
ONCE() when accessing a per-thread variable, for example, a scalability is decidedly non-ideal, especially for smaller
variable declared using GCC’s __thread storage class? sizes of critical sections. To see why read-acquisition can
be so slow, consider that all the acquiring threads must
The loop spanning lines 23–41 carries out the perfor- update the pthread_rwlock_t data structure. Therefore,
mance test. Lines 24–28 acquire the lock, lines 29–31 if all 448 executing threads attempt to read-acquire the
hold the lock for the specified number of microseconds, reader-writer lock concurrently, they must update this
lines 32–36 release the lock, and lines 37–39 wait for the underlying pthread_rwlock_t one at a time. One lucky
specified number of microseconds before re-acquiring the thread might do so almost immediately, but the least-lucky
lock. Line 40 counts this lock acquisition. thread must wait for all the other 447 threads to do their
Line 42 moves the lock-acquisition count to this thread’s updates. This situation will only get worse as you add
element of the readcounts[] array, and line 43 returns, CPUs. Note also the logscale y-axis. Even though the
terminating this thread. 10,000 microsecond trace appears quite ideal, it has in fact
degraded by about 10 % from ideal.
Figure 4.2 shows the results of running this test on a
224-core Xeon system with two hardware threads per core Quick Quiz 4.20: Isn’t comparing against single-CPU
for a total of 448 software-visible CPUs. The thinktime throughput a bit harsh?
parameter was zero for all these tests, and the holdtime
parameter set to values ranging from one microsecond Quick Quiz 4.21: But one microsecond is not a particularly
(“1us” on the graph) to 10,000 microseconds (“10000us” small size for a critical section. What do I do if I need a much
v2023.06.11a
36 CHAPTER 4. TOOLS OF THE TRADE
smaller critical section, for example, one containing only a few Listing 4.9: Compiler Barrier Primitive (for GCC)
instructions? #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
#define READ_ONCE(x) \
({ typeof(x) ___x = ACCESS_ONCE(x); ___x; })
Quick Quiz 4.22: The system used is a few years old, and #define WRITE_ONCE(x, val) \
new hardware should be faster. So why should anyone worry do { ACCESS_ONCE(x) = (val); } while (0)
#define barrier() __asm__ __volatile__("": : :"memory")
about reader-writer locks being slow?
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 37
write atomics. The read-modify-write atom- code that is to be built only with GCC or other compilers
ics include atomic_fetch_add(), atomic_fetch_ supporting __thread.
sub(), atomic_fetch_and(), atomic_fetch_xor(), Fortunately, the C11 standard introduced a _Thread_
atomic_exchange(), atomic_compare_exchange_ local keyword that can be used in place of __thread. In
strong(), and atomic_compare_exchange_weak(). the fullness of time, this new keyword should combine the
These operate in a manner similar to those described ease of use of __thread with the portability of POSIX
in Section 4.2.5, but with the addition of memory-order thread-specific data.
arguments to _explicit variants of all of the opera-
tions. Without memory-order arguments, all the atomic
operations are fully ordered, and the arguments per- 4.3 Alternatives to POSIX Opera-
mit weaker orderings. For example, “atomic_load_ tions
explicit(&a, memory_order_relaxed)” is vaguely
similar to the Linux kernel’s “READ_ONCE()”.1
The strategic marketing paradigm of Open Source is
a massively parallel drunkard’s walk filtered by a
4.2.7 Atomic Operations (Modern GCC) Darwinistic process.
One restriction of the C11 atomics is that they apply Bruce Perens
only to special atomic types, which can be problematic.
The GNU C compiler therefore provides atomic intrin- Unfortunately, threading operations, locking primitives,
sics, including __atomic_load(), __atomic_load_ and atomic operations were in reasonably wide use long
n(), __atomic_store(), __atomic_store_n(), __ before the various standards committees got around to
atomic_thread_fence(), etc. These intrinsics offer them. As a result, there is considerable variation in how
the same semantics as their C11 counterparts, but may these operations are supported. It is still quite common to
be used on plain non-atomic objects. Some of these in- find these operations implemented in assembly language,
trinsics may be passed a memory-order argument from either for historical reasons or to obtain better perfor-
this list: __ATOMIC_RELAXED, __ATOMIC_CONSUME, mance in specialized circumstances. For example, GCC’s
__ATOMIC_ACQUIRE, __ATOMIC_RELEASE, __ATOMIC_ __sync_ family of primitives all provide full memory-
ACQ_REL, and __ATOMIC_SEQ_CST. ordering semantics, which in the past motivated many
developers to create their own implementations for situa-
4.2.8 Per-Thread Variables tions where the full memory ordering semantics are not
required. The following sections show some alternatives
Per-thread variables, also called thread-specific data, from the Linux kernel and some historical primitives used
thread-local storage, and other less-polite names, are used by this book’s sample code.
extremely heavily in concurrent code, as will be explored
in Chapters 5 and 8. POSIX supplies the pthread_key_
4.3.1 Organization and Initialization
create() function to create a per-thread variable (and
return the corresponding key), pthread_key_delete() Although many environments do not require any special
to delete the per-thread variable corresponding to key, initialization code, the code samples in this book start
pthread_setspecific() to set the value of the current with a call to smp_init(), which initializes a mapping
thread’s variable corresponding to the specified key, and from pthread_t to consecutive integers. The userspace
pthread_getspecific() to return that value. RCU library2 similarly requires a call to rcu_init().
A number of compilers (including GCC) provide a __ Although these calls can be hidden in environments (such
thread specifier that may be used in a variable definition as that of GCC) that support constructors, most of the
to designate that variable as being per-thread. The name of RCU flavors supported by the userspace RCU library also
the variable may then be used normally to access the value require each thread invoke rcu_register_thread()
of the current thread’s instance of that variable. Of course, upon thread creation and rcu_unregister_thread()
__thread is much easier to use than the POSIX thead- before thread exit.
specific data, and so __thread is usually preferred for
1 Memory ordering is described in more detail in Chapter 15 and
v2023.06.11a
38 CHAPTER 4. TOOLS OF THE TRADE
Listing 4.10: Thread API thread() primitive returns the thread_id_t cor-
int smp_thread_id(void) responding to the newly created child thread.
thread_id_t create_thread(void *(*func)(void *), void *arg)
for_each_thread(t) This primitive will abort the program if more than
for_each_running_thread(t)
void *wait_thread(thread_id_t tid) NR_THREADS threads are created, counting the one
void wait_all_threads(void) implicitly created by running the program. NR_
THREADS is a compile-time constant that may be
modified, though some systems may have an upper
In the case of the Linux kernel, it is a philosophical bound for the allowable number of threads.
question as to whether the kernel does not require calls
to special initialization code or whether the kernel’s boot- smp_thread_id()
time code is in fact the required initialization code. Because the thread_id_t returned from create_
thread() is system-dependent, the smp_thread_
4.3.2 Thread Creation, Destruction, and id() primitive returns a thread index corresponding
to the thread making the request. This index is
Control
guaranteed to be less than the maximum number of
The Linux kernel uses struct task_struct pointers threads that have been in existence since the program
to track kthreads, kthread_create() to create them, started, and is therefore useful for bitmasks, array
kthread_should_stop() to externally suggest that they indices, and the like.
stop (which has no POSIX equivalent),3 kthread_
stop() to wait for them to stop, and schedule_ for_each_thread()
timeout_interruptible() for a timed wait. There The for_each_thread() macro loops through all
are quite a few additional kthread-management APIs, but threads that exist, including all threads that would
this provides a good start, as well as good search terms. exist if created. This macro is useful for handling the
The CodeSamples API focuses on “threads”, which are a per-thread variables introduced in Section 4.2.8.
locus of control.4 Each such thread has an identifier of type
for_each_running_thread()
thread_id_t, and no two threads running at a given time
The for_each_running_thread() macro loops
will have the same identifier. Threads share everything
through only those threads that currently exist. It is
except for per-thread local state,5 which includes program
the caller’s responsibility to synchronize with thread
counter and stack.
creation and deletion if required.
The thread API is shown in Listing 4.10, and members
are described in the following section. wait_thread()
The wait_thread() primitive waits for completion
4.3.2.1 API Members of the thread specified by the thread_id_t passed
to it. This in no way interferes with the execution
create_thread()
of the specified thread; instead, it merely waits for
The create_thread() primitive creates a new
it. Note that wait_thread() returns the value that
thread, starting the new thread’s execution at the func-
was returned by the corresponding thread.
tion func specified by create_thread()’s first ar-
gument, and passing it the argument specified by wait_all_threads()
create_thread()’s second argument. This newly The wait_all_threads() primitive waits for com-
created thread will terminate when it returns from the pletion of all currently running threads. It is the
starting function specified by func. The create_ caller’s responsibility to synchronize with thread
creation and deletion if required. However, this prim-
itive is normally used to clean up at the end of a run,
3 POSIX environments can work around the lack of kthread_ so such synchronization is normally not needed.
should_stop() by using a properly synchronized boolean flag in
conjunction with pthread_join().
4 There are many other names for similar software constructs, in- 4.3.2.2 Example Usage
cluding “process”, “task”, “fiber”, “event”, “execution agent”, and so on.
Similar design principles apply to all of them. Listing 4.11 (threadcreate.c) shows an example hello-
5 How is that for a circular definition? world-like child thread. As noted earlier, each thread
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 39
4.3.3 Locking
However, the spin_lock() and spin_unlock()
A good starting subset of the Linux kernel’s locking API is primitives do have performance consequences, as will
shown in Listing 4.13, each API element being described in be seen in Chapter 10.
v2023.06.11a
40 CHAPTER 4. TOOLS OF THE TRADE
Listing 4.14: Living Dangerously Early 1990s Style of transformations, including load tearing, store tearing,
1 ptr = global_ptr; load fusing, store fusing, code reordering, invented loads,
2 if (ptr != NULL && ptr < high_address)
3 do_low(ptr); invented stores, store-to-load transformations, and dead-
code elimination, all of which work just fine in single-
Listing 4.15: C Compilers Can Invent Loads threaded code. But concurrent code can be broken by each
1 if (global_ptr != NULL && of these transformations, or shared-variable shenanigans,
2 global_ptr < high_address) as described below.
3 do_low(global_ptr);
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 41
Listing 4.16: Inviting Load Fusing Listing 4.18: C Compilers Can Fuse Non-Adjacent Loads
1 while (!need_to_stop) 1 int *gp;
2 do_something_quickly(); 2
3 void t0(void)
4 {
Listing 4.17: C Compilers Can Fuse Loads 5 WRITE_ONCE(gp, &myvar);
6 }
1 if (!need_to_stop) 7
2 for (;;) { 8 void t1(void)
3 do_something_quickly(); 9 {
4 do_something_quickly(); 10 p1 = gp;
5 do_something_quickly(); 11 do_something(p1);
6 do_something_quickly(); 12 p2 = READ_ONCE(gp);
7 do_something_quickly(); 13 if (p2) {
8 do_something_quickly(); 14 do_something_else();
9 do_something_quickly(); 15 p3 = *gp;
10 do_something_quickly(); 16 }
11 do_something_quickly(); 17 }
12 do_something_quickly();
13 do_something_quickly();
14 do_something_quickly();
15 do_something_quickly(); t1() run concurrently, and do_something() and
16 do_something_quickly();
17 do_something_quickly(); do_something_else() are inline functions. Line 1
18 do_something_quickly(); declares pointer gp, which C initializes to NULL by
19 }
default. At some point, line 5 of t0() stores a non-
NULL pointer to gp. Meanwhile, t1() loads from gp
bit system. But for properly aligned machine-sized three times on lines 10, 12, and 15. Given that line 13
stores, WRITE_ONCE() will prevent store tearing. finds that gp is non-NULL, one might hope that the
dereference on line 15 would be guaranteed never to
Load fusing occurs when the compiler uses the result of a fault. Unfortunately, the compiler is within its rights
prior load from a given variable instead of repeating to fuse the read on lines 10 and 15, which means
the load. Not only is this sort of optimization just that if line 10 loads NULL and line 12 loads &myvar,
fine in single-threaded code, it is often just fine in line 15 could load NULL, resulting in a fault.8 Note
multithreaded code. Unfortunately, the word “often” that the intervening READ_ONCE() does not prevent
hides some truly annoying exceptions. the other two loads from being fused, despite the fact
For example, suppose that a real-time system that all three are loading from the same variable.
needs to invoke a function named do_something_ Quick Quiz 4.29: Why does it matter whether
quickly() repeatedly until the variable need_to_ do_something() and do_something_else() in List-
stop was set, and that the compiler can see that do_ ing 4.18 are inline functions?
something_quickly() does not store to need_
to_stop. One (unsafe) way to code this is shown
Store fusing can occur when the compiler notices a pair
in Listing 4.16. The compiler might reasonably un-
of successive stores to a given variable with no
roll this loop sixteen times in order to reduce the
intervening loads from that variable. In this case, the
per-invocation of the backwards branch at the end
compiler is within its rights to omit the first store.
of the loop. Worse yet, because the compiler knows
This is never a problem in single-threaded code,
that do_something_quickly() does not store to
and in fact it is usually not a problem in correctly
need_to_stop, the compiler could quite reasonably
written concurrent code. After all, if the two stores
decide to check this variable only once, resulting
are executed in quick succession, there is very little
in the code shown in Listing 4.17. Once entered,
chance that some other thread could load the value
the loop on lines 2–19 will never exit, regardless of
from the first store.
how many times some other thread stores a non-zero
value to need_to_stop. The result will at best be However, there are exceptions, for example as shown
consternation, and might well also include severe in Listing 4.19. The function shut_it_down()
physical damage. stores to the shared variable status on lines 3 and 8,
and so assuming that neither start_shutdown()
The compiler can fuse loads across surprisingly large
spans of code. For example, in Listing 4.18, t0() and 8 Will Deacon reports that this happened in the Linux kernel.
v2023.06.11a
42 CHAPTER 4. TOOLS OF THE TRADE
Listing 4.19: C Compilers Can Fuse Stores Listing 4.20: Inviting an Invented Store
1 void shut_it_down(void) 1 if (condition)
2 { 2 a = 1;
3 status = SHUTTING_DOWN; /* BUGGY!!! */ 3 else
4 start_shutdown(); 4 do_a_bunch_of_stuff(&a);
5 while (!other_task_ready) /* BUGGY!!! */
6 continue;
7 finish_shutdown(); Listing 4.21: Compiler Invents an Invited Store
8 status = SHUT_DOWN; /* BUGGY!!! */
1 a = 1;
9 do_something_else();
2 if (!condition) {
10 }
3 a = 0;
11
4 do_a_bunch_of_stuff(&a);
12 void work_until_shut_down(void)
5 }
13 {
14 while (status != SHUTTING_DOWN) /* BUGGY!!! */
15 do_more_work();
16 other_task_ready = 1; /* BUGGY!!! */
17 } see the effect of any subsequent instructions. READ_
ONCE() and WRITE_ONCE() can therefore be used
to control communication between interrupted code
nor finish_shutdown() access status, the com- and interrupt handlers, independent of the ordering
piler could reasonably remove the store to status provided by the underlying hardware.9
on line 3. Unfortunately, this would mean that
work_until_shut_down() would never exit its Invented loads were illustrated by the code in List-
loop spanning lines 14 and 15, and thus would never ings 4.14 and 4.15, in which the compiler optimized
set other_task_ready, which would in turn mean away a temporary variable, thus loading from a
that shut_it_down() would never exit its loop span- shared variable more often than intended.
ning lines 5 and 6, even if the compiler chooses not to Invented loads can be a performance hazard. These
fuse the successive loads from other_task_ready hazards can occur when a load of variable in a “hot”
on line 5. cacheline is hoisted out of an if statement. These
And there are more problems with the code in List- hoisting optimizations are not uncommon, and can
ing 4.19, including code reordering. cause significant increases in cache misses, and thus
significant degradation of both performance and
Code reordering is a common compilation technique scalability.
used to combine common subexpressions, reduce
Invented stores can occur in a number of situations.
register pressure, and improve utilization of the many
For example, a compiler emitting code for work_
functional units available on modern superscalar mi-
until_shut_down() in Listing 4.19 might notice
croprocessors. It is also another reason why the code
that other_task_ready is not accessed by do_
in Listing 4.19 is buggy. For example, suppose that
more_work(), and stored to on line 16. If do_more_
the do_more_work() function on line 15 does not ac-
work() was a complex inline function, it might
cess other_task_ready. Then the compiler would
be necessary to do a register spill, in which case
be within its rights to move the assignment to other_
one attractive place to use for temporary storage is
task_ready on line 16 to precede line 14, which
other_task_ready. After all, there are no accesses
might be a great disappointment for anyone hoping
to it, so what is the harm?
that the last call to do_more_work() on line 15
happens before the call to finish_shutdown() on Of course, a non-zero store to this variable at just the
line 7. wrong time would result in the while loop on line 5
terminating prematurely, again allowing finish_
It might seem futile to prevent the compiler from
shutdown() to run concurrently with do_more_
changing the order of accesses in cases where the
work(). Given that the entire point of this while
underlying hardware is free to reorder them. However,
appears to be to prevent such concurrency, this is not
modern machines have exact exceptions and exact
a good thing.
interrupts, meaning that any interrupt or exception
will appear to have happened at a specific place in 9 That said, the various standards committees would prefer that
the instruction stream. This means that the handler you use atomics or variables of type sig_atomic_t, instead of READ_
will see the effect of all prior instructions, but won’t ONCE() and WRITE_ONCE().
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 43
Listing 4.22: Inviting a Store-to-Load Conversion Listing 4.23: Compiler Converts a Store to a Load
1 r1 = p; 1 r1 = p;
2 if (unlikely(r1)) 2 if (unlikely(r1))
3 do_something_with(r1); 3 do_something_with(r1);
4 barrier(); 4 barrier();
5 p = NULL; 5 if (p != NULL)
6 p = NULL;
v2023.06.11a
44 CHAPTER 4. TOOLS OF THE TRADE
volatile is a hint to the implementation to Listing 4.24: Avoiding Danger, 2018 Style
avoid aggressive optimization involving the ob- 1 ptr = READ_ONCE(global_ptr);
2 if (ptr != NULL && ptr < high_address)
ject because the value of the object might be 3 do_low(ptr);
changed by means undetectable by an implemen-
tation. Furthermore, for some implementations, Listing 4.25: Preventing Load Fusing
volatile might indicate that special hardware 1 while (!READ_ONCE(need_to_stop))
instructions are required to access the object. 2 do_something_quickly();
See 6.8.1 for detailed semantics. In general, the
semantics of volatile are intended to be the Listing 4.26: Preventing Store Fusing and Invented Stores
same in C++ as they are in C. 1 void shut_it_down(void)
2 {
3 WRITE_ONCE(status, SHUTTING_DOWN); /* BUGGY!!! */
This wording might be reassuring to those writing low- 4 start_shutdown();
level code, except for the fact that compiler writers are 5 while (!READ_ONCE(other_task_ready)) /* BUGGY!!! */
6 continue;
free to completely ignore non-normative notes. Parallel 7 finish_shutdown();
programmers might instead reassure themselves that com- 8 WRITE_ONCE(status, SHUT_DOWN); /* BUGGY!!! */
9 do_something_else();
piler writers would like to avoid breaking device drivers 10 }
(though perhaps only after a few “frank and open” discus- 11
12 void work_until_shut_down(void)
sions with device-driver developers), and device drivers 13 {
impose at least the following constraints [MWPF18]: 14 while (READ_ONCE(status) != SHUTTING_DOWN) /* BUGGY!!! */
15 do_more_work();
16 WRITE_ONCE(other_task_ready, 1); /* BUGGY!!! */
1. Implementations are forbidden from tearing an 17 }
aligned volatile access when machine instructions of
that access’s size and type are available.12 Concur-
rent code relies on this constraint to avoid unneces- non-atomic or non-volatile, assuming that all accesses are
sary load and store tearing. aligned and machine-sized. The semantics of mixed-size
accesses to the same locations are more complex, and are
2. Implementations must not assume anything about the left aside for the time being.
semantics of a volatile access, nor, for any volatile So how does volatile stack up against the earlier
access that returns a value, about the possible set of examples?
values that might be returned.13 Concurrent code
Using READ_ONCE() on line 1 of Listing 4.14 avoids
relies on this constraint to avoid optimizations that
invented loads, resulting in the code shown in Listing 4.24.
are inapplicable given that other processors might be
As shown in Listing 4.25, READ_ONCE() can also pre-
concurrently accessing the location in question.
vent the loop unrolling in Listing 4.17.
3. Aligned machine-sized non-mixed-size volatile ac- READ_ONCE() and WRITE_ONCE() can also be used
cesses interact naturally with volatile assembly-code to prevent the store fusing and invented stores that were
sequences before and after. This is necessary because shown in Listing 4.19, with the result shown in List-
some devices must be accessed using a combina- ing 4.26. However, this does nothing to prevent code
tion of volatile MMIO accesses and special-purpose reordering, which requires some additional tricks taught
assembly-language instructions. Concurrent code in Section 4.3.4.3.
relies on this constraint in order to achieve the desired Finally, WRITE_ONCE() can be used to prevent the store
ordering properties from combinations of volatile ac- invention shown in Listing 4.20, with the resulting code
cesses and other means discussed in Section 4.3.4.3. shown in Listing 4.27.
To summarize, the volatile keyword can prevent
Concurrent code also relies on the first two constraints load tearing and store tearing in cases where the loads
to avoid undefined behavior that could result due to data
races if any of the accesses to a given object was either
Listing 4.27: Disinviting an Invented Store
12 Note that this leaves unspecified what to do with 128-bit loads and 1 if (condition)
stores on CPUs having 128-bit CAS but not 128-bit loads and stores. 2 WRITE_ONCE(a, 1);
13 This is strongly implied by the implementation-defined semantics 3 else
4 do_a_bunch_of_stuff();
called out above.
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 45
Listing 4.28: Preventing C Compilers From Fusing Loads Listing 4.29: Preventing Reordering
1 while (!need_to_stop) { 1 void shut_it_down(void)
2 barrier(); 2 {
3 do_something_quickly(); 3 WRITE_ONCE(status, SHUTTING_DOWN);
4 barrier(); 4 smp_mb();
5 } 5 start_shutdown();
6 while (!READ_ONCE(other_task_ready))
7 continue;
8 smp_mb();
and stores are machine-sized and properly aligned. It 9 finish_shutdown();
10 smp_mb();
can also prevent load fusing, store fusing, invented loads, 11 WRITE_ONCE(status, SHUT_DOWN);
and invented stores. However, although it does prevent 12 do_something_else();
13 }
the compiler from reordering volatile accesses with 14
each other, it does nothing to prevent the CPU from 15 void work_until_shut_down(void)
16 {
reordering these accesses. Furthermore, it does nothing 17 while (READ_ONCE(status) != SHUTTING_DOWN) {
to prevent either compiler or CPU from reordering non- 18 smp_mb();
19 do_more_work();
volatile accesses with each other or with volatile 20 }
accesses. Preventing these types of reordering requires 21 smp_mb();
22 WRITE_ONCE(other_task_ready, 1);
the techniques described in the next section. 23 }
v2023.06.11a
46 CHAPTER 4. TOOLS OF THE TRADE
Here is a list of situations allowing plain loads and stores READ_ONCE() nor WRITE_ONCE() provide any ordering
for some accesses to a given variable, while requiring guarantees other than within the compiler. See the above
markings (such as READ_ONCE() and WRITE_ONCE()) for Section 4.3.4.3 or Chapter 15 for information on such
other accesses to that same variable: guarantees.
Examples of many of these data-race-avoidance patterns
1. A shared variable is only modified by a given owning are presented in Chapter 5.
CPU or thread, but is read by other CPUs or threads.
All stores must use WRITE_ONCE(). The owning
CPU or thread may use plain loads. Everything else 4.3.5 Atomic Operations
must use READ_ONCE() for loads. The Linux kernel provides a wide variety of atomic opera-
tions, but those defined on type atomic_t provide a good
2. A shared variable is only modified while holding a
start. Normal non-tearing reads and stores are provided by
given lock, but is read by code not holding that lock.
atomic_read() and atomic_set(), respectively. Ac-
All stores must use WRITE_ONCE(). CPUs or threads
quire load is provided by smp_load_acquire() and
holding the lock may use plain loads. Everything
release store by smp_store_release().
else must use READ_ONCE() for loads.
Non-value-returning fetch-and-add operations are pro-
3. A shared variable is only modified while holding a vided by atomic_add(), atomic_sub(), atomic_
given lock by a given owning CPU or thread, but is inc(), and atomic_dec(), among others. An atomic
read by other CPUs or threads or by code not holding decrement that returns a reached-zero indication is pro-
that lock. All stores must use WRITE_ONCE(). The vided by both atomic_dec_and_test() and atomic_
owning CPU or thread may use plain loads, as may sub_and_test(). An atomic add that returns the
any CPU or thread holding the lock. Everything else new value is provided by atomic_add_return().
must use READ_ONCE() for loads. Both atomic_add_unless() and atomic_inc_not_
zero() provide conditional atomic operations, where
4. A shared variable is only accessed by a given CPU or nothing happens unless the original value of the atomic
thread and by a signal or interrupt handler running variable is different than the value specified (these are very
in that CPU’s or thread’s context. The handler can handy for managing reference counters, for example).
use plain loads and stores, as can any code that An atomic exchange operation is provided by atomic_
has prevented the handler from being invoked, that xchg(), and the celebrated compare-and-swap (CAS)
is, code that has blocked signals and/or interrupts. operation is provided by atomic_cmpxchg(). Both
All other code must use READ_ONCE() and WRITE_ of these return the old value. Many additional atomic
ONCE(). RMW primitives are available in the Linux kernel, see
5. A shared variable is only accessed by a given CPU or the Documentation/atomic_t.txt file in the Linux-
thread and by a signal or interrupt handler running kernel source tree.14
in that CPU’s or thread’s context, and the handler This book’s CodeSamples API closely follows that of
always restores the values of any variables that it the Linux kernel.
has written before return. The handler can use plain
loads and stores, as can any code that has prevented 4.3.6 Per-CPU Variables
the handler from being invoked, that is, code that
The Linux kernel uses DEFINE_PER_CPU() to define a
has blocked signals and/or interrupts. All other code
per-CPU variable, this_cpu_ptr() to form a reference
can use plain loads, but must use WRITE_ONCE()
to this CPU’s instance of a given per-CPU variable, per_
to prevent store tearing, store fusing, and invented
cpu() to access a specified CPU’s instance of a given
stores.
per-CPU variable, along with many other special-purpose
per-CPU operations.
Quick Quiz 4.32: What needs to happen if an interrupt or
signal handler might itself be interrupted? Listing 4.30 shows this book’s per-thread-variable API,
which is patterned after the Linux kernel’s per-CPU-
In most other cases, loads from and stores to a shared variable API. This API provides the per-thread equivalent
variable must use READ_ONCE() and WRITE_ONCE() or
stronger, respectively. But it bears repeating that neither 14 As of Linux kernel v5.11.
v2023.06.11a
4.4. THE RIGHT TOOL FOR THE JOB: HOW TO CHOOSE? 47
Listing 4.30: Per-Thread-Variable API using a per-thread variable. Such a variable can be defined
DEFINE_PER_THREAD(type, name) as follows:
DECLARE_PER_THREAD(type, name)
per_thread(name, thread)
DEFINE_PER_THREAD(int, counter);
__get_thread_var(name)
init_per_thread(name, v)
of global variables. Although this API is, strictly speaking, init_per_thread(counter, 0);
4.3.6.1 API Members The value of the counter is then the sum of its instances.
DEFINE_PER_THREAD() A snapshot of the value of the counter can thus be collected
The DEFINE_PER_THREAD() primitive defines a per- as follows:
thread variable. Unfortunately, it is not possible for_each_thread(t)
to provide an initializer in the way permitted by sum += READ_ONCE(per_thread(counter, t));
init_per_thread() As a rough rule of thumb, use the simplest tool that will
The init_per_thread() primitive sets all threads’ get the job done. If you can, simply program sequentially.
instances of the specified variable to the specified If that is insufficient, try using a shell script to mediate
value. The Linux kernel accomplishes this via normal parallelism. If the resulting shell-script fork()/exec()
C initialization, relying in clever use of linker scripts overhead (about 480 microseconds for a minimal C pro-
and code executed during the CPU-online process. gram on an Intel Core Duo laptop) is too large, try using
the C-language fork() and wait() primitives. If the
4.3.6.2 Usage Example overhead of these primitives (about 80 microseconds for
a minimal child process) is still too large, then you might
Suppose that we have a counter that is incremented very
need to use the POSIX threading primitives, choosing the
frequently but read out quite rarely. As will become clear
appropriate locking and/or atomic-operation primitives.
in Section 5.2, it is helpful to implement such a counter
If the overhead of the POSIX threading primitives (typi-
15You could instead use __thread or _Thread_local. cally sub-microsecond) is too great, then the primitives
v2023.06.11a
48 CHAPTER 4. TOOLS OF THE TRADE
v2023.06.11a
As easy as 1, 2, 3!
Unknown
Chapter 5
Counting
Counting is perhaps the simplest and most natural thing number of structures in use exceeds an exact limit (again, say
a computer can do. However, counting efficiently and 10,000). Suppose further that these structures are short-lived,
scalably on a large shared-memory multiprocessor can and that the limit is rarely exceeded, that there is almost always
be quite challenging. Furthermore, the simplicity of the at least one structure in use, and suppose further still that it is
underlying concept of counting allows us to explore the necessary to know exactly when this counter reaches zero, for
example, in order to free up some memory that is not required
fundamental issues of concurrency without the distractions
unless there is at least one structure in use.
of elaborate data structures or complex synchronization
primitives. Counting therefore provides an excellent
Quick Quiz 5.5: Removable I/O device access-count
introduction to parallel programming. problem. Suppose that you need to maintain a reference count
This chapter covers a number of special cases for which on a heavily used removable mass-storage device, so that you
there are simple, fast, and scalable counting algorithms. can tell the user when it is safe to remove the device. As usual,
But first, let us find out how much you already know about the user indicates a desire to remove the device, and the system
concurrent counting. tells the user when it is safe to do so.
Quick Quiz 5.1: Why should efficient and scalable counting Section 5.1 shows why counting is non-trivial. Sec-
be hard??? After all, computers have special hardware for the
tions 5.2 and 5.3 investigate network-packet counting
sole purpose of doing counting!!!
and approximate structure-allocation limits, respectively.
Section 5.4 takes on exact structure-allocation limits. Fi-
Quick Quiz 5.2: Network-packet counting problem. Sup-
nally, Section 5.5 presents performance measurements
pose that you need to collect statistics on the number of
networking packets transmitted and received. Packets might and discussion.
be transmitted or received by any CPU on the system. Suppose Sections 5.1 and 5.2 contain introductory material,
further that your system is capable of handling millions of while the remaining sections are more advanced.
packets per second per CPU, and that a systems-monitoring
package reads the count every five seconds. How would you
implement this counter? 5.1 Why Isn’t Concurrent Counting
Quick Quiz 5.3: Approximate structure-allocation limit
Trivial?
problem. Suppose that you need to maintain a count of the
number of structures allocated in order to fail any allocations Seek simplicity, and distrust it.
once the number of structures in use exceeds a limit (say,
10,000). Suppose further that the structures are short-lived, Alfred North Whitehead
the limit is rarely exceeded, and a “sloppy” approximate limit
is acceptable. Let’s start with something simple, for example, the
straightforward use of arithmetic shown in Listing 5.1
Quick Quiz 5.4: Exact structure-allocation limit problem. (count_nonatomic.c). Here, we have a counter on
Suppose that you need to maintain a count of the number of line 1, we increment it on line 5, and we read out its value
structures allocated in order to fail any allocations once the on line 10. What could be simpler?
49
v2023.06.11a
50 CHAPTER 5. COUNTING
10
100
1 atomic_t counter = ATOMIC_INIT(0);
2
3 static __inline__ void inc_count(void) Number of CPUs (Threads)
4 {
5 atomic_inc(&counter);
6 }
Figure 5.1: Atomic Increment Scalability on x86
7
8 static __inline__ long read_count(void)
9 {
10 return atomic_read(&counter); times slower than non-atomic increment, even when only
11 } a single thread is incrementing.1
This poor performance should not be a surprise, given
the discussion in Chapter 3, nor should it be a surprise
Quick Quiz 5.6: One thing that could be simpler is ++ instead that the performance of atomic increment gets slower
of that concatenation of READ_ONCE() and WRITE_ONCE(). as the number of CPUs and threads increase, as shown
Why all that extra typing??? in Figure 5.1. In this figure, the horizontal dashed line
resting on the x axis is the ideal performance that would
This approach has the additional advantage of being be achieved by a perfectly scalable algorithm: With
blazingly fast if you are doing lots of reading and almost such an algorithm, a given increment would incur the
no incrementing, and on small systems, the performance same overhead that it would in a single-threaded program.
is excellent. Atomic increment of a single global variable is clearly
There is just one large fly in the ointment: This approach decidedly non-ideal, and gets multiple orders of magnitude
can lose counts. On my six-core x86 laptop, a short run worse with additional CPUs.
invoked inc_count() 285,824,000 times, but the final
Quick Quiz 5.9: Why doesn’t the horizontal dashed line on
value of the counter was only 35,385,525. Although the x axis meet the diagonal line at 𝑥 = 1?
approximation does have a large place in computing, loss
of 87 % of the counts is a bit excessive. Quick Quiz 5.10: But atomic increment is still pretty fast.
Quick Quiz 5.7: But can’t a smart compiler prove that line 5 And incrementing a single variable in a tight loop sounds pretty
of Listing 5.1 is equivalent to the ++ operator and produce an unrealistic to me, after all, most of the program’s execution
x86 add-to-memory instruction? And won’t the CPU cache should be devoted to actually doing work, not accounting for
cause this to be atomic? the work it has done! Why should I care about making this go
faster?
Quick Quiz 5.8: The 8-figure accuracy on the number of
For another perspective on global atomic increment,
failures indicates that you really did test this. Why would it be
necessary to test such a trivial program, especially when the
consider Figure 5.2. In order for each CPU to get a
bug is easily seen by inspection? chance to increment a given global variable, the cache
line containing that variable must circulate among all
The straightforward way to count accurately is to use 1 Interestingly enough, non-atomically incrementing a counter will
atomic operations, as shown in Listing 5.2 (count_ advance the counter more quickly than atomically incrementing the
atomic.c). Line 1 defines an atomic variable, line 5 counter. Of course, if your only goal is to make the counter increase
quickly, an easier approach is to simply assign a large value to the counter.
atomically increments it, and line 10 reads it out. Be- Nevertheless, there is likely to be a role for algorithms that use carefully
cause this is atomic, it keeps perfect count. However, it is relaxed notions of correctness in order to gain greater performance and
slower: On my six-core x86 laptop, it is more than twenty scalability [And91, ACMS03, Rin13, Ung11].
v2023.06.11a
5.2. STATISTICAL COUNTERS 51
5.2.1 Design
Statistical counting is typically handled by providing a
counter per thread (or CPU, when running in the kernel),
Figure 5.3: Waiting to Count so that each thread updates its own counter, as was fore-
shadowed in Section 4.3.6 on page 46. The aggregate
value of the counters is read out by simply summing up
all of the threads’ counters, relying on the commutative
and associative properties of addition. This is an example
the CPUs, as shown by the red arrows. Such circulation of the Data Ownership pattern that will be introduced in
will take significant time, resulting in the poor perfor- Section 6.3.4 on page 86.
mance seen in Figure 5.1, which might be thought of as Quick Quiz 5.12: But doesn’t the fact that C’s “integers” are
shown in Figure 5.3. The following sections discuss high- limited in size complicate things?
performance counting, which avoids the delays inherent
in such circulation.
5.2.2 Array-Based Implementation
One way to provide per-thread variables is to allocate
Quick Quiz 5.11: But why can’t CPU designers simply
ship the addition operation to the data, avoiding the need to
an array with one element per thread (presumably cache
circulate the cache line containing the global variable being aligned and padded to avoid false sharing).
incremented? Quick Quiz 5.13: An array??? But doesn’t that limit the
number of threads?
v2023.06.11a
52 CHAPTER 5. COUNTING
Such an array can be wrapped into per-thread primitives, CPU 0 CPU 1 CPU 2 CPU 3
as shown in Listing 5.3 (count_stat.c). Line 1 defines
Cache Cache Cache Cache
an array containing a set of per-thread counters of type
Interconnect Interconnect
unsigned long named, creatively enough, counter.
Lines 3–8 show a function that increments the counters,
using the __get_thread_var() primitive to locate the Memory System Interconnect Memory
currently running thread’s element of the counter array.
Because this element is modified only by the correspond-
ing thread, non-atomic increment suffices. However, this Interconnect Interconnect
code uses WRITE_ONCE() to prevent destructive compiler Cache Cache Cache Cache
optimizations. For but one example, the compiler is within CPU 4 CPU 5 CPU 6 CPU 7
its rights to use a to-be-stored-to location as temporary
storage, thus writing what would be for all intents and Figure 5.4: Data Flow For Per-Thread Increment
purposes garbage to that location just before doing the
desired store. This could of course be rather confusing
to anything attempting to read out the count. The use the network-packet counting problem presented at the
of WRITE_ONCE() prevents this optimization and others beginning of this chapter.
besides. Quick Quiz 5.17: The read operation takes time to sum
Quick Quiz 5.14: What other nasty optimizations could up the per-thread values, and during that time, the counter
GCC apply? could well be changing. This means that the value returned
by read_count() in Listing 5.3 will not necessarily be exact.
Lines 10–18 show a function that reads out the aggregate Assume that the counter is being incremented at rate 𝑟 counts
value of the counter, using the for_each_thread() per unit time, and that read_count()’s execution consumes
𝛥 units of time. What is the expected error in the return value?
primitive to iterate over the list of currently running
threads, and using the per_thread() primitive to fetch
the specified thread’s counter. This code also uses READ_ However, many implementations provide cheaper mech-
ONCE() to ensure that the compiler doesn’t optimize these anisms for per-thread data that are free from arbitrary
loads into oblivion. For but one example, a pair of array-size limits. This is the topic of the next section.
consecutive calls to read_count() might be inlined, and
an intrepid optimizer might notice that the same locations 5.2.3 Per-Thread-Variable-Based Imple-
were being summed and thus incorrectly conclude that it
would be simply wonderful to sum them once and use the
mentation
resulting value twice. This sort of optimization might be The C language, since C11, features a _Thread_local
rather frustrating to people expecting later read_count() storage class that provides per-thread storage.2 This can be
calls to account for the activities of other threads. The use used as shown in Listing 5.4 (count_end.c) to implement
of READ_ONCE() prevents this optimization and others a statistical counter that not only scales well and avoids
besides. arbitrary thread-number limits, but that also incurs little
Quick Quiz 5.15: How does the per-thread counter variable or no performance penalty to incrementers compared to
in Listing 5.3 get initialized? simple non-atomic increment.
Lines 1–4 define needed variables: counter is the
per-thread counter variable, the counterp[] array allows
Quick Quiz 5.16: How is the code in Listing 5.3 supposed
to permit more than one counter?
threads to access each others’ counters, finalcount ac-
cumulates the total as individual threads exit, and final_
This approach scales linearly with increasing number mutex coordinates between threads accumulating the total
of updater threads invoking inc_count(). As is shown value of the counter and exiting threads.
by the green arrows on each CPU in Figure 5.4, the
reason for this is that each CPU can make rapid progress 2 GCC provides its own __thread storage class, which was used
incrementing its thread’s variable, without any expensive in previous versions of this book. The two methods for specifying a
cross-system communication. As such, this section solves thread-local variable are interchangeable when using GCC.
v2023.06.11a
5.2. STATISTICAL COUNTERS 53
Listing 5.4: Per-Thread Statistical Counters counter-pointers to that variable rather than setting them to
1 unsigned long _Thread_local counter = 0; NULL?
2 unsigned long *counterp[NR_THREADS] = { NULL };
3 unsigned long finalcount = 0;
4 DEFINE_SPINLOCK(final_mutex); Quick Quiz 5.20: Why on earth do we need something as
5
heavyweight as a lock guarding the summation in the function
6 static inline void inc_count(void)
7 { read_count() in Listing 5.4?
8 WRITE_ONCE(counter, counter + 1);
9 } Lines 25–32 show the count_register_thread()
10
11 static inline unsigned long read_count(void) function, which must be called by each thread before its
12 { first use of this counter. This function simply sets up this
13 int t;
14 unsigned long sum; thread’s element of the counterp[] array to point to its
15
16 spin_lock(&final_mutex);
per-thread counter variable.
17 sum = finalcount;
18 for_each_thread(t)
Quick Quiz 5.21: Why on earth do we need to acquire the
19 if (counterp[t] != NULL) lock in count_register_thread() in Listing 5.4? It is a
20 sum += READ_ONCE(*counterp[t]); single properly aligned machine-word store to a location that
21 spin_unlock(&final_mutex);
22 return sum; no other thread is modifying, so it should be atomic anyway,
23 } right?
24
25 void count_register_thread(unsigned long *p)
26 { Lines 34–42 show the count_unregister_
27 int idx = smp_thread_id(); thread() function, which must be called prior to exit
28
29 spin_lock(&final_mutex); by each thread that previously called count_register_
30 counterp[idx] = &counter; thread(). Line 38 acquires the lock, and line 41 releases
31 spin_unlock(&final_mutex);
32 } it, thus excluding any calls to read_count() as well as
33 other calls to count_unregister_thread(). Line 39
34 void count_unregister_thread(int nthreadsexpected)
35 { adds this thread’s counter to the global finalcount,
36 int idx = smp_thread_id(); and then line 40 NULLs out its counterp[] array entry.
37
38 spin_lock(&final_mutex); A subsequent call to read_count() will see the exiting
39 finalcount += counter; thread’s count in the global finalcount, and will
40 counterp[idx] = NULL;
41 spin_unlock(&final_mutex); skip the exiting thread when sequencing through the
42 } counterp[] array, thus obtaining the correct total.
This approach gives updaters almost exactly the same
performance as a non-atomic add, and also scales linearly.
Quick Quiz 5.18: Doesn’t that explicit counterp array On the other hand, concurrent reads contend for a sin-
in Listing 5.4 reimpose an arbitrary limit on the number
gle global lock, and therefore perform poorly and scale
of threads? Why doesn’t the C language provide a per_
abysmally. However, this is not a problem for statistical
thread() interface, similar to the Linux kernel’s per_cpu()
primitive, to allow threads to more easily access each others’ counters, where incrementing happens often and readout
per-thread variables? happens almost never. Of course, this approach is consid-
erably more complex than the array-based scheme, due to
The inc_count() function used by updaters is quite the fact that a given thread’s per-thread variables vanish
simple, as can be seen on lines 6–9. when that thread exits.
The read_count() function used by readers is a bit Quick Quiz 5.22: Fine, but the Linux kernel doesn’t have
more complex. Line 16 acquires a lock to exclude exiting to acquire a lock when reading out the aggregate value of
threads, and line 21 releases it. Line 17 initializes the per-CPU counters. So why should user-space code need to do
sum to the count accumulated by those threads that have this???
already exited, and lines 18–20 sum the counts being
Both the array-based and _Thread_local-based ap-
accumulated by threads currently running. Finally, line 22
proaches offer excellent update-side performance and
returns the sum.
scalability. However, these benefits result in large read-
Quick Quiz 5.19: Doesn’t the check for NULL on line 19 side expense for large numbers of threads. The next
of Listing 5.4 add extra branch mispredictions? Why not
section shows one way to reduce read-side expense while
have a variable set permanently to zero, and point unused
still retaining the update-side scalability.
v2023.06.11a
54 CHAPTER 5. COUNTING
counter. However, updaters only manipulate their per- 9 WRITE_ONCE(*p_counter, *p_counter + 1);
10 }
thread counters. A separate thread is provided to transfer 11
counts from the per-thread counters to the global counter. 12 static __inline__ unsigned long read_count(void)
13 {
Readers simply access the value of the global counter. If 14 return READ_ONCE(global_count);
updaters are active, the value used by the readers will 15 }
16
be out of date, however, once updates cease, the global 17 void *eventual(void *arg)
counter will eventually converge on the true value—hence 18 {
19 int t;
this approach qualifies as eventually consistent. 20 unsigned long sum;
The implementation is shown in Listing 5.5 (count_ 21
22 while (READ_ONCE(stopflag) < 3) {
stat_eventual.c). Lines 1–2 show the per-thread vari- 23 sum = 0;
able and the global variable that track the counter’s value, 24 for_each_thread(t)
25 sum += READ_ONCE(per_thread(counter, t));
and line 3 shows stopflag which is used to coordinate 26 WRITE_ONCE(global_count, sum);
termination (for the case where we want to terminate 27 poll(NULL, 0, 1);
28 if (READ_ONCE(stopflag))
the program with an accurate counter value). The inc_ 29 smp_store_release(&stopflag, stopflag + 1);
count() function shown on lines 5–10 is similar to its 30 }
31 return NULL;
counterpart in Listing 5.3. The read_count() function 32 }
shown on lines 12–15 simply returns the value of the 33
34 void count_init(void)
global_count variable. 35 {
36 int en;
However, the count_init() function on lines 34–44 37 pthread_t tid;
creates the eventual() thread shown on lines 17–32, 38
39 en = pthread_create(&tid, NULL, eventual, NULL);
which cycles through all the threads, summing the per- 40 if (en != 0) {
thread local counter and storing the sum to the global_ 41 fprintf(stderr, "pthread_create: %s\n", strerror(en));
42 exit(EXIT_FAILURE);
count variable. The eventual() thread waits an arbi- 43 }
trarily chosen one millisecond between passes. 44 }
45
The count_cleanup() function on lines 46–51 46 void count_cleanup(void)
coordinates termination. The call to smp_load_ 47 {
48 WRITE_ONCE(stopflag, 1);
acquire() here and the call to smp_store_release() 49 while (smp_load_acquire(&stopflag) < 3)
in eventual() ensure that all updates to global_ 50 poll(NULL, 0, 1);
51 }
count are visible to code following the call to count_
cleanup().
This approach gives extremely fast counter read-out
while still supporting linear counter-update scalability.
However, this excellent read-side performance and update-
side scalability comes at the cost of the additional thread
running eventual().
v2023.06.11a
5.3. APPROXIMATE LIMIT COUNTERS 55
Quick Quiz 5.23: Why doesn’t inc_count() in Listing 5.5 5.3 Approximate Limit Counters
need to use atomic instructions? After all, we now have
multiple threads accessing the per-thread counters!
An approximate answer to the right problem is worth
a good deal more than an exact answer to an
approximate problem.
Quick Quiz 5.24: Won’t the single global thread in the func-
tion eventual() of Listing 5.5 be just as severe a bottleneck John Tukey
as a global lock would be?
Another special case of counting involves limit-checking.
For example, as noted in the approximate structure-
Quick Quiz 5.25: Won’t the estimate returned by read_ allocation limit problem in Quick Quiz 5.3, suppose that
count() in Listing 5.5 become increasingly inaccurate as the you need to maintain a count of the number of structures
number of threads rises? allocated in order to fail any allocations once the number
of structures in use exceeds a limit, in this case, 10,000.
Suppose further that these structures are short-lived, that
Quick Quiz 5.26: Given that in the eventually-consistent this limit is rarely exceeded, and that this limit is approx-
algorithm shown in Listing 5.5 both reads and updates have
imate in that it is OK either to exceed it sometimes by
extremely low overhead and are extremely scalable, why
some bounded amount or to fail to reach it sometimes,
would anyone bother with the implementation described in
Section 5.2.2, given its costly read-side code? again by some bounded amount. See Section 5.4 if you
instead need the limit to be exact.
Quick Quiz 5.27: What is the accuracy of the estimate 5.3.1 Design
returned by read_count() in Listing 5.5?
One possible design for limit counters is to divide the
limit of 10,000 by the number of threads, and give each
thread a fixed pool of structures. For example, given 100
threads, each thread would manage its own pool of 100
structures. This approach is simple, and in some cases
5.2.5 Discussion works well, but it does not handle the common case where
a given structure is allocated by one thread and freed by
These three implementations show that it is possible another [MS93]. On the one hand, if a given thread takes
to obtain near-uniprocessor performance for statistical credit for any structures it frees, then the thread doing
counters, despite running on a parallel machine. most of the allocating runs out of structures, while the
threads doing most of the freeing have lots of credits that
Quick Quiz 5.28: What fundamental difference is there they cannot use. On the other hand, if freed structures
between counting packets and counting the total number of are credited to the CPU that allocated them, it will be
bytes in the packets, given that the packets vary in size? necessary for CPUs to manipulate each others’ counters,
which will require expensive atomic instructions or other
means of communicating between threads.3
Quick Quiz 5.29: Given that the reader must sum all the In short, for many important workloads, we cannot fully
threads’ counters, this counter-read operation could take a long partition the counter. Given that partitioning the counters
time given large numbers of threads. Is there any way that was what brought the excellent update-side performance
the increment operation can remain fast and scalable while for the three schemes discussed in Section 5.2, this might
allowing readers to also enjoy not only reasonable performance be grounds for some pessimism. However, the eventually
and scalability, but also good accuracy?
consistent algorithm presented in Section 5.2.4 provides
an interesting hint. Recall that this algorithm kept two sets
Given what has been presented in this section, you of books, a per-thread counter variable for updaters and a
should now be able to answer the Quick Quiz about 3 That said, if each structure will always be freed by the same CPU
statistical counters for networking near the beginning of (or thread) that allocated it, then this simple partitioning approach works
this chapter. extremely well.
v2023.06.11a
56 CHAPTER 5. COUNTING
global_count variable for readers, with an eventual() Listing 5.6: Simple Limit Counter Variables
thread that periodically updated global_count to be 1 unsigned long __thread counter = 0;
2 unsigned long __thread countermax = 0;
eventually consistent with the values of the per-thread 3 unsigned long globalcountmax = 10000;
counter. The per-thread counter perfectly partitioned 4 unsigned long globalcount = 0;
5 unsigned long globalreserve = 0;
the counter value, while global_count kept the full 6 unsigned long *counterp[NR_THREADS] = { NULL };
value. 7 DEFINE_SPINLOCK(gblcnt_mutex);
v2023.06.11a
5.3. APPROXIMATE LIMIT COUNTERS 57
5 return 1;
6 }
countermax 1 counter 1 7 spin_lock(&gblcnt_mutex);
8 globalize_count();
countermax 0
counter 0 9 if (globalcountmax -
10 globalcount - globalreserve < delta) {
11 spin_unlock(&gblcnt_mutex);
12 return 0;
globalcount
13 }
14 globalcount += delta;
15 balance_count();
16 spin_unlock(&gblcnt_mutex);
17 return 1;
18 }
19
20 static __inline__ int sub_count(unsigned long delta)
21 {
22 if (counter >= delta) {
Figure 5.5: Simple Limit Counter Variable Relationships 23 WRITE_ONCE(counter, counter - delta);
24 return 1;
25 }
26 spin_lock(&gblcnt_mutex);
in other words, no thread is permitted to access or modify 27 globalize_count();
28 if (globalcount < delta) {
any of the global variables unless it has acquired gblcnt_ 29 spin_unlock(&gblcnt_mutex);
mutex. 30 return 0;
31 }
Listing 5.7 shows the add_count(), sub_count(), 32 globalcount -= delta;
and read_count() functions (count_lim.c). 33 balance_count();
34 spin_unlock(&gblcnt_mutex);
Quick Quiz 5.30: Why does Listing 5.7 provide add_ 35 return 1;
36 }
count() and sub_count() instead of the inc_count() and 37
dec_count() interfaces show in Section 5.2? 38 static __inline__ unsigned long read_count(void)
39 {
40 int t;
Lines 1–18 show add_count(), which adds the speci- 41 unsigned long sum;
fied value delta to the counter. Line 3 checks to see if 42
43 spin_lock(&gblcnt_mutex);
there is room for delta on this thread’s counter, and, if 44 sum = globalcount;
so, line 4 adds it and line 5 returns success. This is the 45 for_each_thread(t) {
46 if (counterp[t] != NULL)
add_counter() fastpath, and it does no atomic opera- 47 sum += READ_ONCE(*counterp[t]);
tions, references only per-thread variables, and should not 48 }
49 spin_unlock(&gblcnt_mutex);
incur any cache misses. 50 return sum;
51 }
Quick Quiz 5.31: What is with the strange form of the
condition on line 3 of Listing 5.7? Why not the more intuitive
form of the fastpath shown in Listing 5.8?
v2023.06.11a
58 CHAPTER 5. COUNTING
the expression preceding the less-than sign shown in Fig- Listing 5.9: Simple Limit Counter Utility Functions
ure 5.5 as the difference in height of the two red (leftmost) 1 static __inline__ void globalize_count(void)
2 {
bars. If the addition of delta cannot be accommodated, 3 globalcount += counter;
then line 11 (as noted earlier) releases gblcnt_mutex 4 counter = 0;
5 globalreserve -= countermax;
and line 12 returns indicating failure. 6 countermax = 0;
Otherwise, we take the slowpath. Line 14 adds delta 7 }
8
to globalcount, and then line 15 invokes balance_ 9 static __inline__ void balance_count(void)
count() (shown in Listing 5.9) in order to update both the 10 {
11 countermax = globalcountmax -
global and the per-thread variables. This call to balance_ 12 globalcount - globalreserve;
count() will usually set this thread’s countermax to 13 countermax /= num_online_threads();
14 globalreserve += countermax;
re-enable the fastpath. Line 16 then releases gblcnt_ 15 counter = countermax / 2;
mutex (again, as noted earlier), and, finally, line 17 returns 16 if (counter > globalcount)
17 counter = globalcount;
indicating success. 18 globalcount -= counter;
19 }
Quick Quiz 5.32: Why does globalize_count() zero the 20
v2023.06.11a
5.3. APPROXIMATE LIMIT COUNTERS 59
this function does not change the aggregate value of the by the bottommost dotted line connecting the leftmost
counter, but instead changes how the counter’s current and center configurations. In other words, the sum of
value is represented. Line 3 adds the thread’s counter globalcount and the four threads’ counter variables is
variable to globalcount, and line 4 zeroes counter. the same in both configurations. Similarly, this change did
Similarly, line 5 subtracts the per-thread countermax not affect the sum of globalcount and globalreserve,
from globalreserve, and line 6 zeroes countermax. It as indicated by the upper dotted line.
is helpful to refer to Figure 5.5 when reading both this The rightmost configuration shows the relationship
function and balance_count(), which is next. of these counters after balance_count() is executed,
Lines 9–19 show balance_count(), which is roughly again by thread 0. One-quarter of the remaining count,
speaking the inverse of globalize_count(). This func- denoted by the vertical line extending up from all three
tion’s job is to set the current thread’s countermax vari- configurations, is added to thread 0’s countermax and
able to the largest value that avoids the risk of the counter half of that to thread 0’s counter. The amount added to
exceeding the globalcountmax limit. Changing the thread 0’s counter is also subtracted from globalcount
current thread’s countermax variable of course requires in order to avoid changing the overall value of the counter
corresponding adjustments to counter, globalcount (which is again the sum of globalcount and the three
and globalreserve, as can be seen by referring back to threads’ counter variables), again as indicated by the
Figure 5.5. By doing this, balance_count() maximizes lowermost of the two dotted lines connecting the center and
use of add_count()’s and sub_count()’s low-overhead rightmost configurations. The globalreserve variable
fastpaths. As with globalize_count(), balance_ is also adjusted so that this variable remains equal to the
count() is not permitted to change the aggregate value sum of the four threads’ countermax variables. Because
of the counter. thread 0’s counter is less than its countermax, thread 0
Lines 11–13 compute this thread’s share of that por- can once again increment the counter locally.
tion of globalcountmax that is not already covered by
Quick Quiz 5.37: In Figure 5.6, even though a quarter of the
either globalcount or globalreserve, and assign the remaining count up to the limit is assigned to thread 0, only an
computed quantity to this thread’s countermax. Line 14 eighth of the remaining count is consumed, as indicated by the
makes the corresponding adjustment to globalreserve. uppermost dotted line connecting the center and the rightmost
Line 15 sets this thread’s counter to the middle of the configurations. Why is that?
range from zero to countermax. Line 16 checks to
see whether globalcount can in fact accommodate this Lines 21–28 show count_register_thread(),
value of counter, and, if not, line 17 decreases counter which sets up state for newly created threads. This
accordingly. Finally, in either case, line 18 makes the function simply installs a pointer to the newly created
corresponding adjustment to globalcount. thread’s counter variable into the corresponding entry of
the counterp[] array under the protection of gblcnt_
Quick Quiz 5.36: Why set counter to countermax / 2
mutex.
in line 15 of Listing 5.9? Wouldn’t it be simpler to just take
countermax counts? Finally, lines 30–38 show count_unregister_
thread(), which tears down state for a soon-to-be-exiting
It is helpful to look at a schematic depicting how the thread. Line 34 acquires gblcnt_mutex and line 37 re-
relationship of the counters changes with the execution of leases it. Line 35 invokes globalize_count() to clear
first globalize_count() and then balance_count(), out this thread’s counter state, and line 36 clears this
as shown in Figure 5.6. Time advances from left to right, thread’s entry in the counterp[] array.
with the leftmost configuration roughly that of Figure 5.5.
The center configuration shows the relationship of these
5.3.3 Simple Limit Counter Discussion
same counters after globalize_count() is executed by
thread 0. As can be seen from the figure, thread 0’s This type of counter is quite fast when aggregate val-
counter (“c 0” in the figure) is added to globalcount, ues are near zero, with some overhead due to the com-
while the value of globalreserve is reduced by this same parison and branch in both add_count()’s and sub_
amount. Both thread 0’s counter and its countermax count()’s fastpaths. However, the use of a per-thread
(“cm 0” in the figure) are reduced to zero. The other three countermax reserve means that add_count() can fail
threads’ counters are unchanged. Note that this change even when the aggregate value of the counter is nowhere
did not affect the overall value of the counter, as indicated near globalcountmax. Similarly, sub_count() can fail
v2023.06.11a
60 CHAPTER 5. COUNTING
globalize_count() balance_count()
cm 3
globalreserve
c 3
globalreserve
cm 3 cm 3
globalreserve
c 3 c 3
cm 2
c 2
cm 2 cm 2
c 2 c 2
cm 1 c 1
cm 1 c 1 cm 1 c 1
cm 0
c 0
cm 0 c 0
globalcount
globalcount
globalcount
v2023.06.11a
5.4. EXACT LIMIT COUNTERS 61
5.3.5 Approximate Limit Counter Discus- Listing 5.12: Atomic Limit Counter Variables and Access
sion Functions
1 atomic_t __thread counterandmax = ATOMIC_INIT(0);
2 unsigned long globalcountmax = 1 << 25;
These changes greatly reduce the limit inaccuracy seen in 3 unsigned long globalcount = 0;
the previous version, but present another problem: Any 4 unsigned long globalreserve = 0;
5 atomic_t *counterp[NR_THREADS] = { NULL };
given value of MAX_COUNTERMAX will cause a workload- 6 DEFINE_SPINLOCK(gblcnt_mutex);
dependent fraction of accesses to fall off the fastpath. As 7 #define CM_BITS (sizeof(atomic_t) * 4)
8 #define MAX_COUNTERMAX ((1 << CM_BITS) - 1)
the number of threads increase, non-fastpath execution 9
will become both a performance and a scalability problem. 10 static __inline__ void
11 split_counterandmax_int(int cami, int *c, int *cm)
However, we will defer this problem and turn instead to 12 {
counters with exact limits. 13 *c = (cami >> CM_BITS) & MAX_COUNTERMAX;
14 *cm = cami & MAX_COUNTERMAX;
15 }
16
5.4 Exact Limit Counters 17
18
static __inline__ void
split_counterandmax(atomic_t *cam, int *old, int *c, int *cm)
19 {
20 unsigned int cami = atomic_read(cam);
Exactitude can be expensive. Spend wisely. 21
22 *old = cami;
Unknown 23 split_counterandmax_int(cami, c, cm);
24 }
25
To solve the exact structure-allocation limit problem noted 26 static __inline__ int merge_counterandmax(int c, int cm)
in Quick Quiz 5.4, we need a limit counter that can 27 {
28 unsigned int cami;
tell exactly when its limits are exceeded. One way of 29
implementing such a limit counter is to cause threads 30 cami = (c << CM_BITS) | cm;
31 return ((int)cami);
that have reserved counts to give them up. One way to 32 }
do this is to use atomic instructions. Of course, atomic
instructions will slow down the fastpath, but on the other
hand, it would be silly not to at least give them a try. Lines 2–6 show the definitions for globalcountmax,
globalcount, globalreserve, counterp, and
5.4.1 Atomic Limit Counter Implementa- gblcnt_mutex, all of which take on roles similar to
tion their counterparts in Listing 5.10. Line 7 defines CM_
BITS, which gives the number of bits in each half of
Unfortunately, if one thread is to safely remove counts counterandmax, and line 8 defines MAX_COUNTERMAX,
from another thread, both threads will need to atomically which gives the maximum value that may be held in either
manipulate that thread’s counter and countermax vari- half of counterandmax.
ables. The usual way to do this is to combine these two
Quick Quiz 5.39: In what way does line 7 of Listing 5.12
variables into a single variable, for example, given a 32-bit violate the C standard?
variable, using the high-order 16 bits to represent counter
and the low-order 16 bits to represent countermax. Lines 10–15 show the split_counterandmax_
Quick Quiz 5.38: Why is it necessary to atomically manip- int() function, which, when given the underlying int
ulate the thread’s counter and countermax variables as a from the atomic_t counterandmax variable, splits it
unit? Wouldn’t it be good enough to atomically manipulate into its counter (c) and countermax (cm) components.
them individually? Line 13 isolates the most-significant half of this int,
placing the result as specified by argument c, and line 14
The variables and access functions for a simple atomic isolates the least-significant half of this int, placing the
limit counter are shown in Listing 5.12 (count_lim_ result as specified by argument cm.
atomic.c). The counter and countermax variables in Lines 17–24 show the split_counterandmax() func-
earlier algorithms are combined into the single variable tion, which picks up the underlying int from the spec-
counterandmax shown on line 1, with counter in the ified variable on line 20, stores it as specified by the
upper half and countermax in the lower half. This old argument on line 22, and then invokes split_
variable is of type atomic_t, which has an underlying counterandmax_int() to split it on line 23.
representation of int.
v2023.06.11a
62 CHAPTER 5. COUNTING
v2023.06.11a
5.4. EXACT LIMIT COUNTERS 63
Listing 5.14: Atomic Limit Counter Read Listing 5.15: Atomic Limit Counter Utility Functions 1
1 unsigned long read_count(void) 1 static void globalize_count(void)
2 { 2 {
3 int c; 3 int c;
4 int cm; 4 int cm;
5 int old; 5 int old;
6 int t; 6
7 unsigned long sum; 7 split_counterandmax(&counterandmax, &old, &c, &cm);
8 8 globalcount += c;
9 spin_lock(&gblcnt_mutex); 9 globalreserve -= cm;
10 sum = globalcount; 10 old = merge_counterandmax(0, 0);
11 for_each_thread(t) { 11 atomic_set(&counterandmax, old);
12 if (counterp[t] != NULL) { 12 }
13 split_counterandmax(counterp[t], &old, &c, &cm); 13
14 sum += c; 14 static void flush_local_count(void)
15 } 15 {
16 } 16 int c;
17 spin_unlock(&gblcnt_mutex); 17 int cm;
18 return sum; 18 int old;
19 } 19 int t;
20 int zero;
21
22 if (globalreserve == 0)
of delta still cannot be accommodated, then line 24 23 return;
24 zero = merge_counterandmax(0, 0);
releases gblcnt_mutex (as noted earlier), and then line 25 25 for_each_thread(t)
returns failure. 26 if (counterp[t] != NULL) {
27 old = atomic_xchg(counterp[t], zero);
Otherwise, line 28 adds delta to the global counter, 28 split_counterandmax_int(old, &c, &cm);
line 29 spreads counts to the local state if appropriate, 29 globalcount += c;
30 globalreserve -= cm;
line 30 releases gblcnt_mutex (again, as noted earlier), 31 }
and finally, line 31 returns success. 32 }
v2023.06.11a
64 CHAPTER 5. COUNTING
uses a signal handler to steal counts from other threads. READY are green, REQ is red, and ACK is blue.
v2023.06.11a
5.4. EXACT LIMIT COUNTERS 65
Listing 5.17: Signal-Theft Limit Counter Data variables. Lines 1–7 show globalize_count(), which
1 #define THEFT_IDLE 0 is identical to earlier implementations. Lines 9–16 show
2 #define THEFT_REQ 1
3 #define THEFT_ACK 2 flush_local_count_sig(), which is the signal han-
4 #define THEFT_READY 3 dler used in the theft process. Lines 11 and 12 check
5
6 int __thread theft = THEFT_IDLE; to see if the theft state is REQ, and, if not returns
7 int __thread counting = 0; without change. Line 13 sets the theft state to ACK,
8 unsigned long __thread counter = 0;
9 unsigned long __thread countermax = 0; and, if line 14 sees that this thread’s fastpaths are not
10 unsigned long globalcountmax = 10000; running, line 15 uses smp_store_release() to set the
11 unsigned long globalcount = 0;
12 unsigned long globalreserve = 0; theft state to READY, further ensuring that any change
13 unsigned long *counterp[NR_THREADS] = { NULL }; to counter in the fastpath happens before this change of
14 unsigned long *countermaxp[NR_THREADS] = { NULL };
15 int *theftp[NR_THREADS] = { NULL }; theft to READY.
16 DEFINE_SPINLOCK(gblcnt_mutex);
17 #define MAX_COUNTERMAX 100 Quick Quiz 5.50: In Listing 5.18, doesn’t flush_local_
count_sig() need stronger memory barriers?
handler is not permitted to change the state, and therefore Lines 18–47 show flush_local_count(), which is
simply returns. Otherwise, if the counting variable is set, called from the slowpath to flush all threads’ local counts.
indicating that the current thread’s fastpath is in progress, The loop spanning lines 23–32 advances the theft state
the signal handler sets the theft state to ACK, otherwise for each thread that has local count, and also sends that
to READY. thread a signal. Line 24 skips any non-existent threads.
If the theft state is ACK, only the fastpath is permitted Otherwise, line 25 checks to see if the current thread
to change the theft state, as indicated by the blue color. holds any local count, and, if not, line 26 sets the thread’s
When the fastpath completes, it sets the theft state to theft state to READY and line 27 skips to the next thread.
READY. Otherwise, line 29 sets the thread’s theft state to REQ
Once the slowpath sees a thread’s theft state is and line 30 sends the thread a signal.
READY, the slowpath is permitted to steal that thread’s Quick Quiz 5.51: In Listing 5.18, why is it safe for line 25 to
count. The slowpath then sets that thread’s theft state to directly access the other thread’s countermax variable?
IDLE.
Quick Quiz 5.48: In Figure 5.7, why is the REQ theft state Quick Quiz 5.52: In Listing 5.18, why doesn’t line 30 check
colored red? for the current thread sending itself a signal?
Quick Quiz 5.49: In Figure 5.7, what is the point of having Quick Quiz 5.53: The code shown in Listings 5.17 and 5.18
separate REQ and ACK theft states? Why not simplify the works with GCC and POSIX. What would be required to make
state machine by collapsing them into a single REQACK state? it also conform to the ISO C standard?
Then whichever of the signal handler or the fastpath gets there
first could set the state to READY. The loop spanning lines 33–46 waits until each thread
reaches READY state, then steals that thread’s count.
Lines 34–35 skip any non-existent threads, and the loop
5.4.4 Signal-Theft Limit Counter Imple- spanning lines 36–40 waits until the current thread’s
theft state becomes READY. Line 37 blocks for a
mentation
millisecond to avoid priority-inversion problems, and if
Listing 5.17 (count_lim_sig.c) shows the data struc- line 38 determines that the thread’s signal has not yet
tures used by the signal-theft based counter implemen- arrived, line 39 resends the signal. Execution reaches
tation. Lines 1–7 define the states and values for the line 41 when the thread’s theft state becomes READY,
per-thread theft state machine described in the preceding so lines 41–44 do the thieving. Line 45 then sets the
section. Lines 8–17 are similar to earlier implementa- thread’s theft state back to IDLE.
tions, with the addition of lines 14 and 15 to allow remote Quick Quiz 5.54: In Listing 5.18, why does line 39 resend
access to a thread’s countermax and theft variables, the signal?
respectively.
Listing 5.18 shows the functions responsible for migrat- Lines 49–61 show balance_count(), which is similar
ing counts between per-thread variables and the global to that of earlier examples.
v2023.06.11a
66 CHAPTER 5. COUNTING
v2023.06.11a
5.4. EXACT LIMIT COUNTERS 67
Listing 5.21: Signal-Theft Limit Counter Read Function Listing 5.22: Signal-Theft Limit Counter Initialization Func-
1 unsigned long read_count(void) tions
2 { 1 void count_init(void)
3 int t; 2 {
4 unsigned long sum; 3 struct sigaction sa;
5 4
6 spin_lock(&gblcnt_mutex); 5 sa.sa_handler = flush_local_count_sig;
7 sum = globalcount; 6 sigemptyset(&sa.sa_mask);
8 for_each_thread(t) { 7 sa.sa_flags = 0;
9 if (counterp[t] != NULL) 8 if (sigaction(SIGUSR1, &sa, NULL) != 0) {
10 sum += READ_ONCE(*counterp[t]); 9 perror("sigaction");
11 } 10 exit(EXIT_FAILURE);
12 spin_unlock(&gblcnt_mutex); 11 }
13 return sum; 12 }
14 } 13
14 void count_register_thread(void)
15 {
16 int idx = smp_thread_id();
Listing 5.19 shows the add_count() function. The 17
18 spin_lock(&gblcnt_mutex);
fastpath spans lines 5–18, and the slowpath lines 19–33. 19 counterp[idx] = &counter;
Line 5 sets the per-thread counting variable to 1 so that 20 countermaxp[idx] = &countermax;
21 theftp[idx] = &theft;
any subsequent signal handlers interrupting this thread will 22 spin_unlock(&gblcnt_mutex);
set the theft state to ACK rather than READY, allowing 23 }
24
this fastpath to complete properly. Line 6 prevents the 25 void count_unregister_thread(int nthreadsexpected)
compiler from reordering any of the fastpath body to 26 {
27 int idx = smp_thread_id();
precede the setting of counting. Lines 7 and 8 check 28
v2023.06.11a
68 CHAPTER 5. COUNTING
as required by ever-changing hardware performance char- Line 1 read-acquires the lock, and either line 3 or 7
acteristics. releases it. Line 2 checks to see if the device is being
Quick Quiz 5.56: What if you want an exact limit counter to removed, and, if so, line 3 releases the lock and line 4
be exact only for its lower limit, but to allow the upper limit to cancels the I/O, or takes whatever action is appropriate
be inexact? given that the device is to be removed. Otherwise, line 6
increments the access count, line 7 releases the lock, line 8
performs the I/O, and line 9 decrements the access count.
5.4.6 Applying Exact Limit Counters Quick Quiz 5.58: This is ridiculous! We are read-acquiring
a reader-writer lock to update the counter? What are you
Although the exact limit counter implementations pre- playing at???
sented in this section can be very useful, they are not
much help if the counter’s value remains near zero at The code to remove the device might be as follows:
all times, as it might when counting the number of out-
standing accesses to an I/O device. The high overhead 1 write_lock(&mylock);
2 removing = 1;
of such near-zero counting is especially painful given 3 sub_count(mybias);
that we normally don’t care how many references there 4 write_unlock(&mylock);
5 while (read_count() != 0)
are. As noted in the removable I/O device access-count 6 poll(NULL, 0, 1);
problem posed by Quick Quiz 5.5, the number of accesses 7 remove_device();
is irrelevant except in those rare cases when someone is
actually trying to remove the device. Line 1 write-acquires the lock and line 4 releases it.
One simple solution to this problem is to add a large Line 2 notes that the device is being removed, and the
“bias” (for example, one billion) to the counter in order loop spanning lines 5–6 waits for any I/O operations to
to ensure that the value is far enough from zero that the complete. Finally, line 7 does any additional processing
counter can operate efficiently. When someone wants needed to prepare for device removal.
to remove the device, this bias is subtracted from the
counter value. Counting the last few accesses will be quite Quick Quiz 5.59: What other issues would need to be
inefficient, but the important point is that the many prior accounted for in a real system?
accesses will have been counted at full speed.
Quick Quiz 5.57: What else had you better have done when
using a biased counter? 5.5 Parallel Counting Discussion
Although a biased counter can be quite helpful and This idea that there is generality in the specific is of
useful, it is only a partial solution to the removable I/O far-reaching importance.
device access-count problem called out on page 49. When
attempting to remove a device, we must not only know Douglas R. Hofstadter
the precise number of current I/O accesses, we also need
to prevent any future accesses from starting. One way This chapter has presented the reliability, performance, and
to accomplish this is to read-acquire a reader-writer lock scalability problems with traditional counting primitives.
when updating the counter, and to write-acquire that same The C-language ++ operator is not guaranteed to function
reader-writer lock when checking the counter. Code for reliably in multithreaded code, and atomic operations to a
doing I/O might be as follows: single variable neither perform nor scale well. This chapter
therefore presented a number of counting algorithms that
1 read_lock(&mylock); perform and scale extremely well in certain special cases.
2 if (removing) { It is well worth reviewing the lessons from these count-
3 read_unlock(&mylock);
4 cancel_io();
ing algorithms. To that end, Section 5.5.1 overviews
5 } else { requisite validation, Section 5.5.2 summarizes perfor-
6 add_count(1); mance and scalability, Section 5.5.3 discusses the need
7 read_unlock(&mylock);
8 do_io(); for specialization, and finally, Section 5.5.4 enumerates
9 sub_count(1); lessons learned and calls attention to later chapters that
10 }
will expand on these lessons.
v2023.06.11a
5.5. PARALLEL COUNTING DISCUSSION 69
Exact?
Reads (ns)
Algorithm Updates
(count_*.c) Section (ns) 1 CPU 8 CPUs 64 CPUs 420 CPUs
stat 5.2.2 6.3 294 303 315 612
stat_eventual 5.2.4 6.4 1 1 1 1
end 5.2.3 2.9 301 6,309 147,594 239,683
end_rcu 13.5.1 2.9 454 481 508 2,317
lim 5.3.2 N 3.2 435 6,678 156,175 239,422
lim_app 5.3.4 N 2.4 485 7,041 173,108 239,682
lim_atomic 5.4.1 Y 19.7 513 7,085 199,957 239,450
lim_sig 5.4.4 Y 4.7 519 6,805 120,000 238,811
v2023.06.11a
70 CHAPTER 5. COUNTING
In short, this chapter has demonstrated a number of hardware configuration and in workload. There has in fact
counting algorithms that perform and scale extremely been some research into this sort of automation [AHS+ 03,
well in a number of special cases. But must our parallel SAH+ 03], and the Linux kernel does some boot-time
counting be confined to special cases? Wouldn’t it be reconfiguration, including limited binary rewriting. This
better to have a general algorithm that operated efficiently sort of adaptation will become increasingly important as
in all cases? The next section looks at these questions. the number of CPUs on mainstream systems continues to
increase.
5.5.3 Parallel Counting Specializations In short, as discussed in Chapter 3, the laws of physics
constrain parallel software just as surely as they constrain
The fact that these algorithms only work well in their mechanical artifacts such as bridges. These constraints
respective special cases might be considered a major force specialization, though in the case of software it
problem with parallel programming in general. After might be possible to automate the choice of specialization
all, the C-language ++ operator works just fine in single- to fit the hardware and workload in question.
threaded code, and not just for special cases, but in general, Of course, even generalized counting is quite special-
right? ized. We need to do a great number of other things with
This line of reasoning does contain a grain of truth, but computers. The next section relates what we have learned
is in essence misguided. The problem is not parallelism from counters to topics taken up later in this book.
as such, but rather scalability. To understand this, first
consider the C-language ++ operator. The fact is that it 5.5.4 Parallel Counting Lessons
does not work in general, only for a restricted range of
numbers. If you need to deal with 1,000-digit decimal The opening paragraph of this chapter promised that our
numbers, the C-language ++ operator will not work for study of counting would provide an excellent introduction
you. to parallel programming. This section makes explicit
connections between the lessons from this chapter and the
Quick Quiz 5.64: The ++ operator works just fine for 1,000-
digit numbers! Haven’t you heard of operator overloading??? material presented in a number of later chapters.
The examples in this chapter have shown that an impor-
tant scalability and performance tool is partitioning. The
This problem is not specific to arithmetic. Suppose you counters might be fully partitioned, as in the statistical
need to store and query data. Should you use an ASCII counters discussed in Section 5.2, or partially partitioned
file? XML? A relational database? A linked list? A dense as in the limit counters discussed in Sections 5.3 and 5.4.
array? A B-tree? A radix tree? Or one of the plethora of Partitioning will be considered in far greater depth in Chap-
other data structures and environments that permit data to ter 6, and partial parallelization in particular in Section 6.4,
be stored and queried? It depends on what you need to where it is called parallel fastpath.
do, how fast you need it done, and how large your data set
Quick Quiz 5.65: But if we are going to have to partition
is—even on sequential systems. everything, why bother with shared-memory multithreading?
Similarly, if you need to count, your solution will Why not just partition the problem completely and run as
depend on how large of numbers you need to work with, multiple processes, each in its own address space?
how many CPUs need to be manipulating a given number
concurrently, how the number is to be used, and what level The partially partitioned counting algorithms used lock-
of performance and scalability you will need. ing to guard the global data, and locking is the subject
Nor is this problem specific to software. The design of Chapter 7. In contrast, the partitioned data tended to
for a bridge meant to allow people to walk across a small be fully under the control of the corresponding thread, so
brook might be a simple as a single wooden plank. But you that no synchronization whatsoever was required. This
would probably not use a plank to span the kilometers-wide data ownership will be introduced in Section 6.3.4 and
mouth of the Columbia River, nor would such a design be discussed in more detail in Chapter 8.
advisable for bridges carrying concrete trucks. In short, Because integer addition and subtraction are extremely
just as bridge design must change with increasing span cheap compared to typical synchronization operations,
and load, so must software design change as the number of achieving reasonable scalability requires synchronization
CPUs increases. That said, it would be good to automate operations be used sparingly. One way of achieving this
this process, so that the software adapts to changes in is to batch the addition and subtraction operations, so that
v2023.06.11a
5.5. PARALLEL COUNTING DISCUSSION 71
v2023.06.11a
72 CHAPTER 5. COUNTING
v2023.06.11a
Divide and rule.
Philip II of Macedon
Chapter 6
This chapter describes how to design software to take ad- To this end, Section 6.1 presents partitioning exercises,
vantage of modern commodity multicore systems by using Section 6.2 reviews partitionability design criteria, Sec-
idioms, or “design patterns” [Ale79, GHJV95, SSRB00], tion 6.3 discusses synchronization granularity selection,
to balance performance, scalability, and response time. Section 6.4 overviews important parallel-fastpath design
Correctly partitioned problems lead to simple, scalable, patterns that provide speed and scalability on common-
and high-performance solutions, while poorly partitioned case fastpaths while using simpler less-scalable “slow path”
problems result in slow and complex solutions. This fallbacks for unusual situations, and finally Section 6.5
chapter will help you design partitioning into your code, takes a brief look beyond partitioning.
with some discussion of batching and weakening as well.
The word “design” is very important: You should parti-
tion first, batch second, weaken third, and code fourth. 6.1 Partitioning Exercises
Changing this order often leads to poor performance and
scalability along with great frustration.1 Whenever a theory appears to you as the only
This chapter will also look at some specific problems, possible one, take this as a sign that you have
including: neither understood the theory nor the problem
which it was intended to solve.
1. Constraints on the classic Dining Philosophers prob-
Karl Popper
lem requiring that all the philophers be able to dine
concurrently. Although partitioning is more widely understood than it
2. Lock-based double-ended queue implementations was in the early 2000s, its value is still underappreciated.
that provide concurrency between operations on both Section 6.1.1 therefore takes more highly parallel look at
ends of a given queue when there are many elements the classic Dining Philosophers problem and Section 6.1.2
in the queue, but still work correctly when the queue revisits the double-ended queue.
contains only a few elements. (Or, for that matter, no
elements.) 6.1.1 Dining Philosophers Problem
3. Summarizing the rough quality of a concurrent algo- Figure 6.1 shows a diagram of the classic Dining Phi-
rithm with only a few numbers. losophers problem [Dij71]. This problem features five
philosophers who do nothing but think and eat a “very
4. Selecting the right granularity of partitioning. difficult kind of spaghetti” which requires two forks to
eat.2 A given philosopher is permitted to use only the
5. Comcurrent designs for applications that do not fully
forks to his or her immediate right and left, but will not
partition.
put a given fork down until sated.
6. Obtaining more than 2x speedup from two CPUs. The object is to construct an algorithm that, quite
literally, prevents starvation. One starvation scenario
1 That other great dodge around the Laws of Physics, read-only
replication, is covered in Chapter 9. 2 But feel free to instead think in terms of chopsticks.
73
v2023.06.11a
74 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
3. P4 picks up fork 3.
P5 P2 4. P5 picks up fork 4.
v2023.06.11a
6.1. PARTITIONING EXERCISES 75
P1 P1
5 1
P5 P2
P4 P2
4 2
P4 P3
P3
section shows how a partitioning design strategy can result elements pushed onto it must have already been popped
in a reasonably simple implementation, looking at three from it.
general approaches in the following sections. But first, The beginnings of a test suite for concurrent double-
how should we validate a concurrent double-ended queue? ended queues (“deqtorture.h”) provides the following
checks:
v2023.06.11a
76 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Header L 0 Header R
the list can shift the queue from one special case to another
at any time. It is far better to consider other designs.
Lock L Lock R
Header L 0 1 Header R
6.1.2.3 Compound Double-Ended Queue
Lock L Lock R
One way of forcing non-overlapping lock domains is
shown in Figure 6.6. Two separate double-ended queues
Header L 0 1 2 Header R
are run in tandem, each protected by its own lock. This
means that elements must occasionally be shuttled from
Lock L Lock R
one of the double-ended queues to the other, in which case
both locks must be held. A simple lock hierarchy may
Header L 0 1 2 3 Header R
be used to avoid deadlock, for example, always acquiring
the left-hand lock before acquiring the right-hand lock.
Figure 6.5: Double-Ended Queue With Left- and Right-
This will be much simpler than applying two locks to the
Hand Locks
same double-ended queue, as we can unconditionally left-
enqueue elements to the left-hand queue and right-enqueue
code, you should test considerably more thoroughly for elements to the right-hand queue. The main complication
code intended for production use. Chapters 11 and 12 arises when dequeuing from an empty queue, in which
cover a large array of validation tools and techniques. case it is necessary to:
But with a prototype test suite in place, we are ready
to look at the double-ended-queue algorithms in the next 1. If holding the right-hand lock, release it and acquire
sections. the left-hand lock.
v2023.06.11a
6.1. PARTITIONING EXERCISES 77
Index L Index R
Lock L Lock R
R1
Figure 6.7: Hashed Double-Ended Queue
v2023.06.11a
78 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
R4 R5 R6 R7
L0 R1 R2 R3
v2023.06.11a
6.1. PARTITIONING EXERCISES 79
v2023.06.11a
80 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
the right-hand queue, and line 12 releases the right-hand is 2x scalability, as at most two threads can be holding the
lock. The element, if any, that was dequeued on line 9 dequeue’s locks concurrently. This limitation also applies
will be returned. to algorithms based on non-blocking synchronization,
The pdeq_pop_r() implementation is shown on such as the compare-and-swap-based dequeue algorithm
lines 18–38 of the figure. As before, line 22 acquires of Michael [Mic03].5
the right-hand lock (and line 36 releases it), and line 23 Quick Quiz 6.11: Why are there not one but two solutions to
attempts to right-dequeue an element from the right-hand the double-ended queue problem?
queue, and, if successful, skips lines 25–35 to simply
return this element. However, if line 24 determines that In fact, as noted by Dice et al. [DLM+ 10], an unsynchro-
there was no element to dequeue, line 25 releases the nized single-threaded double-ended queue significantly
right-hand lock and lines 26–27 acquire both locks in outperforms any of the parallel implementations they stud-
the proper order. Line 28 then attempts to right-dequeue ied. Therefore, the key point is that there can be significant
an element from the right-hand list again, and if line 29 overhead enqueuing to or dequeuing from a shared queue,
determines that this second attempt has failed, line 30 regardless of implementation. This should come as no
right-dequeues an element from the left-hand queue (if surprise in light of the material in Chapter 3, given the
there is one available), line 31 moves any remaining ele- strict first-in-first-out (FIFO) nature of these queues.
ments from the left-hand queue to the right-hand queue, Furthermore, these strict FIFO queues are strictly FIFO
and line 32 initializes the left-hand queue. Either way, only with respect to linearization points [HW90]6 that
line 34 releases the left-hand lock. are not visible to the caller, in fact, in these examples, the
Quick Quiz 6.8: Why is it necessary to retry the right-dequeue linearization points are buried in the lock-based critical
operation on line 28 of Listing 6.3? sections. These queues are not strictly FIFO with respect
to (say) the times at which the individual operations
Quick Quiz 6.9: Surely the left-hand lock must sometimes be started [HKLP12]. This indicates that the strict FIFO
available!!! So why is it necessary that line 25 of Listing 6.3 property is not all that valuable in concurrent programs,
unconditionally release the right-hand lock? and in fact, Kirsch et al. present less-strict queues that
provide improved performance and scalability [KLP12].7
The pdeq_push_l() implementation is shown on All that said, if you are pushing all the data used by your
lines 40–45 of Listing 6.3. Line 42 acquires the left- concurrent program through a single queue, you really
hand spinlock, line 43 left-enqueues the element onto the need to rethink your overall design.
left-hand queue, and finally line 44 releases the lock. The
pdeq_push_r() implementation (shown on lines 47–52)
is quite similar. 6.1.3 Partitioning Example Discussion
Quick Quiz 6.10: But in the case where data is flowing in The optimal solution to the dining philosophers problem
only one direction, the algorithm shown in Listing 6.3 will given in the answer to the Quick Quiz in Section 6.1.1 is
have both ends attempting to acquire the same lock whenever an excellent example of “horizontal parallelism” or “data
the consuming end empties its underlying double-ended queue. parallelism”. The synchronization overhead in this case
Doesn’t that mean that sometimes this algorithm fails to provide is nearly (or even exactly) zero. In contrast, the double-
concurrent access to both ends of the queue even when the ended queue implementations are examples of “vertical
queue contains an arbitrarily large number of elements?
parallelism” or “pipelining”, given that data moves from
5 This paper is interesting in that it showed that special double-
6.1.2.6 Double-Ended Queue Discussion compare-and-swap (DCAS) instructions are not needed for lock-free
implementations of double-ended queues. Instead, the common compare-
The compound implementation is somewhat more com- and-swap (e.g., x86 cmpxchg) suffices.
6 In short, a linearization point is a single point within a given
plex than the hashed variant presented in Section 6.1.2.4,
function where that function can be said to have taken effect. In this
but is still reasonably simple. Of course, a more intelligent lock-based implementation, the linearization points can be said to be
rebalancing scheme could be arbitrarily complex, but the anywhere within the critical section that does the work.
7 Nir Shavit produced relaxed stacks for roughly the same rea-
simple scheme shown here has been shown to perform well
sons [Sha11]. This situation leads some to believe that the linearization
compared to software alternatives [DCW+ 11] and even points are useful to theorists rather than developers, and leads others
compared to algorithms using hardware assist [DLM+ 10]. to wonder to what extent the designers of such data structures and
Nevertheless, the best we can hope for from such a scheme algorithms were considering the needs of their users.
v2023.06.11a
6.2. DESIGN CRITERIA 81
one thread to another. The tighter coordination required The design criteria for attaining the three parallel-
for pipelining in turn requires larger units of work to obtain programming goals are speedup, contention, overhead,
a given level of efficiency. read-to-write ratio, and complexity:
Quick Quiz 6.12: The tandem double-ended queue runs Speedup: As noted in Section 2.2, increased performance
about twice as fast as the hashed double-ended queue, even is the major reason to go to all of the time and trouble
when I increase the size of the hash table to an insanely large required to parallelize it. Speedup is defined to be the
number. Why is that? ratio of the time required to run a sequential version
of the program to the time required to run a parallel
Quick Quiz 6.13: Is there a significantly better way of version.
handling concurrency for double-ended queues?
Contention: If more CPUs are applied to a parallel pro-
These two examples show just how powerful partition- gram than can be kept busy by that program, the
ing can be in devising parallel algorithms. Section 6.3.5 excess CPUs are prevented from doing useful work
looks briefly at a third example, matrix multiply. However, by contention. This may be lock contention, memory
all three of these examples beg for more and better design contention, or a host of other performance killers.
criteria for parallel programs, a topic taken up in the next
Work-to-Synchronization Ratio: A uniprocessor, sin-
section.
gle-threaded, non-preemptible, and non-interrupt-
ible8 version of a given parallel program would not
need any synchronization primitives. Therefore,
6.2 Design Criteria any time consumed by these primitives (including
communication cache misses as well as message
One pound of learning requires ten pounds of latency, locking primitives, atomic instructions, and
commonsense to apply it. memory barriers) is overhead that does not contrib-
ute directly to the useful work that the program is
Persian proverb
intended to accomplish. Note that the important
measure is the relationship between the synchroniza-
One way to obtain the best performance and scalability is tion overhead and the overhead of the code in the
to simply hack away until you converge on the best possible critical section, with larger critical sections able to
parallel program. Unfortunately, if your program is other tolerate greater synchronization overhead. The work-
than microscopically tiny, the space of possible parallel to-synchronization ratio is related to the notion of
programs is so huge that convergence is not guaranteed in synchronization efficiency.
the lifetime of the universe. Besides, what exactly is the
“best possible parallel program”? After all, Section 2.2 Read-to-Write Ratio: A data structure that is rarely up-
called out no fewer than three parallel-programming goals dated may often be replicated rather than partitioned,
of performance, productivity, and generality, and the best and furthermore may be protected with asymmet-
possible performance will likely come at a cost in terms ric synchronization primitives that reduce readers’
of productivity and generality. We clearly need to be able synchronization overhead at the expense of that of
to make higher-level choices at design time in order to writers, thereby reducing overall synchronization
arrive at an acceptably good parallel program before that overhead. Corresponding optimizations are possible
program becomes obsolete. for frequently updated data structures, as discussed
However, more detailed design criteria are required to in Chapter 5.
actually produce a real-world design, a task taken up in
Complexity: A parallel program is more complex than an
this section. This being the real world, these criteria often
equivalent sequential program because the parallel
conflict to a greater or lesser degree, requiring that the
program has a much larger state space than does
designer carefully balance the resulting tradeoffs.
the sequential program, although large state spaces
As such, these criteria may be thought of as the
having regular structures can in some cases be easily
“forces” acting on the design, with particularly good
understood. A parallel programmer must consider
tradeoffs between these forces being called “design pat-
terns” [Ale79, GHJV95]. 8 Either by masking interrupts or by being oblivious to them.
v2023.06.11a
82 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
synchronization primitives, messaging, locking de- less than one tenth of its time in the most-restrictive
sign, critical-section identification, and deadlock in exclusive-lock critical section.
the context of this larger state space.
This greater complexity often translates to higher
2. Contention effects consume the excess CPU and/or
development and maintenance costs. Therefore, bud-
wallclock time when the actual speedup is less than
getary constraints can limit the number and types
the number of available CPUs. The larger the gap
of modifications made to an existing program, since
between the number of CPUs and the actual speedup,
a given degree of speedup is worth only so much
the less efficiently the CPUs will be used. Similarly,
time and trouble. Worse yet, added complexity can
the greater the desired efficiency, the smaller the
actually reduce performance and scalability.
achievable speedup.
Therefore, beyond a certain point, there may be
potential sequential optimizations that are cheaper
and more effective than parallelization. As noted 3. If the available synchronization primitives have high
in Section 2.2.1, parallelization is but one perfor- overhead compared to the critical sections that they
mance optimization of many, and is furthermore an guard, the best way to improve speedup is to reduce
optimization that applies most readily to CPU-based the number of times that the primitives are invoked.
bottlenecks. This can be accomplished by batching critical sec-
tions, using data ownership (see Chapter 8), using
These criteria will act together to enforce a maximum asymmetric primitives (see Chapter 9), or by using a
speedup. The first three criteria are deeply interrelated, coarse-grained design such as code locking.
so the remainder of this section analyzes these interrela-
tionships.9
Note that these criteria may also appear as part of 4. If the critical sections have high overhead compared
the requirements specification, and further that they are to the primitives guarding them, the best way to
one solution to the problem of summarizing the quality improve speedup is to increase parallelism by moving
of a concurrent algorithm from page 73. For example, to reader/writer locking, data locking, asymmetric,
speedup may act as a relative desideratum (“the faster, the or data ownership.
better”) or as an absolute requirement of the workload
(“the system must support at least 1,000,000 web hits
per second”). Classic design pattern languages describe 5. If the critical sections have high overhead compared
relative desiderata as forces and absolute requirements as to the primitives guarding them and the data structure
context. being guarded is read much more often than modified,
An understanding of the relationships between these the best way to increase parallelism is to move to
design criteria can be very helpful when identifying ap- reader/writer locking or asymmetric primitives.
propriate design tradeoffs for a parallel program.
1. The less time a program spends in exclusive-lock 6. Many changes that improve SMP performance, for
critical sections, the greater the potential speedup. example, reducing lock contention, also improve
This is a consequence of Amdahl’s Law [Amd67] real-time latencies [McK05c].
because only one CPU may execute within a given
exclusive-lock critical section at a given time.
Quick Quiz 6.14: Don’t all these problems with critical
More specifically, for unbounded linear scalability,
sections mean that we should just always use non-blocking
the fraction of time that the program spends in a
synchronization [Her90], which don’t have critical sections?
given exclusive critical section must decrease as the
number of CPUs increases. For example, a program
will not scale to 10 CPUs unless it spends much
It is worth reiterating that contention has many guises,
9 A real-world parallel system will be subject to many additional including lock contention, memory contention, cache
design criteria, such as data-structure layout, memory size, memory- overflow, thermal throttling, and much else besides. This
hierarchy latencies, bandwidth limitations, and I/O issues. chapter looks primarily at lock and memory contention.
v2023.06.11a
6.3. SYNCHRONIZATION GRANULARITY 83
Sequential 10000
Program
Own Disown
1
Data
Ownership 0.1
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
Figure 6.10: Design Patterns and Lock Granularity
Year
ability to retire multiple instructions per clock is typically limited by of Java, uses classes with synchronized instances, you are instead using
memory-system performance. “data locking”, described in Section 6.3.3.
v2023.06.11a
84 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
8
9 typedef struct node {
10000 10 unsigned long key;
11 struct node *next;
1000 12 } node_t;
13
int hash_search(struct hash_table *h, long key)
100 x86 CPUs 14
15 {
16 struct node *cur;
10 17 int retval;
18
1 19 spin_lock(&hash_lock);
20 cur = h->buckets[key % h->nbuckets];
while (cur != NULL) {
0.1
21
22 if (cur->key >= key) {
1970
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
23 retval = (cur->key == key);
24 spin_unlock(&hash_lock);
25 return retval;
26 }
Year 27 cur = cur->next;
28 }
Figure 6.12: Ethernet Bandwidth vs. Intel x86 CPU 29 spin_unlock(&hash_lock);
30 return 0;
Performance 31 }
v2023.06.11a
6.3. SYNCHRONIZATION GRANULARITY 85
v2023.06.11a
86 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Figure 6.15: Data Locking and Skew 1. Any variables accessible by only one CPU or thread
(such as auto variables in C and C++) are owned by
that CPU or process.
can arise in SMP programs. For example, the Linux 2. An instance of a user interface owns the correspond-
kernel maintains a cache of files and directories (called ing user’s context. It is very common for applications
“dcache”). Each entry in this cache has its own lock, but the interacting with parallel database engines to be writ-
entries corresponding to the root directory and its direct ten as if they were entirely sequential programs. Such
descendants are much more likely to be traversed than applications own the user interface and his current
are more obscure entries. This can result in many CPUs action. Explicit parallelism is thus confined to the
contending for the locks of these popular entries, resulting database engine itself.
in a situation not unlike that shown in Figure 6.15.
In many cases, algorithms can be designed to re- 3. Parametric simulations are often trivially parallelized
duce the instance of data skew, and in some cases by granting each thread ownership of a particular
eliminate it entirely (for example, in the Linux ker- region of the parameter space. There are also com-
nel’s dcache [MSS04, Cor10a, Bro15a, Bro15b, Bro15c]). puting frameworks designed for this type of prob-
Data locking is often used for partitionable data structures lem [Uni08a].
v2023.06.11a
6.3. SYNCHRONIZATION GRANULARITY 87
If there is significant sharing, communication between The service rate 𝜇 is defined similarly, but for the
the threads or CPUs can result in significant complexity average number of synchronization operations per second
and overhead. Furthermore, if the most-heavily used data that the system would process if the overhead of each
happens to be that owned by a single CPU, that CPU will be transaction was zero, and ignoring the fact that CPUs
a “hot spot”, sometimes with results resembling that shown must wait on each other to complete their synchronization
in Figure 6.15. However, in situations where no sharing operations, in other words, 𝜇 can be roughly thought of as
is required, data ownership achieves ideal performance, the synchronization overhead in absence of contention. For
and with code that can be as simple as the sequential- example, suppose that each transaction’s synchronization
program case shown in Listing 6.4. Such situations are operation involves an atomic increment instruction, and
often referred to as “embarrassingly parallel”, and, in that a computer system is able to do a private-variable
the best case, resemble the situation previously shown in atomic increment every 5 nanoseconds on each CPU
Figure 6.14. (see Figure 5.1).13 The value of 𝜇 is therefore about
Another important instance of data ownership occurs 200,000,000 atomic increments per second.
when the data is read-only, in which case, all threads can Of course, the value of 𝜆 increases as increasing num-
“own” it via replication. bers of CPUs increment a shared variable because each
Where data locking partitions both the address space CPU is capable of processing transactions independently
(with one hash buckets per partition) and time (using (again, ignoring synchronization):
per-bucket locks), data ownership partitions only the ad-
dress space. The reason that data ownership need not 𝜆 = 𝑛𝜆0 (6.1)
partition time is because a given thread or CPU is assigned Here, 𝑛 is the number of CPUs and 𝜆 0 is the transaction-
permanent ownership of a given address-space partition. processing capability of a single CPU. Note that the
Quick Quiz 6.18: But won’t system boot and shutdown (or expected time for a single CPU to execute a single trans-
application startup and shutdown) be partitioning time, even action in the absence of contention is 1/𝜆0 .
for data ownership? Because the CPUs have to “wait in line” behind each
other to get their chance to increment the single shared vari-
Data ownership will be presented in more detail in able, we can use the M/M/1 queueing-model expression
Chapter 8. for the expected total waiting time:
1
𝑇= (6.2)
6.3.5 Locking Granularity and Perfor- 𝜇−𝜆
mance Substituting the above value of 𝜆:
This section looks at locking granularity and performance 1
𝑇= (6.3)
from a mathematical synchronization-efficiency viewpoint. 𝜇 − 𝑛𝜆0
Readers who are uninspired by mathematics might choose Now, the efficiency is just the ratio of the time required
to skip this section. to process a transaction in absence of synchronization
The approach is to use a crude queueing model for the (1/𝜆0 ) to the time required including synchronization
efficiency of synchronization mechanism that operate on (𝑇 + 1/𝜆0 ):
a single shared global variable, based on an M/M/1 queue.
M/M/1 queuing models are based on an exponentially 1/𝜆0
𝑒= (6.4)
distributed “inter-arrival rate” 𝜆 and an exponentially 𝑇 + 1/𝜆0
distributed “service rate” 𝜇. The inter-arrival rate 𝜆 can Substituting the above value for 𝑇 and simplifying:
be thought of as the average number of synchronization
𝜇
operations per second that the system would process if the 𝜆0 −𝑛
synchronization were free, in other words, 𝜆 is an inverse 𝑒= 𝜇 (6.5)
𝜆0 − (𝑛 − 1)
measure of the overhead of each non-synchronization
13 Of course, if there are 8 CPUs all incrementing the same shared
unit of work. For example, if each unit of work was a
variable, then each CPU must wait at least 35 nanoseconds for each
transaction, and if each transaction took one millisecond of the other CPUs to do its increment before consuming an additional
to process, excluding synchronization overhead, then 𝜆 5 nanoseconds doing its own increment. In fact, the wait will be longer
would be 1,000 transactions per second. due to the need to move the variable from one CPU to another.
v2023.06.11a
88 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
1.2
Synchronization Efficiency
1
among a group of threads, with each thread computing and jagged traces of Figure 6.17 gives evidence of its real-world nature.
v2023.06.11a
6.4. PARALLEL FASTPATH 89
Quick Quiz 6.20: How are data-parallel techniques going to 6.4.1 Reader/Writer Locking
help with matrix multiply? It is already data parallel!!!
If synchronization overhead is negligible (for example, if
the program uses coarse-grained parallelism with large
Quick Quiz 6.21: What did you do to validate this matrix
multiply algorithm?
critical sections), and if only a small fraction of the critical
sections modify data, then allowing multiple readers to
proceed in parallel can greatly increase scalability. Writ-
ers exclude both readers and each other. There are many
6.4 Parallel Fastpath implementations of reader-writer locking, including the
POSIX implementation described in Section 4.2.4. List-
ing 6.7 shows how the hash search might be implemented
There are two ways of meeting difficulties: You alter using reader-writer locking.
the difficulties, or you alter yourself to meet them.
Reader/writer locking is a simple instance of asymmet-
Phyllis Bottome ric locking. Snaman [ST87] describes a more ornate six-
mode asymmetric locking design used in several clustered
Fine-grained (and therefore usually higher-performance) systems. Locking in general and reader-writer locking in
designs are typically more complex than are coarser- particular is described extensively in Chapter 7.
grained designs. In many cases, most of the overhead is
incurred by a small fraction of the code [Knu73]. So why
not focus effort on that small fraction? 6.4.2 Hierarchical Locking
This is the idea behind the parallel-fastpath design
pattern, to aggressively parallelize the common-case code The idea behind hierarchical locking is to have a coarse-
path without incurring the complexity that would be grained lock that is held only long enough to work out
required to aggressively parallelize the entire algorithm. which fine-grained lock to acquire. Listing 6.8 shows how
You must understand not only the specific algorithm you our hash-table search might be adapted to do hierarchical
wish to parallelize, but also the workload that the algorithm locking, but also shows the great weakness of this ap-
will be subjected to. Great creativity and design effort is proach: We have paid the overhead of acquiring a second
often required to construct a parallel fastpath. lock, but we only hold it for a short time. In this case,
Parallel fastpath combines different patterns (one for the data-locking approach would be simpler and likely
the fastpath, one elsewhere) and is therefore a template perform better.
pattern. The following instances of parallel fastpath occur Quick Quiz 6.22: In what situation would hierarchical
often enough to warrant their own patterns, as depicted in locking work well?
Figure 6.18:
v2023.06.11a
90 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Listing 6.7: Reader-Writer-Locking Hash Table Search Listing 6.8: Hierarchical-Locking Hash Table Search
1 rwlock_t hash_lock; 1 struct hash_table
2 2 {
3 struct hash_table 3 long nbuckets;
4 { 4 struct bucket **buckets;
5 long nbuckets; 5 };
6 struct node **buckets; 6
7 }; 7 struct bucket {
8 8 spinlock_t bucket_lock;
9 typedef struct node { 9 node_t *list_head;
10 unsigned long key; 10 };
11 struct node *next; 11
12 } node_t; 12 typedef struct node {
13 13 spinlock_t node_lock;
14 int hash_search(struct hash_table *h, long key) 14 unsigned long key;
15 { 15 struct node *next;
16 struct node *cur; 16 } node_t;
17 int retval; 17
18 18 int hash_search(struct hash_table *h, long key)
19 read_lock(&hash_lock); 19 {
20 cur = h->buckets[key % h->nbuckets]; 20 struct bucket *bp;
21 while (cur != NULL) { 21 struct node *cur;
22 if (cur->key >= key) { 22 int retval;
23 retval = (cur->key == key); 23
24 read_unlock(&hash_lock); 24 bp = h->buckets[key % h->nbuckets];
25 return retval; 25 spin_lock(&bp->bucket_lock);
26 } 26 cur = bp->list_head;
27 cur = cur->next; 27 while (cur != NULL) {
28 } 28 if (cur->key >= key) {
29 read_unlock(&hash_lock); 29 spin_lock(&cur->node_lock);
30 return 0; 30 spin_unlock(&bp->bucket_lock);
31 } 31 retval = (cur->key == key);
32 spin_unlock(&cur->node_lock);
33 return retval;
34 }
6.4.3 Resource Allocator Caches 35 cur = cur->next;
36 }
37 spin_unlock(&bp->bucket_lock);
This section presents a simplified schematic of a parallel 38 return 0;
fixed-block-size memory allocator. More detailed descrip- 39 }
tions may be found in the literature [MG92, MS93, BA01,
MSK01, Eva11, Ken20] or in the Linux kernel [Tor03].
6.4.3.2 Parallel Fastpath for Resource Allocation
6.4.3.1 Parallel Resource Allocation Problem The commonly used solution uses parallel fastpath with
The basic problem facing a parallel memory allocator each CPU owning a modest cache of blocks, and with a
is the tension between the need to provide extremely large code-locked shared pool for additional blocks. To
fast memory allocation and freeing in the common case prevent any given CPU from monopolizing the memory
and the need to efficiently distribute memory in face of blocks, we place a limit on the number of blocks that can
unfavorable allocation and freeing patterns. be in each CPU’s cache. In a two-CPU system, the flow of
To see this tension, consider a straightforward applica- memory blocks will be as shown in Figure 6.19: When a
tion of data ownership to this problem—simply carve up given CPU is trying to free a block when its pool is full, it
memory so that each CPU owns its share. For example, sends blocks to the global pool, and, similarly, when that
suppose that a system with 12 CPUs has 64 gigabytes of CPU is trying to allocate a block when its pool is empty,
memory, for example, the laptop I am using right now. it retrieves blocks from the global pool.
We could simply assign each CPU a five-gigabyte region
of memory, and allow each CPU to allocate from its own 6.4.3.3 Data Structures
region, without the need for locking and its complexities
and overheads. Unfortunately, this scheme fails when The actual data structures for a “toy” implementation of
CPU 0 only allocates memory and CPU 1 only frees it, as allocator caches are shown in Listing 6.9 (“smpalloc.c”).
happens in simple producer-consumer workloads. The “Global Pool” of Figure 6.19 is implemented by
The other extreme, code locking, suffers from excessive globalmem of type struct globalmempool, and the
lock contention and overhead [MS93]. two CPU pools by the per-thread variable perthreadmem
v2023.06.11a
6.4. PARALLEL FASTPATH 91
(Empty) −1
Global Pool
0
Overflow
Overflow
(Code Locked) 1
Empty
Empty
2
4
CPU 0 Pool CPU 1 Pool
5
(Owned by CPU 0) (Owned by CPU 1)
Figure 6.20: Allocator Pool Schematic
Allocate/Free
boxes represent non-NULL pointers, while the empty boxes
Figure 6.19: Allocator Cache Schematic represent NULL pointers. An important, though potentially
confusing, invariant of this data structure is that the cur
Listing 6.9: Allocator-Cache Data Structures field is always one smaller than the number of non-NULL
1 #define TARGET_POOL_SIZE 3 pointers.
2 #define GLOBAL_POOL_SIZE 40
3
4 struct globalmempool { 6.4.3.4 Allocation Function
5 spinlock_t mutex;
6 int cur;
7 struct memblock *pool[GLOBAL_POOL_SIZE]; The allocation function memblock_alloc() may be seen
8 } globalmem; in Listing 6.10. Line 7 picks up the current thread’s
9
10 struct perthreadmempool { per-thread pool, and line 8 checks to see if it is empty.
11 int cur; If so, lines 9–16 attempt to refill it from the global pool
12 struct memblock *pool[2 * TARGET_POOL_SIZE];
13 }; under the spinlock acquired on line 9 and released on
14 line 16. Lines 10–14 move blocks from the global to the
15 DEFINE_PER_THREAD(struct perthreadmempool, perthreadmem);
per-thread pool until either the local pool reaches its target
size (half full) or the global pool is exhausted, and line 15
sets the per-thread pool’s count to the proper value.
of type struct perthreadmempool. Both of these data
In either case, line 18 checks for the per-thread pool still
structures have arrays of pointers to blocks in their pool
being empty, and if not, lines 19–21 remove a block and
fields, which are filled from index zero upwards. Thus,
return it. Otherwise, line 23 tells the sad tale of memory
if globalmem.pool[3] is NULL, then the remainder of
exhaustion.
the array from index 4 up must also be NULL. The cur
fields contain the index of the highest-numbered full
element of the pool array, or −1 if all elements are 6.4.3.5 Free Function
empty. All elements from globalmem.pool[0] through Listing 6.11 shows the memory-block free function. Line 6
globalmem.pool[globalmem.cur] must be full, and gets a pointer to this thread’s pool, and line 7 checks to
all the rest must be empty.15 see if this per-thread pool is full.
The operation of the pool data structures is illustrated If so, lines 8–15 empty half of the per-thread pool
by Figure 6.20, with the six boxes representing the array into the global pool, with lines 8 and 14 acquiring and
of pointers making up the pool field, and the number releasing the spinlock. Lines 9–12 implement the loop
preceding them representing the cur field. The shaded moving blocks from the local to the global pool, and
15 Both pool sizes (TARGET_POOL_SIZE and GLOBAL_POOL_SIZE) line 13 sets the per-thread pool’s count to the proper value.
are unrealistically small, but this small size makes it easier to single-step In either case, line 16 then places the newly freed block
the program in order to get a feel for its operation. into the per-thread pool.
v2023.06.11a
92 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
30
20
Listing 6.10: Allocator-Cache Allocator Function
1 struct memblock *memblock_alloc(void) 15
2 {
3 int i;
4 struct memblock *p; 10
5 struct perthreadmempool *pcpp;
6
7 pcpp = &__get_thread_var(perthreadmem); 5
8 if (pcpp->cur < 0) {
9 spin_lock(&globalmem.mutex);
10 for (i = 0; i < TARGET_POOL_SIZE && 0
11 globalmem.cur >= 0; i++) { 0 5 10 15 20 25
12 pcpp->pool[i] = globalmem.pool[globalmem.cur]; Allocation Run Length
13 globalmem.pool[globalmem.cur--] = NULL;
14 }
15 pcpp->cur = i - 1;
Figure 6.21: Allocator Cache Performance
16 spin_unlock(&globalmem.mutex);
17 }
18 if (pcpp->cur >= 0) {
19 p = pcpp->pool[pcpp->cur]; Quick Quiz 6.23: Doesn’t this resource-allocator design
20 pcpp->pool[pcpp->cur--] = NULL; resemble that of the approximate limit counters covered in
21 return p;
22 }
Section 5.3?
23 return NULL;
24 }
6.4.3.6 Performance
Rough performance results16 are shown in Figure 6.21,
running on a dual-core Intel x86 running at 1 GHz (4300
bogomips per CPU) with at most six blocks allowed in
each CPU’s cache. In this micro-benchmark, each thread
repeatedly allocates a group of blocks and then frees all
the blocks in that group, with the number of blocks in
the group being the “allocation run length” displayed on
the x-axis. The y-axis shows the number of successful
allocation/free pairs per microsecond—failed allocations
Listing 6.11: Allocator-Cache Free Function
1 void memblock_free(struct memblock *p)
are not counted. The “X”s are from a two-thread run,
2 { while the “+”s are from a single-threaded run.
3 int i; Note that run lengths up to six scale linearly and give
4 struct perthreadmempool *pcpp;
5 excellent performance, while run lengths greater than
6 pcpp = &__get_thread_var(perthreadmem); six show poor performance and almost always also show
7 if (pcpp->cur >= 2 * TARGET_POOL_SIZE - 1) {
8 spin_lock(&globalmem.mutex); negative scaling. It is therefore quite important to size
9 for (i = pcpp->cur; i >= TARGET_POOL_SIZE; i--) { TARGET_POOL_SIZE sufficiently large, which fortunately
10 globalmem.pool[++globalmem.cur] = pcpp->pool[i];
11 pcpp->pool[i] = NULL; is usually quite easy to do in actual practice [MSK01],
12 } especially given today’s large memories. For example,
13 pcpp->cur = i;
14 spin_unlock(&globalmem.mutex); in most systems, it is quite reasonable to set TARGET_
15 } POOL_SIZE to 100, in which case allocations and frees
16 pcpp->pool[++pcpp->cur] = p;
17 }
v2023.06.11a
6.4. PARALLEL FASTPATH 93
are guaranteed to be confined to per-thread pools at least Table 6.1: Schematic of Real-World Parallel Allocator
99 % of the time.
As can be seen from the figure, the situations where Level Locking Purpose
the common-case data-ownership applies (run lengths up Per-thread pool Data ownership High-speed
to six) provide greatly improved performance compared allocation
to the cases where locks must be acquired. Avoiding Global block pool Data locking Distributing blocks
synchronization in the common case will be a recurring among threads
theme through this book.
Coalescing Data locking Combining blocks
Quick Quiz 6.24: In Figure 6.21, there is a pattern of into pages
performance rising with increasing run length in groups of System memory Code locking Memory from/to
three samples, for example, for run lengths 10, 11, and 12. system
Why?
v2023.06.11a
94 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
v2023.06.11a
6.5. BEYOND PARTITIONING 95