0% found this document useful (0 votes)
36 views662 pages

Is Parallel Programming Hard

Uploaded by

Gustavo Sousa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views662 pages

Is Parallel Programming Hard

Uploaded by

Gustavo Sousa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 662

arXiv:1701.00854v6 [cs.

DC] 12 Jun 2023

Is Parallel Programming Hard, And, If So,


What Can You Do About It?

Edited by:

Paul E. McKenney
Facebook
[email protected]

June 11, 2023


Release v2023.06.11a

v2023.06.11a
ii

Legal Statement
This work represents the views of the editor and the authors and does not necessarily
represent the view of their respective employers.

Trademarks:
• IBM, z Systems, and PowerPC are trademarks or registered trademarks of Inter-
national Business Machines Corporation in the United States, other countries, or
both.
• Linux is a registered trademark of Linus Torvalds.
• Intel, Itanium, Intel Core, and Intel Xeon are trademarks of Intel Corporation or
its subsidiaries in the United States, other countries, or both.
• Arm is a registered trademark of Arm Limited (or its subsidiaries) in the US and/or
elsewhere.
• SPARC is a registered trademark of SPARC International, Inc. Products bearing
SPARC trademarks are based on an architecture developed by Sun Microsystems,
Inc.
• Other company, product, and service names may be trademarks or service marks
of such companies.
The non-source-code text and images in this document are provided under the terms
of the Creative Commons Attribution-Share Alike 3.0 United States license.1 In brief,
you may use the contents of this document for any purpose, personal, commercial, or
otherwise, so long as attribution to the authors is maintained. Likewise, the document
may be modified, and derivative works and translations made available, so long as
such modifications and derivations are offered to the public on equal terms as the
non-source-code text and images in the original document.
Source code is covered by various versions of the GPL.2 Some of this code is
GPLv2-only, as it derives from the Linux kernel, while other code is GPLv2-or-later. See
the comment headers of the individual source files within the CodeSamples directory in
the git archive3 for the exact licenses. If you are unsure of the license for a given code
fragment, you should assume GPLv2-only.
Combined work © 2005–2023 by Paul E. McKenney. Each individual contribution
is copyright by its contributor at the time of contribution, as recorded in the git archive.

1 https://creativecommons.org/licenses/by-sa/3.0/us/
2 https://www.gnu.org/licenses/gpl-2.0.html
3 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

v2023.06.11a
Contents

1 How To Use This Book 1


1.1 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Quick Quizzes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Alternatives to This Book . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Sample Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Whose Book Is This? . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Introduction 7
2.1 Historic Parallel Programming Difficulties . . . . . . . . . . . . . . . 7
2.2 Parallel Programming Goals . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Alternatives to Parallel Programming . . . . . . . . . . . . . . . . . . 12
2.3.1 Multiple Instances of a Sequential Application . . . . . . . . 12
2.3.2 Use Existing Parallel Software . . . . . . . . . . . . . . . . . 12
2.3.3 Performance Optimization . . . . . . . . . . . . . . . . . . . 13
2.4 What Makes Parallel Programming Hard? . . . . . . . . . . . . . . . 13
2.4.1 Work Partitioning . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Parallel Access Control . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Resource Partitioning and Replication . . . . . . . . . . . . . 15
2.4.4 Interacting With Hardware . . . . . . . . . . . . . . . . . . . 15
2.4.5 Composite Capabilities . . . . . . . . . . . . . . . . . . . . 15
2.4.6 How Do Languages and Environments Assist With These Tasks? 16
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Hardware and its Habits 17


3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Pipelined CPUs . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Memory References . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Memory Barriers . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.5 Thermal Throttling . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.6 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.7 I/O Operations . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Hardware System Architecture . . . . . . . . . . . . . . . . 22
3.2.2 Costs of Operations . . . . . . . . . . . . . . . . . . . . . . 23

iii

v2023.06.11a
iv CONTENTS

3.2.3 Hardware Optimizations . . . . . . . . . . . . . . . . . . . . 24


3.3 Hardware Free Lunch? . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 3D Integration . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Novel Materials and Processes . . . . . . . . . . . . . . . . . 26
3.3.3 Light, Not Electrons . . . . . . . . . . . . . . . . . . . . . . 27
3.3.4 Special-Purpose Accelerators . . . . . . . . . . . . . . . . . 27
3.3.5 Existing Parallel Software . . . . . . . . . . . . . . . . . . . 27
3.4 Software Design Implications . . . . . . . . . . . . . . . . . . . . . . 28

4 Tools of the Trade 29


4.1 Scripting Languages . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 POSIX Multiprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 POSIX Process Creation and Destruction . . . . . . . . . . . 30
4.2.2 POSIX Thread Creation and Destruction . . . . . . . . . . . 31
4.2.3 POSIX Locking . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.4 POSIX Reader-Writer Locking . . . . . . . . . . . . . . . . 34
4.2.5 Atomic Operations (GCC Classic) . . . . . . . . . . . . . . . 36
4.2.6 Atomic Operations (C11) . . . . . . . . . . . . . . . . . . . 36
4.2.7 Atomic Operations (Modern GCC) . . . . . . . . . . . . . . 37
4.2.8 Per-Thread Variables . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Alternatives to POSIX Operations . . . . . . . . . . . . . . . . . . . 37
4.3.1 Organization and Initialization . . . . . . . . . . . . . . . . . 37
4.3.2 Thread Creation, Destruction, and Control . . . . . . . . . . 38
4.3.3 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.4 Accessing Shared Variables . . . . . . . . . . . . . . . . . . 40
4.3.5 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.6 Per-CPU Variables . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 The Right Tool for the Job: How to Choose? . . . . . . . . . . . . . . 47

5 Counting 49
5.1 Why Isn’t Concurrent Counting Trivial? . . . . . . . . . . . . . . . . 49
5.2 Statistical Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Array-Based Implementation . . . . . . . . . . . . . . . . . 51
5.2.3 Per-Thread-Variable-Based Implementation . . . . . . . . . . 52
5.2.4 Eventually Consistent Implementation . . . . . . . . . . . . 54
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Approximate Limit Counters . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Simple Limit Counter Implementation . . . . . . . . . . . . 56
5.3.3 Simple Limit Counter Discussion . . . . . . . . . . . . . . . 59
5.3.4 Approximate Limit Counter Implementation . . . . . . . . . 60
5.3.5 Approximate Limit Counter Discussion . . . . . . . . . . . . 61
5.4 Exact Limit Counters . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 Atomic Limit Counter Implementation . . . . . . . . . . . . 61
5.4.2 Atomic Limit Counter Discussion . . . . . . . . . . . . . . . 64
5.4.3 Signal-Theft Limit Counter Design . . . . . . . . . . . . . . 64
5.4.4 Signal-Theft Limit Counter Implementation . . . . . . . . . . 65
5.4.5 Signal-Theft Limit Counter Discussion . . . . . . . . . . . . 67
5.4.6 Applying Exact Limit Counters . . . . . . . . . . . . . . . . 68

v2023.06.11a
CONTENTS v

5.5 Parallel Counting Discussion . . . . . . . . . . . . . . . . . . . . . . 68


5.5.1 Parallel Counting Validation . . . . . . . . . . . . . . . . . . 69
5.5.2 Parallel Counting Performance . . . . . . . . . . . . . . . . 69
5.5.3 Parallel Counting Specializations . . . . . . . . . . . . . . . 70
5.5.4 Parallel Counting Lessons . . . . . . . . . . . . . . . . . . . 70

6 Partitioning and Synchronization Design 73


6.1 Partitioning Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.1.1 Dining Philosophers Problem . . . . . . . . . . . . . . . . . 73
6.1.2 Double-Ended Queue . . . . . . . . . . . . . . . . . . . . . 74
6.1.3 Partitioning Example Discussion . . . . . . . . . . . . . . . 80
6.2 Design Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Synchronization Granularity . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1 Sequential Program . . . . . . . . . . . . . . . . . . . . . . 83
6.3.2 Code Locking . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.3 Data Locking . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.4 Data Ownership . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.5 Locking Granularity and Performance . . . . . . . . . . . . . 87
6.4 Parallel Fastpath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.1 Reader/Writer Locking . . . . . . . . . . . . . . . . . . . . . 89
6.4.2 Hierarchical Locking . . . . . . . . . . . . . . . . . . . . . . 89
6.4.3 Resource Allocator Caches . . . . . . . . . . . . . . . . . . 90
6.5 Beyond Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.5.1 Work-Queue Parallel Maze Solver . . . . . . . . . . . . . . . 94
6.5.2 Alternative Parallel Maze Solver . . . . . . . . . . . . . . . . 95
6.5.3 Maze Validation . . . . . . . . . . . . . . . . . . . . . . . . 96
6.5.4 Performance Comparison I . . . . . . . . . . . . . . . . . . 97
6.5.5 Alternative Sequential Maze Solver . . . . . . . . . . . . . . 98
6.5.6 Performance Comparison II . . . . . . . . . . . . . . . . . . 99
6.5.7 Future Directions and Conclusions . . . . . . . . . . . . . . 99
6.6 Partitioning, Parallelism, and Optimization . . . . . . . . . . . . . . . 100

7 Locking 101
7.1 Staying Alive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.1 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.2 Livelock and Starvation . . . . . . . . . . . . . . . . . . . . 109
7.1.3 Unfairness . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.1.4 Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2 Types of Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.1 Exclusive Locks . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.2 Reader-Writer Locks . . . . . . . . . . . . . . . . . . . . . . 111
7.2.3 Beyond Reader-Writer Locks . . . . . . . . . . . . . . . . . 112
7.2.4 Scoped Locking . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3 Locking Implementation Issues . . . . . . . . . . . . . . . . . . . . . 115
7.3.1 Sample Exclusive-Locking Implementation Based on Atomic
Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3.2 Other Exclusive-Locking Implementations . . . . . . . . . . 115
7.4 Lock-Based Existence Guarantees . . . . . . . . . . . . . . . . . . . 117
7.5 Locking: Hero or Villain? . . . . . . . . . . . . . . . . . . . . . . . 118
7.5.1 Locking For Applications: Hero! . . . . . . . . . . . . . . . 119

v2023.06.11a
vi CONTENTS

7.5.2 Locking For Parallel Libraries: Just Another Tool . . . . . . 119


7.5.3 Locking For Parallelizing Sequential Libraries: Villain! . . . 121
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8 Data Ownership 125


8.1 Multiple Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.2 Partial Data Ownership and pthreads . . . . . . . . . . . . . . . . . . 126
8.3 Function Shipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.4 Designated Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.5 Privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.6 Other Uses of Data Ownership . . . . . . . . . . . . . . . . . . . . . 127

9 Deferred Processing 129


9.1 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.2 Reference Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.3 Hazard Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.4 Sequence Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.5 Read-Copy Update (RCU) . . . . . . . . . . . . . . . . . . . . . . . 139
9.5.1 Introduction to RCU . . . . . . . . . . . . . . . . . . . . . . 140
9.5.2 RCU Fundamentals . . . . . . . . . . . . . . . . . . . . . . 146
9.5.3 RCU Linux-Kernel API . . . . . . . . . . . . . . . . . . . . 152
9.5.4 RCU Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.5.5 RCU Related Work . . . . . . . . . . . . . . . . . . . . . . . 179
9.6 Which to Choose? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.6.1 Which to Choose? (Overview) . . . . . . . . . . . . . . . . . 182
9.6.2 Which to Choose? (Details) . . . . . . . . . . . . . . . . . . 183
9.6.3 Which to Choose? (Production Use) . . . . . . . . . . . . . . 185
9.7 What About Updates? . . . . . . . . . . . . . . . . . . . . . . . . . . 186

10 Data Structures 187


10.1 Motivating Application . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.2 Partitionable Data Structures . . . . . . . . . . . . . . . . . . . . . . 188
10.2.1 Hash-Table Design . . . . . . . . . . . . . . . . . . . . . . . 188
10.2.2 Hash-Table Implementation . . . . . . . . . . . . . . . . . . 188
10.2.3 Hash-Table Performance . . . . . . . . . . . . . . . . . . . . 190
10.3 Read-Mostly Data Structures . . . . . . . . . . . . . . . . . . . . . . 191
10.3.1 RCU-Protected Hash Table Implementation . . . . . . . . . . 191
10.3.2 RCU-Protected Hash Table Validation . . . . . . . . . . . . . 192
10.3.3 RCU-Protected Hash Table Performance . . . . . . . . . . . 192
10.3.4 RCU-Protected Hash Table Discussion . . . . . . . . . . . . 195
10.4 Non-Partitionable Data Structures . . . . . . . . . . . . . . . . . . . 196
10.4.1 Resizable Hash Table Design . . . . . . . . . . . . . . . . . 196
10.4.2 Resizable Hash Table Implementation . . . . . . . . . . . . . 197
10.4.3 Resizable Hash Table Discussion . . . . . . . . . . . . . . . 202
10.4.4 Other Resizable Hash Tables . . . . . . . . . . . . . . . . . . 203
10.5 Other Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.6 Micro-Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.6.1 Specialization . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.6.2 Bits and Bytes . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.6.3 Hardware Considerations . . . . . . . . . . . . . . . . . . . 206

v2023.06.11a
CONTENTS vii

10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

11 Validation 209
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.1.1 Where Do Bugs Come From? . . . . . . . . . . . . . . . . . 209
11.1.2 Required Mindset . . . . . . . . . . . . . . . . . . . . . . . 210
11.1.3 When Should Validation Start? . . . . . . . . . . . . . . . . 212
11.1.4 The Open Source Way . . . . . . . . . . . . . . . . . . . . . 213
11.2 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.4 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
11.5 Code Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
11.5.1 Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
11.5.2 Walkthroughs . . . . . . . . . . . . . . . . . . . . . . . . . . 216
11.5.3 Self-Inspection . . . . . . . . . . . . . . . . . . . . . . . . . 216
11.6 Probability and Heisenbugs . . . . . . . . . . . . . . . . . . . . . . . 217
11.6.1 Statistics for Discrete Testing . . . . . . . . . . . . . . . . . 218
11.6.2 Statistics Abuse for Discrete Testing . . . . . . . . . . . . . . 219
11.6.3 Statistics for Continuous Testing . . . . . . . . . . . . . . . . 219
11.6.4 Hunting Heisenbugs . . . . . . . . . . . . . . . . . . . . . . 220
11.7 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 224
11.7.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . 224
11.7.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
11.7.3 Differential Profiling . . . . . . . . . . . . . . . . . . . . . . 225
11.7.4 Microbenchmarking . . . . . . . . . . . . . . . . . . . . . . 225
11.7.5 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
11.7.6 Detecting Interference . . . . . . . . . . . . . . . . . . . . . 227
11.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

12 Formal Verification 231


12.1 State-Space Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12.1.1 Promela and Spin . . . . . . . . . . . . . . . . . . . . . . . 231
12.1.2 How to Use Promela . . . . . . . . . . . . . . . . . . . . . . 233
12.1.3 Promela Example: Locking . . . . . . . . . . . . . . . . . . 236
12.1.4 Promela Example: QRCU . . . . . . . . . . . . . . . . . . . 238
12.1.5 Promela Parable: dynticks and Preemptible RCU . . . . . . . 243
12.1.6 Validating Preemptible RCU and dynticks . . . . . . . . . . . 246
12.2 Special-Purpose State-Space Search . . . . . . . . . . . . . . . . . . 259
12.2.1 Anatomy of a Litmus Test . . . . . . . . . . . . . . . . . . . 259
12.2.2 What Does This Litmus Test Mean? . . . . . . . . . . . . . . 260
12.2.3 Running a Litmus Test . . . . . . . . . . . . . . . . . . . . . 260
12.2.4 PPCMEM Discussion . . . . . . . . . . . . . . . . . . . . . 261
12.3 Axiomatic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 262
12.3.1 Axiomatic Approaches and Locking . . . . . . . . . . . . . . 263
12.3.2 Axiomatic Approaches and RCU . . . . . . . . . . . . . . . 264
12.4 SAT Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.5 Stateless Model Checkers . . . . . . . . . . . . . . . . . . . . . . . . 266
12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
12.7 Choosing a Validation Plan . . . . . . . . . . . . . . . . . . . . . . . 268

v2023.06.11a
viii CONTENTS

13 Putting It All Together 271


13.1 Counter Conundrums . . . . . . . . . . . . . . . . . . . . . . . . . . 271
13.1.1 Counting Updates . . . . . . . . . . . . . . . . . . . . . . . 271
13.1.2 Counting Lookups . . . . . . . . . . . . . . . . . . . . . . . 271
13.2 Refurbish Reference Counting . . . . . . . . . . . . . . . . . . . . . 272
13.2.1 Implementation of Reference-Counting Categories . . . . . . 273
13.2.2 Counter Optimizations . . . . . . . . . . . . . . . . . . . . . 276
13.3 Hazard-Pointer Helpers . . . . . . . . . . . . . . . . . . . . . . . . . 276
13.3.1 Scalable Reference Count . . . . . . . . . . . . . . . . . . . 276
13.3.2 Long-Duration Accesses . . . . . . . . . . . . . . . . . . . . 276
13.4 Sequence-Locking Specials . . . . . . . . . . . . . . . . . . . . . . . 276
13.4.1 Dueling Sequence Locks . . . . . . . . . . . . . . . . . . . . 276
13.4.2 Correlated Data Elements . . . . . . . . . . . . . . . . . . . 277
13.4.3 Atomic Move . . . . . . . . . . . . . . . . . . . . . . . . . . 278
13.4.4 Upgrade to Writer . . . . . . . . . . . . . . . . . . . . . . . 278
13.5 RCU Rescues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
13.5.1 RCU and Per-Thread-Variable-Based Statistical Counters . . . 279
13.5.2 RCU and Counters for Removable I/O Devices . . . . . . . . 280
13.5.3 Array and Length . . . . . . . . . . . . . . . . . . . . . . . 281
13.5.4 Correlated Fields . . . . . . . . . . . . . . . . . . . . . . . . 282
13.5.5 Update-Friendly Traversal . . . . . . . . . . . . . . . . . . . 282
13.5.6 Scalable Reference Count Two . . . . . . . . . . . . . . . . . 282
13.5.7 Retriggered Grace Periods . . . . . . . . . . . . . . . . . . . 283
13.5.8 Long-Duration Accesses Two . . . . . . . . . . . . . . . . . 284

14 Advanced Synchronization 287


14.1 Avoiding Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
14.2 Non-Blocking Synchronization . . . . . . . . . . . . . . . . . . . . . 287
14.2.1 Simple NBS . . . . . . . . . . . . . . . . . . . . . . . . . . 288
14.2.2 Applicability of NBS Benefits . . . . . . . . . . . . . . . . . 290
14.2.3 NBS Discussion . . . . . . . . . . . . . . . . . . . . . . . . 293
14.3 Parallel Real-Time Computing . . . . . . . . . . . . . . . . . . . . . 294
14.3.1 What is Real-Time Computing? . . . . . . . . . . . . . . . . 294
14.3.2 Who Needs Real-Time? . . . . . . . . . . . . . . . . . . . . 298
14.3.3 Who Needs Parallel Real-Time? . . . . . . . . . . . . . . . . 298
14.3.4 Implementing Parallel Real-Time Systems . . . . . . . . . . 299
14.3.5 Implementing Parallel Real-Time Operating Systems . . . . . 300
14.3.6 Implementing Parallel Real-Time Applications . . . . . . . . 310
14.3.7 Real Time vs. Real Fast: How to Choose? . . . . . . . . . . . 313

15 Advanced Synchronization: Memory Ordering 315


15.1 Ordering: Why and How? . . . . . . . . . . . . . . . . . . . . . . . 315
15.1.1 Why Hardware Misordering? . . . . . . . . . . . . . . . . . 316
15.1.2 How to Force Ordering? . . . . . . . . . . . . . . . . . . . . 318
15.1.3 Basic Rules of Thumb . . . . . . . . . . . . . . . . . . . . . 321
15.2 Tricks and Traps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
15.2.1 Variables With Multiple Values . . . . . . . . . . . . . . . . 322
15.2.2 Memory-Reference Reordering . . . . . . . . . . . . . . . . 325
15.2.3 Address Dependencies . . . . . . . . . . . . . . . . . . . . . 326
15.2.4 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . 328

v2023.06.11a
CONTENTS ix

15.2.5 Control Dependencies . . . . . . . . . . . . . . . . . . . . . 328


15.2.6 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . 329
15.2.7 Multicopy Atomicity . . . . . . . . . . . . . . . . . . . . . . 330
15.2.8 A Counter-Intuitive Case Study . . . . . . . . . . . . . . . . 337
15.3 Compile-Time Consternation . . . . . . . . . . . . . . . . . . . . . . 340
15.3.1 Memory-Reference Restrictions . . . . . . . . . . . . . . . . 342
15.3.2 Address- and Data-Dependency Difficulties . . . . . . . . . . 342
15.3.3 Control-Dependency Calamities . . . . . . . . . . . . . . . . 345
15.4 Higher-Level Primitives . . . . . . . . . . . . . . . . . . . . . . . . . 348
15.4.1 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . 348
15.4.2 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
15.4.3 RCU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
15.4.4 Higher-Level Primitives: Discussion . . . . . . . . . . . . . 358
15.5 Hardware Specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
15.5.1 Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
15.5.2 Armv7-A/R . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
15.5.3 Armv8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
15.5.4 Itanium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
15.5.5 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
15.5.6 POWER / PowerPC . . . . . . . . . . . . . . . . . . . . . . 365
15.5.7 SPARC TSO . . . . . . . . . . . . . . . . . . . . . . . . . . 366
15.5.8 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
15.5.9 z Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
15.5.10 Hardware Specifics: Discussion . . . . . . . . . . . . . . . . 367
15.6 Memory-Model Intuitions . . . . . . . . . . . . . . . . . . . . . . . 367
15.6.1 Transitive Intuitions . . . . . . . . . . . . . . . . . . . . . . 367
15.6.2 Rules of Thumb . . . . . . . . . . . . . . . . . . . . . . . . 372

16 Ease of Use 375


16.1 What is Easy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
16.2 Rusty Scale for API Design . . . . . . . . . . . . . . . . . . . . . . . 375
16.3 Shaving the Mandelbrot Set . . . . . . . . . . . . . . . . . . . . . . . 376

17 Conflicting Visions of the Future 379


17.1 The Future of CPU Technology Ain’t What it Used to Be . . . . . . . 379
17.1.1 Uniprocessor Über Alles . . . . . . . . . . . . . . . . . . . . 379
17.1.2 Multithreaded Mania . . . . . . . . . . . . . . . . . . . . . . 381
17.1.3 More of the Same . . . . . . . . . . . . . . . . . . . . . . . 381
17.1.4 Crash Dummies Slamming into the Memory Wall . . . . . . 382
17.1.5 Astounding Accelerators . . . . . . . . . . . . . . . . . . . . 383
17.2 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . 383
17.2.1 Outside World . . . . . . . . . . . . . . . . . . . . . . . . . 383
17.2.2 Process Modification . . . . . . . . . . . . . . . . . . . . . . 386
17.2.3 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . 390
17.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
17.3 Hardware Transactional Memory . . . . . . . . . . . . . . . . . . . . 395
17.3.1 HTM Benefits WRT Locking . . . . . . . . . . . . . . . . . 395
17.3.2 HTM Weaknesses WRT Locking . . . . . . . . . . . . . . . 396
17.3.3 HTM Weaknesses WRT Locking When Augmented . . . . . 400
17.3.4 Where Does HTM Best Fit In? . . . . . . . . . . . . . . . . . 403

v2023.06.11a
x CONTENTS

17.3.5 Potential Game Changers . . . . . . . . . . . . . . . . . . . 403


17.3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 407
17.4 Formal Regression Testing? . . . . . . . . . . . . . . . . . . . . . . . 407
17.4.1 Automatic Translation . . . . . . . . . . . . . . . . . . . . . 407
17.4.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 408
17.4.3 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
17.4.4 Locate Bugs . . . . . . . . . . . . . . . . . . . . . . . . . . 409
17.4.5 Minimal Scaffolding . . . . . . . . . . . . . . . . . . . . . . 410
17.4.6 Relevant Bugs . . . . . . . . . . . . . . . . . . . . . . . . . 410
17.4.7 Formal Regression Scorecard . . . . . . . . . . . . . . . . . 411
17.5 Functional Programming for Parallelism . . . . . . . . . . . . . . . . 412
17.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

18 Looking Forward and Back 415

A Important Questions 419


A.1 Why Aren’t Parallel Programs Always Faster? . . . . . . . . . . . . . 419
A.2 Why Not Remove Locking? . . . . . . . . . . . . . . . . . . . . . . . 419
A.3 What Time Is It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
A.4 What Does “After” Mean? . . . . . . . . . . . . . . . . . . . . . . . 421
A.5 How Much Ordering Is Needed? . . . . . . . . . . . . . . . . . . . . 423
A.5.1 Where is the Defining Data? . . . . . . . . . . . . . . . . . . 423
A.5.2 Consistent Data Used Consistently? . . . . . . . . . . . . . . 424
A.5.3 Is the Problem Partitionable? . . . . . . . . . . . . . . . . . 424
A.5.4 None of the Above? . . . . . . . . . . . . . . . . . . . . . . 424
A.6 What is the Difference Between “Concurrent” and “Parallel”? . . . . . 424
A.7 Why Is Software Buggy? . . . . . . . . . . . . . . . . . . . . . . . . 425

B “Toy” RCU Implementations 427


B.1 Lock-Based RCU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
B.2 Per-Thread Lock-Based RCU . . . . . . . . . . . . . . . . . . . . . . 428
B.3 Simple Counter-Based RCU . . . . . . . . . . . . . . . . . . . . . . 428
B.4 Starvation-Free Counter-Based RCU . . . . . . . . . . . . . . . . . . 429
B.5 Scalable Counter-Based RCU . . . . . . . . . . . . . . . . . . . . . . 431
B.6 Scalable Counter-Based RCU With Shared Grace Periods . . . . . . . 432
B.7 RCU Based on Free-Running Counter . . . . . . . . . . . . . . . . . 434
B.8 Nestable RCU Based on Free-Running Counter . . . . . . . . . . . . 435
B.9 RCU Based on Quiescent States . . . . . . . . . . . . . . . . . . . . 437
B.10 Summary of Toy RCU Implementations . . . . . . . . . . . . . . . . 438

C Why Memory Barriers? 441


C.1 Cache Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
C.2 Cache-Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . 443
C.2.1 MESI States . . . . . . . . . . . . . . . . . . . . . . . . . . 443
C.2.2 MESI Protocol Messages . . . . . . . . . . . . . . . . . . . 443
C.2.3 MESI State Diagram . . . . . . . . . . . . . . . . . . . . . . 444
C.2.4 MESI Protocol Example . . . . . . . . . . . . . . . . . . . . 445
C.3 Stores Result in Unnecessary Stalls . . . . . . . . . . . . . . . . . . . 445
C.3.1 Store Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . 446
C.3.2 Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . 447

v2023.06.11a
CONTENTS xi

C.3.3 Store Buffers and Memory Barriers . . . . . . . . . . . . . . 447


C.4 Store Sequences Result in Unnecessary Stalls . . . . . . . . . . . . . 449
C.4.1 Invalidate Queues . . . . . . . . . . . . . . . . . . . . . . . 449
C.4.2 Invalidate Queues and Invalidate Acknowledge . . . . . . . . 450
C.4.3 Invalidate Queues and Memory Barriers . . . . . . . . . . . 450
C.5 Read and Write Memory Barriers . . . . . . . . . . . . . . . . . . . 452
C.6 Example Memory-Barrier Sequences . . . . . . . . . . . . . . . . . . 452
C.6.1 Ordering-Hostile Architecture . . . . . . . . . . . . . . . . . 452
C.6.2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
C.6.3 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
C.6.4 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
C.7 Are Memory Barriers Forever? . . . . . . . . . . . . . . . . . . . . . 454
C.8 Advice to Hardware Designers . . . . . . . . . . . . . . . . . . . . . 455

D Style Guide 457


D.1 Paul’s Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
D.2 NIST Style Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
D.2.1 Unit Symbol . . . . . . . . . . . . . . . . . . . . . . . . . . 458
D.2.2 NIST Guide Yet To Be Followed . . . . . . . . . . . . . . . 459
D.3 LATEX Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
D.3.1 Monospace Font . . . . . . . . . . . . . . . . . . . . . . . . 459
D.3.2 Cross-reference . . . . . . . . . . . . . . . . . . . . . . . . . 463
D.3.3 Non Breakable Spaces . . . . . . . . . . . . . . . . . . . . . 463
D.3.4 Hyphenation and Dashes . . . . . . . . . . . . . . . . . . . . 464
D.3.5 Punctuation . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
D.3.6 Floating Object Format . . . . . . . . . . . . . . . . . . . . 465
D.3.7 Improvement Candidates . . . . . . . . . . . . . . . . . . . . 466

E Answers to Quick Quizzes 473


E.1 How To Use This Book . . . . . . . . . . . . . . . . . . . . . . . . . 473
E.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
E.3 Hardware and its Habits . . . . . . . . . . . . . . . . . . . . . . . . . 478
E.4 Tools of the Trade . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
E.5 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
E.6 Partitioning and Synchronization Design . . . . . . . . . . . . . . . . 501
E.7 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
E.8 Data Ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
E.9 Deferred Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
E.10 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
E.11 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
E.12 Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
E.13 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 550
E.14 Advanced Synchronization . . . . . . . . . . . . . . . . . . . . . . . 553
E.15 Advanced Synchronization: Memory Ordering . . . . . . . . . . . . 556
E.16 Ease of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
E.17 Conflicting Visions of the Future . . . . . . . . . . . . . . . . . . . . 567
E.18 Important Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 572
E.19 “Toy” RCU Implementations . . . . . . . . . . . . . . . . . . . . . . 573
E.20 Why Memory Barriers? . . . . . . . . . . . . . . . . . . . . . . . . . 579

v2023.06.11a
xii CONTENTS

Glossary 585

Bibliography 595

Credits 639
LATEX Advisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Reviewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Machine Owners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Original Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Figure Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
Other Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641

Acronyms 643

Index 645

API Index 649

v2023.06.11a
If you would only recognize that life is hard, things
would be so much easier for you.

Chapter 1 Louis D. Brandeis

How To Use This Book

The purpose of this book is to help you program shared- 1.1 Roadmap
memory parallel systems without risking your sanity.1
Nevertheless, you should think of the information in this
book as a foundation on which to build, rather than as Cat: Where are you going?
a completed cathedral. Your mission, if you choose to Alice: Which way should I go?
accept, is to help make further progress in the exciting Cat: That depends on where you are going.
Alice: I don’t know.
field of parallel programming—progress that will in time
Cat: Then it doesn’t matter which way you go.
render this book obsolete.
Parallel programming in the 21st century is no longer Lewis Carroll, Alice in Wonderland
focused solely on science, research, and grand-challenge
projects. And this is all to the good, because it means This book is a handbook of widely applicable and heav-
that parallel programming is becoming an engineering ily used design techniques, rather than a collection of
discipline. Therefore, as befits an engineering discipline, optimal algorithms with tiny areas of applicability. You
this book examines specific parallel-programming tasks are currently reading Chapter 1, but you knew that al-
and describes how to approach them. In some surprisingly ready. Chapter 2 gives a high-level overview of parallel
common cases, these tasks can be automated. programming.
This book is written in the hope that presenting the
Chapter 3 introduces shared-memory parallel hardware.
engineering discipline underlying successful parallel-
After all, it is difficult to write good parallel code un-
programming projects will free a new generation of par-
less you understand the underlying hardware. Because
allel hackers from the need to slowly and painstakingly
hardware constantly evolves, this chapter will always be
reinvent old wheels, enabling them to instead focus their
out of date. We will nevertheless do our best to keep up.
energy and creativity on new frontiers. However, what
Chapter 4 then provides a very brief overview of common
you get from this book will be determined by what you
shared-memory parallel-programming primitives.
put into it. It is hoped that simply reading this book will
be helpful, and that working the Quick Quizzes will be Chapter 5 takes an in-depth look at parallelizing one
even more helpful. However, the best results come from of the simplest problems imaginable, namely counting.
applying the techniques taught in this book to real-life Because almost everyone has an excellent grasp of count-
problems. As always, practice makes perfect. ing, this chapter is able to delve into many important
But no matter how you approach it, we sincerely hope parallel-programming issues without the distractions of
that parallel programming brings you at least as much fun, more-typical computer-science problems. My impression
excitement, and challenge that it has brought to us! is that this chapter has seen the greatest use in parallel-
programming coursework.
Chapter 6 introduces a number of design-level methods
of addressing the issues identified in Chapter 5. It turns out
1 Or, perhaps more accurately, without much greater risk to your that it is important to address parallelism at the design level
sanity than that incurred by non-parallel programming. Which, come to when feasible: To paraphrase Dijkstra [Dij68], “retrofitted
think of it, might not be saying all that much. parallelism considered grossly suboptimal” [McK12c].

v2023.06.11a
2 CHAPTER 1. HOW TO USE THIS BOOK

The next three chapters examine three important ap- Some of them are based on material in which that quick
proaches to synchronization. Chapter 7 covers locking, quiz appears, but others require you to think beyond that
which is still not only the workhorse of production-quality section, and, in some cases, beyond the realm of current
parallel programming, but is also widely considered to knowledge. As with most endeavors, what you get out of
be parallel programming’s worst villain. Chapter 8 gives this book is largely determined by what you are willing to
a brief overview of data ownership, an often overlooked put into it. Therefore, readers who make a genuine effort
but remarkably pervasive and powerful approach. Finally, to solve a quiz before looking at the answer find their
Chapter 9 introduces a number of deferred-processing effort repaid handsomely with increased understanding of
mechanisms, including reference counting, hazard point- parallel programming.
ers, sequence locking, and RCU. Quick Quiz 1.1: Where are the answers to the Quick Quizzes
Chapter 10 applies the lessons of previous chapters to found?
hash tables, which are heavily used due to their excel-
lent partitionability, which (usually) leads to excellent Quick Quiz 1.2: Some of the Quick Quiz questions seem to
performance and scalability. be from the viewpoint of the reader rather than the author. Is
As many have learned to their sorrow, parallel program- that really the intent?
ming without validation is a sure path to abject failure.
Chapter 11 covers various forms of testing. It is of course Quick Quiz 1.3: These Quick Quizzes are just not my cup of
impossible to test reliability into your program after the tea. What can I do about it?
fact, so Chapter 12 follows up with a brief overview of a
In short, if you need a deep understanding of the mate-
couple of practical approaches to formal verification.
rial, then you should invest some time into answering the
Chapter 13 contains a series of moderate-sized parallel Quick Quizzes. Don’t get me wrong, passively reading the
programming problems. The difficulty of these problems material can be quite valuable, but gaining full problem-
vary, but should be appropriate for someone who has solving capability really does require that you practice
mastered the material in the previous chapters. solving problems. Similarly, gaining full code-production
Chapter 14 looks at advanced synchronization meth- capability really does require that you practice producing
ods, including non-blocking synchronization and parallel code.
real-time computing, while Chapter 15 covers the ad-
Quick Quiz 1.4: If passively reading this book doesn’t get me
vanced topic of memory ordering. Chapter 16 follows up
full problem-solving and code-production capabilities, what
with some ease-of-use advice. Chapter 17 looks at a few on earth is the point???
possible future directions, including shared-memory par-
allel system design, software and hardware transactional I learned this the hard way during coursework for my
memory, and functional programming for parallelism. Fi- late-in-life Ph.D. I was studying a familiar topic, and
nally, Chapter 18 reviews the material in this book and its was surprised at how few of the chapter’s exercises I
origins. could answer off the top of my head.2 Forcing myself to
This chapter is followed by a number of appendices. The answer the questions greatly increased my retention of the
most popular of these appears to be Appendix C, which material. So with these Quick Quizzes I am not asking
delves even further into memory ordering. Appendix E you to do anything that I have not been doing myself.
contains the answers to the infamous Quick Quizzes, Finally, the most common learning disability is thinking
which are discussed in the next section. that you already understand the material at hand. The
quick quizzes can be an extremely effective cure.

1.2 Quick Quizzes

Undertake something difficult, otherwise you will


never grow.
Abbreviated from Ronald E. Osburn

“Quick quizzes” appear throughout this book, and the 2 So I suppose that it was just as well that my professors refused to

answers may be found in Appendix E starting on page 473. let me waive that class!

v2023.06.11a
1.3. ALTERNATIVES TO THIS BOOK 3

1.3 Alternatives to This Book any data structure.” These are clearly not the words
of someone who is hostile towards RCU.
Between two evils I always pick the one I never tried 2. If you would like an academic treatment of parallel
before. programming from a programming-language-prag-
Mae West matics viewpoint, you might be interested in the
concurrency chapter from Scott’s textbook [Sco06,
As Knuth learned the hard way, if you want your book Sco15] on programming-language pragmatics.
to be finite, it must be focused. This book focuses on
3. If you are interested in an object-oriented patternist
shared-memory parallel programming, with an emphasis
treatment of parallel programming focussing on C++,
on software that lives near the bottom of the software
you might try Volumes 2 and 4 of Schmidt’s POSA
stack, such as operating-system kernels, parallel data-
series [SSRB00, BHS07]. Volume 4 in particular
management systems, low-level libraries, and the like.
has some interesting chapters applying this work to a
The programming language used by this book is C.
warehouse application. The realism of this example
If you are interested in other aspects of parallelism,
is attested to by the section entitled “Partitioning the
you might well be better served by some other book.
Big Ball of Mud”, in which the problems inherent
Fortunately, there are many alternatives available to you:
in parallelism often take a back seat to getting one’s
1. If you prefer a more academic and rigorous treatment head around a real-world application.
of parallel programming, you might like Herlihy’s 4. If you want to work with Linux-kernel device driv-
and Shavit’s textbook [HS08, HSLS20]. This book ers, then Corbet’s, Rubini’s, and Kroah-Hartman’s
starts with an interesting combination of low-level “Linux Device Drivers” [CRKH05] is indispensable,
primitives at high levels of abstraction from the hard- as is the Linux Weekly News web site (https:
ware, and works its way through locking and simple //lwn.net/). There is a large number of books and
data structures including lists, queues, hash tables, resources on the more general topic of Linux kernel
and counters, culminating with transactional mem- internals.
ory, all in Java. Michael Scott’s textbook [Sco13]
approaches similar material with more of a software- 5. If your primary focus is scientific and technical com-
engineering focus, and, as far as I know, is the first puting, and you prefer a patternist approach, you
formally published academic textbook with section might try Mattson et al.’s textbook [MSM05]. It
devoted to RCU. covers Java, C/C++, OpenMP, and MPI. Its pat-
Herlihy, Shavit, Luchangco, and Spear did catch up terns are admirably focused first on design, then on
in their second edition [HSLS20] by adding short implementation.
sections on hazard pointers and on RCU, with the
6. If your primary focus is scientific and technical com-
latter in the guise of EBR.3 They also include
puting, and you are interested in GPUs, CUDA, and
a brief history of both, albeit with an abbreviated
MPI, you might check out Norm Matloff’s “Program-
history of RCU that picks up almost a year after it was
ming on Parallel Machines” [Mat17]. Of course, the
accepted into the Linux kernel and more than 20 years
GPU vendors have quite a bit of additional informa-
after Kung’s and Lehman’s landmark paper [KL80].
tion [AMD20, Zel11, NVi17a, NVi17b].
Those wishing a deeper view of the history may find
it in this book’s Section 9.5.5. 7. If you are interested in POSIX Threads, you might
However, readers who might otherwise suspect a take a look at David R. Butenhof’s book [But97]. In
hostile attitude towards RCU on the part of this text- addition, W. Richard Stevens’s book [Ste92, Ste13]
book’s first author should refer to the last full sentence covers UNIX and POSIX, and Stewart Weiss’s lecture
on the first page of one of his papers [BGHZ16]. This notes [Wei13] provide an thorough and accessible
sentence reads “QSBR [a particular class of RCU im- introduction with a good set of examples.
plementations] is fast and can be applied to virtually
8. If you are interested in C++11, you might like An-
3 Albeitan implementation that contains a reader-preemption bug thony Williams’s “C++ Concurrency in Action: Prac-
noted by Richard Bornat. tical Multithreading” [Wil12, Wil19].

v2023.06.11a
4 CHAPTER 1. HOW TO USE THIS BOOK

9. If you are interested in C++, but in a Windows Listing 1.1: Creating an Up-To-Date PDF
environment, you might try Herb Sutter’s “Effective git clone git://git.kernel.org/pub/scm/linux/kernel/git/↵
↩→ paulmck/perfbook.git
Concurrency” series in Dr. Dobbs Journal [Sut08]. cd perfbook
This series does a reasonable job of presenting a # You may need to install a font. See item 1 in FAQ.txt.
make # -jN for parallel build
commonsense approach to parallelism. evince perfbook.pdf & # Two-column version
make perfbook-1c.pdf
10. If you want to try out Intel Threading Building Blocks, evince perfbook-1c.pdf & # One-column version for e-readers
make help # Display other build options
then perhaps James Reinders’s book [Rei07] is what
you are looking for.
11. Those interested in learning how various types of
multi-processor hardware cache organizations affect find CodeSamples -name rcu_rcpls.c -print
the implementation of kernel internals should take
a look at Curt Schimmel’s classic treatment of this This command will locate the file rcu_rcpls.c, which
subject [Sch94]. is called out in Appendix B. Non-UNIX systems have
12. If you are looking for a hardware view, Hennessy’s their own well-known ways of locating files by filename.
and Patterson’s classic textbook [HP17, HP11] is
well worth a read. A “Readers Digest” version of
this tome geared for scientific and technical work-
1.5 Whose Book Is This?
loads (bashing big arrays) may be found in Andrew
Chien’s textbook [Chi22]. If you are looking for an If you become a teacher, by your pupils you’ll be
academic textbook on memory ordering from a more taught.
hardware-centric viewpoint, that of Daniel Sorin Oscar Hammerstein II
et al. [SHW11, NSHW20] is highly recommended.
For a memory-ordering tutorial from a Linux-kernel As the cover says, the editor is one Paul E. McKenney.
viewpoint, Paolo Bonzini’s LWN series is a good However, the editor does accept contributions via the
place to start [Bon21a, Bon21e, Bon21c, Bon21b, [email protected] email list. These contri-
Bon21d, Bon21f]. butions can be in pretty much any form, with popular
approaches including text emails, patches against the
13. Those wishing to learn about the Rust language’s
book’s LATEX source, and even git pull requests. Use
support for low-level concurrency should refer to
whatever form works best for you.
Mara Bos’s book [Bos23].
To create patches or git pull requests, you
14. Finally, those using Java might be well-served by will need the LATEX source to the book, which
Doug Lea’s textbooks [Lea97, GPB+ 07]. is at git://git.kernel.org/pub/scm/linux/
kernel/git/paulmck/perfbook.git, or, alterna-
However, if you are interested in principles of parallel
tively, https://git.kernel.org/pub/scm/linux/
design for low-level software, especially software written
kernel/git/paulmck/perfbook.git. You will of
in C, read on!
course also need git and LATEX, which are available
as part of most mainstream Linux distributions. Other
1.4 Sample Source Code packages may be required, depending on the distribution
you use. The required list of packages for a few popular
distributions is listed in the file FAQ-BUILD.txt in the
Use the source, Luke! LATEX source to the book.
Unknown Star Wars fan To create and display a current LATEX source tree of this
book, use the list of Linux commands shown in Listing 1.1.
This book discusses its fair share of source code, and In some environments, the evince command that displays
in many cases this source code may be found in the perfbook.pdf may need to be replaced, for example,
CodeSamples directory of this book’s git tree. For ex- with acroread. The git clone command need only be
ample, on UNIX systems, you should be able to type the used the first time you create a PDF, subsequently, you
following: can run the commands shown in Listing 1.2 to pull in any

v2023.06.11a
1.5. WHOSE BOOK IS THIS? 5

Listing 1.2: Generating an Updated PDF may be redistributed consistent with this project or
git remote update the open source license(s) involved.
git checkout origin/master
make # -jN for parallel build
evince perfbook.pdf & # Two-column version This is quite similar to the Developer’s Certificate
make perfbook-1c.pdf
evince perfbook-1c.pdf & # One-column version for e-readers
of Origin (DCO) 1.1 used by the Linux kernel. You
must use your real name: I unfortunately cannot accept
pseudonymous or anonymous contributions.
updates and generate an updated PDF. The commands The language of this book is American English, however,
in Listing 1.2 must be run within the perfbook directory the open-source nature of this book permits translations,
created by the commands shown in Listing 1.1. and I personally encourage them. The open-source li-
PDFs of this book are sporadically posted at censes covering this book additionally allow you to sell
https://kernel.org/pub/linux/kernel/people/ your translation, if you wish. I do request that you send
paulmck/perfbook/perfbook.html and at http: me a copy of the translation (hardcopy if available), but
//www.rdrop.com/users/paulmck/perfbook/. this is a request made as a professional courtesy, and
The actual process of contributing patches and is not in any way a prerequisite to the permission that
sending git pull requests is similar to that of you already have under the Creative Commons and GPL
the Linux kernel, which is documented here: licenses. Please see the FAQ.txt file in the source tree
https://www.kernel.org/doc/html/latest/ for a list of translations currently in progress. I consider
process/submitting-patches.html. One important a translation effort to be “in progress” once at least one
requirement is that each patch (or commit, in the chapter has been fully translated.
case of a git pull request) must contain a valid There are many styles under the “American English”
Signed-off-by: line, which has the following format: rubric. The style for this particular book is documented
in Appendix D.
Signed-off-by: My Name <[email protected]> As noted at the beginning of this section, I am this
book’s editor. However, if you choose to contribute, it will
Please see https://lkml.org/lkml/2007/1/15/ be your book as well. In that spirit, I offer you Chapter 2,
219 for an example patch with a Signed-off-by: line. our introduction.
Note well that the Signed-off-by: line has a very spe-
cific meaning, namely that you are certifying that:

(a) The contribution was created in whole or in part by


me and I have the right to submit it under the open
source license indicated in the file; or
(b) The contribution is based upon previous work that, to
the best of my knowledge, is covered under an appro-
priate open source license and I have the right under
that license to submit that work with modifications,
whether created in whole or in part by me, under the
same open source license (unless I am permitted to
submit under a different license), as indicated in the
file; or
(c) The contribution was provided directly to me by
some other person who certified (a), (b) or (c) and I
have not modified it.
(d) I understand and agree that this project and the contri-
bution are public and that a record of the contribution
(including all personal information I submit with it,
including my sign-off) is maintained indefinitely and

v2023.06.11a
6 CHAPTER 1. HOW TO USE THIS BOOK

v2023.06.11a
If parallel programming is so hard, why are there so
many parallel programs?

Chapter 2 Unknown

Introduction

Parallel programming has earned a reputation as one of 2.1 Historic Parallel Programming
the most difficult areas a hacker can tackle. Papers and
textbooks warn of the perils of deadlock, livelock, race Difficulties
conditions, non-determinism, Amdahl’s-Law limits to
scaling, and excessive realtime latencies. And these perils Not the power to remember, but its very opposite,
are quite real; we authors have accumulated uncounted the power to forget, is a necessary condition for our
years of experience along with the resulting emotional existence.
scars, grey hairs, and hair loss. Sholem Asch
However, new technologies that are difficult to use at
introduction invariably become easier over time. For As indicated by its title, this book takes a different ap-
example, the once-rare ability to drive a car is now com- proach. Rather than complain about the difficulty of
monplace in many countries. This dramatic change came parallel programming, it instead examines the reasons
about for two basic reasons: (1) Cars became cheaper why parallel programming is difficult, and then works to
and more readily available, so that more people had the help the reader to overcome these difficulties. As will be
opportunity to learn to drive, and (2) Cars became easier to seen, these difficulties have historically fallen into several
operate due to automatic transmissions, automatic chokes, categories, including:
automatic starters, greatly improved reliability, and a host
1. The historic high cost and relative rarity of parallel
of other technological improvements.
systems.
The same is true for many other technologies, includ-
ing computers. It is no longer necessary to operate a 2. The typical researcher’s and practitioner’s lack of
keypunch in order to program. Spreadsheets allow most experience with parallel systems.
non-programmers to get results from their computers that 3. The paucity of publicly accessible parallel code.
would have required a team of specialists a few decades
ago. Perhaps the most compelling example is web-surfing 4. The lack of a widely understood engineering disci-
and content creation, which since the early 2000s has been pline of parallel programming.
easily done by untrained, uneducated people using various
now-commonplace social-networking tools. As recently 5. The high overhead of communication relative to that
as 1968, such content creation was a far-out research of processing, even in tightly coupled shared-memory
project [Eng68], described at the time as “like a UFO computers.
landing on the White House lawn” [Gri00].
Many of these historic difficulties are well on the way
Therefore, if you wish to argue that parallel program- to being overcome. First, over the past few decades,
ming will remain as difficult as it is currently perceived the cost of parallel systems has decreased from many
by many to be, it is you who bears the burden of proof, multiples of that of a house to that of a modest meal,
keeping in mind the many centuries of counter-examples courtesy of Moore’s Law [Moo65]. Papers calling out the
in many fields of endeavor. advantages of multicore CPUs were published as early

v2023.06.11a
8 CHAPTER 2. INTRODUCTION

as 1996 [ONH+ 96]. IBM introduced simultaneous multi- hardware will be more friendly to parallel software, as
threading into its high-end POWER family in 2000, and discussed in Section 3.3.
multicore in 2001. Intel introduced hyperthreading into Quick Quiz 2.1: Come on now!!! Parallel programming has
its commodity Pentium line in November 2000, and both been known to be exceedingly hard for many decades. You
AMD and Intel introduced dual-core CPUs in 2005. Sun seem to be hinting that it is not so hard. What sort of game
followed with the multicore/multi-threaded Niagara in are you playing?
late 2005. In fact, by 2008, it was becoming difficult to
find a single-CPU desktop system, with single-core CPUs However, even though parallel programming might not
being relegated to netbooks and embedded devices. By be as hard as is commonly advertised, it is often more
2012, even smartphones were starting to sport multiple work than is sequential programming.
CPUs. By 2020, safety-critical software standards started Quick Quiz 2.2: How could parallel programming ever be
addressing concurrency. as easy as sequential programming?
Second, the advent of low-cost and readily available
multicore systems means that the once-rare experience It therefore makes sense to consider alternatives to
of parallel programming is now available to almost all parallel programming. However, it is not possible to
researchers and practitioners. In fact, parallel systems reasonably consider parallel-programming alternatives
have long been within the budget of students and hobbyists. without understanding parallel-programming goals. This
We can therefore expect greatly increased levels of inven- topic is addressed in the next section.
tion and innovation surrounding parallel systems, and that
increased familiarity will over time make the once pro- 2.2 Parallel Programming Goals
hibitively expensive field of parallel programming much
more friendly and commonplace.
If you don’t know where you are going, you will end
Third, in the 20th century, large systems of highly par-
up somewhere else.
allel software were almost always closely guarded propri-
etary secrets. In happy contrast, the 21st century has seen Yogi Berra
numerous open-source (and thus publicly available) paral-
lel software projects, including the Linux kernel [Tor03], The three major goals of parallel programming (over and
database systems [Pos08, MS08], and message-passing above those of sequential programming) are as follows:
systems [The08, Uni08a]. This book will draw primarily
from the Linux kernel, but will provide much material 1. Performance.
suitable for user-level applications. 2. Productivity.
Fourth, even though the large-scale parallel-program-
ming projects of the 1980s and 1990s were almost all 3. Generality.
proprietary projects, these projects have seeded other
communities with cadres of developers who understand Unfortunately, given the current state of the art, it is
the engineering discipline required to develop production- possible to achieve at best two of these three goals for any
quality parallel code. A major purpose of this book is to given parallel program. These three goals therefore form
present this engineering discipline. the iron triangle of parallel programming, a triangle upon
which overly optimistic hopes all too often come to grief.1
Unfortunately, the fifth difficulty, the high cost of com-
munication relative to that of processing, remains largely Quick Quiz 2.3: Oh, really??? What about correctness,
in force. This difficulty has been receiving increasing maintainability, robustness, and so on?
attention during the new millennium. However, accord-
ing to Stephen Hawking, the finite speed of light and Quick Quiz 2.4: And if correctness, maintainability, and
robustness don’t make the list, why do productivity and gener-
the atomic nature of matter will limit progress in this
ality?
area [Gar07, Moo03]. Fortunately, this difficulty has been
in force since the late 1980s, so that the aforementioned
engineering discipline has evolved practical and effective
strategies for handling it. In addition, hardware designers
are increasingly aware of these issues, so perhaps future 1 Kudos to Michael Wong for naming the iron triangle.

v2023.06.11a
2.2. PARALLEL PROGRAMMING GOALS 9

10000 single-threaded code and simply waiting a year or two for


the CPUs to catch up may no longer be an option. Given
CPU Clock Frequency / MIPS

1000 the recent trends on the part of all major manufacturers


towards multicore/multithreaded systems, parallelism is
the way to go for those wanting to avail themselves of the
100
full performance of their systems.
Quick Quiz 2.8: Why not instead rewrite programs from
10
inefficient scripting languages to C or C++?

1 Even so, the first goal is performance rather than scal-


ability, especially given that the easiest way to attain
0.1
linear scalability is to reduce the performance of each
CPU [Tor01]. Given a four-CPU system, which would
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
you prefer? A program that provides 100 transactions per
second on a single CPU, but does not scale at all? Or
Year a program that provides 10 transactions per second on a
Figure 2.1: MIPS/Clock-Frequency Trend for Intel CPUs single CPU, but scales perfectly? The first program seems
like a better bet, though the answer might change if you
happened to have a 32-CPU system.
Quick Quiz 2.5: Given that parallel programs are much That said, just because you have multiple CPUs is
harder to prove correct than are sequential programs, again, not necessarily in and of itself a reason to use them all,
shouldn’t correctness really be on the list? especially given the recent decreases in price of multi-
CPU systems. The key point to understand is that parallel
Quick Quiz 2.6: What about just having fun? programming is primarily a performance optimization,
and, as such, it is one potential optimization of many. If
Each of these goals is elaborated upon in the following your program is fast enough as currently written, there is no
sections. reason to optimize, either by parallelizing it or by applying
any of a number of potential sequential optimizations.3
2.2.1 Performance By the same token, if you are looking to apply parallelism
as an optimization to a sequential program, then you will
Performance is the primary goal behind most parallel- need to compare parallel algorithms to the best sequential
programming effort. After all, if performance is not a algorithms. This may require some care, as far too many
concern, why not do yourself a favor: Just write sequential publications ignore the sequential case when analyzing
code, and be happy? It will very likely be easier and you the performance of parallel algorithms.
will probably get done much more quickly.
Quick Quiz 2.7: Are there no cases where parallel program-
ming is about something other than performance?

Note that “performance” is interpreted broadly here,


including for example scalability (performance per CPU)
and efficiency (performance per watt). of instructions per second, usually from the old Dhrystone benchmark)
That said, the focus of performance has shifted from for older CPUs requiring multiple clocks to execute even the simplest
hardware to parallel software. This change in focus is due instruction. The reason for shifting between these two measures is
that the newer CPUs’ ability to retire multiple instructions per clock
to the fact that, although Moore’s Law continues to deliver is typically limited by memory-system performance. Furthermore, the
increases in transistor density, it has ceased to provide the benchmarks commonly used on the older CPUs are obsolete, and it is
traditional single-threaded performance increases. This difficult to run the newer benchmarks on systems containing the old
can be seen in Figure 2.1,2 which shows that writing CPUs, in part because it is hard to find working instances of the old
CPUs.
3 Of course, if you are a hobbyist whose primary interest is writing
2 This plot shows clock frequencies for newer CPUs theoretically parallel software, that is more than enough reason to parallelize whatever
capable of retiring one or more instructions per clock, and MIPS (millions software you are interested in.

v2023.06.11a
10 CHAPTER 2. INTRODUCTION

2.2.2 Productivity 1x106

Quick Quiz 2.9: Why all this prattling on about non-technical 100000
issues??? And not just any non-technical issue, but productivity 10000

MIPS per Die


of all things? Who cares?
1000
Productivity has been becoming increasingly important
in recent decades. To see this, consider that the price of 100
early computers was tens of millions of dollars at a time 10
when engineering salaries were but a few thousand dollars
a year. If dedicating a team of ten engineers to such a 1
machine would improve its performance, even by only
0.1
10 %, then their salaries would be repaid many times over.

1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
One such machine was the CSIRAC, the oldest still-
intact stored-program computer, which was put into op-
eration in 1949 [Mus04, Dep06]. Because this machine Year
was built before the transistor era, it was constructed of Figure 2.2: MIPS per Die for Intel CPUs
2,000 vacuum tubes, ran with a clock frequency of 1 kHz,
consumed 30 kW of power, and weighed more than three
metric tons. Given that this machine had but 768 words cently has high productivity become critically important
of RAM, it is safe to say that it did not suffer from the when creating parallel software.
productivity issues that often plague today’s large-scale
Quick Quiz 2.10: Given how cheap parallel systems have
software projects.
become, how can anyone afford to pay people to program
Today, it would be quite difficult to purchase a machine them?
with so little computing power. Perhaps the closest equiv-
alents are 8-bit embedded microprocessors exemplified Perhaps at one time, the sole purpose of parallel software
by the venerable Z80 [Wik08], but even the old Z80 had was performance. Now, however, productivity is gaining
a CPU clock frequency more than 1,000 times faster than the spotlight.
the CSIRAC. The Z80 CPU had 8,500 transistors, and
could be purchased in 2008 for less than $2 US per unit 2.2.3 Generality
in 1,000-unit quantities. In stark contrast to the CSIRAC,
software-development costs are anything but insignificant One way to justify the high cost of developing parallel
for the Z80. software is to strive for maximal generality. All else being
The CSIRAC and the Z80 are two points in a long- equal, the cost of a more-general software artifact can be
term trend, as can be seen in Figure 2.2. This figure spread over more users than that of a less-general one. In
plots an approximation to computational power per die fact, this economic force explains much of the maniacal
over the past four decades, showing an impressive six- focus on portability, which can be seen as an important
order-of-magnitude increase over a period of forty years. special case of generality.4
Note that the advent of multicore CPUs has permitted this Unfortunately, generality often comes at the cost of per-
increase to continue apace despite the clock-frequency wall formance, productivity, or both. For example, portability
encountered in 2003, albeit courtesy of dies supporting is often achieved via adaptation layers, which inevitably
more than 50 hardware threads each. exact a performance penalty. To see this more gener-
One of the inescapable consequences of the rapid de- ally, consider the following popular parallel programming
crease in the cost of hardware is that software productivity environments:
becomes increasingly important. It is no longer sufficient
C/C++ “Locking Plus Threads”: This category, which
merely to make efficient use of the hardware: It is now
includes POSIX Threads (pthreads) [Ope97], Win-
necessary to make extremely efficient use of software
dows Threads, and numerous operating-system ker-
developers as well. This has long been the case for se-
nel environments, offers excellent performance (at
quential hardware, but parallel hardware has become a
low-cost commodity only recently. Therefore, only re- 4 Kudos to Michael Wong for pointing this out.

v2023.06.11a
2.2. PARALLEL PROGRAMMING GOALS 11

least within the confines of a single SMP system) and Productivity


also offers good generality. Pity about the relatively
low productivity. Application
Java: This general purpose and inherently multithreaded
Middleware (e.g., DBMS)
programming environment is widely believed to offer

Performance
much higher productivity than C or C++, courtesy

Generality
System Libraries
of the automatic garbage collector and the rich set
of class libraries. However, its performance, though Container
greatly improved in the early 2000s, lags that of C Operating System Kernel
and C++.
Hypervisor
MPI: This Message Passing Interface [MPI08] powers
Firmware
the largest scientific and technical computing clusters
in the world and offers unparalleled performance
Hardware
and scalability. In theory, it is general purpose,
but it is mainly used for scientific and technical
computing. Its productivity is believed by many Figure 2.3: Software Layers and Performance, Productiv-
to be even lower than that of C/C++ “locking plus ity, and Generality
threads” environments.
OpenMP: This set of compiler directives can be used to Special−Purpose
User 1 Env Productive User 2
parallelize loops. It is thus quite specific to this task, for User 1
and this specificity often limits its performance. It
is, however, much easier to use than MPI or C/C++
HW / Special−Purpose
“locking plus threads.” Abs Environment
Productive for User 2
SQL: Structured Query Language [Int92] is specific to
relational database queries. However, its perfor-
mance is quite good as measured by the Transaction User 3
General−Purpose User 4
Processing Performance Council (TPC) benchmark Environment
results [Tra01]. Productivity is excellent; in fact, this
parallel programming environment enables people to Special−Purpose Environment
Special−Purpose
make good use of a large parallel system despite hav- Productive for User 3
Environment
ing little or no knowledge of parallel programming Productive for User 4
concepts. Figure 2.4: Tradeoff Between Productivity and Generality
The nirvana of parallel programming environments,
one that offers world-class performance, productivity, and
lost in lower layers cannot easily be recovered further up
generality, simply does not yet exist. Until such a nirvana
the stack. In the upper layers of the stack, there might be
appears, it will be necessary to make engineering tradeoffs
very few users for a given specific application, in which
among performance, productivity, and generality. One
case productivity concerns are paramount. This explains
such tradeoff is depicted by the green “iron triangle”5
the tendency towards “bloatware” further up the stack:
shown in Figure 2.3, which shows how productivity be-
Extra hardware is often cheaper than extra developers.
comes increasingly important at the upper layers of the
This book is intended for developers working near the
system stack, while performance and generality become
bottom of the stack, where performance and generality
increasingly important at the lower layers of the system
are of greatest concern.
stack. The huge development costs incurred at the lower
It is important to note that a tradeoff between produc-
layers must be spread over equally huge numbers of users
tivity and generality has existed for centuries in many
(hence the importance of generality), and performance
fields. For but one example, a nailgun is more productive
5 Kudos to Michael Wong for coining “iron triangle.” than a hammer for driving nails, but in contrast to the

v2023.06.11a
12 CHAPTER 2. INTRODUCTION

nailgun, a hammer can be used for many things besides performance, productivity, and generality. Because this
driving nails. It should therefore be no surprise to see book is intended for developers working on performance-
similar tradeoffs appear in the field of parallel computing. critical code near the bottom of the software stack, the
This tradeoff is shown schematically in Figure 2.4. Here, remainder of this section focuses primarily on performance
users 1, 2, 3, and 4 have specific jobs that they need the improvement.
computer to help them with. The most productive possible It is important to keep in mind that parallelism is but
language or environment for a given user is one that simply one way to improve performance. Other well-known
does that user’s job, without requiring any programming, approaches include the following, in roughly increasing
configuration, or other setup. order of difficulty:
Quick Quiz 2.11: This is a ridiculously unachievable ideal!
1. Run multiple instances of a sequential application.
Why not focus on something that is achievable in practice?
2. Make the application use existing parallel software.
Unfortunately, a system that does the job required by
user 1 is unlikely to do user 2’s job. In other words, the 3. Optimize the serial application.
most productive languages and environments are domain-
specific, and thus by definition lacking generality. These approaches are covered in the following sections.
Another option is to tailor a given programming lan-
guage or environment to the hardware system (for example, 2.3.1 Multiple Instances of a Sequential Ap-
low-level languages such as assembly, C, C++, or Java)
or to some abstraction (for example, Haskell, Prolog, or
plication
Snobol), as is shown by the circular region near the center Running multiple instances of a sequential application can
of Figure 2.4. These languages can be considered to be allow you to do parallel programming without actually
general in the sense that they are equally ill-suited to the doing parallel programming. There are a large number of
jobs required by users 1, 2, 3, and 4. In other words, ways to approach this, depending on the structure of the
their generality comes at the expense of decreased produc- application.
tivity when compared to domain-specific languages and If your program is analyzing a large number of different
environments. Worse yet, a language that is tailored to a scenarios, or is analyzing a large number of independent
given abstraction is likely to suffer from performance and data sets, one easy and effective approach is to create a
scalability problems unless and until it can be efficiently single sequential program that carries out a single analysis,
mapped to real hardware. then use any of a number of scripting environments (for
Is there no escape from iron triangle’s three conflicting example the bash shell) to run a number of instances of
goals of performance, productivity, and generality? that sequential program in parallel. In some cases, this
It turns out that there often is an escape, for example, approach can be easily extended to a cluster of machines.
using the alternatives to parallel programming discussed This approach may seem like cheating, and in fact some
in the next section. After all, parallel programming can denigrate such programs as “embarrassingly parallel”.
be a great deal of fun, but it is not always the best tool for And in fact, this approach does have some potential dis-
the job. advantages, including increased memory consumption,
waste of CPU cycles recomputing common intermediate
results, and increased copying of data. However, it is of-
2.3 Alternatives to Parallel Pro- ten extremely productive, garnering extreme performance
gramming gains with little or no added effort.

Experiment is folly when experience shows the way. 2.3.2 Use Existing Parallel Software
Roger M. Babson There is no longer any shortage of parallel software envi-
ronments that can present a single-threaded programming
In order to properly consider alternatives to parallel pro- environment, including relational databases [Dat82], web-
gramming, you must first decide on what exactly you application servers, and map-reduce environments. For
expect the parallelism to do for you. As seen in Sec- example, a common design provides a separate process for
tion 2.2, the primary goals of parallel programming are each user, each of which generates SQL from user queries.

v2023.06.11a
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 13

This per-user SQL is run against a common relational Furthermore, different programs might have different
database, which automatically runs the users’ queries performance bottlenecks. For example, if your program
concurrently. The per-user programs are responsible only spends most of its time waiting on data from your disk
for the user interface, with the relational database tak- drive, using multiple CPUs will probably just increase the
ing full responsibility for the difficult issues surrounding time wasted waiting for the disks. In fact, if the program
parallelism and persistence. was reading from a single large file laid out sequentially
In addition, there are a growing number of parallel on a rotating disk, parallelizing your program might well
library functions, particularly for numeric computation. make it a lot slower due to the added seek overhead. You
Even better, some libraries take advantage of special- should instead optimize the data layout so that the file can
purpose hardware such as vector units and general-purpose be smaller (thus faster to read), split the file into chunks
graphical processing units (GPGPUs). which can be accessed in parallel from different drives,
Taking this approach often sacrifices some performance, cache frequently accessed data in main memory, or, if
at least when compared to carefully hand-coding a fully possible, reduce the amount of data that must be read.
parallel application. However, such sacrifice is often well Quick Quiz 2.13: What other bottlenecks might prevent
repaid by a huge reduction in development effort. additional CPUs from providing additional performance?
Quick Quiz 2.12: Wait a minute! Doesn’t this approach
simply shift the development effort from you to whoever wrote Parallelism can be a powerful optimization technique,
the existing parallel software you are using? but it is not the only such technique, nor is it appropriate
for all situations. Of course, the easier it is to parallelize
your program, the more attractive parallelization becomes
as an optimization. Parallelization has a reputation of
2.3.3 Performance Optimization being quite difficult, which leads to the question “exactly
Up through the early 2000s, CPU clock frequencies dou- what makes parallel programming so difficult?”
bled every 18 months. It was therefore usually more impor-
tant to create new functionality than to carefully optimize
performance. Now that Moore’s Law is “only” increasing 2.4 What Makes Parallel Program-
transistor density instead of increasing both transistor ming Hard?
density and per-transistor performance, it might be a good
time to rethink the importance of performance optimiza-
tion. After all, new hardware generations no longer bring Real difficulties can be overcome; it is only the
imaginary ones that are unconquerable.
significant single-threaded performance improvements.
Furthermore, many performance optimizations can also Theodore N. Vail
conserve energy.
From this viewpoint, parallel programming is but an- It is important to note that the difficulty of parallel pro-
other performance optimization, albeit one that is be- gramming is as much a human-factors issue as it is a set of
coming much more attractive as parallel systems become technical properties of the parallel programming problem.
cheaper and more readily available. However, it is wise We do need human beings to be able to tell parallel sys-
to keep in mind that the speedup available from paral- tems what to do, otherwise known as programming. But
lelism is limited to roughly the number of CPUs (but parallel programming involves two-way communication,
see Section 6.5 for an interesting exception). In contrast, with a program’s performance and scalability being the
the speedup available from traditional single-threaded communication from the machine to the human. In short,
software optimizations can be much larger. For example, the human writes a program telling the computer what
replacing a long linked list with a hash table or a search to do, and the computer critiques this program via the
tree can improve performance by many orders of mag- resulting performance and scalability. Therefore, appeals
nitude. This highly optimized single-threaded program to abstractions or to mathematical analyses will often be
might run much faster than its unoptimized parallel coun- of severely limited utility.
terpart, making parallelization unnecessary. Of course, a In the Industrial Revolution, the interface between hu-
highly optimized parallel program would be even better, man and machine was evaluated by human-factor studies,
aside from the added development effort required. then called time-and-motion studies. Although there have

v2023.06.11a
14 CHAPTER 2. INTRODUCTION

Performance Productivity errors and events: A parallel program may need to carry
out non-trivial synchronization in order to safely process
Work
Partitioning
such global events. More generally, each partition requires
some sort of communication: After all, if a given thread
Resource did not communicate at all, it would have no effect and
Parallel
Partitioning and would thus not need to be executed. However, because
Access Control Replication
communication incurs overhead, careless partitioning
choices can result in severe performance degradation.
Interacting
With Hardware
Furthermore, the number of concurrent threads must
often be controlled, as each such thread occupies common
Generality resources, for example, space in CPU caches. If too
many threads are permitted to execute concurrently, the
Figure 2.5: Categories of Tasks Required of Parallel CPU caches will overflow, resulting in high cache miss
Programmers rate, which in turn degrades performance. Conversely,
large numbers of threads are often required to overlap
computation and I/O so as to fully utilize I/O devices.
been a few human-factor studies examining parallel pro-
gramming [ENS05, ES05, HCS+ 05, SS94], these studies Quick Quiz 2.14: Other than CPU cache capacity, what
might require limiting the number of concurrent threads?
have been extremely narrowly focused, and hence unable
to demonstrate any general results. Furthermore, given Finally, permitting threads to execute concurrently
that the normal range of programmer productivity spans greatly increases the program’s state space, which can
more than an order of magnitude, it is unrealistic to expect make the program difficult to understand and debug, de-
an affordable study to be capable of detecting (say) a 10 % grading productivity. All else being equal, smaller state
difference in productivity. Although the multiple-order-of- spaces having more regular structure are more easily un-
magnitude differences that such studies can reliably detect derstood, but this is a human-factors statement as much as
are extremely valuable, the most impressive improvements it is a technical or mathematical statement. Good parallel
tend to be based on a long series of 10 % improvements. designs might have extremely large state spaces, but never-
We must therefore take a different approach. theless be easy to understand due to their regular structure,
One such approach is to carefully consider the tasks that while poor designs can be impenetrable despite having a
parallel programmers must undertake that are not required comparatively small state space. The best designs exploit
of sequential programmers. We can then evaluate how embarrassing parallelism, or transform the problem to
well a given programming language or environment assists one having an embarrassingly parallel solution. In either
the developer with these tasks. These tasks fall into the case, “embarrassingly parallel” is in fact an embarrass-
four categories shown in Figure 2.5, each of which is ment of riches. The current state of the art enumerates
covered in the following sections. good designs; more work is required to make more general
judgments on state-space size and structure.
2.4.1 Work Partitioning
2.4.2 Parallel Access Control
Work partitioning is absolutely required for parallel ex-
ecution: If there is but one “glob” of work, then it can Given a single-threaded sequential program, that single
be executed by at most one CPU at a time, which is by thread has full access to all of the program’s resources.
definition sequential execution. However, partitioning the These resources are most often in-memory data structures,
work requires great care. For example, uneven partitioning but can be CPUs, memory (including caches), I/O devices,
can result in sequential execution once the small partitions computational accelerators, files, and much else besides.
have completed [Amd67]. In less extreme cases, load The first parallel-access-control issue is whether the
balancing can be used to fully utilize available hardware form of access to a given resource depends on that re-
and restore performance and scalability. source’s location. For example, in many message-passing
Although partitioning can greatly improve performance environments, local-variable access is via expressions
and scalability, it can also increase complexity. For and assignments, while remote-variable access uses an
example, partitioning can complicate handling of global entirely different syntax, usually involving messaging.

v2023.06.11a
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 15

The POSIX Threads environment [Ope97], Structured Performance Productivity


Query Language (SQL) [Int92], and partitioned global
Work
address-space (PGAS) environments such as Universal Partitioning
Parallel C (UPC) [EGCD03, CBF13] offer implicit access,
while Message Passing Interface (MPI) [MPI08] offers Resource
Parallel
explicit access because access to remote data requires Partitioning and
Access Control Replication
explicit messaging.
The other parallel-access-control issue is how threads Interacting
coordinate access to the resources. This coordination is With Hardware
carried out by the very large number of synchronization
mechanisms provided by various parallel languages and Generality
environments, including message passing, locking, trans-
actions, reference counting, explicit timing, shared atomic Figure 2.6: Ordering of Parallel-Programming Tasks
variables, and data ownership. Many traditional parallel-
programming concerns such as deadlock, livelock, and
transaction rollback stem from this coordination. This 2.4.4 Interacting With Hardware
framework can be elaborated to include comparisons of
these synchronization mechanisms, for example locking Hardware interaction is normally the domain of the op-
vs. transactional memory [MMW07], but such elaboration erating system, the compiler, libraries, or other software-
is beyond the scope of this section. (See Sections 17.2 environment infrastructure. However, developers working
and 17.3 for more information on transactional memory.) with novel hardware features and components will often
Quick Quiz 2.15: Just what is “explicit timing”??? need to work directly with such hardware. In addition,
direct access to the hardware can be required when squeez-
ing the last drop of performance out of a given system. In
this case, the developer may need to tailor or configure
2.4.3 Resource Partitioning and Replication the application to the cache geometry, system topology,
The most effective parallel algorithms and systems exploit or interconnect protocol of the target hardware.
resource parallelism, so much so that it is usually wise to In some cases, hardware may be considered to be a
begin parallelization by partitioning your write-intensive resource which is subject to partitioning or access control,
resources and replicating frequently accessed read-mostly as described in the previous sections.
resources. The resource in question is most frequently
data, which might be partitioned over computer systems,
mass-storage devices, NUMA nodes, CPU cores (or dies
or hardware threads), pages, cache lines, instances of 2.4.5 Composite Capabilities
synchronization primitives, or critical sections of code.
For example, partitioning over locking primitives is termed Although these four capabilities are fundamental, good
“data locking” [BK85]. engineering practice uses composites of these capabilities.
Resource partitioning is frequently application depen- For example, the data-parallel approach first partitions
dent. For example, numerical applications frequently par- the data so as to minimize the need for inter-partition
tition matrices by row, column, or sub-matrix, while com- communication, partitions the code accordingly, and fi-
mercial applications frequently partition write-intensive nally maps data partitions and threads so as to maximize
data structures and replicate read-mostly data structures. throughput while minimizing inter-thread communication,
Thus, a commercial application might assign the data for as shown in Figure 2.6. The developer can then con-
a given customer to a given few computers out of a large sider each partition separately, greatly reducing the size
cluster. An application might statically partition data, or of the relevant state space, in turn increasing productiv-
dynamically change the partitioning over time. ity. Even though some problems are non-partitionable,
Resource partitioning is extremely effective, but it clever transformations into forms permitting partitioning
can be quite challenging for complex multilinked data can sometimes greatly enhance both performance and
structures. scalability [Met99].

v2023.06.11a
16 CHAPTER 2. INTRODUCTION

2.4.6 How Do Languages and Environments to the parallel-programming challenge here in the 21st
Assist With These Tasks? century!
We are now ready to proceed to the next chapter, which
Although many environments require the developer to dives into the relevant properties of the parallel hardware
deal manually with these tasks, there are long-standing underlying our parallel software.
environments that bring significant automation to bear.
The poster child for these environments is SQL, many
implementations of which automatically parallelize single
large queries and also automate concurrent execution of
independent queries and updates.
These four categories of tasks must be carried out in all
parallel programs, but that of course does not necessarily
mean that the developer must manually carry out these
tasks. We can expect to see ever-increasing automation of
these four tasks as parallel systems continue to become
cheaper and more readily available.
Quick Quiz 2.16: Are there any other obstacles to parallel
programming?

2.5 Discussion
Until you try, you don’t know what you can’t do.
Henry James

This section has given an overview of the difficulties with,


goals of, and alternatives to parallel programming. This
overview was followed by a discussion of what can make
parallel programming hard, along with a high-level ap-
proach for dealing with parallel programming’s difficulties.
Those who still insist that parallel programming is impossi-
bly difficult should review some of the older guides to par-
allel programmming [Seq88, Bir89, BK85, Inm85]. The
following quote from Andrew Birrell’s monograph [Bir89]
is especially telling:

Writing concurrent programs has a reputation


for being exotic and difficult. I believe it is
neither. You need a system that provides you
with good primitives and suitable libraries, you
need a basic caution and carefulness, you need
an armory of useful techniques, and you need
to know of the common pitfalls. I hope that
this paper has helped you towards sharing my
belief.

The authors of these older guides were well up to the


parallel programming challenge back in the 1980s. As
such, there are simply no excuses for refusing to step up

v2023.06.11a
Premature abstraction is the root of all evil.
A cast of thousands
Chapter 3

Hardware and its Habits

Most people intuitively understand that passing messages


between systems is more expensive than performing simple
calculations within the confines of a single system. But it
is also the case that communicating among threads within
the confines of a single shared-memory system can also be
quite expensive. This chapter therefore looks at the cost
of synchronization and communication within a shared-
memory system. These few pages can do no more than
scratch the surface of shared-memory parallel hardware
design; readers desiring more detail would do well to start
with a recent edition of Hennessy’s and Patterson’s classic
text [HP17, HP95].
Quick Quiz 3.1: Why should parallel programmers bother CPU Benchmark Trackmeet
learning low-level properties of the hardware? Wouldn’t it be
easier, better, and more elegant to remain at a higher level of
abstraction? Figure 3.1: CPU Performance at its Best

3.1.1 Pipelined CPUs


3.1 Overview
In the 1980s, the typical microprocessor fetched an in-
struction, decoded it, and executed it, typically taking
Mechanical Sympathy: Hardware and software
at least three clock cycles to complete one instruction
working together in harmony.
before even starting the next. In contrast, the CPU of the
Martin Thompson late 1990s and of the 2000s execute many instructions
simultaneously, using pipelines; superscalar techniques;
Careless reading of computer-system specification sheets out-of-order instruction and data handling; speculative
might lead one to believe that CPU performance is a execution, and more [HP17, HP11] in order to optimize
footrace on a clear track, as illustrated in Figure 3.1, the flow of instructions and data through the CPU. Some
where the race always goes to the swiftest. cores have more than one hardware thread, which is
Although there are a few CPU-bound benchmarks that variously called simultaneous multithreading (SMT) or
approach the ideal case shown in Figure 3.1, the typical hyperthreading (HT) [Fen73], each of which appears as
program more closely resembles an obstacle course than an independent CPU to software, at least from a functional
a race track. This is because the internal architecture of viewpoint. These modern hardware features can greatly
CPUs has changed dramatically over the past few decades, improve performance, as illustrated by Figure 3.2.
courtesy of Moore’s Law. These changes are described in Achieving full performance with a CPU having a long
the following sections. pipeline requires highly predictable control flow through

17

v2023.06.11a
18 CHAPTER 3. HARDWARE AND ITS HABITS

Thread 0 Thread 1
4.0 GHz clock, 20 MB L3 Instructions
Decode and
Instructions
cache, 20 stage pipeline... Translate

Micro-Op
Scheduler

The only pipeline I need


is to cool off that hot-
headed brat. Registers
Execution (100s!)
Units
Figure 3.2: CPUs Old and New
Figure 3.4: Rough View of Modern Micro-Architecture

for execution to proceed far enough to be certain where that


branch leads, or it must guess and then proceed using spec-
ulative execution. Although guessing works extremely
well for programs with predictable control flow, for un-
N predictable branches (such as those in binary search) the
IO
CT
R ED
I guesses will frequently be wrong. A wrong guess can be
SP
PI
PE
MI expensive because the CPU must discard any speculatively
LI CH
NE AN executed instructions following the corresponding branch,
ER BR
RO
R resulting in a pipeline flush. If pipeline flushes appear too
frequently, they drastically reduce overall performance, as
fancifully depicted in Figure 3.3.
This gets even worse in the increasingly common case
of hyperthreading (or SMT, if you prefer), especially on
pipelined superscalar out-of-order CPU featuring specu-
Figure 3.3: CPU Meets a Pipeline Flush lative execution. In this increasingly common case, all
the hardware threads sharing a core also share that core’s
resources, including registers, cache, execution units, and
the program. Suitable control flow can be provided by a so on. The instructions are often decoded into micro-
program that executes primarily in tight loops, for example, operations, and use of the shared execution units and the
arithmetic on large matrices or vectors. The CPU can then hundreds of hardware registers is often coordinated by a
correctly predict that the branch at the end of the loop will micro-operation scheduler. A rough diagram of such a
be taken in almost all cases, allowing the pipeline to be two-threaded core is shown in Figure 3.4, and more accu-
kept full and the CPU to execute at full speed. rate (and thus more complex) diagrams are available in
However, branch prediction is not always so easy. For textbooks and scholarly papers.1 Therefore, the execution
example, consider a program with many loops, each of of one hardware thread can and often is perturbed by the
which iterates a small but random number of times. For actions of other hardware threads sharing that core.
another example, consider an old-school object-oriented Even if only one hardware thread is active (for example,
program with many virtual objects that can reference many in old-school CPU designs where there is only one thread),
different real objects, all with different implementations counterintuitive results are quite common. Execution
for frequently invoked member functions, resulting in units often have overlapping capabilities, so that a CPU’s
many calls through pointers. In these cases, it is difficult 1 Here is one example for a late-2010s Intel core: https:
or even impossible for the CPU to predict where the next //en.wikichip.org/wiki/intel/microarchitectures/
branch might lead. Then either the CPU must stall waiting skylake_(server).

v2023.06.11a
3.1. OVERVIEW 19

choice of execution unit can result in pipeline stalls due to


contention for that execution unit from later instructions.
In theory, this contention is avoidable, but in practice
CPUs must choose very quickly and without the benefit of
clairvoyance. In particular, adding an instruction to a tight
loop can sometimes actually cause execution to speed up.
Unfortunately, pipeline flushes and shared-resource
contention are not the only hazards in the obstacle course
that modern CPUs must run. The next section covers the
hazards of referencing memory.

3.1.2 Memory References


In the 1980s, it often took less time for a microprocessor
to load a value from memory than it did to execute an
instruction. More recently, microprocessors might execute
hundreds or even thousands of instructions in the time
required to access memory. This disparity is due to the
fact that Moore’s Law has increased CPU performance at
a much greater rate than it has decreased memory latency, Figure 3.5: CPU Meets a Memory Reference
in part due to the rate at which memory sizes have grown.
For example, a typical 1970s minicomputer might have
4 KB (yes, kilobytes, not megabytes, let alone gigabytes 3.1.3 Atomic Operations
or terabytes) of main memory, with single-cycle access.2
Present-day CPU designers still can construct a 4 KB One such obstacle is atomic operations. The problem here
memory with single-cycle access, even on systems with is that the whole idea of an atomic operation conflicts
multi-GHz clock frequencies. And in fact they frequently with the piece-at-a-time assembly-line operation of a
do construct such memories, but they now call them CPU pipeline. To hardware designers’ credit, modern
“level-0 caches”, plus they can be quite a bit bigger than CPUs use a number of extremely clever tricks to make
4 KB. such operations look atomic even though they are in fact
Although the large caches found on modern micro- being executed piece-at-a-time, with one common trick
processors can do quite a bit to help combat memory- being to identify all the cachelines containing the data to
access latencies, these caches require highly predictable be atomically operated on, ensure that these cachelines
data-access patterns to successfully hide those latencies. are owned by the CPU executing the atomic operation,
Unfortunately, common operations such as traversing a and only then proceed with the atomic operation while
linked list have extremely unpredictable memory-access ensuring that these cachelines remained owned by this
patterns—after all, if the pattern was predictable, us soft- CPU. Because all the data is private to this CPU, other
ware types would not bother with the pointers, right? CPUs are unable to interfere with the atomic operation
Therefore, as shown in Figure 3.5, memory references despite the piece-at-a-time nature of the CPU’s pipeline.
often pose severe obstacles to modern CPUs. Needless to say, this sort of trick can require that the
Thus far, we have only been considering obstacles pipeline must be delayed or even flushed in order to
that can arise during a given CPU’s execution of single- perform the setup operations that permit a given atomic
threaded code. Multi-threading presents additional obsta- operation to complete correctly.
cles to the CPU, as described in the following sections. In contrast, when executing a non-atomic operation,
the CPU can load values from cachelines as they appear
and place the results in the store buffer, without the need
to wait for cacheline ownership. Although there are a
number of hardware optimizations that can sometimes
2 It is only fair to add that each of these single cycles lasted no less hide cache latencies, the resulting effect on performance
than 1.6 microseconds. is all too often as depicted in Figure 3.6.

v2023.06.11a
20 CHAPTER 3. HARDWARE AND ITS HABITS

Memory
Barrier

Figure 3.6: CPU Meets an Atomic Operation

Unfortunately, atomic operations usually apply only to


single elements of data. Because many parallel algorithms
require that ordering constraints be maintained between Figure 3.7: CPU Meets a Memory Barrier
updates of multiple data elements, most CPUs provide
memory barriers. These memory barriers also serve as
performance-sapping obstacles, as described in the next
section.
Quick Quiz 3.2: What types of machines would allow atomic
operations on multiple data elements?

3.1.4 Memory Barriers


Memory barriers will be considered in more detail in
Chapter 15 and Appendix C. In the meantime, consider
the following simple lock-based critical section:
1 spin_lock(&mylock); Figure 3.8: CPU Encounters Thermal Throttling
2 a = a + 1;
3 spin_unlock(&mylock);

If the CPU were not constrained to execute these state- 3.1.5 Thermal Throttling
ments in the order shown, the effect would be that the
variable “a” would be incremented without the protection One increasingly common frustrating experience is to
of “mylock”, which would certainly defeat the purpose of carefully micro-optimize a critical code path, greatly
acquiring it. To prevent such destructive reordering, lock- reducing the number of clock cycles consumed by that
ing primitives contain either explicit or implicit memory code path, only to find that the wall-clock time consumed
barriers. Because the whole purpose of these memory by that code has actually increased.
barriers is to prevent reorderings that the CPU would Welcome to modern thermal throttling.
otherwise undertake in order to increase performance, If you reduced the number of clock cycles by making
memory barriers almost always reduce performance, as more effective use of the CPU’s functional units, you will
depicted in Figure 3.7. have increased the power consumed by that CPU. This
As with atomic operations, CPU designers have been will in turn increase the amount of heat dissipated by that
working hard to reduce memory-barrier overhead, and CPU. If this heat dissipation exceeds the cooling system’s
have made substantial progress. capacity, the system will thermally throttle that CPU, for

v2023.06.11a
3.1. OVERVIEW 21

CACHE- TELE Please stay on the


MISS line. Your call is very
TOLL important to us...

BOOTH

Figure 3.9: CPU Meets a Cache Miss Figure 3.10: CPU Waits for I/O Completion

example, by reducing its clock frequency, as fancifully Quick Quiz 3.3: So have CPU designers also greatly reduced
depicted by the snow penguin in Figure 3.8. the overhead of cache misses?
If performance is of the essence, the proper fix is im-
proved cooling, an approach loved by serious gamers
and by overclockers.3 But if you cannot modify your 3.1.7 I/O Operations
computer’s cooling system, perhaps because you are rent-
A cache miss can be thought of as a CPU-to-CPU I/O
ing it from a cloud provider, then you will need to take
operation, and as such is one of the cheapest I/O operations
some other optimization approach. For example, you
available. I/O operations involving networking, mass
might need to apply algorithmic optimizations instead
storage, or (worse yet) human beings pose much greater
of hardware-centric micro-optimizations. Alternatively,
obstacles than the internal obstacles called out in the prior
perhaps you can parallelize your code, spreading the work
sections, as illustrated by Figure 3.10.
(and thus the heat) over multiple CPU cores.
This is one of the differences between shared-memory
and distributed-system parallelism: Shared-memory par-
3.1.6 Cache Misses allel programs must normally deal with no obstacle worse
An additional multi-threading obstacle to CPU perfor- than a cache miss, while a distributed parallel program
mance is the “cache miss”. As noted earlier, modern will typically incur the larger network communication
CPUs sport large caches in order to reduce the perfor- latencies. In both cases, the relevant latencies can be
mance penalty that would otherwise be incurred due to thought of as a cost of communication—a cost that would
high memory latencies. However, these caches are actu- be absent in a sequential program. Therefore, the ratio
ally counter-productive for variables that are frequently between the overhead of the communication to that of the
shared among CPUs. This is because when a given CPU actual work being performed is a key design parameter.
wishes to modify the variable, it is most likely the case A major goal of parallel hardware design is to reduce this
that some other CPU has modified it recently. In this case, ratio as needed to achieve the relevant performance and
the variable will be in that other CPU’s cache, but not in scalability goals. In turn, as will be seen in Chapter 6,
this CPU’s cache, which will therefore incur an expensive a major goal of parallel software design is to reduce the
cache miss (see Appendix C.1 for more detail). Such frequency of expensive operations like communications
cache misses form a major obstacle to CPU performance, cache misses.
as shown in Figure 3.9. Of course, it is one thing to say that a given operation is
an obstacle, and quite another to show that the operation
is a significant obstacle. This distinction is discussed in
3 Some of whom make good use of liquid nitrogen. the following sections.

v2023.06.11a
22 CHAPTER 3. HARDWARE AND ITS HABITS

CPU 0 CPU 1 CPU 2 CPU 3


For example, if CPU 0 were to write to a variable
whose cacheline resided in CPU 7’s cache, the following
Cache Cache Cache Cache
over-simplified sequence of events might ensue:
Interconnect Interconnect
1. CPU 0 checks its local cache, and does not find the
Memory System Interconnect Memory
cacheline. It therefore records the write in its store
buffer.

Interconnect Interconnect
2. A request for this cacheline is forwarded to CPU 0’s
and 1’s interconnect, which checks CPU 1’s local
Cache Cache Cache Cache
cache, and does not find the cacheline.
CPU 4 CPU 5 CPU 6 CPU 7
3. This request is forwarded to the system interconnect,
which checks with the other three dies, learning that
Speed−of−Light Round−Trip Distance in Vacuum the cacheline is held by the die containing CPU 6
for 1.8 GHz Clock Period (8 cm) and 7.

Figure 3.11: System Hardware Architecture 4. This request is forwarded to CPU 6’s and 7’s inter-
connect, which checks both CPUs’ caches, finding
the value in CPU 7’s cache.
3.2 Overheads 5. CPU 7 forwards the cacheline to its interconnect, and
also flushes the cacheline from its cache.
Don’t design bridges in ignorance of materials, and
don’t design low-level software in ignorance of the 6. CPU 6’s and 7’s interconnect forwards the cacheline
underlying hardware. to the system interconnect.
Unknown 7. The system interconnect forwards the cacheline to
CPU 0’s and 1’s interconnect.
This section presents actual overheads of the obstacles to
performance listed out in the previous section. However, 8. CPU 0’s and 1’s interconnect forwards the cacheline
it is first necessary to get a rough view of hardware system to CPU 0’s cache.
architecture, which is the subject of the next section.
9. CPU 0 can now complete the write, updating the
relevant portions of the newly arrived cacheline from
3.2.1 Hardware System Architecture the value previously recorded in the store buffer.
Figure 3.11 shows a rough schematic of an eight-core
Quick Quiz 3.4: This is a simplified sequence of events?
computer system. Each die has a pair of CPU cores, each
How could it possibly be any more complex?
with its cache, as well as an interconnect allowing the pair
of CPUs to communicate with each other. The system
Quick Quiz 3.5: Why is it necessary to flush the cacheline
interconnect allows the four dies to communicate with from CPU 7’s cache?
each other and with main memory.
Data moves through this system in units of “cache This simplified sequence is just the beginning of a dis-
lines”, which are power-of-two fixed-size aligned blocks cipline called cache-coherency protocols [HP95, CSG99,
of memory, usually ranging from 32 to 256 bytes in size. MHS12, SHW11], which is discussed in more detail in
When a CPU loads a variable from memory to one of its Appendix C. As can be seen in the sequence of events
registers, it must first load the cacheline containing that triggered by a CAS operation, a single instruction can
variable into its cache. Similarly, when a CPU stores a cause considerable protocol traffic, which can significantly
value from one of its registers into memory, it must also degrade your parallel program’s performance.
load the cacheline containing that variable into its cache, Fortunately, if a given variable is being frequently read
but must also ensure that no other CPU has a copy of that during a time interval during which it is never updated,
cacheline. that variable can be replicated across all CPUs’ caches.

v2023.06.11a
3.2. OVERHEADS 23

Table 3.1: CPU 0 View of Synchronization Mechanisms (unexpected) value, and the CAS operation fails. The
on 8-Socket System With Intel Xeon Platinum 8176 operation is atomic in that the hardware guarantees that
CPUs @ 2.10 GHz the memory location will not be changed between the
Ratio
compare and the store. CAS functionality is provided by
Operation Cost (ns) (cost/clock) CPUs the lock;cmpxchg instruction on x86.
Clock period 0.5 1.0 The “same-CPU” prefix means that the CPU now per-
forming the CAS operation on a given variable was also
Same-CPU 0
CAS 7.0 14.6
the last CPU to access this variable, so that the corre-
lock 15.4 32.3 sponding cacheline is already held in that CPU’s cache.
Similarly, the same-CPU lock operation (a “round trip”
On-Core 224
pair consisting of a lock acquisition and release) consumes
Blind CAS 7.2 15.2
CAS 18.0 37.7 more than fifteen nanoseconds, or more than thirty clock
cycles. The lock operation is more expensive than CAS
Off-Core 1–27
because it requires two atomic operations on the lock data
Blind CAS 47.5 99.8 225–251
CAS 101.9 214.0
structure, one for acquisition and the other for release.
On-core operations involving interactions between the
Off-Socket 28–111
Blind CAS 148.8 312.5 252–335
hardware threads sharing a single core are about the same
CAS 442.9 930.1 cost as same-CPU operations. This should not be too
surprising, given that these two hardware threads also
Cross-Interconnect 112–223
Blind CAS 336.6 706.8 336–447
share the full cache hierarchy.
CAS 944.8 1,984.2 In the case of the blind CAS, the software specifies the
Off-System
old value without looking at the memory location. This
Comms Fabric 5,000 10,500 approach is appropriate when attempting to acquire a lock.
Global Comms 195,000,000 409,500,000 If the unlocked state is represented by zero and the locked
state is represented by the value one, then a CAS operation
on the lock that specifies zero for the old value and one
This replication permits all CPUs to enjoy extremely fast for the new value will acquire the lock if it is not already
access to this read-mostly variable. Chapter 9 presents held. The key point is that there is only one access to the
synchronization mechanisms that take full advantage of memory location, namely the CAS operation itself.
this important hardware read-mostly optimization. In contrast, a normal CAS operation’s old value is de-
rived from some earlier load. For example, to implement
an atomic increment, the current value of that location
3.2.2 Costs of Operations
is loaded and that value is incremented to produce the
The overheads of some common operations important to new value. Then in the CAS operation, the value actu-
parallel programs are displayed in Table 3.1. This system’s ally loaded would be specified as the old value and the
clock period rounds to 0.5 ns. Although it is not unusual incremented value as the new value. If the value had
for modern microprocessors to be able to retire multiple not been changed between the load and the CAS, this
instructions per clock period, the operations’ costs are would increment the memory location. However, if the
nevertheless normalized to a clock period in the third value had in fact changed, then the old value would not
column, labeled “Ratio”. The first thing to note about this match, causing a miscompare that would result in the CAS
table is the large values of many of the ratios. operation failing. The key point is that there are now two
The same-CPU compare-and-swap (CAS) operation accesses to the memory location, the load and the CAS.
consumes about seven nanoseconds, a duration more than Thus, it is not surprising that on-core blind CAS con-
ten times that of the clock period. CAS is an atomic sumes only about seven nanoseconds, while on-core CAS
operation in which the hardware compares the contents consumes about 18 nanoseconds. The non-blind case’s
of the specified memory location to a specified “old” extra load does not come for free. That said, the overhead
value, and if they compare equal, stores a specified “new” of these operations are similar to same-CPU CAS and
value, in which case the CAS operation succeeds. If lock, respectively.
they compare unequal, the memory location keeps its

v2023.06.11a
24 CHAPTER 3. HARDWARE AND ITS HABITS

Quick Quiz 3.6: Table 3.1 shows CPU 0 sharing a core with Table 3.2: Cache Geometry for 8-Socket System With
CPU 224. Shouldn’t that instead be CPU 1??? Intel Xeon Platinum 8176 CPUs @ 2.10 GHz

A blind CAS involving CPUs in different cores but Level Scope Line Size Sets Ways Size
on the same socket consumes almost fifty nanoseconds,
L0 Core 64 64 8 32K
or almost one hundred clock cycles. The code used for
L1 Core 64 64 8 32K
this cache-miss measurement passes the cache line back
L2 Core 64 1024 16 1024K
and forth between a pair of CPUs, so this cache miss
L3 Socket 64 57,344 11 39,424K
is satisfied not from memory, but rather from the other
CPU’s cache. A non-blind CAS operation, which as
noted earlier must look at the old value of the variable
are organized as a hardware hash table with a limited
as well as store a new value, consumes over one hundred
number of items per bucket. For example, the raw size of
nanoseconds, or more than two hundred clock cycles.
the L3 cache (“Size”) is almost 40 MB, but each bucket
Think about this a bit. In the time required to do one CAS
(“Line”) can only hold 11 blocks of memory (“Ways”),
operation, the CPU could have executed more than two
each of which can be at most 64 bytes (“Line Size”).
hundred normal instructions. This should demonstrate
This means that only 12 bytes of memory (admittedly at
the limitations not only of fine-grained locking, but of any
carefully chosen addresses) are required to overflow this
other synchronization mechanism relying on fine-grained
40 MB cache. On the other hand, equally careful choice
global agreement.
of addresses might make good use of the entire 40 MB.
If the pair of CPUs are on different sockets, the oper-
Spatial locality of reference is clearly extremely impor-
ations are considerably more expensive. A blind CAS
tant, as is spreading the data across memory.
operation consumes almost 150 nanoseconds, or more
I/O operations are even more expensive. As shown
than three hundred clock cycles. A normal CAS operation
in the “Comms Fabric” row, high performance (and ex-
consumes more than 400 nanoseconds, or almost one
pensive!) communications fabric, such as InfiniBand or
thousand clock cycles.
any number of proprietary interconnects, has a latency of
Worse yet, not all pairs of sockets are created equal.
roughly five microseconds for an end-to-end round trip,
This particular system appears to be constructed as a
during which time more than ten thousand instructions
pair of four-socket components, with additional latency
might have been executed. Standards-based communi-
penalties when the CPUs reside in different components.
cations networks often require some sort of protocol
In this case, a blind CAS operation consumes more than
processing, which further increases the latency. Of course,
three hundred nanoseconds, or more than seven hundred
geographic distance also increases latency, with the speed-
clock cycles. A CAS operation consumes almost a full
of-light through optical fiber latency around the world
microsecond, or almost two thousand clock cycles.
coming to roughly 195 milliseconds, or more than 400
Quick Quiz 3.7: Surely the hardware designers could be per- million clock cycles, as shown in the “Global Comms”
suaded to improve this situation! Why have they been content row.
with such abysmal performance for these single-instruction
operations? Quick Quiz 3.9: These numbers are insanely large! How
can I possibly get my head around them?
Quick Quiz 3.8: Table E.1 in the answer to Quick Quiz 3.7
on page 480 says that on-core CAS is faster than both of
same-CPU CAS and on-core blind CAS. What is happening 3.2.3 Hardware Optimizations
there?
It is only natural to ask how the hardware is helping, and
Unfortunately, the high speed of within-core and within- the answer is “Quite a bit!”
socket communication does not come for free. First, there One hardware optimization is large cachelines. This
are only two CPUs within a given core and only 56 within a can provide a big performance boost, especially when
given socket, compared to 448 across the system. Second, software is accessing memory sequentially. For example,
as shown in Table 3.2, the on-core caches are quite small given a 64-byte cacheline and software accessing 64-
compared to the on-socket caches, which are in turn quite bit variables, the first access will still be slow due to
small compared to the 1.4 TB of memory configured on speed-of-light delays (if nothing else), but the remaining
this system. Third, again referring to the figure, the caches seven can be quite fast. However, this optimization has

v2023.06.11a
3.3. HARDWARE FREE LUNCH? 25

a dark side, namely false sharing, which happens when


different variables in the same cacheline are being updated
by different CPUs, resulting in a high cache-miss rate.
Software can use the alignment directives available in
many compilers to avoid false sharing, and adding such
directives is a common step in tuning parallel software.
A second related hardware optimization is cache
prefetching, in which the hardware reacts to consecutive
accesses by prefetching subsequent cachelines, thereby
evading speed-of-light delays for these subsequent cache-
lines. Of course, the hardware must use simple heuristics
to determine when to prefetch, and these heuristics can be Figure 3.12: Hardware and Software: On Same Side
fooled by the complex data-access patterns in many appli-
cations. Fortunately, some CPU families allow for this by
providing special prefetch instructions. Unfortunately, the go fast despite the best efforts of the laws of physics, as
effectiveness of these instructions in the general case is fancifully depicted in Figure 3.12 where our data stream
subject to some dispute. is trying its best to exceed the speed of light. The next
section discusses some additional things that the hardware
A third hardware optimization is the store buffer, which
engineers might (or might not) be able to do, depending on
allows a string of store instructions to execute quickly
how well recent research translates to practice. Software’s
even when the stores are to non-consecutive addresses
contribution to this noble goal is outlined in the remaining
and when none of the needed cachelines are present in
chapters of this book.
the CPU’s cache. The dark side of this optimization is
memory misordering, for which see Chapter 15.
A fourth hardware optimization is speculative execution,
which can allow the hardware to make good use of the store
3.3 Hardware Free Lunch?
buffers without resulting in memory misordering. The
dark side of this optimization can be energy inefficiency The great trouble today is that there are too many
and lowered performance if the speculative execution goes people looking for someone else to do something for
awry and must be rolled back and retried. Worse yet, the them. The solution to most of our troubles is to be
advent of Spectre and Meltdown [Hor18] made it apparent found in everyone doing something for themselves.
that hardware speculation can also enable side-channel Henry Ford, updated
attacks that defeat memory-protection hardware so as to
allow unprivileged processes to read memory that they The major reason that concurrency has been receiving so
should not have access to. It is clear that the combination much focus over the past few years is the end of Moore’s-
of speculative execution and cloud computing needs more Law induced single-threaded performance increases (or
than a bit of rework! “free lunch” [Sut08]), as shown in Figure 2.1 on page 9.
A fifth hardware optimization is large caches, allowing This section briefly surveys a few ways that hardware
individual CPUs to operate on larger datasets without designers might bring back the “free lunch”.
incurring expensive cache misses. Although large caches However, the preceding section presented some substan-
can degrade energy efficiency and cache-miss latency, the tial hardware obstacles to exploiting concurrency. One
ever-growing cache sizes on production microprocessors severe physical limitation that hardware designers face
attests to the power of this optimization. is the finite speed of light. As noted in Figure 3.11 on
A final hardware optimization is read-mostly replication, page 22, light can manage only about an 8-centimeters
in which data that is frequently read but rarely updated is round trip in a vacuum during the duration of a 1.8 GHz
present in all CPUs’ caches. This optimization allows the clock period. This distance drops to about 3 centimeters
read-mostly data to be accessed exceedingly efficiently, for a 5 GHz clock. Both of these distances are relatively
and is the subject of Chapter 9. small compared to the size of a modern computer system.
In short, hardware and software engineers are really To make matters even worse, electric waves in silicon
on the same side, with both trying to make computers move from three to thirty times more slowly than does light

v2023.06.11a
26 CHAPTER 3. HARDWARE AND ITS HABITS

70 um
path through the system by a factor of two, keeping in
mind that each layer is quite thin. In addition, given proper
attention to design and placement, long horizontal electri-
cal connections (which are both slow and power hungry)
can be replaced by short vertical electrical connections,
which are both faster and more power efficient.
However, delays due to levels of clocked logic will not be
3 cm 1.5 cm decreased by 3D integration, and significant manufactur-
ing, testing, power-supply, and heat-dissipation problems
Figure 3.13: Latency Benefit of 3D Integration must be solved for 3D integration to reach production
while still delivering on its promise. The heat-dissipation
problems might be solved using semiconductors based
in a vacuum, and common clocked logic constructs run
on diamond, which is a good conductor for heat, but an
still more slowly, for example, a memory reference may
electrical insulator. That said, it remains difficult to grow
need to wait for a local cache lookup to complete before
large single diamond crystals, to say nothing of slicing
the request may be passed on to the rest of the system.
them into wafers. In addition, it seems unlikely that any of
Furthermore, relatively low speed and high power drivers
these technologies will be able to deliver the exponential
are required to move electrical signals from one silicon
increases to which some people have become accustomed.
die to another, for example, to communicate between a
That said, they may be necessary steps on the path to the
CPU and main memory.
late Jim Gray’s “smoking hairy golf balls” [Gra02].
Quick Quiz 3.10: But individual electrons don’t move
anywhere near that fast, even in conductors!!! The electron
drift velocity in a conductor under semiconductor voltage
3.3.2 Novel Materials and Processes
levels is on the order of only one millimeter per second. What Stephen Hawking is said to have claimed that semicon-
gives??? ductor manufacturers have but two fundamental problems:
(1) The finite speed of light and (2) The atomic nature of
There are nevertheless some technologies (both hard-
matter [Gar07]. It is possible that semiconductor man-
ware and software) that might help improve matters:
ufacturers are approaching these limits, but there are
1. 3D integration, nevertheless a few avenues of research and development
focused on working around these fundamental limits.
2. Novel materials and processes, One workaround for the atomic nature of matter are
so-called “high-K dielectric” materials, which allow larger
3. Substituting light for electricity, devices to mimic the electrical properties of infeasibly
small devices. These materials pose some severe fab-
4. Special-purpose accelerators, and
rication challenges, but nevertheless may help push the
5. Existing parallel software. frontiers out a bit farther. Another more-exotic work-
around stores multiple bits in a single electron, relying
Each of these is described in one of the following on the fact that a given electron can exist at a number
sections. of energy levels. It remains to be seen if this particular
approach can be made to work reliably in production
semiconductor devices.
3.3.1 3D Integration
Another proposed workaround is the “quantum dot”
3-dimensional integration (3DI) is the practice of bonding approach that allows much smaller device sizes, but which
very thin silicon dies to each other in a vertical stack. is still in the research stage.
This practice provides potential benefits, but also poses One challenge is that many recent hardware-device-
significant fabrication challenges [Kni08]. level breakthroughs require very tight control of which
Perhaps the most important benefit of 3DI is decreased atoms are placed where [Kel17]. It therefore seems likely
path length through the system, as shown in Figure 3.13. that whoever finds a good way to hand-place atoms on
A 3-centimeter silicon die is replaced with a stack of four each of the billions of devices on a chip will have most
1.5-centimeter dies, in theory decreasing the maximum excellent bragging rights, if nothing else!

v2023.06.11a
3.3. HARDWARE FREE LUNCH? 27

3.3.3 Light, Not Electrons must be sufficiently generally useful that the high up-front
hardware-design costs can be spread over enough users to
Although the speed of light would be a hard limit, the fact make the specialized hardware affordable. In part due to
is that semiconductor devices are limited by the speed of these sorts of economic considerations, specialized hard-
electricity rather than that of light, given that electric waves ware has thus far appeared only for a few application areas,
in semiconductor materials move at between 3 % and 30 % including graphics processing (GPUs), vector processors
of the speed of light in a vacuum. The use of copper (MMX, SSE, and VMX instructions), and, to a lesser ex-
connections on silicon devices is one way to increase the tent, encryption. And even in these areas, it is not always
speed of electricity, and it is quite possible that additional easy to realize the expected performance gains, for exam-
advances will push closer still to the actual speed of ple, due to thermal throttling [Kra17, Lem18, Dow20].
light. In addition, there have been some experiments with
tiny optical fibers as interconnects within and between Unlike the server and PC arena, smartphones have long
chips, based on the fact that the speed of light in glass is used a wide variety of hardware accelerators. These hard-
more than 60 % of the speed of light in a vacuum. One ware accelerators are often used for media decoding, so
obstacle to such optical fibers is the inefficiency conversion much so that a high-end MP3 player might be able to play
between electricity and light and vice versa, resulting in audio for several minutes—with its CPU fully powered
both power-consumption and heat-dissipation problems. off the entire time. The purpose of these accelerators
That said, absent some fundamental advances in the is to improve energy efficiency and thus extend battery
field of physics, any exponential increases in the speed of life: Special purpose hardware can often compute more
data flow will be sharply limited by the actual speed of efficiently than can a general-purpose CPU. This is an-
light in a vacuum. other example of the principle called out in Section 2.2.3:
Generality is almost never free.

3.3.4 Special-Purpose Accelerators Nevertheless, given the end of Moore’s-Law-induced


single-threaded performance increases, it seems safe to as-
A general-purpose CPU working on a specialized problem sume that increasing varieties of special-purpose hardware
is often spending significant time and energy doing work will appear.
that is only tangentially related to the problem at hand.
For example, when taking the dot product of a pair of
vectors, a general-purpose CPU will normally use a loop
(possibly unrolled) with a loop counter. Decoding the 3.3.5 Existing Parallel Software
instructions, incrementing the loop counter, testing this
counter, and branching back to the top of the loop are in
some sense wasted effort: The real goal is instead to multi- Although multicore CPUs seem to have taken the com-
ply corresponding elements of the two vectors. Therefore, puting industry by surprise, the fact remains that shared-
a specialized piece of hardware designed specifically to memory parallel computer systems have been commer-
multiply vectors could get the job done more quickly and cially available for more than a quarter century. This is
with less energy consumed. more than enough time for significant parallel software to
This is in fact the motivation for the vector instructions make its appearance, and it indeed has. Parallel operating
present in many commodity microprocessors. Because systems are quite commonplace, as are parallel threading
these instructions operate on multiple data items simulta- libraries, parallel relational database management sys-
neously, they would permit a dot product to be computed tems, and parallel numerical software. Use of existing
with less instruction-decode and loop overhead. parallel software can go a long ways towards solving any
Similarly, specialized hardware can more efficiently parallel-software crisis we might encounter.
encrypt and decrypt, compress and decompress, encode Perhaps the most common example is the parallel re-
and decode, and many other tasks besides. Unfortunately, lational database management system. It is not unusual
this efficiency does not come for free. A computer system for single-threaded programs, often written in high-level
incorporating this specialized hardware will contain more scripting languages, to access a central relational database
transistors, which will consume some power even when concurrently. In the resulting highly parallel system, only
not in use. Software must be modified to take advantage of the database need actually deal directly with parallelism.
this specialized hardware, and this specialized hardware A very nice trick when it works!

v2023.06.11a
28 CHAPTER 3. HARDWARE AND ITS HABITS

3.4 Software Design Implications gorithms and implementations, whether by careful choice
of data structures and algorithms, use of existing paral-
lel applications and environments, or transforming the
One ship drives east and another west
problem into an embarrassingly parallel form.
While the self-same breezes blow;
’Tis the set of the sail and not the gail Quick Quiz 3.12: OK, if we are going to have to apply
That bids them where to go. distributed-programming techniques to shared-memory par-
allel programs, why not just always use these distributed
Ella Wheeler Wilcox
techniques and dispense with shared memory?

The values of the ratios in Table 3.1 are critically important, So, to sum up:
as they limit the efficiency of a given parallel application.
To see this, suppose that the parallel application uses CAS 1. The good news is that multicore systems are inexpen-
operations to communicate among threads. These CAS sive and readily available.
operations will typically involve a cache miss, that is,
assuming that the threads are communicating primarily 2. More good news: The overhead of many synchro-
with each other rather than with themselves. Suppose nization operations is much lower than it was on
further that the unit of work corresponding to each CAS parallel systems from the early 2000s.
communication operation takes 300 ns, which is sufficient 3. The bad news is that the overhead of cache misses is
time to compute several floating-point transcendental still high, especially on large systems.
functions. Then about half of the execution time will be
consumed by the CAS communication operations! This The remainder of this book describes ways of handling
in turn means that a two-CPU system running such a this bad news.
parallel program would run no faster than a sequential In particular, Chapter 4 will cover some of the low-
implementation running on a single CPU. level tools used for parallel programming, Chapter 5 will
The situation is even worse in the distributed-system investigate problems and solutions to parallel counting,
case, where the latency of a single communications oper- and Chapter 6 will discuss design disciplines that promote
ation might take as long as thousands or even millions of performance and scalability.
floating-point operations. This illustrates how important
it is for communications operations to be extremely infre-
quent and to enable very large quantities of processing.
Quick Quiz 3.11: Given that distributed-systems communi-
cation is so horribly expensive, why does anyone bother with
such systems?

The lesson should be quite clear: Parallel algorithms


must be explicitly designed with these hardware properties
firmly in mind. One approach is to run nearly independent
threads. The less frequently the threads communicate,
whether by atomic operations, locks, or explicit messages,
the better the application’s performance and scalability
will be. This approach will be touched on in Chapter 5,
explored in Chapter 6, and taken to its logical extreme in
Chapter 8.
Another approach is to make sure that any sharing be
read-mostly, which allows the CPUs’ caches to replicate
the read-mostly data, in turn allowing all CPUs fast access.
This approach is touched on in Section 5.2.4, and explored
more deeply in Chapter 9.
In short, achieving excellent parallel performance and
scalability means striving for embarrassingly parallel al-

v2023.06.11a
You are only as good as your tools, and your tools
are only as good as you are.

Chapter 4 Unknown

Tools of the Trade

This chapter provides a brief introduction to some basic


tools of the parallel-programming trade, focusing mainly
on those available to user applications running on op-
erating systems similar to Linux. Section 4.1 begins compute_it 1 > compute_it 2 >
with scripting languages, Section 4.2 describes the multi- compute_it.1.out & compute_it.2.out &
process parallelism supported by the POSIX API and
touches on POSIX threads, Section 4.3 presents analogous
operations in other environments, and finally, Section 4.4 wait

helps to choose the tool that will get the job done.
Quick Quiz 4.1: You call these tools??? They look more cat compute_it.1.out
like low-level synchronization primitives to me!

Please note that this chapter provides but a brief intro- cat compute_it.2.out
duction. More detail is available from the references (and
from the Internet), and more information will be provided Figure 4.1: Execution Diagram for Parallel Shell Execu-
in later chapters. tion

4.1 Scripting Languages character directing the shell to run the two instances of
the program in the background. Line 3 waits for both
The supreme excellence is simplicity. instances to complete, and lines 4 and 5 display their
output. The resulting execution is as shown in Figure 4.1:
Henry Wadsworth Longfellow, simplified The two instances of compute_it execute in parallel,
wait completes after both of them do, and then the two
The Linux shell scripting languages provide simple but instances of cat execute sequentially.
effective ways of managing parallelism. For example,
suppose that you had a program compute_it that you Quick Quiz 4.2: But this silly shell script isn’t a real parallel
needed to run twice with two different sets of arguments. program! Why bother with such trivia???
This can be accomplished using UNIX shell scripting as
follows: Quick Quiz 4.3: Is there a simpler way to create a parallel
1 compute_it 1 > compute_it.1.out & shell script? If so, how? If not, why not?
2 compute_it 2 > compute_it.2.out &
3 wait
4 cat compute_it.1.out For another example, the make software-build scripting
5 cat compute_it.2.out language provides a -j option that specifies how much par-
allelism should be introduced into the build process. Thus,
Lines 1 and 2 launch two instances of this program, typing make -j4 when building a Linux kernel specifies
redirecting their output to two separate files, with the & that up to four build steps be executed concurrently.

29

v2023.06.11a
30 CHAPTER 4. TOOLS OF THE TRADE

It is hoped that these simple examples convince you Listing 4.1: Using the fork() Primitive
that parallel programming need not always be complex or 1 pid = fork();
2 if (pid == 0) {
difficult. 3 /* child */
4 } else if (pid < 0) {
Quick Quiz 4.4: But if script-based parallel programming is 5 /* parent, upon error */
so easy, why bother with anything else? 6 perror("fork");
7 exit(EXIT_FAILURE);
8 } else {
9 /* parent, pid == child ID */
10 }
4.2 POSIX Multiprocessing
Listing 4.2: Using the wait() Primitive
1 static __inline__ void waitall(void)
A camel is a horse designed by committee. 2 {
3 int pid;
Unknown 4 int status;
5
6 for (;;) {
This section scratches the surface of the POSIX environ- 7 pid = wait(&status);
ment, including pthreads [Ope97], as this environment is 8 if (pid == -1) {
9 if (errno == ECHILD)
readily available and widely implemented. Section 4.2.1 10 break;
provides a glimpse of the POSIX fork() and related 11 perror("wait");
12 exit(EXIT_FAILURE);
primitives, Section 4.2.2 touches on thread creation and 13 }
destruction, Section 4.2.3 gives a brief overview of POSIX 14 }
15 }
locking, and, finally, Section 4.2.4 describes a specific
lock which can be used for data that is read by many
threads and only occasionally updated. noted earlier, the child may terminate via the exit()
primitive. Otherwise, this is the parent, which checks for
4.2.1 POSIX Process Creation and Destruc- an error return from the fork() primitive on line 4, and
tion prints an error and exits on lines 5–7 if so. Otherwise,
the fork() has executed successfully, and the parent
Processes are created using the fork() primitive, they therefore executes line 9 with the variable pid containing
may be destroyed using the kill() primitive, they may the process ID of the child.
destroy themselves using the exit() primitive. A process The parent process may use the wait() primitive to
executing a fork() primitive is said to be the “parent” wait for its children to complete. However, use of this
of the newly created process. A parent may wait on its primitive is a bit more complicated than its shell-script
children using the wait() primitive. counterpart, as each invocation of wait() waits for but one
Please note that the examples in this section are quite child process. It is therefore customary to wrap wait()
simple. Real-world applications using these primitives into a function similar to the waitall() function shown
might need to manipulate signals, file descriptors, shared in Listing 4.2 (api-pthreads.h), with this waitall()
memory segments, and any number of other resources. In function having semantics similar to the shell-script wait
addition, some applications need to take specific actions command. Each pass through the loop spanning lines 6–14
if a given child terminates, and might also need to be waits on one child process. Line 7 invokes the wait()
concerned with the reason that the child terminated. These primitive, which blocks until a child process exits, and
issues can of course add substantial complexity to the code. returns that child’s process ID. If the process ID is instead
For more information, see any of a number of textbooks −1, this indicates that the wait() primitive was unable to
on the subject [Ste92, Wei13]. wait on a child. If so, line 9 checks for the ECHILD errno,
If fork() succeeds, it returns twice, once for the which indicates that there are no more child processes, so
parent and again for the child. The value returned from that line 10 exits the loop. Otherwise, lines 11 and 12
fork() allows the caller to tell the difference, as shown in print an error and exit.
Listing 4.1 (forkjoin.c). Line 1 executes the fork()
Quick Quiz 4.5: Why does this wait() primitive need to be
primitive, and saves its return value in local variable pid.
so complicated? Why not just make it work like the shell-script
Line 2 checks to see if pid is zero, in which case, this wait does?
is the child, which continues on to execute line 3. As

v2023.06.11a
4.2. POSIX MULTIPROCESSING 31

Listing 4.3: Processes Created Via fork() Do Not Share Listing 4.4: Threads Created Via pthread_create() Share
Memory Memory
1 int x = 0; 1 int x = 0;
2 2
3 int main(int argc, char *argv[]) 3 void *mythread(void *arg)
4 { 4 {
5 int pid; 5 x = 1;
6 6 printf("Child process set x=1\n");
7 pid = fork(); 7 return NULL;
8 if (pid == 0) { /* child */ 8 }
9 x = 1; 9
10 printf("Child process set x=1\n"); 10 int main(int argc, char *argv[])
11 exit(EXIT_SUCCESS); 11 {
12 } 12 int en;
13 if (pid < 0) { /* parent, upon error */ 13 pthread_t tid;
14 perror("fork"); 14 void *vp;
15 exit(EXIT_FAILURE); 15
16 } 16 if ((en = pthread_create(&tid, NULL,
17 17 mythread, NULL)) != 0) {
18 /* parent */ 18 fprintf(stderr, "pthread_create: %s\n", strerror(en));
19 19 exit(EXIT_FAILURE);
20 waitall(); 20 }
21 printf("Parent process sees x=%d\n", x); 21
22 22 /* parent */
23 return EXIT_SUCCESS; 23
24 } 24 if ((en = pthread_join(tid, &vp)) != 0) {
25 fprintf(stderr, "pthread_join: %s\n", strerror(en));
26 exit(EXIT_FAILURE);
27 }
It is critically important to note that the parent and child 28 printf("Parent process sees x=%d\n", x);
29
do not share memory. This is illustrated by the program 30 return EXIT_SUCCESS;
shown in Listing 4.3 (forkjoinvar.c), in which the 31 }

child sets a global variable x to 1 on line 9, prints a


message on line 10, and exits on line 11. The parent
continues at line 20, where it waits on the child, and on that is to be invoked by the new thread, and the last
line 21 finds that its copy of the variable x is still zero. NULL argument is the argument that will be passed to
The output is thus as follows: mythread().
In this example, mythread() simply returns, but it
Child process set x=1 could instead call pthread_exit().
Parent process sees x=0
Quick Quiz 4.7: If the mythread() function in Listing 4.4
can simply return, why bother with pthread_exit()?
Quick Quiz 4.6: Isn’t there a lot more to fork() and wait()
than discussed here?
The pthread_join() primitive, shown on line 24, is
The finest-grained parallelism requires shared memory, analogous to the fork-join wait() primitive. It blocks
and this is covered in Section 4.2.2. That said, shared- until the thread specified by the tid variable completes
memory parallelism can be significantly more complex execution, either by invoking pthread_exit() or by re-
than fork-join parallelism. turning from the thread’s top-level function. The thread’s
exit value will be stored through the pointer passed as
the second argument to pthread_join(). The thread’s
4.2.2 POSIX Thread Creation and Destruc- exit value is either the value passed to pthread_exit()
tion or the value returned by the thread’s top-level function,
To create a thread within an existing process, invoke the depending on how the thread in question exits.
pthread_create() primitive, for example, as shown The program shown in Listing 4.4 produces output
on lines 16 and 17 of Listing 4.4 (pcreate.c). The as follows, demonstrating that memory is in fact shared
first argument is a pointer to a pthread_t in which to between the two threads:
store the ID of the thread to be created, the second NULL
argument is a pointer to an optional pthread_attr_t, the Child process set x=1
Parent process sees x=1
third argument is the function (in this case, mythread())

v2023.06.11a
32 CHAPTER 4. TOOLS OF THE TRADE

Note that this program carefully makes sure that only


one of the threads stores a value to variable x at a time.
Any situation in which one thread might be storing a
value to a given variable while some other thread either
loads from or stores to that same variable is termed a data
race. Because the C language makes no guarantee that
Listing 4.5: Demonstration of Exclusive Locks
the results of a data race will be in any way reasonable, 1 pthread_mutex_t lock_a = PTHREAD_MUTEX_INITIALIZER;
we need some way of safely accessing and modifying data 2 pthread_mutex_t lock_b = PTHREAD_MUTEX_INITIALIZER;
3
concurrently, such as the locking primitives discussed in 4 int x = 0;
the following section. 5
6 void *lock_reader(void *arg)
But your data races are benign, you say? Well, maybe 7 {
they are. But please do everyone (yourself included) a 8 int en;
9 int i;
big favor and read Section 4.3.4.1 very carefully. As 10 int newx = -1;
compilers optimize more and more aggressively, there are 11 int oldx = -1;
12 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg;
fewer and fewer truly benign data races. 13
14 if ((en = pthread_mutex_lock(pmlp)) != 0) {
Quick Quiz 4.8: If the C language makes no guarantees in 15 fprintf(stderr, "lock_reader:pthread_mutex_lock: %s\n",
presence of a data race, then why does the Linux kernel have 16 strerror(en));
so many data races? Are you trying to tell me that the Linux 17 exit(EXIT_FAILURE);
18 }
kernel is completely broken??? 19 for (i = 0; i < 100; i++) {
20 newx = READ_ONCE(x);
21 if (newx != oldx) {
22 printf("lock_reader(): x = %d\n", newx);
}
4.2.3 POSIX Locking 23
24 oldx = newx;
25 poll(NULL, 0, 1);
The POSIX standard allows the programmer to avoid 26 }
data races via “POSIX locking”. POSIX locking fea- 27 if ((en = pthread_mutex_unlock(pmlp)) != 0) {
28 fprintf(stderr, "lock_reader:pthread_mutex_unlock: %s\n",
tures a number of primitives, the most fundamental 29 strerror(en));
of which are pthread_mutex_lock() and pthread_ 30 exit(EXIT_FAILURE);
31 }
mutex_unlock(). These primitives operate on locks, 32 return NULL;
which are of type pthread_mutex_t. These locks may be 33 }
34
declared statically and initialized with PTHREAD_MUTEX_ 35 void *lock_writer(void *arg)
INITIALIZER, or they may be allocated dynamically and 36 {
37 int en;
initialized using the pthread_mutex_init() primitive. 38 int i;
The demonstration code in this section will take the former 39 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg;
40
course. 41 if ((en = pthread_mutex_lock(pmlp)) != 0) {
The pthread_mutex_lock() primitive “acquires” the 42 fprintf(stderr, "lock_writer:pthread_mutex_lock: %s\n",
43 strerror(en));
specified lock, and the pthread_mutex_unlock() “re- 44 exit(EXIT_FAILURE);
leases” the specified lock. Because these are “exclusive” 45 }
46 for (i = 0; i < 3; i++) {
locking primitives, only one thread at a time may “hold” 47 WRITE_ONCE(x, READ_ONCE(x) + 1);
a given lock at a given time. For example, if a pair of 48 poll(NULL, 0, 5);
49 }
threads attempt to acquire the same lock concurrently, 50 if ((en = pthread_mutex_unlock(pmlp)) != 0) {
one of the pair will be “granted” the lock first, and the 51 fprintf(stderr, "lock_writer:pthread_mutex_unlock: %s\n",
52 strerror(en));
other will wait until the first thread releases the lock. A 53 exit(EXIT_FAILURE);
simple and reasonably useful programming model permits 54 }
55 return NULL;
a given data item to be accessed only while holding the 56 }
corresponding lock [Hoa74].
Quick Quiz 4.9: What if I want several threads to hold the
same lock at the same time?

This exclusive-locking property is demonstrated using


the code shown in Listing 4.5 (lock.c). Line 1 defines

v2023.06.11a
4.2. POSIX MULTIPROCESSING 33

and initializes a POSIX lock named lock_a, while line 2 Listing 4.6: Demonstration of Same Exclusive Lock
similarly defines and initializes a lock named lock_b. 1 printf("Creating two threads using same lock:\n");
2 en = pthread_create(&tid1, NULL, lock_reader, &lock_a);
Line 4 defines and initializes a shared variable x. 3 if (en != 0) {
4 fprintf(stderr, "pthread_create: %s\n", strerror(en));
Lines 6–33 define a function lock_reader() which 5 exit(EXIT_FAILURE);
repeatedly reads the shared variable x while holding the 6 }
7 en = pthread_create(&tid2, NULL, lock_writer, &lock_a);
lock specified by arg. Line 12 casts arg to a pointer to a 8 if (en != 0) {
pthread_mutex_t, as required by the pthread_mutex_ 9 fprintf(stderr, "pthread_create: %s\n", strerror(en));
10 exit(EXIT_FAILURE);
lock() and pthread_mutex_unlock() primitives. 11 }
12 if ((en = pthread_join(tid1, &vp)) != 0) {
Quick Quiz 4.10: Why not simply make the argument to 13 fprintf(stderr, "pthread_join: %s\n", strerror(en));
lock_reader() on line 6 of Listing 4.5 be a pointer to a 14 exit(EXIT_FAILURE);
15 }
pthread_mutex_t? 16 if ((en = pthread_join(tid2, &vp)) != 0) {
17 fprintf(stderr, "pthread_join: %s\n", strerror(en));
18 exit(EXIT_FAILURE);
Quick Quiz 4.11: What is the READ_ONCE() on lines 20 19 }
and 47 and the WRITE_ONCE() on line 47 of Listing 4.5?
Listing 4.7: Demonstration of Different Exclusive Locks
Lines 14–18 acquire the specified pthread_mutex_t, 1 printf("Creating two threads w/different locks:\n");
2 x = 0;
checking for errors and exiting the program if any occur. 3 en = pthread_create(&tid1, NULL, lock_reader, &lock_a);
Lines 19–26 repeatedly check the value of x, printing 4 if (en != 0) {
5 fprintf(stderr, "pthread_create: %s\n", strerror(en));
the new value each time that it changes. Line 25 sleeps 6 exit(EXIT_FAILURE);
for one millisecond, which allows this demonstration 7 }
8 en = pthread_create(&tid2, NULL, lock_writer, &lock_b);
to run nicely on a uniprocessor machine. Lines 27–31 9 if (en != 0) {
release the pthread_mutex_t, again checking for errors 10 fprintf(stderr, "pthread_create: %s\n", strerror(en));
11 exit(EXIT_FAILURE);
and exiting the program if any occur. Finally, line 32 12 }
returns NULL, again to match the function type required 13 if ((en = pthread_join(tid1, &vp)) != 0) {
14 fprintf(stderr, "pthread_join: %s\n", strerror(en));
by pthread_create(). 15 exit(EXIT_FAILURE);
16 }
Quick Quiz 4.12: Writing four lines of code for each 17 if ((en = pthread_join(tid2, &vp)) != 0) {
18 fprintf(stderr, "pthread_join: %s\n", strerror(en));
acquisition and release of a pthread_mutex_t sure seems 19 exit(EXIT_FAILURE);
painful! Isn’t there a better way? 20 }

Lines 35–56 of Listing 4.5 show lock_writer(),


which periodically updates the shared variable x while Because both threads are using the same lock, the lock_
holding the specified pthread_mutex_t. As with lock_ reader() thread cannot see any of the intermediate values
reader(), line 39 casts arg to a pointer to pthread_ of x produced by lock_writer() while holding the lock.
mutex_t, lines 41–45 acquire the specified lock, and Quick Quiz 4.13: Is “x = 0” the only possible output from
lines 50–54 release it. While holding the lock, lines 46–49 the code fragment shown in Listing 4.6? If so, why? If not,
increment the shared variable x, sleeping for five millisec- what other output could appear, and why?
onds between each increment. Finally, lines 50–54 release
the lock. Listing 4.7 shows a similar code fragment, but this time
Listing 4.6 shows a code fragment that runs lock_ using different locks: lock_a for lock_reader() and
reader() and lock_writer() as threads using the same lock_b for lock_writer(). The output of this code
lock, namely, lock_a. Lines 2–6 create a thread running fragment is as follows:
lock_reader(), and then lines 7–11 create a thread Creating two threads w/different locks:
running lock_writer(). Lines 12–19 wait for both lock_reader(): x = 0
lock_reader(): x = 1
threads to complete. The output of this code fragment is lock_reader(): x = 2
as follows: lock_reader(): x = 3

Creating two threads using same lock: Because the two threads are using different locks, they
lock_reader(): x = 0
do not exclude each other, and can run concurrently. The

v2023.06.11a
34 CHAPTER 4. TOOLS OF THE TRADE

lock_reader() function can therefore see the interme-


diate values of x stored by lock_writer().
Quick Quiz 4.14: Using different locks could cause quite
a bit of confusion, what with threads seeing each others’
intermediate states. So should well-written parallel programs
restrict themselves to using a single lock in order to avoid this
kind of confusion?

Quick Quiz 4.15: In the code shown in Listing 4.7, is


lock_reader() guaranteed to see all the values produced by
Listing 4.8: Measuring Reader-Writer Lock Scalability
lock_writer()? Why or why not?
1 pthread_rwlock_t rwl = PTHREAD_RWLOCK_INITIALIZER;
2 unsigned long holdtime = 0;
3 unsigned long thinktime = 0;
Quick Quiz 4.16: Wait a minute here!!! Listing 4.6 didn’t 4 long long *readcounts;
initialize shared variable x, so why does it need to be initialized 5 int nreadersrunning = 0;
in Listing 4.7? 6
7 #define GOFLAG_INIT 0
8 #define GOFLAG_RUN 1
Although there is quite a bit more to POSIX exclusive 9 #define GOFLAG_STOP 2
10 char goflag = GOFLAG_INIT;
locking, these primitives provide a good start and are in 11
fact sufficient in a great many situations. The next section 12 void *reader(void *arg)
13 {
takes a brief look at POSIX reader-writer locking. 14 int en;
15 int i;
16 long long loopcnt = 0;
4.2.4 POSIX Reader-Writer Locking 17 long me = (long)arg;
18
19 __sync_fetch_and_add(&nreadersrunning, 1);
The POSIX API provides a reader-writer lock, which 20 while (READ_ONCE(goflag) == GOFLAG_INIT) {
is represented by a pthread_rwlock_t. As with 21 continue;
22 }
pthread_mutex_t, pthread_rwlock_t may be stat- 23 while (READ_ONCE(goflag) == GOFLAG_RUN) {
ically initialized via PTHREAD_RWLOCK_INITIALIZER 24 if ((en = pthread_rwlock_rdlock(&rwl)) != 0) {
25 fprintf(stderr,
or dynamically initialized via the pthread_rwlock_ 26 "pthread_rwlock_rdlock: %s\n", strerror(en));
init() primitive. The pthread_rwlock_rdlock() 27 exit(EXIT_FAILURE);
28 }
primitive read-acquires the specified pthread_rwlock_ 29 for (i = 1; i < holdtime; i++) {
t, the pthread_rwlock_wrlock() primitive write- 30 wait_microseconds(1);
31 }
acquires it, and the pthread_rwlock_unlock() prim- 32 if ((en = pthread_rwlock_unlock(&rwl)) != 0) {
itive releases it. Only a single thread may write-hold a 33 fprintf(stderr,
34 "pthread_rwlock_unlock: %s\n", strerror(en));
given pthread_rwlock_t at any given time, but multiple 35 exit(EXIT_FAILURE);
threads may read-hold a given pthread_rwlock_t, at 36 }
37 for (i = 1; i < thinktime; i++) {
least while there is no thread currently write-holding it. 38 wait_microseconds(1);
As you might expect, reader-writer locks are designed 39 }
40 loopcnt++;
for read-mostly situations. In these situations, a reader- 41 }
writer lock can provide greater scalability than can an 42 readcounts[me] = loopcnt;
43 return NULL;
exclusive lock because the exclusive lock is by definition 44 }
limited to a single thread holding the lock at any given time,
while the reader-writer lock permits an arbitrarily large
number of readers to concurrently hold the lock. How-
ever, in practice, we need to know how much additional
scalability is provided by reader-writer locks.
Listing 4.8 (rwlockscale.c) shows one way of mea-
suring reader-writer lock scalability. Line 1 shows the
definition and initialization of the reader-writer lock, line 2
shows the holdtime argument controlling the time each
thread holds the reader-writer lock, line 3 shows the

v2023.06.11a
4.2. POSIX MULTIPROCESSING 35

10
thinktime argument controlling the time between the
release of the reader-writer lock and the next acquisition,
line 4 defines the readcounts array into which each ideal 10000us
1
reader thread places the number of times it acquired the

Critical Section Performance


lock, and line 5 defines the nreadersrunning variable, 1000us
which determines when all reader threads have started 0.1
running. 100us

Lines 7–10 define goflag, which synchronizes the


start and the end of the test. This variable is initially set to 0.01
10us
GOFLAG_INIT, then set to GOFLAG_RUN after all the reader
threads have started, and finally set to GOFLAG_STOP to
0.001
terminate the test run. 1us
Lines 12–44 define reader(), which is the
reader thread. Line 19 atomically increments the 0.0001
nreadersrunning variable to indicate that this thread 0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
is now running, and lines 20–22 wait for the test to start.
The READ_ONCE() primitive forces the compiler to fetch Figure 4.2: Reader-Writer Lock Scalability vs. Microsec-
goflag on each pass through the loop—the compiler onds in Critical Section on 8-Socket System With
would otherwise be within its rights to assume that the Intel Xeon Platinum 8176 CPUs @ 2.10GHz
value of goflag would never change.

Quick Quiz 4.17: Instead of using READ_ONCE() everywhere, on the graph). The actual value plotted is:
why not just declare goflag as volatile on line 10 of
Listing 4.8? 𝐿𝑁
(4.1)
𝑁 𝐿1
Quick Quiz 4.18: READ_ONCE() only affects the compiler, where 𝑁 is the number of threads in the current run, 𝐿 𝑁 is
not the CPU. Don’t we also need memory barriers to make the total number of lock acquisitions by all 𝑁 threads in the
sure that the change in goflag’s value propagates to the CPU current run, and 𝐿 1 is the number of lock acquisitions in
in a timely fashion in Listing 4.8? a single-threaded run. Given ideal hardware and software
scalability, this value will always be 1.0.
Quick Quiz 4.19: Would it ever be necessary to use READ_ As can be seen in the figure, reader-writer locking
ONCE() when accessing a per-thread variable, for example, a scalability is decidedly non-ideal, especially for smaller
variable declared using GCC’s __thread storage class? sizes of critical sections. To see why read-acquisition can
be so slow, consider that all the acquiring threads must
The loop spanning lines 23–41 carries out the perfor- update the pthread_rwlock_t data structure. Therefore,
mance test. Lines 24–28 acquire the lock, lines 29–31 if all 448 executing threads attempt to read-acquire the
hold the lock for the specified number of microseconds, reader-writer lock concurrently, they must update this
lines 32–36 release the lock, and lines 37–39 wait for the underlying pthread_rwlock_t one at a time. One lucky
specified number of microseconds before re-acquiring the thread might do so almost immediately, but the least-lucky
lock. Line 40 counts this lock acquisition. thread must wait for all the other 447 threads to do their
Line 42 moves the lock-acquisition count to this thread’s updates. This situation will only get worse as you add
element of the readcounts[] array, and line 43 returns, CPUs. Note also the logscale y-axis. Even though the
terminating this thread. 10,000 microsecond trace appears quite ideal, it has in fact
degraded by about 10 % from ideal.
Figure 4.2 shows the results of running this test on a
224-core Xeon system with two hardware threads per core Quick Quiz 4.20: Isn’t comparing against single-CPU
for a total of 448 software-visible CPUs. The thinktime throughput a bit harsh?
parameter was zero for all these tests, and the holdtime
parameter set to values ranging from one microsecond Quick Quiz 4.21: But one microsecond is not a particularly
(“1us” on the graph) to 10,000 microseconds (“10000us” small size for a critical section. What do I do if I need a much

v2023.06.11a
36 CHAPTER 4. TOOLS OF THE TRADE

smaller critical section, for example, one containing only a few Listing 4.9: Compiler Barrier Primitive (for GCC)
instructions? #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
#define READ_ONCE(x) \
({ typeof(x) ___x = ACCESS_ONCE(x); ___x; })
Quick Quiz 4.22: The system used is a few years old, and #define WRITE_ONCE(x, val) \
new hardware should be faster. So why should anyone worry do { ACCESS_ONCE(x) = (val); } while (0)
#define barrier() __asm__ __volatile__("": : :"memory")
about reader-writer locks being slow?

Despite these limitations, reader-writer locking is quite


useful in many cases, for example when the readers must is “universal” in the sense that any atomic operation
do high-latency file or network I/O. There are alternatives, on a single location can be implemented in terms of
some of which will be presented in Chapters 5 and 9. compare-and-swap, though the earlier operations are often
more efficient where they apply. The compare-and-swap
operation is also capable of serving as the basis for a
4.2.5 Atomic Operations (GCC Classic) wider set of atomic operations, though the more elaborate
of these often suffer from complexity, scalability, and
Figure 4.2 shows that the overhead of reader-writer locking
performance problems [Her90].
is most severe for the smallest critical sections, so it would
be nice to have some other way of protecting tiny critical Quick Quiz 4.24: Given that these atomic operations will
sections. One such way uses atomic operations. We have often be able to generate single atomic instructions that are
seen an atomic operation already, namely the __sync_ directly supported by the underlying instruction set, shouldn’t
fetch_and_add() primitive on line 19 of Listing 4.8. they be the fastest possible way to get things done?
This primitive atomically adds the value of its second
argument to the value referenced by its first argument, The __sync_synchronize() primitive issues a
returning the old value (which was ignored in this case). “memory barrier”, which constrains both the compiler’s
If a pair of threads concurrently execute __sync_fetch_ and the CPU’s ability to reorder operations, as discussed in
and_add() on the same variable, the resulting value of Chapter 15. In some cases, it is sufficient to constrain the
the variable will include the result of both additions. compiler’s ability to reorder operations, while allowing the
The GNU C compiler offers a number of addi- CPU free rein, in which case the barrier() primitive may
tional atomic operations, including __sync_fetch_and_ be used. In some cases, it is only necessary to ensure that
sub(), __sync_fetch_and_or(), __sync_fetch_ the compiler avoids optimizing away a given memory read,
and_and(), __sync_fetch_and_xor(), and __sync_ in which case the READ_ONCE() primitive may be used,
fetch_and_nand(), all of which return the old value. as it was on line 20 of Listing 4.5. Similarly, the WRITE_
If you instead need the new value, you can instead ONCE() primitive may be used to prevent the compiler
use the __sync_add_and_fetch(), __sync_sub_ from optimizing away a given memory write. These last
and_fetch(), __sync_or_and_fetch(), __sync_ three primitives are not provided directly by GCC, but may
and_and_fetch(), __sync_xor_and_fetch(), and be implemented straightforwardly as shown in Listing 4.9,
__sync_nand_and_fetch() primitives. and all three are discussed at length in Section 4.3.4. Al-
ternatively, READ_ONCE(x) has much in common with
Quick Quiz 4.23: Is it really necessary to have both sets of the GCC intrinsic __atomic_load_n(&x, __ATOMIC_
primitives? RELAXED) and WRITE_ONCE() has much in common
with the GCC intrinsic __atomic_store_n(&x, v,
The classic compare-and-swap operation is provided __ATOMIC_RELAXED).
by a pair of primitives, __sync_bool_compare_and_
swap() and __sync_val_compare_and_swap(). Both Quick Quiz 4.25: What happened to ACCESS_ONCE()?
of these primitives atomically update a location to a new
value, but only if its prior value was equal to the specified
old value. The first variant returns 1 if the operation 4.2.6 Atomic Operations (C11)
succeeded and 0 if it failed, for example, if the prior value
was not equal to the specified old value. The second The C11 standard added atomic operations, in-
variant returns the prior value of the location, which, if cluding loads (atomic_load()), stores (atomic_
equal to the specified old value, indicates that the operation store()), memory barriers (atomic_thread_fence()
succeeded. Either of the compare-and-swap operation and atomic_signal_fence()), and read-modify-

v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 37

write atomics. The read-modify-write atom- code that is to be built only with GCC or other compilers
ics include atomic_fetch_add(), atomic_fetch_ supporting __thread.
sub(), atomic_fetch_and(), atomic_fetch_xor(), Fortunately, the C11 standard introduced a _Thread_
atomic_exchange(), atomic_compare_exchange_ local keyword that can be used in place of __thread. In
strong(), and atomic_compare_exchange_weak(). the fullness of time, this new keyword should combine the
These operate in a manner similar to those described ease of use of __thread with the portability of POSIX
in Section 4.2.5, but with the addition of memory-order thread-specific data.
arguments to _explicit variants of all of the opera-
tions. Without memory-order arguments, all the atomic
operations are fully ordered, and the arguments per- 4.3 Alternatives to POSIX Opera-
mit weaker orderings. For example, “atomic_load_ tions
explicit(&a, memory_order_relaxed)” is vaguely
similar to the Linux kernel’s “READ_ONCE()”.1
The strategic marketing paradigm of Open Source is
a massively parallel drunkard’s walk filtered by a
4.2.7 Atomic Operations (Modern GCC) Darwinistic process.
One restriction of the C11 atomics is that they apply Bruce Perens
only to special atomic types, which can be problematic.
The GNU C compiler therefore provides atomic intrin- Unfortunately, threading operations, locking primitives,
sics, including __atomic_load(), __atomic_load_ and atomic operations were in reasonably wide use long
n(), __atomic_store(), __atomic_store_n(), __ before the various standards committees got around to
atomic_thread_fence(), etc. These intrinsics offer them. As a result, there is considerable variation in how
the same semantics as their C11 counterparts, but may these operations are supported. It is still quite common to
be used on plain non-atomic objects. Some of these in- find these operations implemented in assembly language,
trinsics may be passed a memory-order argument from either for historical reasons or to obtain better perfor-
this list: __ATOMIC_RELAXED, __ATOMIC_CONSUME, mance in specialized circumstances. For example, GCC’s
__ATOMIC_ACQUIRE, __ATOMIC_RELEASE, __ATOMIC_ __sync_ family of primitives all provide full memory-
ACQ_REL, and __ATOMIC_SEQ_CST. ordering semantics, which in the past motivated many
developers to create their own implementations for situa-
4.2.8 Per-Thread Variables tions where the full memory ordering semantics are not
required. The following sections show some alternatives
Per-thread variables, also called thread-specific data, from the Linux kernel and some historical primitives used
thread-local storage, and other less-polite names, are used by this book’s sample code.
extremely heavily in concurrent code, as will be explored
in Chapters 5 and 8. POSIX supplies the pthread_key_
4.3.1 Organization and Initialization
create() function to create a per-thread variable (and
return the corresponding key), pthread_key_delete() Although many environments do not require any special
to delete the per-thread variable corresponding to key, initialization code, the code samples in this book start
pthread_setspecific() to set the value of the current with a call to smp_init(), which initializes a mapping
thread’s variable corresponding to the specified key, and from pthread_t to consecutive integers. The userspace
pthread_getspecific() to return that value. RCU library2 similarly requires a call to rcu_init().
A number of compilers (including GCC) provide a __ Although these calls can be hidden in environments (such
thread specifier that may be used in a variable definition as that of GCC) that support constructors, most of the
to designate that variable as being per-thread. The name of RCU flavors supported by the userspace RCU library also
the variable may then be used normally to access the value require each thread invoke rcu_register_thread()
of the current thread’s instance of that variable. Of course, upon thread creation and rcu_unregister_thread()
__thread is much easier to use than the POSIX thead- before thread exit.
specific data, and so __thread is usually preferred for
1 Memory ordering is described in more detail in Chapter 15 and

Appendix C. 2 See Section 9.5 for more information on RCU.

v2023.06.11a
38 CHAPTER 4. TOOLS OF THE TRADE

Listing 4.10: Thread API thread() primitive returns the thread_id_t cor-
int smp_thread_id(void) responding to the newly created child thread.
thread_id_t create_thread(void *(*func)(void *), void *arg)
for_each_thread(t) This primitive will abort the program if more than
for_each_running_thread(t)
void *wait_thread(thread_id_t tid) NR_THREADS threads are created, counting the one
void wait_all_threads(void) implicitly created by running the program. NR_
THREADS is a compile-time constant that may be
modified, though some systems may have an upper
In the case of the Linux kernel, it is a philosophical bound for the allowable number of threads.
question as to whether the kernel does not require calls
to special initialization code or whether the kernel’s boot- smp_thread_id()
time code is in fact the required initialization code. Because the thread_id_t returned from create_
thread() is system-dependent, the smp_thread_
4.3.2 Thread Creation, Destruction, and id() primitive returns a thread index corresponding
to the thread making the request. This index is
Control
guaranteed to be less than the maximum number of
The Linux kernel uses struct task_struct pointers threads that have been in existence since the program
to track kthreads, kthread_create() to create them, started, and is therefore useful for bitmasks, array
kthread_should_stop() to externally suggest that they indices, and the like.
stop (which has no POSIX equivalent),3 kthread_
stop() to wait for them to stop, and schedule_ for_each_thread()
timeout_interruptible() for a timed wait. There The for_each_thread() macro loops through all
are quite a few additional kthread-management APIs, but threads that exist, including all threads that would
this provides a good start, as well as good search terms. exist if created. This macro is useful for handling the
The CodeSamples API focuses on “threads”, which are a per-thread variables introduced in Section 4.2.8.
locus of control.4 Each such thread has an identifier of type
for_each_running_thread()
thread_id_t, and no two threads running at a given time
The for_each_running_thread() macro loops
will have the same identifier. Threads share everything
through only those threads that currently exist. It is
except for per-thread local state,5 which includes program
the caller’s responsibility to synchronize with thread
counter and stack.
creation and deletion if required.
The thread API is shown in Listing 4.10, and members
are described in the following section. wait_thread()
The wait_thread() primitive waits for completion
4.3.2.1 API Members of the thread specified by the thread_id_t passed
to it. This in no way interferes with the execution
create_thread()
of the specified thread; instead, it merely waits for
The create_thread() primitive creates a new
it. Note that wait_thread() returns the value that
thread, starting the new thread’s execution at the func-
was returned by the corresponding thread.
tion func specified by create_thread()’s first ar-
gument, and passing it the argument specified by wait_all_threads()
create_thread()’s second argument. This newly The wait_all_threads() primitive waits for com-
created thread will terminate when it returns from the pletion of all currently running threads. It is the
starting function specified by func. The create_ caller’s responsibility to synchronize with thread
creation and deletion if required. However, this prim-
itive is normally used to clean up at the end of a run,
3 POSIX environments can work around the lack of kthread_ so such synchronization is normally not needed.
should_stop() by using a properly synchronized boolean flag in
conjunction with pthread_join().
4 There are many other names for similar software constructs, in- 4.3.2.2 Example Usage
cluding “process”, “task”, “fiber”, “event”, “execution agent”, and so on.
Similar design principles apply to all of them. Listing 4.11 (threadcreate.c) shows an example hello-
5 How is that for a circular definition? world-like child thread. As noted earlier, each thread

v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 39

Listing 4.11: Example Child Thread Listing 4.13: Locking API


1 void *thread_test(void *arg) void spin_lock_init(spinlock_t *sp);
2 { void spin_lock(spinlock_t *sp);
3 int myarg = (intptr_t)arg; int spin_trylock(spinlock_t *sp);
4 void spin_unlock(spinlock_t *sp);
5 printf("child thread %d: smp_thread_id() = %d\n",
6 myarg, smp_thread_id());
7 return NULL;
8 } the following section. This book’s CodeSamples locking
API closely follows that of the Linux kernel.
Listing 4.12: Example Parent Thread
1 int main(int argc, char *argv[])
2 {
4.3.3.1 API Members
3 int i;
4 int nkids = 1; spin_lock_init()
5 The spin_lock_init() primitive initializes the
6 smp_init();
7 specified spinlock_t variable, and must be invoked
8 if (argc > 1) { before this variable is passed to any other spinlock
9 nkids = strtoul(argv[1], NULL, 0);
10 if (nkids > NR_THREADS) { primitive.
11 fprintf(stderr, "nkids = %d too large, max = %d\n",
12 nkids, NR_THREADS); spin_lock()
13 usage(argv[0]);
14 } The spin_lock() primitive acquires the specified
15 } spinlock, if necessary, waiting until the spinlock
16 printf("Parent thread spawning %d threads.\n", nkids);
17 becomes available. In some environments, such as
18 for (i = 0; i < nkids; i++) pthreads, this waiting will involve blocking, while in
19 create_thread(thread_test, (void *)(intptr_t)i);
20 others, such as the Linux kernel, it might involve a
21 wait_all_threads(); CPU-bound spin loop.
22
23 printf("All spawned threads completed.\n"); The key point is that only one thread may hold a
24
25 exit(0); spinlock at any given time.
26 }
spin_trylock()
The spin_trylock() primitive acquires the speci-
is allocated its own stack, so each thread has its own fied spinlock, but only if it is immediately available.
private arg argument and myarg variable. Each child It returns true if it was able to acquire the spinlock
simply prints its argument and its smp_thread_id() and false otherwise.
before exiting. Note that the return statement on line 7
terminates the thread, returning a NULL to whoever invokes spin_unlock()
wait_thread() on this thread. The spin_unlock() primitive releases the specified
The parent program is shown in Listing 4.12. It invokes spinlock, allowing other threads to acquire it.
smp_init() to initialize the threading system on line 6,
parses arguments on lines 8–15, and announces its pres- 4.3.3.2 Example Usage
ence on line 16. It creates the specified number of child A spinlock named mutex may be used to protect a variable
threads on lines 18–19, and waits for them to complete counter as follows:
on line 21. Note that wait_all_threads() discards the
threads return values, as in this case they are all NULL, spin_lock(&mutex);
counter++;
which is not very interesting. spin_unlock(&mutex);

Quick Quiz 4.26: What happened to the Linux-kernel


equivalents to fork() and wait()? Quick Quiz 4.27: What problems could occur if the variable
counter were incremented without the protection of mutex?

4.3.3 Locking
However, the spin_lock() and spin_unlock()
A good starting subset of the Linux kernel’s locking API is primitives do have performance consequences, as will
shown in Listing 4.13, each API element being described in be seen in Chapter 10.

v2023.06.11a
40 CHAPTER 4. TOOLS OF THE TRADE

Listing 4.14: Living Dangerously Early 1990s Style of transformations, including load tearing, store tearing,
1 ptr = global_ptr; load fusing, store fusing, code reordering, invented loads,
2 if (ptr != NULL && ptr < high_address)
3 do_low(ptr); invented stores, store-to-load transformations, and dead-
code elimination, all of which work just fine in single-
Listing 4.15: C Compilers Can Invent Loads threaded code. But concurrent code can be broken by each
1 if (global_ptr != NULL && of these transformations, or shared-variable shenanigans,
2 global_ptr < high_address) as described below.
3 do_low(global_ptr);

Load tearing occurs when the compiler uses multiple


load instructions for a single access. For example,
4.3.4 Accessing Shared Variables the compiler could in theory compile the load from
It was not until 2011 that the C standard defined seman- global_ptr (see line 1 of Listing 4.14) as a series of
tics for concurrent read/write access to shared variables. one-byte loads. If some other thread was concurrently
However, concurrent C code was being written at least setting global_ptr to NULL, the result might have
a quarter century earlier [BK85, Inm85]. This raises the all but one byte of the pointer set to zero, thus forming
question as to what today’s greybeards did back in long- a “wild pointer”. Stores using such a wild pointer
past pre-C11 days. A short answer to this question is “they could corrupt arbitrary regions of memory, resulting
lived dangerously”. in rare and difficult-to-debug crashes.
At least they would have been living dangerously had Worse yet, on (say) an 8-bit system with 16-bit
they been using 2021 compilers. In (say) the early 1990s, pointers, the compiler might have no choice but to
compilers did fewer optimizations, in part because there use a pair of 8-bit instructions to access a given
were fewer compiler writers and in part due to the relatively pointer. Because the C standard must support all
small memories of that era. Nevertheless, problems did manner of systems, the standard cannot rule out load
arise, as shown in Listing 4.14, which the compiler is tearing in the general case.
within its rights to transform into Listing 4.15. As you
can see, the temporary on line 1 of Listing 4.14 has been Store tearing occurs when the compiler uses multiple
optimized away, so that global_ptr will be loaded up to store instructions for a single access. For example,
three times. one thread might store 0x12345678 to a four-byte
Quick Quiz 4.28: What is wrong with loading Listing 4.14’s
integer variable at the same time another thread stored
global_ptr up to three times? 0xabcdef00. If the compiler used 16-bit stores for
either access, the result might well be 0x1234ef00,
Section 4.3.4.1 describes additional problems caused by which could come as quite a surprise to code loading
plain accesses, Sections 4.3.4.2 and 4.3.4.3 describe some from this integer. Nor is this a strictly theoretical
pre-C11 solutions. Of course, where practical, direct issue. For example, there are CPUs that feature small
C-language memory references should be replaced by immediate instruction fields, and on such CPUs, the
the primitives described in Section 4.2.5 or (especially) compiler might split a 64-bit store into two 32-bit
Section 4.2.6. Use these primitives to avoid data races, stores in order to reduce the overhead of explicitly
that is, ensure that if there are multiple concurrent C- forming the 64-bit constant in a register, even on
language accesses to a given variable, all of those accesses a 64-bit CPU. There are historical reports of this
are loads. actually happening in the wild (e.g. [KM13]), but
there is also a recent report [Dea19].7
4.3.4.1 Shared-Variable Shenanigans Of course, the compiler simply has no choice but to
tear some stores in the general case, given the possi-
Given code that does plain loads and stores,6 the compiler
bility of code using 64-bit integers running on a 32-
is within its rights to assume that the affected variables are
neither accessed nor modified by any other thread. This
7 Note that this tearing can happen even on properly aligned and
assumption allows the compiler to carry out a large number
machine-word-sized accesses, and in this particular case, even for volatile
stores. Some might argue that this behavior constitutes a bug in the
6 That is, normal loads and stores instead of C11 atomics, inline compiler, but either way it illustrates the perceived value of store tearing
assembly, or volatile accesses. from a compiler-writer viewpoint.

v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 41

Listing 4.16: Inviting Load Fusing Listing 4.18: C Compilers Can Fuse Non-Adjacent Loads
1 while (!need_to_stop) 1 int *gp;
2 do_something_quickly(); 2
3 void t0(void)
4 {
Listing 4.17: C Compilers Can Fuse Loads 5 WRITE_ONCE(gp, &myvar);
6 }
1 if (!need_to_stop) 7
2 for (;;) { 8 void t1(void)
3 do_something_quickly(); 9 {
4 do_something_quickly(); 10 p1 = gp;
5 do_something_quickly(); 11 do_something(p1);
6 do_something_quickly(); 12 p2 = READ_ONCE(gp);
7 do_something_quickly(); 13 if (p2) {
8 do_something_quickly(); 14 do_something_else();
9 do_something_quickly(); 15 p3 = *gp;
10 do_something_quickly(); 16 }
11 do_something_quickly(); 17 }
12 do_something_quickly();
13 do_something_quickly();
14 do_something_quickly();
15 do_something_quickly(); t1() run concurrently, and do_something() and
16 do_something_quickly();
17 do_something_quickly(); do_something_else() are inline functions. Line 1
18 do_something_quickly(); declares pointer gp, which C initializes to NULL by
19 }
default. At some point, line 5 of t0() stores a non-
NULL pointer to gp. Meanwhile, t1() loads from gp
bit system. But for properly aligned machine-sized three times on lines 10, 12, and 15. Given that line 13
stores, WRITE_ONCE() will prevent store tearing. finds that gp is non-NULL, one might hope that the
dereference on line 15 would be guaranteed never to
Load fusing occurs when the compiler uses the result of a fault. Unfortunately, the compiler is within its rights
prior load from a given variable instead of repeating to fuse the read on lines 10 and 15, which means
the load. Not only is this sort of optimization just that if line 10 loads NULL and line 12 loads &myvar,
fine in single-threaded code, it is often just fine in line 15 could load NULL, resulting in a fault.8 Note
multithreaded code. Unfortunately, the word “often” that the intervening READ_ONCE() does not prevent
hides some truly annoying exceptions. the other two loads from being fused, despite the fact
For example, suppose that a real-time system that all three are loading from the same variable.
needs to invoke a function named do_something_ Quick Quiz 4.29: Why does it matter whether
quickly() repeatedly until the variable need_to_ do_something() and do_something_else() in List-
stop was set, and that the compiler can see that do_ ing 4.18 are inline functions?
something_quickly() does not store to need_
to_stop. One (unsafe) way to code this is shown
Store fusing can occur when the compiler notices a pair
in Listing 4.16. The compiler might reasonably un-
of successive stores to a given variable with no
roll this loop sixteen times in order to reduce the
intervening loads from that variable. In this case, the
per-invocation of the backwards branch at the end
compiler is within its rights to omit the first store.
of the loop. Worse yet, because the compiler knows
This is never a problem in single-threaded code,
that do_something_quickly() does not store to
and in fact it is usually not a problem in correctly
need_to_stop, the compiler could quite reasonably
written concurrent code. After all, if the two stores
decide to check this variable only once, resulting
are executed in quick succession, there is very little
in the code shown in Listing 4.17. Once entered,
chance that some other thread could load the value
the loop on lines 2–19 will never exit, regardless of
from the first store.
how many times some other thread stores a non-zero
value to need_to_stop. The result will at best be However, there are exceptions, for example as shown
consternation, and might well also include severe in Listing 4.19. The function shut_it_down()
physical damage. stores to the shared variable status on lines 3 and 8,
and so assuming that neither start_shutdown()
The compiler can fuse loads across surprisingly large
spans of code. For example, in Listing 4.18, t0() and 8 Will Deacon reports that this happened in the Linux kernel.

v2023.06.11a
42 CHAPTER 4. TOOLS OF THE TRADE

Listing 4.19: C Compilers Can Fuse Stores Listing 4.20: Inviting an Invented Store
1 void shut_it_down(void) 1 if (condition)
2 { 2 a = 1;
3 status = SHUTTING_DOWN; /* BUGGY!!! */ 3 else
4 start_shutdown(); 4 do_a_bunch_of_stuff(&a);
5 while (!other_task_ready) /* BUGGY!!! */
6 continue;
7 finish_shutdown(); Listing 4.21: Compiler Invents an Invited Store
8 status = SHUT_DOWN; /* BUGGY!!! */
1 a = 1;
9 do_something_else();
2 if (!condition) {
10 }
3 a = 0;
11
4 do_a_bunch_of_stuff(&a);
12 void work_until_shut_down(void)
5 }
13 {
14 while (status != SHUTTING_DOWN) /* BUGGY!!! */
15 do_more_work();
16 other_task_ready = 1; /* BUGGY!!! */
17 } see the effect of any subsequent instructions. READ_
ONCE() and WRITE_ONCE() can therefore be used
to control communication between interrupted code
nor finish_shutdown() access status, the com- and interrupt handlers, independent of the ordering
piler could reasonably remove the store to status provided by the underlying hardware.9
on line 3. Unfortunately, this would mean that
work_until_shut_down() would never exit its Invented loads were illustrated by the code in List-
loop spanning lines 14 and 15, and thus would never ings 4.14 and 4.15, in which the compiler optimized
set other_task_ready, which would in turn mean away a temporary variable, thus loading from a
that shut_it_down() would never exit its loop span- shared variable more often than intended.
ning lines 5 and 6, even if the compiler chooses not to Invented loads can be a performance hazard. These
fuse the successive loads from other_task_ready hazards can occur when a load of variable in a “hot”
on line 5. cacheline is hoisted out of an if statement. These
And there are more problems with the code in List- hoisting optimizations are not uncommon, and can
ing 4.19, including code reordering. cause significant increases in cache misses, and thus
significant degradation of both performance and
Code reordering is a common compilation technique scalability.
used to combine common subexpressions, reduce
Invented stores can occur in a number of situations.
register pressure, and improve utilization of the many
For example, a compiler emitting code for work_
functional units available on modern superscalar mi-
until_shut_down() in Listing 4.19 might notice
croprocessors. It is also another reason why the code
that other_task_ready is not accessed by do_
in Listing 4.19 is buggy. For example, suppose that
more_work(), and stored to on line 16. If do_more_
the do_more_work() function on line 15 does not ac-
work() was a complex inline function, it might
cess other_task_ready. Then the compiler would
be necessary to do a register spill, in which case
be within its rights to move the assignment to other_
one attractive place to use for temporary storage is
task_ready on line 16 to precede line 14, which
other_task_ready. After all, there are no accesses
might be a great disappointment for anyone hoping
to it, so what is the harm?
that the last call to do_more_work() on line 15
happens before the call to finish_shutdown() on Of course, a non-zero store to this variable at just the
line 7. wrong time would result in the while loop on line 5
terminating prematurely, again allowing finish_
It might seem futile to prevent the compiler from
shutdown() to run concurrently with do_more_
changing the order of accesses in cases where the
work(). Given that the entire point of this while
underlying hardware is free to reorder them. However,
appears to be to prevent such concurrency, this is not
modern machines have exact exceptions and exact
a good thing.
interrupts, meaning that any interrupt or exception
will appear to have happened at a specific place in 9 That said, the various standards committees would prefer that
the instruction stream. This means that the handler you use atomics or variables of type sig_atomic_t, instead of READ_
will see the effect of all prior instructions, but won’t ONCE() and WRITE_ONCE().

v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 43

Listing 4.22: Inviting a Store-to-Load Conversion Listing 4.23: Compiler Converts a Store to a Load
1 r1 = p; 1 r1 = p;
2 if (unlikely(r1)) 2 if (unlikely(r1))
3 do_something_with(r1); 3 do_something_with(r1);
4 barrier(); 4 barrier();
5 p = NULL; 5 if (p != NULL)
6 p = NULL;

Using a stored-to variable as a temporary might


desirable, it could be problematic if the actual store
seem outlandish, but it is permitted by the standard.
was required for ordering. For example, a write
Nevertheless, readers might be justified in wanting
memory barrier (Linux kernel smp_wmb()) would
a less outlandish example, which is provided by
order the store, but not the load. This situation
Listings 4.20 and 4.21.
might suggest use of smp_store_release() over
A compiler emitting code for Listing 4.20 might know smp_wmb().
that the value of a is initially zero, which might be
a strong temptation to optimize away one branch by Dead-code elimination can occur when the compiler
transforming this code to that in Listing 4.21. Here, notices that the value from a load is never used, or
line 1 unconditionally stores 1 to a, then resets the when a variable is stored to, but never loaded from.
value back to zero on line 3 if condition was not This can of course eliminate an access to a shared
set. This transforms the if-then-else into an if-then, variable, which can in turn defeat a memory-ordering
saving one branch. primitive, which could cause your concurrent code to
act in surprising ways. Experience thus far indicates
Quick Quiz 4.30: Ouch! So can’t the compiler invent a that relatively few such surprises will be at all pleas-
store to a normal variable pretty much any time it likes? ant. Elimination of store-only variables is especially
dangerous in cases where external code locates the
variable via symbol tables: The compiler is neces-
Finally, pre-C11 compilers could invent writes to sarily ignorant of such external-code accesses, and
unrelated variables that happened to be adjacent might thus eliminate a variable that the external code
to written-to variables [Boe05, Section 4.2]. This relies upon.
variant of invented stores has been outlawed by the
prohibition against compiler optimizations that invent Reliable concurrent code clearly needs a way to cause
data races. the compiler to preserve the number, order, and type of
important accesses to shared memory, a topic taken up by
Store-to-load transformations can occur when the com- Sections 4.3.4.2 and 4.3.4.3, which are up next.
piler notices that a plain store might not actually
change the value in memory. For example, consider 4.3.4.2 A Volatile Solution
Listing 4.22. Line 1 fetches p, but the “if” statement
on line 2 also tells the compiler that the developer Although it is now much maligned, before the advent of
thinks that p is usually zero.10 The barrier() state- C11 and C++11 [Bec11], the volatile keyword was an
ment on line 4 forces the compiler to forget the value indispensable tool in the parallel programmer’s toolbox.
of p, but one could imagine a compiler choosing This raises the question of exactly what volatile means,
to remember the hint—or getting an additional hint a question that is not answered with excessive precision
via feedback-directed optimization. Doing so would even by more recent versions of this standard [Smi19].11
cause the compiler to realize that line 5 is often an This version guarantees that “Accesses through volatile
expensive no-op. glvalues are evaluated strictly according to the rules of
the abstract machine”, that volatile accesses are side
Such a compiler might therefore guard the store of
effects, that they are one of the four forward-progress indi-
NULL with a check, as shown on lines 5 and 6 of
cators, and that their exact semantics are implementation-
Listing 4.23. Although this transformation is often
defined. Perhaps the clearest guidance is provided by this
10 The unlikely() function provides this hint to the compiler, non-normative note:
and different compilers provide different ways of implementing 11 JF Bastien thoroughly documented the history and use cases for

unlikely(). the volatile keyword in C++ [Bas18].

v2023.06.11a
44 CHAPTER 4. TOOLS OF THE TRADE

volatile is a hint to the implementation to Listing 4.24: Avoiding Danger, 2018 Style
avoid aggressive optimization involving the ob- 1 ptr = READ_ONCE(global_ptr);
2 if (ptr != NULL && ptr < high_address)
ject because the value of the object might be 3 do_low(ptr);
changed by means undetectable by an implemen-
tation. Furthermore, for some implementations, Listing 4.25: Preventing Load Fusing
volatile might indicate that special hardware 1 while (!READ_ONCE(need_to_stop))
instructions are required to access the object. 2 do_something_quickly();
See 6.8.1 for detailed semantics. In general, the
semantics of volatile are intended to be the Listing 4.26: Preventing Store Fusing and Invented Stores
same in C++ as they are in C. 1 void shut_it_down(void)
2 {
3 WRITE_ONCE(status, SHUTTING_DOWN); /* BUGGY!!! */
This wording might be reassuring to those writing low- 4 start_shutdown();
level code, except for the fact that compiler writers are 5 while (!READ_ONCE(other_task_ready)) /* BUGGY!!! */
6 continue;
free to completely ignore non-normative notes. Parallel 7 finish_shutdown();
programmers might instead reassure themselves that com- 8 WRITE_ONCE(status, SHUT_DOWN); /* BUGGY!!! */
9 do_something_else();
piler writers would like to avoid breaking device drivers 10 }
(though perhaps only after a few “frank and open” discus- 11
12 void work_until_shut_down(void)
sions with device-driver developers), and device drivers 13 {
impose at least the following constraints [MWPF18]: 14 while (READ_ONCE(status) != SHUTTING_DOWN) /* BUGGY!!! */
15 do_more_work();
16 WRITE_ONCE(other_task_ready, 1); /* BUGGY!!! */
1. Implementations are forbidden from tearing an 17 }
aligned volatile access when machine instructions of
that access’s size and type are available.12 Concur-
rent code relies on this constraint to avoid unneces- non-atomic or non-volatile, assuming that all accesses are
sary load and store tearing. aligned and machine-sized. The semantics of mixed-size
accesses to the same locations are more complex, and are
2. Implementations must not assume anything about the left aside for the time being.
semantics of a volatile access, nor, for any volatile So how does volatile stack up against the earlier
access that returns a value, about the possible set of examples?
values that might be returned.13 Concurrent code
Using READ_ONCE() on line 1 of Listing 4.14 avoids
relies on this constraint to avoid optimizations that
invented loads, resulting in the code shown in Listing 4.24.
are inapplicable given that other processors might be
As shown in Listing 4.25, READ_ONCE() can also pre-
concurrently accessing the location in question.
vent the loop unrolling in Listing 4.17.
3. Aligned machine-sized non-mixed-size volatile ac- READ_ONCE() and WRITE_ONCE() can also be used
cesses interact naturally with volatile assembly-code to prevent the store fusing and invented stores that were
sequences before and after. This is necessary because shown in Listing 4.19, with the result shown in List-
some devices must be accessed using a combina- ing 4.26. However, this does nothing to prevent code
tion of volatile MMIO accesses and special-purpose reordering, which requires some additional tricks taught
assembly-language instructions. Concurrent code in Section 4.3.4.3.
relies on this constraint in order to achieve the desired Finally, WRITE_ONCE() can be used to prevent the store
ordering properties from combinations of volatile ac- invention shown in Listing 4.20, with the resulting code
cesses and other means discussed in Section 4.3.4.3. shown in Listing 4.27.
To summarize, the volatile keyword can prevent
Concurrent code also relies on the first two constraints load tearing and store tearing in cases where the loads
to avoid undefined behavior that could result due to data
races if any of the accesses to a given object was either
Listing 4.27: Disinviting an Invented Store
12 Note that this leaves unspecified what to do with 128-bit loads and 1 if (condition)
stores on CPUs having 128-bit CAS but not 128-bit loads and stores. 2 WRITE_ONCE(a, 1);
13 This is strongly implied by the implementation-defined semantics 3 else
4 do_a_bunch_of_stuff();
called out above.

v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 45

Listing 4.28: Preventing C Compilers From Fusing Loads Listing 4.29: Preventing Reordering
1 while (!need_to_stop) { 1 void shut_it_down(void)
2 barrier(); 2 {
3 do_something_quickly(); 3 WRITE_ONCE(status, SHUTTING_DOWN);
4 barrier(); 4 smp_mb();
5 } 5 start_shutdown();
6 while (!READ_ONCE(other_task_ready))
7 continue;
8 smp_mb();
and stores are machine-sized and properly aligned. It 9 finish_shutdown();
10 smp_mb();
can also prevent load fusing, store fusing, invented loads, 11 WRITE_ONCE(status, SHUT_DOWN);
and invented stores. However, although it does prevent 12 do_something_else();
13 }
the compiler from reordering volatile accesses with 14

each other, it does nothing to prevent the CPU from 15 void work_until_shut_down(void)
16 {
reordering these accesses. Furthermore, it does nothing 17 while (READ_ONCE(status) != SHUTTING_DOWN) {
to prevent either compiler or CPU from reordering non- 18 smp_mb();
19 do_more_work();
volatile accesses with each other or with volatile 20 }
accesses. Preventing these types of reordering requires 21 smp_mb();
22 WRITE_ONCE(other_task_ready, 1);
the techniques described in the next section. 23 }

4.3.4.3 Assembling the Rest of a Solution


Quick Quiz 4.31: But aren’t full memory barriers very
Additional ordering has traditionally been provided by heavyweight? Isn’t there a cheaper way to enforce the ordering
recourse to assembly language, for example, GCC asm needed in Listing 4.29?
directives. Oddly enough, these directives need not ac-
tually contain assembly language, as exemplified by the Ordering is also provided by some read-modify-write
barrier() macro shown in Listing 4.9. atomic operations, some of which are presented in Sec-
In the barrier() macro, the __asm__ introduces the tion 4.3.5. In the general case, memory ordering can be
asm directive, the __volatile__ prevents the compiler quite subtle, as discussed in Chapter 15. The next section
from optimizing the asm away, the empty string specifies covers an alternative to memory ordering, namely limiting
that no actual instructions are to be emitted, and the or even entirely avoiding data races.
final "memory" tells the compiler that this do-nothing
asm can arbitrarily change memory. In response, the
4.3.4.4 Avoiding Data Races
compiler will avoid moving any memory references across
the barrier() macro. This means that the real-time- “Doctor, it hurts my head when I think about
destroying loop unrolling shown in Listing 4.17 can be concurrently accessing shared variables!”
prevented by adding barrier() calls as shown on lines 2
“Then stop concurrently accessing shared vari-
and 4 of Listing 4.28. These two lines of code prevent the
ables!!!”
compiler from pushing the load from need_to_stop into
or past do_something_quickly() from either direction.
The doctor’s advice might seem unhelpful, but one
However, this does nothing to prevent the CPU from
time-tested way to avoid concurrently accessing shared
reordering the references. In many cases, this is not
variables is access those variables only when holding a
a problem because the hardware can only do a certain
particular lock, as will be discussed in Chapter 7. Another
amount of reordering. However, there are cases such
way is to access a given “shared” variable only from a
as Listing 4.19 where the hardware must be constrained.
given CPU or thread, as will be discussed in Chapter 8. It
Listing 4.26 prevented store fusing and invention, and
is possible to combine these two approaches, for example,
Listing 4.29 further prevents the remaining reordering
a given variable might be modified only by a given CPU or
by addition of smp_mb() on lines 4, 8, 10, 18, and 21.
thread while holding a particular lock, and might be read
The smp_mb() macro is similar to barrier() shown in
either from that same CPU or thread on the one hand, or
Listing 4.9, but with the empty string replaced by a string
from some other CPU or thread while holding that same
containing the instruction for a full memory barrier, for
lock on the other. In all of these situations, all accesses to
example, "mfence" on x86 or "sync" on PowerPC.
the shared variables may be plain C-language accesses.

v2023.06.11a
46 CHAPTER 4. TOOLS OF THE TRADE

Here is a list of situations allowing plain loads and stores READ_ONCE() nor WRITE_ONCE() provide any ordering
for some accesses to a given variable, while requiring guarantees other than within the compiler. See the above
markings (such as READ_ONCE() and WRITE_ONCE()) for Section 4.3.4.3 or Chapter 15 for information on such
other accesses to that same variable: guarantees.
Examples of many of these data-race-avoidance patterns
1. A shared variable is only modified by a given owning are presented in Chapter 5.
CPU or thread, but is read by other CPUs or threads.
All stores must use WRITE_ONCE(). The owning
CPU or thread may use plain loads. Everything else 4.3.5 Atomic Operations
must use READ_ONCE() for loads. The Linux kernel provides a wide variety of atomic opera-
tions, but those defined on type atomic_t provide a good
2. A shared variable is only modified while holding a
start. Normal non-tearing reads and stores are provided by
given lock, but is read by code not holding that lock.
atomic_read() and atomic_set(), respectively. Ac-
All stores must use WRITE_ONCE(). CPUs or threads
quire load is provided by smp_load_acquire() and
holding the lock may use plain loads. Everything
release store by smp_store_release().
else must use READ_ONCE() for loads.
Non-value-returning fetch-and-add operations are pro-
3. A shared variable is only modified while holding a vided by atomic_add(), atomic_sub(), atomic_
given lock by a given owning CPU or thread, but is inc(), and atomic_dec(), among others. An atomic
read by other CPUs or threads or by code not holding decrement that returns a reached-zero indication is pro-
that lock. All stores must use WRITE_ONCE(). The vided by both atomic_dec_and_test() and atomic_
owning CPU or thread may use plain loads, as may sub_and_test(). An atomic add that returns the
any CPU or thread holding the lock. Everything else new value is provided by atomic_add_return().
must use READ_ONCE() for loads. Both atomic_add_unless() and atomic_inc_not_
zero() provide conditional atomic operations, where
4. A shared variable is only accessed by a given CPU or nothing happens unless the original value of the atomic
thread and by a signal or interrupt handler running variable is different than the value specified (these are very
in that CPU’s or thread’s context. The handler can handy for managing reference counters, for example).
use plain loads and stores, as can any code that An atomic exchange operation is provided by atomic_
has prevented the handler from being invoked, that xchg(), and the celebrated compare-and-swap (CAS)
is, code that has blocked signals and/or interrupts. operation is provided by atomic_cmpxchg(). Both
All other code must use READ_ONCE() and WRITE_ of these return the old value. Many additional atomic
ONCE(). RMW primitives are available in the Linux kernel, see
5. A shared variable is only accessed by a given CPU or the Documentation/atomic_t.txt file in the Linux-
thread and by a signal or interrupt handler running kernel source tree.14
in that CPU’s or thread’s context, and the handler This book’s CodeSamples API closely follows that of
always restores the values of any variables that it the Linux kernel.
has written before return. The handler can use plain
loads and stores, as can any code that has prevented 4.3.6 Per-CPU Variables
the handler from being invoked, that is, code that
The Linux kernel uses DEFINE_PER_CPU() to define a
has blocked signals and/or interrupts. All other code
per-CPU variable, this_cpu_ptr() to form a reference
can use plain loads, but must use WRITE_ONCE()
to this CPU’s instance of a given per-CPU variable, per_
to prevent store tearing, store fusing, and invented
cpu() to access a specified CPU’s instance of a given
stores.
per-CPU variable, along with many other special-purpose
per-CPU operations.
Quick Quiz 4.32: What needs to happen if an interrupt or
signal handler might itself be interrupted? Listing 4.30 shows this book’s per-thread-variable API,
which is patterned after the Linux kernel’s per-CPU-
In most other cases, loads from and stores to a shared variable API. This API provides the per-thread equivalent
variable must use READ_ONCE() and WRITE_ONCE() or
stronger, respectively. But it bears repeating that neither 14 As of Linux kernel v5.11.

v2023.06.11a
4.4. THE RIGHT TOOL FOR THE JOB: HOW TO CHOOSE? 47

Listing 4.30: Per-Thread-Variable API using a per-thread variable. Such a variable can be defined
DEFINE_PER_THREAD(type, name) as follows:
DECLARE_PER_THREAD(type, name)
per_thread(name, thread)
DEFINE_PER_THREAD(int, counter);
__get_thread_var(name)
init_per_thread(name, v)

The counter must be initialized as follows:

of global variables. Although this API is, strictly speaking, init_per_thread(counter, 0);

not necessary,15 it can provide a good userspace analogy


to Linux kernel code. A thread can increment its instance of this counter as
Quick Quiz 4.33: How could you work around the lack of a follows:
per-thread-variable API on systems that do not provide it? p_counter = &__get_thread_var(counter);
WRITE_ONCE(*p_counter, *p_counter + 1);

4.3.6.1 API Members The value of the counter is then the sum of its instances.
DEFINE_PER_THREAD() A snapshot of the value of the counter can thus be collected
The DEFINE_PER_THREAD() primitive defines a per- as follows:
thread variable. Unfortunately, it is not possible for_each_thread(t)
to provide an initializer in the way permitted by sum += READ_ONCE(per_thread(counter, t));

the Linux kernel’s DEFINE_PER_CPU() primitive,


but there is an init_per_thread() primitive that Again, it is possible to gain a similar effect using other
permits easy runtime initialization. mechanisms, but per-thread variables combine conve-
nience and high performance, as will be shown in more
DECLARE_PER_THREAD() detail in Section 5.2.
The DECLARE_PER_THREAD() primitive is a declara-
Quick Quiz 4.34: What do you do if you need a per-thread
tion in the C sense, as opposed to a definition. Thus,
(not per-CPU!) variable in the Linux kernel?
a DECLARE_PER_THREAD() primitive may be used
to access a per-thread variable defined in some other
file.
4.4 The Right Tool for the Job: How
per_thread()
The per_thread() primitive accesses the specified to Choose?
thread’s variable.
If you get stuck, change your tools; it may free your
__get_thread_var()
thinking.
The __get_thread_var() primitive accesses the
current thread’s variable. Paul Arden, abbreviated

init_per_thread() As a rough rule of thumb, use the simplest tool that will
The init_per_thread() primitive sets all threads’ get the job done. If you can, simply program sequentially.
instances of the specified variable to the specified If that is insufficient, try using a shell script to mediate
value. The Linux kernel accomplishes this via normal parallelism. If the resulting shell-script fork()/exec()
C initialization, relying in clever use of linker scripts overhead (about 480 microseconds for a minimal C pro-
and code executed during the CPU-online process. gram on an Intel Core Duo laptop) is too large, try using
the C-language fork() and wait() primitives. If the
4.3.6.2 Usage Example overhead of these primitives (about 80 microseconds for
a minimal child process) is still too large, then you might
Suppose that we have a counter that is incremented very
need to use the POSIX threading primitives, choosing the
frequently but read out quite rarely. As will become clear
appropriate locking and/or atomic-operation primitives.
in Section 5.2, it is helpful to implement such a counter
If the overhead of the POSIX threading primitives (typi-
15You could instead use __thread or _Thread_local. cally sub-microsecond) is too great, then the primitives

v2023.06.11a
48 CHAPTER 4. TOOLS OF THE TRADE

introduced in Chapter 9 may be required. Of course, the


actual overheads will depend not only on your hardware,
but most critically on the manner in which you use the
primitives. Furthermore, always remember that inter-
process communication and message-passing can be good
alternatives to shared-memory multithreaded execution,
especially when your code makes good use of the design
principles called out in Chapter 6.
Quick Quiz 4.35: Wouldn’t the shell normally use vfork()
rather than fork()?

Because concurrency was added to the C standard


several decades after the C language was first used to
build concurrent systems, there are a number of ways
of concurrently accessing shared variables. All else
being equal, the C11 standard operations described in
Section 4.2.6 should be your first stop. If you need to
access a given shared variable both with plain accesses and
atomically, then the modern GCC atomics described in
Section 4.2.7 might work well for you. If you are working
on an old codebase that uses the classic GCC __sync
API, then you should review Section 4.2.5 as well as the
relevant GCC documentation. If you are working on the
Linux kernel or similar codebase that combines use of the
volatile keyword with inline assembly, or if you need
dependencies to provide ordering, look at the material
presented in Section 4.3.4 as well as that in Chapter 15.
Whatever approach you take, please keep in mind that
randomly hacking multi-threaded code is a spectacularly
bad idea, especially given that shared-memory parallel sys-
tems use your own intelligence against you: The smarter
you are, the deeper a hole you will dig for yourself before
you realize that you are in trouble [Pok16]. Therefore,
it is necessary to make the right design choices as well
as the correct choice of individual primitives, as will be
discussed at length in subsequent chapters.

v2023.06.11a
As easy as 1, 2, 3!
Unknown
Chapter 5

Counting

Counting is perhaps the simplest and most natural thing number of structures in use exceeds an exact limit (again, say
a computer can do. However, counting efficiently and 10,000). Suppose further that these structures are short-lived,
scalably on a large shared-memory multiprocessor can and that the limit is rarely exceeded, that there is almost always
be quite challenging. Furthermore, the simplicity of the at least one structure in use, and suppose further still that it is
underlying concept of counting allows us to explore the necessary to know exactly when this counter reaches zero, for
example, in order to free up some memory that is not required
fundamental issues of concurrency without the distractions
unless there is at least one structure in use.
of elaborate data structures or complex synchronization
primitives. Counting therefore provides an excellent
Quick Quiz 5.5: Removable I/O device access-count
introduction to parallel programming. problem. Suppose that you need to maintain a reference count
This chapter covers a number of special cases for which on a heavily used removable mass-storage device, so that you
there are simple, fast, and scalable counting algorithms. can tell the user when it is safe to remove the device. As usual,
But first, let us find out how much you already know about the user indicates a desire to remove the device, and the system
concurrent counting. tells the user when it is safe to do so.
Quick Quiz 5.1: Why should efficient and scalable counting Section 5.1 shows why counting is non-trivial. Sec-
be hard??? After all, computers have special hardware for the
tions 5.2 and 5.3 investigate network-packet counting
sole purpose of doing counting!!!
and approximate structure-allocation limits, respectively.
Section 5.4 takes on exact structure-allocation limits. Fi-
Quick Quiz 5.2: Network-packet counting problem. Sup-
nally, Section 5.5 presents performance measurements
pose that you need to collect statistics on the number of
networking packets transmitted and received. Packets might and discussion.
be transmitted or received by any CPU on the system. Suppose Sections 5.1 and 5.2 contain introductory material,
further that your system is capable of handling millions of while the remaining sections are more advanced.
packets per second per CPU, and that a systems-monitoring
package reads the count every five seconds. How would you
implement this counter? 5.1 Why Isn’t Concurrent Counting
Quick Quiz 5.3: Approximate structure-allocation limit
Trivial?
problem. Suppose that you need to maintain a count of the
number of structures allocated in order to fail any allocations Seek simplicity, and distrust it.
once the number of structures in use exceeds a limit (say,
10,000). Suppose further that the structures are short-lived, Alfred North Whitehead
the limit is rarely exceeded, and a “sloppy” approximate limit
is acceptable. Let’s start with something simple, for example, the
straightforward use of arithmetic shown in Listing 5.1
Quick Quiz 5.4: Exact structure-allocation limit problem. (count_nonatomic.c). Here, we have a counter on
Suppose that you need to maintain a count of the number of line 1, we increment it on line 5, and we read out its value
structures allocated in order to fail any allocations once the on line 10. What could be simpler?

49

v2023.06.11a
50 CHAPTER 5. COUNTING

Listing 5.1: Just Count! 100000


1 unsigned long counter = 0;

Time Per Increment (ns)


2
10000
3 static __inline__ void inc_count(void)
4 {
5 WRITE_ONCE(counter, READ_ONCE(counter) + 1); 1000
6 }
7
8 static __inline__ unsigned long read_count(void) 100
9 {
10 return READ_ONCE(counter);
11 } 10

Listing 5.2: Just Count Atomically! 1

10

100
1 atomic_t counter = ATOMIC_INIT(0);
2
3 static __inline__ void inc_count(void) Number of CPUs (Threads)
4 {
5 atomic_inc(&counter);
6 }
Figure 5.1: Atomic Increment Scalability on x86
7
8 static __inline__ long read_count(void)
9 {
10 return atomic_read(&counter); times slower than non-atomic increment, even when only
11 } a single thread is incrementing.1
This poor performance should not be a surprise, given
the discussion in Chapter 3, nor should it be a surprise
Quick Quiz 5.6: One thing that could be simpler is ++ instead that the performance of atomic increment gets slower
of that concatenation of READ_ONCE() and WRITE_ONCE(). as the number of CPUs and threads increase, as shown
Why all that extra typing??? in Figure 5.1. In this figure, the horizontal dashed line
resting on the x axis is the ideal performance that would
This approach has the additional advantage of being be achieved by a perfectly scalable algorithm: With
blazingly fast if you are doing lots of reading and almost such an algorithm, a given increment would incur the
no incrementing, and on small systems, the performance same overhead that it would in a single-threaded program.
is excellent. Atomic increment of a single global variable is clearly
There is just one large fly in the ointment: This approach decidedly non-ideal, and gets multiple orders of magnitude
can lose counts. On my six-core x86 laptop, a short run worse with additional CPUs.
invoked inc_count() 285,824,000 times, but the final
Quick Quiz 5.9: Why doesn’t the horizontal dashed line on
value of the counter was only 35,385,525. Although the x axis meet the diagonal line at 𝑥 = 1?
approximation does have a large place in computing, loss
of 87 % of the counts is a bit excessive. Quick Quiz 5.10: But atomic increment is still pretty fast.
Quick Quiz 5.7: But can’t a smart compiler prove that line 5 And incrementing a single variable in a tight loop sounds pretty
of Listing 5.1 is equivalent to the ++ operator and produce an unrealistic to me, after all, most of the program’s execution
x86 add-to-memory instruction? And won’t the CPU cache should be devoted to actually doing work, not accounting for
cause this to be atomic? the work it has done! Why should I care about making this go
faster?
Quick Quiz 5.8: The 8-figure accuracy on the number of
For another perspective on global atomic increment,
failures indicates that you really did test this. Why would it be
necessary to test such a trivial program, especially when the
consider Figure 5.2. In order for each CPU to get a
bug is easily seen by inspection? chance to increment a given global variable, the cache
line containing that variable must circulate among all
The straightforward way to count accurately is to use 1 Interestingly enough, non-atomically incrementing a counter will
atomic operations, as shown in Listing 5.2 (count_ advance the counter more quickly than atomically incrementing the
atomic.c). Line 1 defines an atomic variable, line 5 counter. Of course, if your only goal is to make the counter increase
quickly, an easier approach is to simply assign a large value to the counter.
atomically increments it, and line 10 reads it out. Be- Nevertheless, there is likely to be a role for algorithms that use carefully
cause this is atomic, it keeps perfect count. However, it is relaxed notions of correctness in order to gain greater performance and
slower: On my six-core x86 laptop, it is more than twenty scalability [And91, ACMS03, Rin13, Ung11].

v2023.06.11a
5.2. STATISTICAL COUNTERS 51

Listing 5.3: Array-Based Per-Thread Statistical Counters


CPU 0 CPU 1 CPU 2 CPU 3
1 DEFINE_PER_THREAD(unsigned long, counter);
Cache Cache Cache Cache 2
3 static __inline__ void inc_count(void)
Interconnect Interconnect 4 {
5 unsigned long *p_counter = &__get_thread_var(counter);
6
7 WRITE_ONCE(*p_counter, *p_counter + 1);
Memory System Interconnect Memory 8 }
9
10 static __inline__ unsigned long read_count(void)
11 {
Interconnect Interconnect 12 int t;
Cache Cache Cache Cache 13 unsigned long sum = 0;
14
CPU 4 CPU 5 CPU 6 CPU 7 15 for_each_thread(t)
16 sum += READ_ONCE(per_thread(counter, t));
17 return sum;
Figure 5.2: Data Flow For Global Atomic Increment 18 }

5.2 Statistical Counters


Facts are stubborn things, but statistics are pliable.
Mark Twain

This section covers the common special case of statistical


counters, where the count is updated extremely frequently
One one thousand. and the value is read out rarely, if ever. These will be used
Two one thousand. to solve the network-packet counting problem posed in
Three one thousand...
Quick Quiz 5.2.

5.2.1 Design
Statistical counting is typically handled by providing a
counter per thread (or CPU, when running in the kernel),
Figure 5.3: Waiting to Count so that each thread updates its own counter, as was fore-
shadowed in Section 4.3.6 on page 46. The aggregate
value of the counters is read out by simply summing up
all of the threads’ counters, relying on the commutative
and associative properties of addition. This is an example
the CPUs, as shown by the red arrows. Such circulation of the Data Ownership pattern that will be introduced in
will take significant time, resulting in the poor perfor- Section 6.3.4 on page 86.
mance seen in Figure 5.1, which might be thought of as Quick Quiz 5.12: But doesn’t the fact that C’s “integers” are
shown in Figure 5.3. The following sections discuss high- limited in size complicate things?
performance counting, which avoids the delays inherent
in such circulation.
5.2.2 Array-Based Implementation
One way to provide per-thread variables is to allocate
Quick Quiz 5.11: But why can’t CPU designers simply
ship the addition operation to the data, avoiding the need to
an array with one element per thread (presumably cache
circulate the cache line containing the global variable being aligned and padded to avoid false sharing).
incremented? Quick Quiz 5.13: An array??? But doesn’t that limit the
number of threads?

v2023.06.11a
52 CHAPTER 5. COUNTING

Such an array can be wrapped into per-thread primitives, CPU 0 CPU 1 CPU 2 CPU 3
as shown in Listing 5.3 (count_stat.c). Line 1 defines
Cache Cache Cache Cache
an array containing a set of per-thread counters of type
Interconnect Interconnect
unsigned long named, creatively enough, counter.
Lines 3–8 show a function that increments the counters,
using the __get_thread_var() primitive to locate the Memory System Interconnect Memory
currently running thread’s element of the counter array.
Because this element is modified only by the correspond-
ing thread, non-atomic increment suffices. However, this Interconnect Interconnect
code uses WRITE_ONCE() to prevent destructive compiler Cache Cache Cache Cache
optimizations. For but one example, the compiler is within CPU 4 CPU 5 CPU 6 CPU 7
its rights to use a to-be-stored-to location as temporary
storage, thus writing what would be for all intents and Figure 5.4: Data Flow For Per-Thread Increment
purposes garbage to that location just before doing the
desired store. This could of course be rather confusing
to anything attempting to read out the count. The use the network-packet counting problem presented at the
of WRITE_ONCE() prevents this optimization and others beginning of this chapter.
besides. Quick Quiz 5.17: The read operation takes time to sum
Quick Quiz 5.14: What other nasty optimizations could up the per-thread values, and during that time, the counter
GCC apply? could well be changing. This means that the value returned
by read_count() in Listing 5.3 will not necessarily be exact.
Lines 10–18 show a function that reads out the aggregate Assume that the counter is being incremented at rate 𝑟 counts
value of the counter, using the for_each_thread() per unit time, and that read_count()’s execution consumes
𝛥 units of time. What is the expected error in the return value?
primitive to iterate over the list of currently running
threads, and using the per_thread() primitive to fetch
the specified thread’s counter. This code also uses READ_ However, many implementations provide cheaper mech-
ONCE() to ensure that the compiler doesn’t optimize these anisms for per-thread data that are free from arbitrary
loads into oblivion. For but one example, a pair of array-size limits. This is the topic of the next section.
consecutive calls to read_count() might be inlined, and
an intrepid optimizer might notice that the same locations 5.2.3 Per-Thread-Variable-Based Imple-
were being summed and thus incorrectly conclude that it
would be simply wonderful to sum them once and use the
mentation
resulting value twice. This sort of optimization might be The C language, since C11, features a _Thread_local
rather frustrating to people expecting later read_count() storage class that provides per-thread storage.2 This can be
calls to account for the activities of other threads. The use used as shown in Listing 5.4 (count_end.c) to implement
of READ_ONCE() prevents this optimization and others a statistical counter that not only scales well and avoids
besides. arbitrary thread-number limits, but that also incurs little
Quick Quiz 5.15: How does the per-thread counter variable or no performance penalty to incrementers compared to
in Listing 5.3 get initialized? simple non-atomic increment.
Lines 1–4 define needed variables: counter is the
per-thread counter variable, the counterp[] array allows
Quick Quiz 5.16: How is the code in Listing 5.3 supposed
to permit more than one counter?
threads to access each others’ counters, finalcount ac-
cumulates the total as individual threads exit, and final_
This approach scales linearly with increasing number mutex coordinates between threads accumulating the total
of updater threads invoking inc_count(). As is shown value of the counter and exiting threads.
by the green arrows on each CPU in Figure 5.4, the
reason for this is that each CPU can make rapid progress 2 GCC provides its own __thread storage class, which was used
incrementing its thread’s variable, without any expensive in previous versions of this book. The two methods for specifying a
cross-system communication. As such, this section solves thread-local variable are interchangeable when using GCC.

v2023.06.11a
5.2. STATISTICAL COUNTERS 53

Listing 5.4: Per-Thread Statistical Counters counter-pointers to that variable rather than setting them to
1 unsigned long _Thread_local counter = 0; NULL?
2 unsigned long *counterp[NR_THREADS] = { NULL };
3 unsigned long finalcount = 0;
4 DEFINE_SPINLOCK(final_mutex); Quick Quiz 5.20: Why on earth do we need something as
5
heavyweight as a lock guarding the summation in the function
6 static inline void inc_count(void)
7 { read_count() in Listing 5.4?
8 WRITE_ONCE(counter, counter + 1);
9 } Lines 25–32 show the count_register_thread()
10
11 static inline unsigned long read_count(void) function, which must be called by each thread before its
12 { first use of this counter. This function simply sets up this
13 int t;
14 unsigned long sum; thread’s element of the counterp[] array to point to its
15
16 spin_lock(&final_mutex);
per-thread counter variable.
17 sum = finalcount;
18 for_each_thread(t)
Quick Quiz 5.21: Why on earth do we need to acquire the
19 if (counterp[t] != NULL) lock in count_register_thread() in Listing 5.4? It is a
20 sum += READ_ONCE(*counterp[t]); single properly aligned machine-word store to a location that
21 spin_unlock(&final_mutex);
22 return sum; no other thread is modifying, so it should be atomic anyway,
23 } right?
24
25 void count_register_thread(unsigned long *p)
26 { Lines 34–42 show the count_unregister_
27 int idx = smp_thread_id(); thread() function, which must be called prior to exit
28
29 spin_lock(&final_mutex); by each thread that previously called count_register_
30 counterp[idx] = &counter; thread(). Line 38 acquires the lock, and line 41 releases
31 spin_unlock(&final_mutex);
32 } it, thus excluding any calls to read_count() as well as
33 other calls to count_unregister_thread(). Line 39
34 void count_unregister_thread(int nthreadsexpected)
35 { adds this thread’s counter to the global finalcount,
36 int idx = smp_thread_id(); and then line 40 NULLs out its counterp[] array entry.
37
38 spin_lock(&final_mutex); A subsequent call to read_count() will see the exiting
39 finalcount += counter; thread’s count in the global finalcount, and will
40 counterp[idx] = NULL;
41 spin_unlock(&final_mutex); skip the exiting thread when sequencing through the
42 } counterp[] array, thus obtaining the correct total.
This approach gives updaters almost exactly the same
performance as a non-atomic add, and also scales linearly.
Quick Quiz 5.18: Doesn’t that explicit counterp array On the other hand, concurrent reads contend for a sin-
in Listing 5.4 reimpose an arbitrary limit on the number
gle global lock, and therefore perform poorly and scale
of threads? Why doesn’t the C language provide a per_
abysmally. However, this is not a problem for statistical
thread() interface, similar to the Linux kernel’s per_cpu()
primitive, to allow threads to more easily access each others’ counters, where incrementing happens often and readout
per-thread variables? happens almost never. Of course, this approach is consid-
erably more complex than the array-based scheme, due to
The inc_count() function used by updaters is quite the fact that a given thread’s per-thread variables vanish
simple, as can be seen on lines 6–9. when that thread exits.
The read_count() function used by readers is a bit Quick Quiz 5.22: Fine, but the Linux kernel doesn’t have
more complex. Line 16 acquires a lock to exclude exiting to acquire a lock when reading out the aggregate value of
threads, and line 21 releases it. Line 17 initializes the per-CPU counters. So why should user-space code need to do
sum to the count accumulated by those threads that have this???
already exited, and lines 18–20 sum the counts being
Both the array-based and _Thread_local-based ap-
accumulated by threads currently running. Finally, line 22
proaches offer excellent update-side performance and
returns the sum.
scalability. However, these benefits result in large read-
Quick Quiz 5.19: Doesn’t the check for NULL on line 19 side expense for large numbers of threads. The next
of Listing 5.4 add extra branch mispredictions? Why not
section shows one way to reduce read-side expense while
have a variable set permanently to zero, and point unused
still retaining the update-side scalability.

v2023.06.11a
54 CHAPTER 5. COUNTING

5.2.4 Eventually Consistent Implementa-


tion
One way to retain update-side scalability while greatly
improving read-side performance is to weaken consis-
tency requirements. The counting algorithm in the pre-
vious section is guaranteed to return a value between the
value that an ideal counter would have taken on near the Listing 5.5: Array-Based Per-Thread Eventually Consistent
beginning of read_count()’s execution and that near Counters
1 DEFINE_PER_THREAD(unsigned long, counter);
the end of read_count()’s execution. Eventual consis- 2 unsigned long global_count;
tency [Vog09] provides a weaker guarantee: In absence 3 int stopflag;
4
of calls to inc_count(), calls to read_count() will 5 static __inline__ void inc_count(void)
eventually return an accurate count. 6 {
7 unsigned long *p_counter = &__get_thread_var(counter);
We exploit eventual consistency by maintaining a global 8

counter. However, updaters only manipulate their per- 9 WRITE_ONCE(*p_counter, *p_counter + 1);
10 }
thread counters. A separate thread is provided to transfer 11

counts from the per-thread counters to the global counter. 12 static __inline__ unsigned long read_count(void)
13 {
Readers simply access the value of the global counter. If 14 return READ_ONCE(global_count);
updaters are active, the value used by the readers will 15 }
16
be out of date, however, once updates cease, the global 17 void *eventual(void *arg)
counter will eventually converge on the true value—hence 18 {
19 int t;
this approach qualifies as eventually consistent. 20 unsigned long sum;
The implementation is shown in Listing 5.5 (count_ 21
22 while (READ_ONCE(stopflag) < 3) {
stat_eventual.c). Lines 1–2 show the per-thread vari- 23 sum = 0;
able and the global variable that track the counter’s value, 24 for_each_thread(t)
25 sum += READ_ONCE(per_thread(counter, t));
and line 3 shows stopflag which is used to coordinate 26 WRITE_ONCE(global_count, sum);
termination (for the case where we want to terminate 27 poll(NULL, 0, 1);
28 if (READ_ONCE(stopflag))
the program with an accurate counter value). The inc_ 29 smp_store_release(&stopflag, stopflag + 1);
count() function shown on lines 5–10 is similar to its 30 }
31 return NULL;
counterpart in Listing 5.3. The read_count() function 32 }
shown on lines 12–15 simply returns the value of the 33
34 void count_init(void)
global_count variable. 35 {
36 int en;
However, the count_init() function on lines 34–44 37 pthread_t tid;
creates the eventual() thread shown on lines 17–32, 38
39 en = pthread_create(&tid, NULL, eventual, NULL);
which cycles through all the threads, summing the per- 40 if (en != 0) {
thread local counter and storing the sum to the global_ 41 fprintf(stderr, "pthread_create: %s\n", strerror(en));
42 exit(EXIT_FAILURE);
count variable. The eventual() thread waits an arbi- 43 }
trarily chosen one millisecond between passes. 44 }
45
The count_cleanup() function on lines 46–51 46 void count_cleanup(void)
coordinates termination. The call to smp_load_ 47 {
48 WRITE_ONCE(stopflag, 1);
acquire() here and the call to smp_store_release() 49 while (smp_load_acquire(&stopflag) < 3)
in eventual() ensure that all updates to global_ 50 poll(NULL, 0, 1);
51 }
count are visible to code following the call to count_
cleanup().
This approach gives extremely fast counter read-out
while still supporting linear counter-update scalability.
However, this excellent read-side performance and update-
side scalability comes at the cost of the additional thread
running eventual().

v2023.06.11a
5.3. APPROXIMATE LIMIT COUNTERS 55

Quick Quiz 5.23: Why doesn’t inc_count() in Listing 5.5 5.3 Approximate Limit Counters
need to use atomic instructions? After all, we now have
multiple threads accessing the per-thread counters!
An approximate answer to the right problem is worth
a good deal more than an exact answer to an
approximate problem.
Quick Quiz 5.24: Won’t the single global thread in the func-
tion eventual() of Listing 5.5 be just as severe a bottleneck John Tukey
as a global lock would be?
Another special case of counting involves limit-checking.
For example, as noted in the approximate structure-
Quick Quiz 5.25: Won’t the estimate returned by read_ allocation limit problem in Quick Quiz 5.3, suppose that
count() in Listing 5.5 become increasingly inaccurate as the you need to maintain a count of the number of structures
number of threads rises? allocated in order to fail any allocations once the number
of structures in use exceeds a limit, in this case, 10,000.
Suppose further that these structures are short-lived, that
Quick Quiz 5.26: Given that in the eventually-consistent this limit is rarely exceeded, and that this limit is approx-
algorithm shown in Listing 5.5 both reads and updates have
imate in that it is OK either to exceed it sometimes by
extremely low overhead and are extremely scalable, why
some bounded amount or to fail to reach it sometimes,
would anyone bother with the implementation described in
Section 5.2.2, given its costly read-side code? again by some bounded amount. See Section 5.4 if you
instead need the limit to be exact.

Quick Quiz 5.27: What is the accuracy of the estimate 5.3.1 Design
returned by read_count() in Listing 5.5?
One possible design for limit counters is to divide the
limit of 10,000 by the number of threads, and give each
thread a fixed pool of structures. For example, given 100
threads, each thread would manage its own pool of 100
structures. This approach is simple, and in some cases
5.2.5 Discussion works well, but it does not handle the common case where
a given structure is allocated by one thread and freed by
These three implementations show that it is possible another [MS93]. On the one hand, if a given thread takes
to obtain near-uniprocessor performance for statistical credit for any structures it frees, then the thread doing
counters, despite running on a parallel machine. most of the allocating runs out of structures, while the
threads doing most of the freeing have lots of credits that
Quick Quiz 5.28: What fundamental difference is there they cannot use. On the other hand, if freed structures
between counting packets and counting the total number of are credited to the CPU that allocated them, it will be
bytes in the packets, given that the packets vary in size? necessary for CPUs to manipulate each others’ counters,
which will require expensive atomic instructions or other
means of communicating between threads.3
Quick Quiz 5.29: Given that the reader must sum all the In short, for many important workloads, we cannot fully
threads’ counters, this counter-read operation could take a long partition the counter. Given that partitioning the counters
time given large numbers of threads. Is there any way that was what brought the excellent update-side performance
the increment operation can remain fast and scalable while for the three schemes discussed in Section 5.2, this might
allowing readers to also enjoy not only reasonable performance be grounds for some pessimism. However, the eventually
and scalability, but also good accuracy?
consistent algorithm presented in Section 5.2.4 provides
an interesting hint. Recall that this algorithm kept two sets
Given what has been presented in this section, you of books, a per-thread counter variable for updaters and a
should now be able to answer the Quick Quiz about 3 That said, if each structure will always be freed by the same CPU
statistical counters for networking near the beginning of (or thread) that allocated it, then this simple partitioning approach works
this chapter. extremely well.

v2023.06.11a
56 CHAPTER 5. COUNTING

global_count variable for readers, with an eventual() Listing 5.6: Simple Limit Counter Variables
thread that periodically updated global_count to be 1 unsigned long __thread counter = 0;
2 unsigned long __thread countermax = 0;
eventually consistent with the values of the per-thread 3 unsigned long globalcountmax = 10000;
counter. The per-thread counter perfectly partitioned 4 unsigned long globalcount = 0;
5 unsigned long globalreserve = 0;
the counter value, while global_count kept the full 6 unsigned long *counterp[NR_THREADS] = { NULL };
value. 7 DEFINE_SPINLOCK(gblcnt_mutex);

For limit counters, we can use a variation on this theme


where we partially partition the counter. For example,
consider four threads with each having not only a per- globalcountmax limit. This means that the value of a
thread counter, but also a per-thread maximum value given thread’s countermax variable can be set based on
(call it countermax). this difference. When far from the limit, the countermax
But then what happens if a given thread needs to per-thread variables are set to large values to optimize for
increment its counter, but counter is equal to its performance and scalability, while when close to the limit,
countermax? The trick here is to move half of that these same variables are set to small values to minimize
thread’s counter value to a globalcount, then incre- the error in the checks against the globalcountmax limit.
ment counter. For example, if a given thread’s counter This design is an example of parallel fastpath, which is
and countermax variables were both equal to 10, we do an important design pattern in which the common case
the following: executes with no expensive instructions and no interactions
between threads, but where occasional use is also made
1. Acquire a global lock. of a more conservatively designed (and higher overhead)
global algorithm. This design pattern is covered in more
2. Add five to globalcount. detail in Section 6.4.

3. To balance out the addition, subtract five from this


thread’s counter. 5.3.2 Simple Limit Counter Implementa-
tion
4. Release the global lock.
Listing 5.6 shows both the per-thread and global variables
5. Increment this thread’s counter, resulting in a value used by this implementation. The per-thread counter
of six. and countermax variables are the corresponding thread’s
local counter and the upper bound on that counter, re-
Although this procedure still requires a global lock, spectively. The globalcountmax variable on line 3
that lock need only be acquired once for every five in- contains the upper bound for the aggregate counter, and
crement operations, greatly reducing that lock’s level of the globalcount variable on line 4 is the global counter.
contention. We can reduce this contention as low as we The sum of globalcount and each thread’s counter
wish by increasing the value of countermax. However, gives the aggregate value of the overall counter. The
the corresponding penalty for increasing the value of globalreserve variable on line 5 is at least the sum of
countermax is reduced accuracy of globalcount. To all of the per-thread countermax variables. The relation-
see this, note that on a four-CPU system, if countermax ship among these variables is shown by Figure 5.5:
is equal to ten, globalcount will be in error by at most
40 counts. In contrast, if countermax is increased to 1. The sum of globalcount and globalreserve
100, globalcount might be in error by as much as 400 must be less than or equal to globalcountmax.
counts.
2. The sum of all threads’ countermax values must be
This raises the question of just how much we care about
less than or equal to globalreserve.
globalcount’s deviation from the aggregate value of
the counter, where this aggregate value is the sum of 3. Each thread’s counter must be less than or equal to
globalcount and each thread’s counter variable. The that thread’s countermax.
answer to this question depends on how far the aggregate
value is from the counter’s limit (call it globalcountmax). Each element of the counterp[] array references the
The larger the difference between these two values, the corresponding thread’s counter variable, and, finally, the
larger countermax can be without risk of exceeding the gblcnt_mutex spinlock guards all of the global variables,

v2023.06.11a
5.3. APPROXIMATE LIMIT COUNTERS 57

Listing 5.7: Simple Limit Counter Add, Subtract, and Read


countermax 3
globalreserve

counter 3 1 static __inline__ int add_count(unsigned long delta)


2 {
3 if (countermax - counter >= delta) {
countermax 2 counter 2 4 WRITE_ONCE(counter, counter + delta);
globalcountmax

5 return 1;
6 }
countermax 1 counter 1 7 spin_lock(&gblcnt_mutex);
8 globalize_count();
countermax 0
counter 0 9 if (globalcountmax -
10 globalcount - globalreserve < delta) {
11 spin_unlock(&gblcnt_mutex);
12 return 0;
globalcount

13 }
14 globalcount += delta;
15 balance_count();
16 spin_unlock(&gblcnt_mutex);
17 return 1;
18 }
19
20 static __inline__ int sub_count(unsigned long delta)
21 {
22 if (counter >= delta) {
Figure 5.5: Simple Limit Counter Variable Relationships 23 WRITE_ONCE(counter, counter - delta);
24 return 1;
25 }
26 spin_lock(&gblcnt_mutex);
in other words, no thread is permitted to access or modify 27 globalize_count();
28 if (globalcount < delta) {
any of the global variables unless it has acquired gblcnt_ 29 spin_unlock(&gblcnt_mutex);
mutex. 30 return 0;
31 }
Listing 5.7 shows the add_count(), sub_count(), 32 globalcount -= delta;
and read_count() functions (count_lim.c). 33 balance_count();
34 spin_unlock(&gblcnt_mutex);
Quick Quiz 5.30: Why does Listing 5.7 provide add_ 35 return 1;
36 }
count() and sub_count() instead of the inc_count() and 37
dec_count() interfaces show in Section 5.2? 38 static __inline__ unsigned long read_count(void)
39 {
40 int t;
Lines 1–18 show add_count(), which adds the speci- 41 unsigned long sum;
fied value delta to the counter. Line 3 checks to see if 42
43 spin_lock(&gblcnt_mutex);
there is room for delta on this thread’s counter, and, if 44 sum = globalcount;
so, line 4 adds it and line 5 returns success. This is the 45 for_each_thread(t) {
46 if (counterp[t] != NULL)
add_counter() fastpath, and it does no atomic opera- 47 sum += READ_ONCE(*counterp[t]);
tions, references only per-thread variables, and should not 48 }
49 spin_unlock(&gblcnt_mutex);
incur any cache misses. 50 return sum;
51 }
Quick Quiz 5.31: What is with the strange form of the
condition on line 3 of Listing 5.7? Why not the more intuitive
form of the fastpath shown in Listing 5.8?

If the test on line 3 fails, we must access global variables,


and thus must acquire gblcnt_mutex on line 7, which we
release on line 11 in the failure case or on line 16 in the suc- Listing 5.8: Intuitive Fastpath
cess case. Line 8 invokes globalize_count(), shown 3 if (counter + delta <= countermax) {
in Listing 5.9, which clears the thread-local variables, 4 WRITE_ONCE(counter, counter + delta);
5 return 1;
adjusting the global variables as needed, thus simplifying 6 }
global processing. (But don’t take my word for it, try
coding it yourself!) Lines 9 and 10 check to see if addition
of delta can be accommodated, with the meaning of

v2023.06.11a
58 CHAPTER 5. COUNTING

the expression preceding the less-than sign shown in Fig- Listing 5.9: Simple Limit Counter Utility Functions
ure 5.5 as the difference in height of the two red (leftmost) 1 static __inline__ void globalize_count(void)
2 {
bars. If the addition of delta cannot be accommodated, 3 globalcount += counter;
then line 11 (as noted earlier) releases gblcnt_mutex 4 counter = 0;
5 globalreserve -= countermax;
and line 12 returns indicating failure. 6 countermax = 0;
Otherwise, we take the slowpath. Line 14 adds delta 7 }
8
to globalcount, and then line 15 invokes balance_ 9 static __inline__ void balance_count(void)
count() (shown in Listing 5.9) in order to update both the 10 {
11 countermax = globalcountmax -
global and the per-thread variables. This call to balance_ 12 globalcount - globalreserve;
count() will usually set this thread’s countermax to 13 countermax /= num_online_threads();
14 globalreserve += countermax;
re-enable the fastpath. Line 16 then releases gblcnt_ 15 counter = countermax / 2;
mutex (again, as noted earlier), and, finally, line 17 returns 16 if (counter > globalcount)
17 counter = globalcount;
indicating success. 18 globalcount -= counter;
19 }
Quick Quiz 5.32: Why does globalize_count() zero the 20

per-thread variables, only to later call balance_count() to 21 void count_register_thread(void)


22 {
refill them in Listing 5.7? Why not just leave the per-thread 23 int idx = smp_thread_id();
variables non-zero? 24
25 spin_lock(&gblcnt_mutex);
26 counterp[idx] = &counter;
Lines 20–36 show sub_count(), which subtracts the 27 spin_unlock(&gblcnt_mutex);
specified delta from the counter. Line 22 checks to see if 28 }
29
the per-thread counter can accommodate this subtraction, 30 void count_unregister_thread(int nthreadsexpected)
and, if so, line 23 does the subtraction and line 24 returns 31 {
32 int idx = smp_thread_id();
success. These lines form sub_count()’s fastpath, and, 33
as with add_count(), this fastpath executes no costly 34 spin_lock(&gblcnt_mutex);
35 globalize_count();
operations. 36 counterp[idx] = NULL;
If the fastpath cannot accommodate subtraction of 37 spin_unlock(&gblcnt_mutex);
38 }
delta, execution proceeds to the slowpath on lines 26–35.
Because the slowpath must access global state, line 26 ac-
quires gblcnt_mutex, which is released either by line 29
re-enabling the fastpath). Then line 34 releases gblcnt_
(in case of failure) or by line 34 (in case of success).
mutex, and line 35 returns success.
Line 27 invokes globalize_count(), shown in List-
ing 5.9, which again clears the thread-local variables, Quick Quiz 5.35: Why have both add_count() and sub_
adjusting the global variables as needed. Line 28 checks count() in Listing 5.7? Why not simply pass a negative
to see if the counter can accommodate subtracting delta, number to add_count()?
and, if not, line 29 releases gblcnt_mutex (as noted
earlier) and line 30 returns failure. Lines 38–51 show read_count(), which returns the
aggregate value of the counter. It acquires gblcnt_
Quick Quiz 5.33: Given that globalreserve counted mutex on line 43 and releases it on line 49, excluding
against us in add_count(), why doesn’t it count for us in
global operations from add_count() and sub_count(),
sub_count() in Listing 5.7?
and, as we will see, also excluding thread creation and
exit. Line 44 initializes local variable sum to the value of
Quick Quiz 5.34: Suppose that one thread invokes add_
globalcount, and then the loop spanning lines 45–48
count() shown in Listing 5.7, and then another thread in-
vokes sub_count(). Won’t sub_count() return failure even
sums the per-thread counter variables. Line 50 then
though the value of the counter is non-zero? returns the sum.
Listing 5.9 shows a number of utility functions used by
If, on the other hand, line 28 finds that the counter the add_count(), sub_count(), and read_count()
can accommodate subtracting delta, we complete the primitives shown in Listing 5.7.
slowpath. Line 32 does the subtraction and then line 33 in- Lines 1–7 show globalize_count(), which zeros
vokes balance_count() (shown in Listing 5.9) in order the current thread’s per-thread counters, adjusting the
to update both global and per-thread variables (hopefully global variables appropriately. It is important to note that

v2023.06.11a
5.3. APPROXIMATE LIMIT COUNTERS 59

this function does not change the aggregate value of the by the bottommost dotted line connecting the leftmost
counter, but instead changes how the counter’s current and center configurations. In other words, the sum of
value is represented. Line 3 adds the thread’s counter globalcount and the four threads’ counter variables is
variable to globalcount, and line 4 zeroes counter. the same in both configurations. Similarly, this change did
Similarly, line 5 subtracts the per-thread countermax not affect the sum of globalcount and globalreserve,
from globalreserve, and line 6 zeroes countermax. It as indicated by the upper dotted line.
is helpful to refer to Figure 5.5 when reading both this The rightmost configuration shows the relationship
function and balance_count(), which is next. of these counters after balance_count() is executed,
Lines 9–19 show balance_count(), which is roughly again by thread 0. One-quarter of the remaining count,
speaking the inverse of globalize_count(). This func- denoted by the vertical line extending up from all three
tion’s job is to set the current thread’s countermax vari- configurations, is added to thread 0’s countermax and
able to the largest value that avoids the risk of the counter half of that to thread 0’s counter. The amount added to
exceeding the globalcountmax limit. Changing the thread 0’s counter is also subtracted from globalcount
current thread’s countermax variable of course requires in order to avoid changing the overall value of the counter
corresponding adjustments to counter, globalcount (which is again the sum of globalcount and the three
and globalreserve, as can be seen by referring back to threads’ counter variables), again as indicated by the
Figure 5.5. By doing this, balance_count() maximizes lowermost of the two dotted lines connecting the center and
use of add_count()’s and sub_count()’s low-overhead rightmost configurations. The globalreserve variable
fastpaths. As with globalize_count(), balance_ is also adjusted so that this variable remains equal to the
count() is not permitted to change the aggregate value sum of the four threads’ countermax variables. Because
of the counter. thread 0’s counter is less than its countermax, thread 0
Lines 11–13 compute this thread’s share of that por- can once again increment the counter locally.
tion of globalcountmax that is not already covered by
Quick Quiz 5.37: In Figure 5.6, even though a quarter of the
either globalcount or globalreserve, and assign the remaining count up to the limit is assigned to thread 0, only an
computed quantity to this thread’s countermax. Line 14 eighth of the remaining count is consumed, as indicated by the
makes the corresponding adjustment to globalreserve. uppermost dotted line connecting the center and the rightmost
Line 15 sets this thread’s counter to the middle of the configurations. Why is that?
range from zero to countermax. Line 16 checks to
see whether globalcount can in fact accommodate this Lines 21–28 show count_register_thread(),
value of counter, and, if not, line 17 decreases counter which sets up state for newly created threads. This
accordingly. Finally, in either case, line 18 makes the function simply installs a pointer to the newly created
corresponding adjustment to globalcount. thread’s counter variable into the corresponding entry of
the counterp[] array under the protection of gblcnt_
Quick Quiz 5.36: Why set counter to countermax / 2
mutex.
in line 15 of Listing 5.9? Wouldn’t it be simpler to just take
countermax counts? Finally, lines 30–38 show count_unregister_
thread(), which tears down state for a soon-to-be-exiting
It is helpful to look at a schematic depicting how the thread. Line 34 acquires gblcnt_mutex and line 37 re-
relationship of the counters changes with the execution of leases it. Line 35 invokes globalize_count() to clear
first globalize_count() and then balance_count(), out this thread’s counter state, and line 36 clears this
as shown in Figure 5.6. Time advances from left to right, thread’s entry in the counterp[] array.
with the leftmost configuration roughly that of Figure 5.5.
The center configuration shows the relationship of these
5.3.3 Simple Limit Counter Discussion
same counters after globalize_count() is executed by
thread 0. As can be seen from the figure, thread 0’s This type of counter is quite fast when aggregate val-
counter (“c 0” in the figure) is added to globalcount, ues are near zero, with some overhead due to the com-
while the value of globalreserve is reduced by this same parison and branch in both add_count()’s and sub_
amount. Both thread 0’s counter and its countermax count()’s fastpaths. However, the use of a per-thread
(“cm 0” in the figure) are reduced to zero. The other three countermax reserve means that add_count() can fail
threads’ counters are unchanged. Note that this change even when the aggregate value of the counter is nowhere
did not affect the overall value of the counter, as indicated near globalcountmax. Similarly, sub_count() can fail

v2023.06.11a
60 CHAPTER 5. COUNTING

globalize_count() balance_count()

cm 3

globalreserve
c 3

globalreserve
cm 3 cm 3
globalreserve

c 3 c 3

cm 2
c 2
cm 2 cm 2
c 2 c 2
cm 1 c 1
cm 1 c 1 cm 1 c 1
cm 0
c 0
cm 0 c 0

globalcount

globalcount
globalcount

Figure 5.6: Schematic of Globalization and Balancing

even when the aggregate value of the counter is nowhere


near zero.
In many cases, this is unacceptable. Even if the Listing 5.10: Approximate Limit Counter Variables
1 unsigned long __thread counter = 0;
globalcountmax is intended to be an approximate limit, 2 unsigned long __thread countermax = 0;
there is usually a limit to exactly how much approxima- 3 unsigned long globalcountmax = 10000;
4 unsigned long globalcount = 0;
tion can be tolerated. One way to limit the degree of 5 unsigned long globalreserve = 0;
approximation is to impose an upper limit on the value 6 unsigned long *counterp[NR_THREADS] = { NULL };
7 DEFINE_SPINLOCK(gblcnt_mutex);
of the per-thread countermax instances. This task is 8 #define MAX_COUNTERMAX 100
undertaken in the next section.

5.3.4 Approximate Limit Counter Imple-


mentation Listing 5.11: Approximate Limit Counter Balancing
1 static void balance_count(void)
2 {
Because this implementation (count_lim_app.c) is 3 countermax = globalcountmax -
quite similar to that in the previous section (Listings 5.6, 4 globalcount - globalreserve;
5 countermax /= num_online_threads();
5.7, and 5.9), only the changes are shown here. List- 6 if (countermax > MAX_COUNTERMAX)
ing 5.10 is identical to Listing 5.6, with the addition of 7 countermax = MAX_COUNTERMAX;
8 globalreserve += countermax;
MAX_COUNTERMAX, which sets the maximum permissible 9 counter = countermax / 2;
value of the per-thread countermax variable. 10 if (counter > globalcount)
11 counter = globalcount;
Similarly, Listing 5.11 is identical to the balance_ 12 globalcount -= counter;
13 }
count() function in Listing 5.9, with the addition of
lines 6 and 7, which enforce the MAX_COUNTERMAX limit
on the per-thread countermax variable.

v2023.06.11a
5.4. EXACT LIMIT COUNTERS 61

5.3.5 Approximate Limit Counter Discus- Listing 5.12: Atomic Limit Counter Variables and Access
sion Functions
1 atomic_t __thread counterandmax = ATOMIC_INIT(0);
2 unsigned long globalcountmax = 1 << 25;
These changes greatly reduce the limit inaccuracy seen in 3 unsigned long globalcount = 0;
the previous version, but present another problem: Any 4 unsigned long globalreserve = 0;
5 atomic_t *counterp[NR_THREADS] = { NULL };
given value of MAX_COUNTERMAX will cause a workload- 6 DEFINE_SPINLOCK(gblcnt_mutex);
dependent fraction of accesses to fall off the fastpath. As 7 #define CM_BITS (sizeof(atomic_t) * 4)
8 #define MAX_COUNTERMAX ((1 << CM_BITS) - 1)
the number of threads increase, non-fastpath execution 9
will become both a performance and a scalability problem. 10 static __inline__ void
11 split_counterandmax_int(int cami, int *c, int *cm)
However, we will defer this problem and turn instead to 12 {
counters with exact limits. 13 *c = (cami >> CM_BITS) & MAX_COUNTERMAX;
14 *cm = cami & MAX_COUNTERMAX;
15 }
16
5.4 Exact Limit Counters 17
18
static __inline__ void
split_counterandmax(atomic_t *cam, int *old, int *c, int *cm)
19 {
20 unsigned int cami = atomic_read(cam);
Exactitude can be expensive. Spend wisely. 21
22 *old = cami;
Unknown 23 split_counterandmax_int(cami, c, cm);
24 }
25
To solve the exact structure-allocation limit problem noted 26 static __inline__ int merge_counterandmax(int c, int cm)
in Quick Quiz 5.4, we need a limit counter that can 27 {
28 unsigned int cami;
tell exactly when its limits are exceeded. One way of 29

implementing such a limit counter is to cause threads 30 cami = (c << CM_BITS) | cm;
31 return ((int)cami);
that have reserved counts to give them up. One way to 32 }
do this is to use atomic instructions. Of course, atomic
instructions will slow down the fastpath, but on the other
hand, it would be silly not to at least give them a try. Lines 2–6 show the definitions for globalcountmax,
globalcount, globalreserve, counterp, and
5.4.1 Atomic Limit Counter Implementa- gblcnt_mutex, all of which take on roles similar to
tion their counterparts in Listing 5.10. Line 7 defines CM_
BITS, which gives the number of bits in each half of
Unfortunately, if one thread is to safely remove counts counterandmax, and line 8 defines MAX_COUNTERMAX,
from another thread, both threads will need to atomically which gives the maximum value that may be held in either
manipulate that thread’s counter and countermax vari- half of counterandmax.
ables. The usual way to do this is to combine these two
Quick Quiz 5.39: In what way does line 7 of Listing 5.12
variables into a single variable, for example, given a 32-bit violate the C standard?
variable, using the high-order 16 bits to represent counter
and the low-order 16 bits to represent countermax. Lines 10–15 show the split_counterandmax_
Quick Quiz 5.38: Why is it necessary to atomically manip- int() function, which, when given the underlying int
ulate the thread’s counter and countermax variables as a from the atomic_t counterandmax variable, splits it
unit? Wouldn’t it be good enough to atomically manipulate into its counter (c) and countermax (cm) components.
them individually? Line 13 isolates the most-significant half of this int,
placing the result as specified by argument c, and line 14
The variables and access functions for a simple atomic isolates the least-significant half of this int, placing the
limit counter are shown in Listing 5.12 (count_lim_ result as specified by argument cm.
atomic.c). The counter and countermax variables in Lines 17–24 show the split_counterandmax() func-
earlier algorithms are combined into the single variable tion, which picks up the underlying int from the spec-
counterandmax shown on line 1, with counter in the ified variable on line 20, stores it as specified by the
upper half and countermax in the lower half. This old argument on line 22, and then invokes split_
variable is of type atomic_t, which has an underlying counterandmax_int() to split it on line 23.
representation of int.

v2023.06.11a
62 CHAPTER 5. COUNTING

Quick Quiz 5.40: Given that there is only one


counterandmax variable, why bother passing in a pointer
to it on line 18 of Listing 5.12?
Listing 5.13: Atomic Limit Counter Add and Subtract
Lines 26–32 show the merge_counterandmax() func- 1 int add_count(unsigned long delta)
tion, which can be thought of as the inverse of split_ 2 {
counterandmax(). Line 30 merges the counter and 3 int c;
4 int cm;
countermax values passed in c and cm, respectively, and 5 int old;
returns the result. 6 int new;
7
Quick Quiz 5.41: Why does merge_counterandmax() in 8 do {
9 split_counterandmax(&counterandmax, &old, &c, &cm);
Listing 5.12 return an int rather than storing directly into an 10 if (delta > MAX_COUNTERMAX || c + delta > cm)
atomic_t? 11 goto slowpath;
12 new = merge_counterandmax(c + delta, cm);
13 } while (atomic_cmpxchg(&counterandmax,
Listing 5.13 shows the add_count() and sub_ 14 old, new) != old);
count() functions. 15 return 1;
16 slowpath:
Lines 1–32 show add_count(), whose fastpath spans 17 spin_lock(&gblcnt_mutex);
lines 8–15, with the remainder of the function being the 18 globalize_count();
19 if (globalcountmax - globalcount -
slowpath. Lines 8–14 of the fastpath form a compare-and- 20 globalreserve < delta) {
swap (CAS) loop, with the atomic_cmpxchg() primitive 21 flush_local_count();
22 if (globalcountmax - globalcount -
on lines 13–14 performing the actual CAS. Line 9 splits 23 globalreserve < delta) {
the current thread’s counterandmax variable into its 24 spin_unlock(&gblcnt_mutex);
25 return 0;
counter (in c) and countermax (in cm) components, 26 }
while placing the underlying int into old. Line 10 27 }
28 globalcount += delta;
checks whether the amount delta can be accommodated 29 balance_count();
locally (taking care to avoid integer overflow), and if not, 30 spin_unlock(&gblcnt_mutex);
31 return 1;
line 11 transfers to the slowpath. Otherwise, line 12 32 }
combines an updated counter value with the original 33
34 int sub_count(unsigned long delta)
countermax value into new. The atomic_cmpxchg() 35 {
primitive on lines 13–14 then atomically compares this 36 int c;
37 int cm;
thread’s counterandmax variable to old, updating its 38 int old;
value to new if the comparison succeeds. If the comparison 39 int new;
40
succeeds, line 15 returns success, otherwise, execution 41 do {
continues in the loop at line 8. 42 split_counterandmax(&counterandmax, &old, &c, &cm);
43 if (delta > c)
Quick Quiz 5.42: Yecch! Why the ugly goto on line 11 of 44 goto slowpath;
Listing 5.13? Haven’t you heard of the break statement??? 45 new = merge_counterandmax(c - delta, cm);
46 } while (atomic_cmpxchg(&counterandmax,
47 old, new) != old);
48 return 1;
Quick Quiz 5.43: Why would the atomic_cmpxchg() 49 slowpath:
50 spin_lock(&gblcnt_mutex);
primitive at lines 13–14 of Listing 5.13 ever fail? After all, we 51 globalize_count();
picked up its old value on line 9 and have not changed it! 52 if (globalcount < delta) {
53 flush_local_count();
54 if (globalcount < delta) {
Lines 16–31 of Listing 5.13 show add_count()’s 55 spin_unlock(&gblcnt_mutex);
slowpath, which is protected by gblcnt_mutex, which 56 return 0;
57 }
is acquired on line 17 and released on lines 24 and 30. 58 }
Line 18 invokes globalize_count(), which moves this 59 globalcount -= delta;
60 balance_count();
thread’s state to the global counters. Lines 19–20 check 61 spin_unlock(&gblcnt_mutex);
whether the delta value can be accommodated by the 62 return 1;
63 }
current global state, and, if not, line 21 invokes flush_
local_count() to flush all threads’ local state to the
global counters, and then lines 22–23 recheck whether
delta can be accommodated. If, after all that, the addition

v2023.06.11a
5.4. EXACT LIMIT COUNTERS 63

Listing 5.14: Atomic Limit Counter Read Listing 5.15: Atomic Limit Counter Utility Functions 1
1 unsigned long read_count(void) 1 static void globalize_count(void)
2 { 2 {
3 int c; 3 int c;
4 int cm; 4 int cm;
5 int old; 5 int old;
6 int t; 6
7 unsigned long sum; 7 split_counterandmax(&counterandmax, &old, &c, &cm);
8 8 globalcount += c;
9 spin_lock(&gblcnt_mutex); 9 globalreserve -= cm;
10 sum = globalcount; 10 old = merge_counterandmax(0, 0);
11 for_each_thread(t) { 11 atomic_set(&counterandmax, old);
12 if (counterp[t] != NULL) { 12 }
13 split_counterandmax(counterp[t], &old, &c, &cm); 13
14 sum += c; 14 static void flush_local_count(void)
15 } 15 {
16 } 16 int c;
17 spin_unlock(&gblcnt_mutex); 17 int cm;
18 return sum; 18 int old;
19 } 19 int t;
20 int zero;
21
22 if (globalreserve == 0)
of delta still cannot be accommodated, then line 24 23 return;
24 zero = merge_counterandmax(0, 0);
releases gblcnt_mutex (as noted earlier), and then line 25 25 for_each_thread(t)
returns failure. 26 if (counterp[t] != NULL) {
27 old = atomic_xchg(counterp[t], zero);
Otherwise, line 28 adds delta to the global counter, 28 split_counterandmax_int(old, &c, &cm);
line 29 spreads counts to the local state if appropriate, 29 globalcount += c;
30 globalreserve -= cm;
line 30 releases gblcnt_mutex (again, as noted earlier), 31 }
and finally, line 31 returns success. 32 }

Lines 34–63 of Listing 5.13 show sub_count(), which


is structured similarly to add_count(), having a fastpath
through each thread. Line 26 checks to see if the current
on lines 41–48 and a slowpath on lines 49–62. A line-by-
thread has counter state, and, if so, lines 27–30 move that
line analysis of this function is left as an exercise to the
state to the global counters. Line 27 atomically fetches
reader.
the current thread’s state while replacing it with zero.
Listing 5.14 shows read_count(). Line 9 acquires
Line 28 splits this state into its counter (in local variable
gblcnt_mutex and line 17 releases it. Line 10 initializes
c) and countermax (in local variable cm) components.
local variable sum to the value of globalcount, and the
Line 29 adds this thread’s counter to globalcount,
loop spanning lines 11–16 adds the per-thread counters to
while line 30 subtracts this thread’s countermax from
this sum, isolating each per-thread counter using split_
globalreserve.
counterandmax on line 13. Finally, line 18 returns the
sum. Quick Quiz 5.44: What stops a thread from simply refilling its
Listings 5.15 and 5.16 show the utility func- counterandmax variable immediately after flush_local_
tions globalize_count(), flush_local_count(), count() on line 14 of Listing 5.15 empties it?
balance_count(), count_register_thread(), and
count_unregister_thread(). The code for Quick Quiz 5.45: What prevents concurrent execution of
globalize_count() is shown on lines 1–12 of List- the fastpath of either add_count() or sub_count() from
interfering with the counterandmax variable while flush_
ing 5.15, and is similar to that of previous algorithms,
local_count() is accessing it on line 27 of Listing 5.15?
with the addition of line 7, which is now required to split
out counter and countermax from counterandmax.
The code for flush_local_count(), which moves Lines 1–22 on Listing 5.16 show the code for
all threads’ local counter state to the global counter, is balance_count(), which refills the calling thread’s local
shown on lines 14–32. Line 22 checks to see if the value counterandmax variable. This function is quite similar
of globalreserve permits any per-thread counts, and, to that of the preceding algorithms, with changes required
if not, line 23 returns. Otherwise, line 24 initializes to handle the merged counterandmax variable. Detailed
local variable zero to a combined zeroed counter and analysis of the code is left as an exercise for the reader, as
countermax. The loop spanning lines 25–31 sequences it is with the count_register_thread() function start-

v2023.06.11a
64 CHAPTER 5. COUNTING

Listing 5.16: Atomic Limit Counter Utility Functions 2


1 static void balance_count(void) IDLE
2 {
3 int c;
4 int cm;
5 int old;
unsigned long limit; need no
6 flushed
7 flush count
8 limit = globalcountmax - globalcount -
9 globalreserve;
10 limit /= num_online_threads(); !counting
11 if (limit > MAX_COUNTERMAX) REQ READY
12 cm = MAX_COUNTERMAX;
13 else
14 cm = limit;
15 globalreserve += cm; done
16 c = cm / 2; counting
counting
17 if (c > globalcount)
18 c = globalcount;
19 globalcount -= c;
20 old = merge_counterandmax(c, cm);
21 atomic_set(&counterandmax, old);
ACK
22 }
23
24 void count_register_thread(void) Figure 5.7: Signal-Theft State Machine
25 {
26 int idx = smp_thread_id();
27
28 spin_lock(&gblcnt_mutex); Because signal handlers run in the context of the signaled
29 counterp[idx] = &counterandmax;
30 spin_unlock(&gblcnt_mutex); thread, atomic operations are not necessary, as shown in
31 } the next section.
32
33 void count_unregister_thread(int nthreadsexpected) Quick Quiz 5.47: But signal handlers can be migrated to
34 {
35 int idx = smp_thread_id();
some other CPU while running. Doesn’t this possibility require
36 that atomic instructions and memory barriers are required to
37 spin_lock(&gblcnt_mutex); reliably communicate between a thread and a signal handler
38 globalize_count();
39 counterp[idx] = NULL; that interrupts that thread?
40 spin_unlock(&gblcnt_mutex);
41 }

5.4.3 Signal-Theft Limit Counter Design


ing on line 24 and the count_unregister_thread() Even though per-thread state will now be manipulated
function starting on line 33. only by the corresponding thread, there will still need
Quick Quiz 5.46: Given that the atomic_set() primitive to be synchronization with the signal handlers. This
does a simple store to the specified atomic_t, how can line 21 synchronization is provided by the state machine shown
of balance_count() in Listing 5.16 work correctly in face of in Figure 5.7.
concurrent flush_local_count() updates to this variable? The state machine starts out in the IDLE state, and when
add_count() or sub_count() find that the combination
of the local thread’s count and the global count cannot
The next section qualitatively evaluates this design.
accommodate the request, the corresponding slowpath sets
each thread’s theft state to REQ (unless that thread has
5.4.2 Atomic Limit Counter Discussion no count, in which case it transitions directly to READY).
This is the first implementation that actually allows the Only the slowpath, which holds the gblcnt_mutex lock,
counter to be run all the way to either of its limits, but it is permitted to transition from the IDLE state, as indicated
does so at the expense of adding atomic operations to the by the green color.4 The slowpath then sends a signal
fastpaths, which slow down the fastpaths significantly on to each thread, and the corresponding signal handler
some systems. Although some workloads might tolerate checks the corresponding thread’s theft and counting
this slowdown, it is worthwhile looking for algorithms variables. If the theft state is not REQ, then the signal
with better write-side performance. One such algorithm 4 For those with black-and-white versions of this book, IDLE and

uses a signal handler to steal counts from other threads. READY are green, REQ is red, and ACK is blue.

v2023.06.11a
5.4. EXACT LIMIT COUNTERS 65

Listing 5.17: Signal-Theft Limit Counter Data variables. Lines 1–7 show globalize_count(), which
1 #define THEFT_IDLE 0 is identical to earlier implementations. Lines 9–16 show
2 #define THEFT_REQ 1
3 #define THEFT_ACK 2 flush_local_count_sig(), which is the signal han-
4 #define THEFT_READY 3 dler used in the theft process. Lines 11 and 12 check
5
6 int __thread theft = THEFT_IDLE; to see if the theft state is REQ, and, if not returns
7 int __thread counting = 0; without change. Line 13 sets the theft state to ACK,
8 unsigned long __thread counter = 0;
9 unsigned long __thread countermax = 0; and, if line 14 sees that this thread’s fastpaths are not
10 unsigned long globalcountmax = 10000; running, line 15 uses smp_store_release() to set the
11 unsigned long globalcount = 0;
12 unsigned long globalreserve = 0; theft state to READY, further ensuring that any change
13 unsigned long *counterp[NR_THREADS] = { NULL }; to counter in the fastpath happens before this change of
14 unsigned long *countermaxp[NR_THREADS] = { NULL };
15 int *theftp[NR_THREADS] = { NULL }; theft to READY.
16 DEFINE_SPINLOCK(gblcnt_mutex);
17 #define MAX_COUNTERMAX 100 Quick Quiz 5.50: In Listing 5.18, doesn’t flush_local_
count_sig() need stronger memory barriers?

handler is not permitted to change the state, and therefore Lines 18–47 show flush_local_count(), which is
simply returns. Otherwise, if the counting variable is set, called from the slowpath to flush all threads’ local counts.
indicating that the current thread’s fastpath is in progress, The loop spanning lines 23–32 advances the theft state
the signal handler sets the theft state to ACK, otherwise for each thread that has local count, and also sends that
to READY. thread a signal. Line 24 skips any non-existent threads.
If the theft state is ACK, only the fastpath is permitted Otherwise, line 25 checks to see if the current thread
to change the theft state, as indicated by the blue color. holds any local count, and, if not, line 26 sets the thread’s
When the fastpath completes, it sets the theft state to theft state to READY and line 27 skips to the next thread.
READY. Otherwise, line 29 sets the thread’s theft state to REQ
Once the slowpath sees a thread’s theft state is and line 30 sends the thread a signal.
READY, the slowpath is permitted to steal that thread’s Quick Quiz 5.51: In Listing 5.18, why is it safe for line 25 to
count. The slowpath then sets that thread’s theft state to directly access the other thread’s countermax variable?
IDLE.
Quick Quiz 5.48: In Figure 5.7, why is the REQ theft state Quick Quiz 5.52: In Listing 5.18, why doesn’t line 30 check
colored red? for the current thread sending itself a signal?

Quick Quiz 5.49: In Figure 5.7, what is the point of having Quick Quiz 5.53: The code shown in Listings 5.17 and 5.18
separate REQ and ACK theft states? Why not simplify the works with GCC and POSIX. What would be required to make
state machine by collapsing them into a single REQACK state? it also conform to the ISO C standard?
Then whichever of the signal handler or the fastpath gets there
first could set the state to READY. The loop spanning lines 33–46 waits until each thread
reaches READY state, then steals that thread’s count.
Lines 34–35 skip any non-existent threads, and the loop
5.4.4 Signal-Theft Limit Counter Imple- spanning lines 36–40 waits until the current thread’s
theft state becomes READY. Line 37 blocks for a
mentation
millisecond to avoid priority-inversion problems, and if
Listing 5.17 (count_lim_sig.c) shows the data struc- line 38 determines that the thread’s signal has not yet
tures used by the signal-theft based counter implemen- arrived, line 39 resends the signal. Execution reaches
tation. Lines 1–7 define the states and values for the line 41 when the thread’s theft state becomes READY,
per-thread theft state machine described in the preceding so lines 41–44 do the thieving. Line 45 then sets the
section. Lines 8–17 are similar to earlier implementa- thread’s theft state back to IDLE.
tions, with the addition of lines 14 and 15 to allow remote Quick Quiz 5.54: In Listing 5.18, why does line 39 resend
access to a thread’s countermax and theft variables, the signal?
respectively.
Listing 5.18 shows the functions responsible for migrat- Lines 49–61 show balance_count(), which is similar
ing counts between per-thread variables and the global to that of earlier examples.

v2023.06.11a
66 CHAPTER 5. COUNTING

Listing 5.19: Signal-Theft Limit Counter Add Function


1 int add_count(unsigned long delta)
2 {
3 int fastpath = 0;
Listing 5.18: Signal-Theft Limit Counter Value-Migration 4

Functions 5 WRITE_ONCE(counting, 1);


6 barrier();
1 static void globalize_count(void)
7 if (smp_load_acquire(&theft) <= THEFT_REQ &&
2 {
8 countermax - counter >= delta) {
3 globalcount += counter;
9 WRITE_ONCE(counter, counter + delta);
4 counter = 0;
10 fastpath = 1;
5 globalreserve -= countermax;
11 }
6 countermax = 0;
12 barrier();
7 }
13 WRITE_ONCE(counting, 0);
8
14 barrier();
9 static void flush_local_count_sig(int unused)
15 if (READ_ONCE(theft) == THEFT_ACK)
10 {
16 smp_store_release(&theft, THEFT_READY);
11 if (READ_ONCE(theft) != THEFT_REQ)
17 if (fastpath)
12 return;
18 return 1;
13 WRITE_ONCE(theft, THEFT_ACK);
19 spin_lock(&gblcnt_mutex);
14 if (!counting)
20 globalize_count();
15 smp_store_release(&theft, THEFT_READY);
21 if (globalcountmax - globalcount -
16 }
22 globalreserve < delta) {
17
23 flush_local_count();
18 static void flush_local_count(void)
24 if (globalcountmax - globalcount -
19 {
25 globalreserve < delta) {
20 int t;
26 spin_unlock(&gblcnt_mutex);
21 thread_id_t tid;
27 return 0;
22
28 }
23 for_each_tid(t, tid) {
29 }
24 if (theftp[t] != NULL) {
30 globalcount += delta;
25 if (*countermaxp[t] == 0) {
31 balance_count();
26 WRITE_ONCE(*theftp[t], THEFT_READY);
32 spin_unlock(&gblcnt_mutex);
27 continue;
33 return 1;
28 }
34 }
29 WRITE_ONCE(*theftp[t], THEFT_REQ);
30 pthread_kill(tid, SIGUSR1);
31 }
32 }
33 for_each_tid(t, tid) { Listing 5.20: Signal-Theft Limit Counter Subtract Function
34 if (theftp[t] == NULL) 1 int sub_count(unsigned long delta)
35 continue; 2 {
36 while (smp_load_acquire(theftp[t]) != THEFT_READY) { 3 int fastpath = 0;
37 poll(NULL, 0, 1); 4
38 if (READ_ONCE(*theftp[t]) == THEFT_REQ) 5 WRITE_ONCE(counting, 1);
39 pthread_kill(tid, SIGUSR1); 6 barrier();
40 } 7 if (smp_load_acquire(&theft) <= THEFT_REQ &&
41 globalcount += *counterp[t]; 8 counter >= delta) {
42 *counterp[t] = 0; 9 WRITE_ONCE(counter, counter - delta);
43 globalreserve -= *countermaxp[t]; 10 fastpath = 1;
44 *countermaxp[t] = 0; 11 }
45 smp_store_release(theftp[t], THEFT_IDLE); 12 barrier();
46 } 13 WRITE_ONCE(counting, 0);
47 } 14 barrier();
48 15 if (READ_ONCE(theft) == THEFT_ACK)
49 static void balance_count(void) 16 smp_store_release(&theft, THEFT_READY);
50 { 17 if (fastpath)
51 countermax = globalcountmax - globalcount - 18 return 1;
52 globalreserve; 19 spin_lock(&gblcnt_mutex);
53 countermax /= num_online_threads(); 20 globalize_count();
54 if (countermax > MAX_COUNTERMAX) 21 if (globalcount < delta) {
55 countermax = MAX_COUNTERMAX; 22 flush_local_count();
56 globalreserve += countermax; 23 if (globalcount < delta) {
57 counter = countermax / 2; 24 spin_unlock(&gblcnt_mutex);
58 if (counter > globalcount) 25 return 0;
59 counter = globalcount; 26 }
60 globalcount -= counter; 27 }
61 } 28 globalcount -= delta;
29 balance_count();
30 spin_unlock(&gblcnt_mutex);
31 return 1;
32 }

v2023.06.11a
5.4. EXACT LIMIT COUNTERS 67

Listing 5.21: Signal-Theft Limit Counter Read Function Listing 5.22: Signal-Theft Limit Counter Initialization Func-
1 unsigned long read_count(void) tions
2 { 1 void count_init(void)
3 int t; 2 {
4 unsigned long sum; 3 struct sigaction sa;
5 4
6 spin_lock(&gblcnt_mutex); 5 sa.sa_handler = flush_local_count_sig;
7 sum = globalcount; 6 sigemptyset(&sa.sa_mask);
8 for_each_thread(t) { 7 sa.sa_flags = 0;
9 if (counterp[t] != NULL) 8 if (sigaction(SIGUSR1, &sa, NULL) != 0) {
10 sum += READ_ONCE(*counterp[t]); 9 perror("sigaction");
11 } 10 exit(EXIT_FAILURE);
12 spin_unlock(&gblcnt_mutex); 11 }
13 return sum; 12 }
14 } 13
14 void count_register_thread(void)
15 {
16 int idx = smp_thread_id();
Listing 5.19 shows the add_count() function. The 17
18 spin_lock(&gblcnt_mutex);
fastpath spans lines 5–18, and the slowpath lines 19–33. 19 counterp[idx] = &counter;
Line 5 sets the per-thread counting variable to 1 so that 20 countermaxp[idx] = &countermax;
21 theftp[idx] = &theft;
any subsequent signal handlers interrupting this thread will 22 spin_unlock(&gblcnt_mutex);
set the theft state to ACK rather than READY, allowing 23 }
24
this fastpath to complete properly. Line 6 prevents the 25 void count_unregister_thread(int nthreadsexpected)
compiler from reordering any of the fastpath body to 26 {
27 int idx = smp_thread_id();
precede the setting of counting. Lines 7 and 8 check 28

to see if the per-thread data can accommodate the add_ 29 spin_lock(&gblcnt_mutex);


30 globalize_count();
count() and if there is no ongoing theft in progress, and 31 counterp[idx] = NULL;
if so line 9 does the fastpath addition and line 10 notes 32 countermaxp[idx] = NULL;
33 theftp[idx] = NULL;
that the fastpath was taken. 34 spin_unlock(&gblcnt_mutex);
35 }
In either case, line 12 prevents the compiler from
reordering the fastpath body to follow line 13, which
permits any subsequent signal handlers to undertake theft.
Line 14 again disables compiler reordering, and then
5.4.5 Signal-Theft Limit Counter Discus-
line 15 checks to see if the signal handler deferred the sion
theft state-change to READY, and, if so, line 16 uses The signal-theft implementation runs more than eight
smp_store_release() to set the theft state to READY, times as fast as the atomic implementation on my six-core
further ensuring that any CPU that sees the READY state x86 laptop. Is it always preferable?
also sees the effects of line 9. If the fastpath addition at The signal-theft implementation would be vastly prefer-
line 9 was executed, then line 18 returns success. able on Pentium-4 systems, given their slow atomic in-
Otherwise, we fall through to the slowpath starting at structions, but the old 80386-based Sequent Symmetry
line 19. The structure of the slowpath is similar to those systems would do much better with the shorter path length
of earlier examples, so its analysis is left as an exercise of the atomic implementation. However, this increased
to the reader. Similarly, the structure of sub_count() update-side performance comes at the prices of higher
on Listing 5.20 is the same as that of add_count(), so read-side overhead: Those POSIX signals are not free. If
the analysis of sub_count() is also left as an exercise ultimate performance is of the essence, you will need to
for the reader, as is the analysis of read_count() in measure them both on the system that your application is
Listing 5.21. to be deployed on.
Lines 1–12 of Listing 5.22 show count_init(), which
Quick Quiz 5.55: Not only are POSIX signals slow, sending
set up flush_local_count_sig() as the signal han-
one to each thread simply does not scale. What would you do
dler for SIGUSR1, enabling the pthread_kill() calls if you had (say) 10,000 threads and needed the read side to be
in flush_local_count() to invoke flush_local_ fast?
count_sig(). The code for thread registry and unregistry
is similar to that of earlier examples, so its analysis is left This is but one reason why high-quality APIs are so
as an exercise for the reader. important: They permit implementations to be changed

v2023.06.11a
68 CHAPTER 5. COUNTING

as required by ever-changing hardware performance char- Line 1 read-acquires the lock, and either line 3 or 7
acteristics. releases it. Line 2 checks to see if the device is being
Quick Quiz 5.56: What if you want an exact limit counter to removed, and, if so, line 3 releases the lock and line 4
be exact only for its lower limit, but to allow the upper limit to cancels the I/O, or takes whatever action is appropriate
be inexact? given that the device is to be removed. Otherwise, line 6
increments the access count, line 7 releases the lock, line 8
performs the I/O, and line 9 decrements the access count.
5.4.6 Applying Exact Limit Counters Quick Quiz 5.58: This is ridiculous! We are read-acquiring
a reader-writer lock to update the counter? What are you
Although the exact limit counter implementations pre- playing at???
sented in this section can be very useful, they are not
much help if the counter’s value remains near zero at The code to remove the device might be as follows:
all times, as it might when counting the number of out-
standing accesses to an I/O device. The high overhead 1 write_lock(&mylock);
2 removing = 1;
of such near-zero counting is especially painful given 3 sub_count(mybias);
that we normally don’t care how many references there 4 write_unlock(&mylock);
5 while (read_count() != 0)
are. As noted in the removable I/O device access-count 6 poll(NULL, 0, 1);
problem posed by Quick Quiz 5.5, the number of accesses 7 remove_device();
is irrelevant except in those rare cases when someone is
actually trying to remove the device. Line 1 write-acquires the lock and line 4 releases it.
One simple solution to this problem is to add a large Line 2 notes that the device is being removed, and the
“bias” (for example, one billion) to the counter in order loop spanning lines 5–6 waits for any I/O operations to
to ensure that the value is far enough from zero that the complete. Finally, line 7 does any additional processing
counter can operate efficiently. When someone wants needed to prepare for device removal.
to remove the device, this bias is subtracted from the
counter value. Counting the last few accesses will be quite Quick Quiz 5.59: What other issues would need to be
inefficient, but the important point is that the many prior accounted for in a real system?
accesses will have been counted at full speed.
Quick Quiz 5.57: What else had you better have done when
using a biased counter? 5.5 Parallel Counting Discussion
Although a biased counter can be quite helpful and This idea that there is generality in the specific is of
useful, it is only a partial solution to the removable I/O far-reaching importance.
device access-count problem called out on page 49. When
attempting to remove a device, we must not only know Douglas R. Hofstadter
the precise number of current I/O accesses, we also need
to prevent any future accesses from starting. One way This chapter has presented the reliability, performance, and
to accomplish this is to read-acquire a reader-writer lock scalability problems with traditional counting primitives.
when updating the counter, and to write-acquire that same The C-language ++ operator is not guaranteed to function
reader-writer lock when checking the counter. Code for reliably in multithreaded code, and atomic operations to a
doing I/O might be as follows: single variable neither perform nor scale well. This chapter
therefore presented a number of counting algorithms that
1 read_lock(&mylock); perform and scale extremely well in certain special cases.
2 if (removing) { It is well worth reviewing the lessons from these count-
3 read_unlock(&mylock);
4 cancel_io();
ing algorithms. To that end, Section 5.5.1 overviews
5 } else { requisite validation, Section 5.5.2 summarizes perfor-
6 add_count(1); mance and scalability, Section 5.5.3 discusses the need
7 read_unlock(&mylock);
8 do_io(); for specialization, and finally, Section 5.5.4 enumerates
9 sub_count(1); lessons learned and calls attention to later chapters that
10 }
will expand on these lessons.

v2023.06.11a
5.5. PARALLEL COUNTING DISCUSSION 69

Table 5.1: Statistical/Limit Counter Performance on x86

Exact?
Reads (ns)
Algorithm Updates
(count_*.c) Section (ns) 1 CPU 8 CPUs 64 CPUs 420 CPUs
stat 5.2.2 6.3 294 303 315 612
stat_eventual 5.2.4 6.4 1 1 1 1
end 5.2.3 2.9 301 6,309 147,594 239,683
end_rcu 13.5.1 2.9 454 481 508 2,317
lim 5.3.2 N 3.2 435 6,678 156,175 239,422
lim_app 5.3.4 N 2.4 485 7,041 173,108 239,682
lim_atomic 5.4.1 Y 19.7 513 7,085 199,957 239,450
lim_sig 5.4.4 Y 4.7 519 6,805 120,000 238,811

5.5.1 Parallel Counting Validation The per-thread-variable implementation (count_end.c)


is significantly faster on updates than the array-based
Many of the algorithms in this section are quite simple, so implementation (count_stat.c), but is slower at reads
much so that it is tempting to declare them to be correct on large numbers of core, and suffers severe lock con-
by construction or by inspection. Unfortunately, it is all tention when there are many parallel readers. This con-
too easy for those carrying out the construction or the tention can be addressed using the deferred-processing
inspection to become overconfident, tired, confused, or techniques introduced in Chapter 9, as shown on the
just plain sloppy, all of which can result in bugs. And count_end_rcu.c row of Table 5.1. Deferred process-
early implementations of these limit counters have in fact ing also shines on the count_stat_eventual.c row,
contained bugs, in some cases aided and abetted by the courtesy of eventual consistency.
complexities inherent in maintaining a 64-bit count on a
32-bit system. Therefore, validation is not optional, even Quick Quiz 5.60: On the count_stat.c row of Table 5.1,
for the simple algorithms presented in this chapter. we see that the read-side scales linearly with the number of
The statistical counters are tested for acting like counters threads. How is that possible given that the more threads there
(“counttorture.h”), that is, that the aggregate sum in are, the more per-thread counters must be summed up?
the counter changes by the sum of the amounts added by
the various update-side threads. Quick Quiz 5.61: Even on the fourth row of Table 5.1,
The limit counters are also tested for acting like counters the read-side performance of these statistical counter imple-
(“limtorture.h”), and additionally checked for their mentations is pretty horrible. So why bother with them?
ability to accommodate the specified limit.
Both of these test suites produce performance data that
The bottom half of Table 5.1 shows the performance of
is used in Section 5.5.2.
the parallel limit-counting algorithms. Exact enforcement
Although this level of validation is good and sufficient of the limits incurs a substantial update-side performance
for textbook implementations such as these, it would be penalty, although on this x86 system that penalty can
wise to apply additional validation before putting similar be reduced by substituting signals for atomic operations.
algorithms into production. Chapter 11 describes addi- All of these implementations suffer from read-side lock
tional approaches to testing, and given the simplicity of contention in the face of concurrent readers.
most of these counting algorithms, most of the techniques
described in Chapter 12 can also be quite helpful. Quick Quiz 5.62: Given the performance data shown in the
bottom half of Table 5.1, we should always prefer signals over
atomic operations, right?
5.5.2 Parallel Counting Performance
The top half of Table 5.1 shows the performance of the Quick Quiz 5.63: Can advanced techniques be applied to
four parallel statistical counting algorithms. All four algo- address the lock contention for readers seen in the bottom half
rithms provide near-perfect linear scalability for updates. of Table 5.1?

v2023.06.11a
70 CHAPTER 5. COUNTING

In short, this chapter has demonstrated a number of hardware configuration and in workload. There has in fact
counting algorithms that perform and scale extremely been some research into this sort of automation [AHS+ 03,
well in a number of special cases. But must our parallel SAH+ 03], and the Linux kernel does some boot-time
counting be confined to special cases? Wouldn’t it be reconfiguration, including limited binary rewriting. This
better to have a general algorithm that operated efficiently sort of adaptation will become increasingly important as
in all cases? The next section looks at these questions. the number of CPUs on mainstream systems continues to
increase.
5.5.3 Parallel Counting Specializations In short, as discussed in Chapter 3, the laws of physics
constrain parallel software just as surely as they constrain
The fact that these algorithms only work well in their mechanical artifacts such as bridges. These constraints
respective special cases might be considered a major force specialization, though in the case of software it
problem with parallel programming in general. After might be possible to automate the choice of specialization
all, the C-language ++ operator works just fine in single- to fit the hardware and workload in question.
threaded code, and not just for special cases, but in general, Of course, even generalized counting is quite special-
right? ized. We need to do a great number of other things with
This line of reasoning does contain a grain of truth, but computers. The next section relates what we have learned
is in essence misguided. The problem is not parallelism from counters to topics taken up later in this book.
as such, but rather scalability. To understand this, first
consider the C-language ++ operator. The fact is that it 5.5.4 Parallel Counting Lessons
does not work in general, only for a restricted range of
numbers. If you need to deal with 1,000-digit decimal The opening paragraph of this chapter promised that our
numbers, the C-language ++ operator will not work for study of counting would provide an excellent introduction
you. to parallel programming. This section makes explicit
connections between the lessons from this chapter and the
Quick Quiz 5.64: The ++ operator works just fine for 1,000-
digit numbers! Haven’t you heard of operator overloading??? material presented in a number of later chapters.
The examples in this chapter have shown that an impor-
tant scalability and performance tool is partitioning. The
This problem is not specific to arithmetic. Suppose you counters might be fully partitioned, as in the statistical
need to store and query data. Should you use an ASCII counters discussed in Section 5.2, or partially partitioned
file? XML? A relational database? A linked list? A dense as in the limit counters discussed in Sections 5.3 and 5.4.
array? A B-tree? A radix tree? Or one of the plethora of Partitioning will be considered in far greater depth in Chap-
other data structures and environments that permit data to ter 6, and partial parallelization in particular in Section 6.4,
be stored and queried? It depends on what you need to where it is called parallel fastpath.
do, how fast you need it done, and how large your data set
Quick Quiz 5.65: But if we are going to have to partition
is—even on sequential systems. everything, why bother with shared-memory multithreading?
Similarly, if you need to count, your solution will Why not just partition the problem completely and run as
depend on how large of numbers you need to work with, multiple processes, each in its own address space?
how many CPUs need to be manipulating a given number
concurrently, how the number is to be used, and what level The partially partitioned counting algorithms used lock-
of performance and scalability you will need. ing to guard the global data, and locking is the subject
Nor is this problem specific to software. The design of Chapter 7. In contrast, the partitioned data tended to
for a bridge meant to allow people to walk across a small be fully under the control of the corresponding thread, so
brook might be a simple as a single wooden plank. But you that no synchronization whatsoever was required. This
would probably not use a plank to span the kilometers-wide data ownership will be introduced in Section 6.3.4 and
mouth of the Columbia River, nor would such a design be discussed in more detail in Chapter 8.
advisable for bridges carrying concrete trucks. In short, Because integer addition and subtraction are extremely
just as bridge design must change with increasing span cheap compared to typical synchronization operations,
and load, so must software design change as the number of achieving reasonable scalability requires synchronization
CPUs increases. That said, it would be good to automate operations be used sparingly. One way of achieving this
this process, so that the software adapts to changes in is to batch the addition and subtraction operations, so that

v2023.06.11a
5.5. PARALLEL COUNTING DISCUSSION 71

a great many of these cheap operations are handled by a Batch


single synchronization operation. Batching optimizations
Work
of one sort or another are used by each of the counting
Partitioning
algorithms listed in Table 5.1.
Finally, the eventually consistent statistical counter dis- Resource
Parallel
cussed in Section 5.2.4 showed how deferring activity Partitioning and
Access Control Replication
(in that case, updating the global counter) can provide
substantial performance and scalability benefits. This
Interacting
approach allows common case code to use much cheaper
With Hardware
synchronization operations than would otherwise be pos- Weaken Partition
sible. Chapter 9 will examine a number of additional ways
that deferral can improve performance, scalability, and
even real-time response. Figure 5.8: Optimization and the Four Parallel-
Summarizing the summary: Programming Tasks

1. Partitioning promotes performance and scalability.


acceptable for a two-CPU system, but nevertheless
2. Partial partitioning, that is, partitioning applied only be completely inadequate for an eight-CPU system.
to common code paths, works almost as well.
Summarizing still further, we have the “big three” meth-
3. Partial partitioning can be applied to code (as in ods of increasing performance and scalability, namely
Section 5.2’s statistical counters’ partitioned updates (1) partitioning over CPUs or threads, (2) batching so that
and non-partitioned reads), but also across time (as in more work can be done by each expensive synchronization
Section 5.3’s and Section 5.4’s limit counters running operation, and (3) weakening synchronization operations
fast when far from the limit, but slowly when close where feasible. As a rough rule of thumb, you should
to the limit). apply these methods in this order, as was noted earlier
in the discussion of Figure 2.6 on page 15. The parti-
4. Partitioning across time often batches updates locally
tioning optimization applies to the “Resource Partitioning
in order to reduce the number of expensive global
and Replication” bubble, the batching optimization to the
operations, thereby decreasing synchronization over-
“Work Partitioning” bubble, and the weakening optimiza-
head, in turn improving performance and scalability.
tion to the “Parallel Access Control” bubble, as shown in
All the algorithms shown in Table 5.1 make heavy
Figure 5.8. Of course, if you are using special-purpose
use of batching.
hardware such as digital signal processors (DSPs), field-
5. Read-only code paths should remain read-only: Spu- programmable gate arrays (FPGAs), or general-purpose
rious synchronization writes to shared memory graphical processing units (GPGPUs), you may need to
kill performance and scalability, as seen in the pay close attention to the “Interacting With Hardware”
count_end.c row of Table 5.1. bubble throughout the design process. For example, the
structure of a GPGPU’s hardware threads and memory
6. Judicious use of delay promotes performance and connectivity might richly reward very careful partitioning
scalability, as seen in Section 5.2.4. and batching design decisions.
In short, as noted at the beginning of this chapter, the
7. Parallel performance and scalability is usually a
simplicity of counting have allowed us to explore many
balancing act: Beyond a certain point, optimizing
fundamental concurrency issues without the distraction
some code paths will degrade others. The count_
of complex synchronization primitives or elaborate data
stat.c and count_end_rcu.c rows of Table 5.1
structures. Such synchronization primitives and data
illustrate this point.
structures are covered in later chapters.
8. Different levels of performance and scalability will
affect algorithm and data-structure design, as do a
large number of other factors. Figure 5.1 illustrates
this point: Atomic increment might be completely

v2023.06.11a
72 CHAPTER 5. COUNTING

v2023.06.11a
Divide and rule.
Philip II of Macedon
Chapter 6

Partitioning and Synchronization Design

This chapter describes how to design software to take ad- To this end, Section 6.1 presents partitioning exercises,
vantage of modern commodity multicore systems by using Section 6.2 reviews partitionability design criteria, Sec-
idioms, or “design patterns” [Ale79, GHJV95, SSRB00], tion 6.3 discusses synchronization granularity selection,
to balance performance, scalability, and response time. Section 6.4 overviews important parallel-fastpath design
Correctly partitioned problems lead to simple, scalable, patterns that provide speed and scalability on common-
and high-performance solutions, while poorly partitioned case fastpaths while using simpler less-scalable “slow path”
problems result in slow and complex solutions. This fallbacks for unusual situations, and finally Section 6.5
chapter will help you design partitioning into your code, takes a brief look beyond partitioning.
with some discussion of batching and weakening as well.
The word “design” is very important: You should parti-
tion first, batch second, weaken third, and code fourth. 6.1 Partitioning Exercises
Changing this order often leads to poor performance and
scalability along with great frustration.1 Whenever a theory appears to you as the only
This chapter will also look at some specific problems, possible one, take this as a sign that you have
including: neither understood the theory nor the problem
which it was intended to solve.
1. Constraints on the classic Dining Philosophers prob-
Karl Popper
lem requiring that all the philophers be able to dine
concurrently. Although partitioning is more widely understood than it
2. Lock-based double-ended queue implementations was in the early 2000s, its value is still underappreciated.
that provide concurrency between operations on both Section 6.1.1 therefore takes more highly parallel look at
ends of a given queue when there are many elements the classic Dining Philosophers problem and Section 6.1.2
in the queue, but still work correctly when the queue revisits the double-ended queue.
contains only a few elements. (Or, for that matter, no
elements.) 6.1.1 Dining Philosophers Problem
3. Summarizing the rough quality of a concurrent algo- Figure 6.1 shows a diagram of the classic Dining Phi-
rithm with only a few numbers. losophers problem [Dij71]. This problem features five
philosophers who do nothing but think and eat a “very
4. Selecting the right granularity of partitioning. difficult kind of spaghetti” which requires two forks to
eat.2 A given philosopher is permitted to use only the
5. Comcurrent designs for applications that do not fully
forks to his or her immediate right and left, but will not
partition.
put a given fork down until sated.
6. Obtaining more than 2x speedup from two CPUs. The object is to construct an algorithm that, quite
literally, prevents starvation. One starvation scenario
1 That other great dodge around the Laws of Physics, read-only

replication, is covered in Chapter 9. 2 But feel free to instead think in terms of chopsticks.

73

v2023.06.11a
74 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1. P2 picks up fork 1, preventing P1 from taking a fork.


P1
2. P3 picks up fork 2.

3. P4 picks up fork 3.

P5 P2 4. P5 picks up fork 4.

5. P5 picks up fork 5 and eats.

6. P5 puts down forks 4 and 5.

7. P4 picks up fork 4 and eats.

In short, this algorithm can result in only one philoso-


P4 P3
pher eating at a given time, even when all five philosophers
are hungry, despite the fact that there are more than enough
Figure 6.1: Dining Philosophers Problem forks for two philosophers to eat concurrently. It should
be possible to do better than this!
One approach is shown in Figure 6.4, which includes
would be if all of the philosophers picked up their leftmost four philosophers rather than five to better illustrate the
forks simultaneously. Because none of them will put down partition technique. Here the upper and rightmost philos-
their fork until after they finished eating, and because none ophers share a pair of forks, while the lower and leftmost
of them may pick up their second fork until at least one philosophers share another pair of forks. If all philoso-
of them has finished eating, they all starve. Please note phers are simultaneously hungry, at least two will always
that it is not sufficient to allow at least one philosopher to be able to eat concurrently. In addition, as shown in the
eat. As Figure 6.2 shows, starvation of even a few of the figure, the forks can now be bundled so that the pair are
philosophers is to be avoided. picked up and put down simultaneously, simplifying the
Dijkstra’s solution used a global semaphore, which acquisition and release algorithms.
works fine assuming negligible communications delays,
Quick Quiz 6.1: Is there a better solution to the Dining
an assumption that became invalid in the late 1980s or Philosophers Problem?
early 1990s.3 More recent solutions number the forks
as shown in Figure 6.3. Each philosopher picks up the Quick Quiz 6.2: How would you valididate an algorithm
lowest-numbered fork next to his or her plate, then picks alleged to solve the Dining Philosophers Problem?
up the other fork. The philosopher sitting in the uppermost
position in the diagram thus picks up the leftmost fork first, This is an example of “horizontal parallelism” [Inm85]
then the rightmost fork, while the rest of the philosophers or “data parallelism”, so named because there is no depen-
instead pick up their rightmost fork first. Because two of dency among the pairs of philosophers. In a horizontally
the philosophers will attempt to pick up fork 1 first, and parallel data-processing system, a given item of data would
because only one of those two philosophers will succeed, be processed by only one of a replicated set of software
there will be five forks available to four philosophers. At components.
least one of these four will have two forks, and will thus
Quick Quiz 6.3: And in just what sense can this “horizontal
be able to eat. parallelism” be said to be “horizontal”?
This general technique of numbering resources and
acquiring them in numerical order is heavily used as a
deadlock-prevention technique. However, it is easy to
6.1.2 Double-Ended Queue
imagine a sequence of events that will result in only one
philosopher eating at a time even though all are hungry: A double-ended queue is a data structure containing a
3 It is all too easy to denigrate Dijkstra from the viewpoint of the
list of elements that may be inserted or removed from
year 2021, more than 50 years after the fact. If you still feel the need
either end [Knu73]. It has been claimed that a lock-based
to denigrate Dijkstra, my advice is to publish something, wait 50 years, implementation permitting concurrent operations on both
and then see how well your ideas stood the test of time. ends of the double-ended queue is difficult [Gro07]. This

v2023.06.11a
6.1. PARTITIONING EXERCISES 75

Figure 6.2: Partial Starvation Is Also Bad

P1 P1

5 1

P5 P2

P4 P2

4 2

P4 P3

P3

Figure 6.3: Dining Philosophers Problem, Textbook


Solution Figure 6.4: Dining Philosophers Problem, Partitioned

section shows how a partitioning design strategy can result elements pushed onto it must have already been popped
in a reasonably simple implementation, looking at three from it.
general approaches in the following sections. But first, The beginnings of a test suite for concurrent double-
how should we validate a concurrent double-ended queue? ended queues (“deqtorture.h”) provides the following
checks:

6.1.2.1 Double-Ended Queue Validation 1. Element-ordering checks provided by CHECK_


SEQUENCE_PAIR().
A good place to start is with invariants. For example, if
elements are pushed onto one end of a double-ended queue 2. Checks that elements popped were most recently
and popped off of the other, the order of those elements pushed, provided by melee().
must be preserved. Similarly, if elements are pushed onto 3. Checks that elements pushed are popped before the
one end of the queue and popped off of that same end, the queue is emptied, also provided by melee().
order of those elements must be reversed. Any element
popped from the queue must have been most recently This suite includes both sequential and concurrent tests.
pushed onto that queue, and if the queue is emptied, all Although this suite is good and sufficient for textbook

v2023.06.11a
76 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Lock L Lock R Lock L Lock R


Header L Header R
DEQ L DEQ R

Figure 6.6: Compound Double-Ended Queue


Lock L Lock R

Header L 0 Header R

the list can shift the queue from one special case to another
at any time. It is far better to consider other designs.
Lock L Lock R

Header L 0 1 Header R
6.1.2.3 Compound Double-Ended Queue

Lock L Lock R
One way of forcing non-overlapping lock domains is
shown in Figure 6.6. Two separate double-ended queues
Header L 0 1 2 Header R
are run in tandem, each protected by its own lock. This
means that elements must occasionally be shuttled from
Lock L Lock R
one of the double-ended queues to the other, in which case
both locks must be held. A simple lock hierarchy may
Header L 0 1 2 3 Header R
be used to avoid deadlock, for example, always acquiring
the left-hand lock before acquiring the right-hand lock.
Figure 6.5: Double-Ended Queue With Left- and Right-
This will be much simpler than applying two locks to the
Hand Locks
same double-ended queue, as we can unconditionally left-
enqueue elements to the left-hand queue and right-enqueue
code, you should test considerably more thoroughly for elements to the right-hand queue. The main complication
code intended for production use. Chapters 11 and 12 arises when dequeuing from an empty queue, in which
cover a large array of validation tools and techniques. case it is necessary to:
But with a prototype test suite in place, we are ready
to look at the double-ended-queue algorithms in the next 1. If holding the right-hand lock, release it and acquire
sections. the left-hand lock.

2. Acquire the right-hand lock.


6.1.2.2 Left- and Right-Hand Locks
3. Rebalance the elements across the two queues.
One seemingly straightforward approach would be to use
a doubly linked list with a left-hand lock for left-hand-end
4. Remove the required element if there is one.
enqueue and dequeue operations along with a right-hand
lock for right-hand-end operations, as shown in Figure 6.5.
However, the problem with this approach is that the two 5. Release both locks.
locks’ domains must overlap when there are fewer than
four elements on the list. This overlap is due to the fact Quick Quiz 6.4: In this compound double-ended queue
that removing any given element affects not only that implementation, what should be done if the queue has become
element, but also its left- and right-hand neighbors. These non-empty while releasing and reacquiring the lock?
domains are indicated by color in the figure, with blue with
downward stripes indicating the domain of the left-hand The resulting code (locktdeq.c) is quite straightfor-
lock, red with upward stripes indicating the domain of the ward. The rebalancing operation might well shuttle a given
right-hand lock, and purple (with no stripes) indicating element back and forth between the two queues, wasting
overlapping domains. Although it is possible to create time and possibly requiring workload-dependent heuris-
an algorithm that works this way, the fact that it has no tics to obtain optimal performance. Although this might
fewer than five special cases should raise a big red flag, well be the best approach in some cases, it is interesting
especially given that concurrent activity at the other end of to try for an algorithm with greater determinism.

v2023.06.11a
6.1. PARTITIONING EXERCISES 77

DEQ 0 DEQ 1 DEQ 2 DEQ 3

Lock 0 Lock 1 Lock 2 Lock 3

Index L Index R

Lock L Lock R
R1
Figure 6.7: Hashed Double-Ended Queue

DEQ 0 DEQ 1 DEQ 2 DEQ 3


6.1.2.4 Hashed Double-Ended Queue
Index L Index R
One of the simplest and most effective ways to determinis-
tically partition a data structure is to hash it. It is possible Enq 3R
to trivially hash a double-ended queue by assigning each
element a sequence number based on its position in the list,
so that the first element left-enqueued into an empty queue
is numbered zero and the first element right-enqueued R4 R1 R2 R3
into an empty queue is numbered one. A series of ele-
ments left-enqueued into an otherwise-idle queue would
be assigned decreasing numbers (−1, −2, −3, . . .), while DEQ 0 DEQ 1 DEQ 2 DEQ 3
a series of elements right-enqueued into an otherwise-idle
queue would be assigned increasing numbers (2, 3, 4, . . .). Index L Index R
A key point is that it is not necessary to actually represent
a given element’s number, as this number will be implied Enq 3L1R
by its position in the queue.
Given this approach, we assign one lock to guard the
left-hand index, one to guard the right-hand index, and one
lock for each hash chain. Figure 6.7 shows the resulting R4 R5 R2 R3
data structure given four hash chains. Note that the lock
domains do not overlap, and that deadlock is avoided by
acquiring the index locks before the chain locks, and by L0 R1 L −2 L −1
never acquiring more than one lock of a given type (index
or chain) at a time.
Each hash chain is itself a double-ended queue, and DEQ 0 DEQ 1 DEQ 2 DEQ 3
in this example, each holds every fourth element. The
uppermost portion of Figure 6.8 shows the state after a
Index L Index R
single element (“R1 ”) has been right-enqueued, with the
right-hand index having been incremented to reference Figure 6.8: Hashed Double-Ended Queue After Insertions
hash chain 2. The middle portion of this same figure
shows the state after three more elements have been
right-enqueued. As you can see, the indexes are back to
their initial states (see Figure 6.7), however, each hash
chain is now non-empty. The lower portion of this figure
shows the state after three additional elements have been
left-enqueued and an additional element has been right-
enqueued.

v2023.06.11a
78 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

R4 R5 R6 R7

L0 R1 R2 R3

L−4 L−3 L−2 L−1

L−8 L−7 L−6 L−5

Figure 6.9: Hashed Double-Ended Queue With 16 Ele-


ments Listing 6.2: Lock-Based Parallel Double-Ended Queue Imple-
mentation
1 struct cds_list_head *pdeq_pop_l(struct pdeq *d)
Listing 6.1: Lock-Based Parallel Double-Ended Queue Data 2 {
Structure 3 struct cds_list_head *e;
1 struct pdeq { 4 int i;
2 spinlock_t llock; 5

3 int lidx; 6 spin_lock(&d->llock);


4 spinlock_t rlock; 7 i = moveright(d->lidx);
5 int ridx; 8 e = deq_pop_l(&d->bkt[i]);
6 struct deq bkt[PDEQ_N_BKTS]; 9 if (e != NULL)
7 }; 10 d->lidx = i;
11 spin_unlock(&d->llock);
12 return e;
13 }
14
From the last state shown in Figure 6.8, a left-dequeue 15 struct cds_list_head *pdeq_pop_r(struct pdeq *d)
operation would return element “L−2 ” and leave the left- 16 {
17 struct cds_list_head *e;
hand index referencing hash chain 2, which would then 18 int i;
contain only a single element (“R2 ”). In this state, a 19
20 spin_lock(&d->rlock);
left-enqueue running concurrently with a right-enqueue 21 i = moveleft(d->ridx);
would result in lock contention, but the probability of 22 e = deq_pop_r(&d->bkt[i]);
23 if (e != NULL)
such contention can be reduced to arbitrarily low levels 24 d->ridx = i;
by using a larger hash table. 25 spin_unlock(&d->rlock);
26 return e;
Figure 6.9 shows how 16 elements would be organized 27 }
in a four-hash-bucket parallel double-ended queue. Each 28
29 void pdeq_push_l(struct cds_list_head *e, struct pdeq *d)
underlying single-lock double-ended queue holds a one- 30 {
quarter slice of the full parallel double-ended queue. 31 int i;
32
Listing 6.1 shows the corresponding C-language data 33 spin_lock(&d->llock);
structure, assuming an existing struct deq that provides 34 i = d->lidx;
35 deq_push_l(e, &d->bkt[i]);
a trivially locked double-ended-queue implementation. 36 d->lidx = moveleft(d->lidx);
This data structure contains the left-hand lock on line 2, 37 spin_unlock(&d->llock);
38 }
the left-hand index on line 3, the right-hand lock on line 4 39
(which is cache-aligned in the actual implementation), the 40 void pdeq_push_r(struct cds_list_head *e, struct pdeq *d)
41 {
right-hand index on line 5, and, finally, the hashed array 42 int i;
of simple lock-based double-ended queues on line 6. A 43
44 spin_lock(&d->rlock);
high-performance implementation would of course use 45 i = d->ridx;
padding or special alignment directives to avoid false 46 deq_push_r(e, &d->bkt[i]);
47 d->ridx = moveright(d->ridx);
sharing. 48 spin_unlock(&d->rlock);
Listing 6.2 (lockhdeq.c) shows the implementation of 49 }
the enqueue and dequeue functions.4 Discussion will focus
on the left-hand operations, as the right-hand operations
are trivially derived from them.
Lines 1–13 show pdeq_pop_l(), which left-dequeues
and returns an element if possible, returning NULL other-

4 One could easily create a polymorphic implementation in any

number of languages, but doing so is left as an exercise for the reader.

v2023.06.11a
6.1. PARTITIONING EXERCISES 79

wise. Line 6 acquires the left-hand spinlock, and line 7


computes the index to be dequeued from. Line 8 dequeues
the element, and, if line 9 finds the result to be non-NULL,
line 10 records the new left-hand index. Either way, line 11
releases the lock, and, finally, line 12 returns the element
if there was one, or NULL otherwise.
Lines 29–38 show pdeq_push_l(), which left- Listing 6.3: Compound Parallel Double-Ended Queue Imple-
enqueues the specified element. Line 33 acquires the mentation
left-hand lock, and line 34 picks up the left-hand in- 1 struct cds_list_head *pdeq_pop_l(struct pdeq *d)
dex. Line 35 left-enqueues the specified element onto 2 {
3 struct cds_list_head *e;
the double-ended queue indexed by the left-hand index. 4

Line 36 then updates the left-hand index and line 37 5 spin_lock(&d->llock);


6 e = deq_pop_l(&d->ldeq);
releases the lock. 7 if (e == NULL) {
As noted earlier, the right-hand operations are com- 8 spin_lock(&d->rlock);
9 e = deq_pop_l(&d->rdeq);
pletely analogous to their left-handed counterparts, so 10 cds_list_splice(&d->rdeq.chain, &d->ldeq.chain);
their analysis is left as an exercise for the reader. 11 CDS_INIT_LIST_HEAD(&d->rdeq.chain);
12 spin_unlock(&d->rlock);
13 }
Quick Quiz 6.5: Is the hashed double-ended queue a good 14 spin_unlock(&d->llock);
solution? Why or why not? 15 return e;
16 }
17
18 struct cds_list_head *pdeq_pop_r(struct pdeq *d)
6.1.2.5 Compound Double-Ended Queue Revisited 19 {
20 struct cds_list_head *e;
21
This section revisits the compound double-ended queue, 22 spin_lock(&d->rlock);
using a trivial rebalancing scheme that moves all the 23 e = deq_pop_r(&d->rdeq);
24 if (e == NULL) {
elements from the non-empty queue to the now-empty 25 spin_unlock(&d->rlock);
queue. 26 spin_lock(&d->llock);
27 spin_lock(&d->rlock);
Quick Quiz 6.6: Move all the elements to the queue that 28 e = deq_pop_r(&d->rdeq);
29 if (e == NULL) {
became empty? In what possible universe is this brain-dead 30 e = deq_pop_r(&d->ldeq);
solution in any way optimal??? 31 cds_list_splice(&d->ldeq.chain, &d->rdeq.chain);
32 CDS_INIT_LIST_HEAD(&d->ldeq.chain);
33 }
In contrast to the hashed implementation presented in 34 spin_unlock(&d->llock);
the previous section, the compound implementation will 35 }
36 spin_unlock(&d->rlock);
build on a sequential implementation of a double-ended 37 return e;
queue that uses neither locks nor atomic operations. 38 }
39
Listing 6.3 shows the implementation. Unlike the 40 void pdeq_push_l(struct cds_list_head *e, struct pdeq *d)
hashed implementation, this compound implementation is 41 {
42 spin_lock(&d->llock);
asymmetric, so that we must consider the pdeq_pop_l() 43 deq_push_l(e, &d->ldeq);
and pdeq_pop_r() implementations separately. 44 spin_unlock(&d->llock);
45 }
Quick Quiz 6.7: Why can’t the compound parallel double- 46
47 void pdeq_push_r(struct cds_list_head *e, struct pdeq *d)
ended queue implementation be symmetric? 48 {
49 spin_lock(&d->rlock);
The pdeq_pop_l() implementation is shown on 50 deq_push_r(e, &d->rdeq);
51 spin_unlock(&d->rlock);
lines 1–16 of the figure. Line 5 acquires the left-hand lock, 52 }
which line 14 releases. Line 6 attempts to left-dequeue
an element from the left-hand underlying double-ended
queue, and, if successful, skips lines 8–13 to simply return
this element. Otherwise, line 8 acquires the right-hand
lock, line 9 left-dequeues an element from the right-hand
queue, and line 10 moves any remaining elements on the
right-hand queue to the left-hand queue, line 11 initializes

v2023.06.11a
80 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

the right-hand queue, and line 12 releases the right-hand is 2x scalability, as at most two threads can be holding the
lock. The element, if any, that was dequeued on line 9 dequeue’s locks concurrently. This limitation also applies
will be returned. to algorithms based on non-blocking synchronization,
The pdeq_pop_r() implementation is shown on such as the compare-and-swap-based dequeue algorithm
lines 18–38 of the figure. As before, line 22 acquires of Michael [Mic03].5
the right-hand lock (and line 36 releases it), and line 23 Quick Quiz 6.11: Why are there not one but two solutions to
attempts to right-dequeue an element from the right-hand the double-ended queue problem?
queue, and, if successful, skips lines 25–35 to simply
return this element. However, if line 24 determines that In fact, as noted by Dice et al. [DLM+ 10], an unsynchro-
there was no element to dequeue, line 25 releases the nized single-threaded double-ended queue significantly
right-hand lock and lines 26–27 acquire both locks in outperforms any of the parallel implementations they stud-
the proper order. Line 28 then attempts to right-dequeue ied. Therefore, the key point is that there can be significant
an element from the right-hand list again, and if line 29 overhead enqueuing to or dequeuing from a shared queue,
determines that this second attempt has failed, line 30 regardless of implementation. This should come as no
right-dequeues an element from the left-hand queue (if surprise in light of the material in Chapter 3, given the
there is one available), line 31 moves any remaining ele- strict first-in-first-out (FIFO) nature of these queues.
ments from the left-hand queue to the right-hand queue, Furthermore, these strict FIFO queues are strictly FIFO
and line 32 initializes the left-hand queue. Either way, only with respect to linearization points [HW90]6 that
line 34 releases the left-hand lock. are not visible to the caller, in fact, in these examples, the
Quick Quiz 6.8: Why is it necessary to retry the right-dequeue linearization points are buried in the lock-based critical
operation on line 28 of Listing 6.3? sections. These queues are not strictly FIFO with respect
to (say) the times at which the individual operations
Quick Quiz 6.9: Surely the left-hand lock must sometimes be started [HKLP12]. This indicates that the strict FIFO
available!!! So why is it necessary that line 25 of Listing 6.3 property is not all that valuable in concurrent programs,
unconditionally release the right-hand lock? and in fact, Kirsch et al. present less-strict queues that
provide improved performance and scalability [KLP12].7
The pdeq_push_l() implementation is shown on All that said, if you are pushing all the data used by your
lines 40–45 of Listing 6.3. Line 42 acquires the left- concurrent program through a single queue, you really
hand spinlock, line 43 left-enqueues the element onto the need to rethink your overall design.
left-hand queue, and finally line 44 releases the lock. The
pdeq_push_r() implementation (shown on lines 47–52)
is quite similar. 6.1.3 Partitioning Example Discussion
Quick Quiz 6.10: But in the case where data is flowing in The optimal solution to the dining philosophers problem
only one direction, the algorithm shown in Listing 6.3 will given in the answer to the Quick Quiz in Section 6.1.1 is
have both ends attempting to acquire the same lock whenever an excellent example of “horizontal parallelism” or “data
the consuming end empties its underlying double-ended queue. parallelism”. The synchronization overhead in this case
Doesn’t that mean that sometimes this algorithm fails to provide is nearly (or even exactly) zero. In contrast, the double-
concurrent access to both ends of the queue even when the ended queue implementations are examples of “vertical
queue contains an arbitrarily large number of elements?
parallelism” or “pipelining”, given that data moves from
5 This paper is interesting in that it showed that special double-

6.1.2.6 Double-Ended Queue Discussion compare-and-swap (DCAS) instructions are not needed for lock-free
implementations of double-ended queues. Instead, the common compare-
The compound implementation is somewhat more com- and-swap (e.g., x86 cmpxchg) suffices.
6 In short, a linearization point is a single point within a given
plex than the hashed variant presented in Section 6.1.2.4,
function where that function can be said to have taken effect. In this
but is still reasonably simple. Of course, a more intelligent lock-based implementation, the linearization points can be said to be
rebalancing scheme could be arbitrarily complex, but the anywhere within the critical section that does the work.
7 Nir Shavit produced relaxed stacks for roughly the same rea-
simple scheme shown here has been shown to perform well
sons [Sha11]. This situation leads some to believe that the linearization
compared to software alternatives [DCW+ 11] and even points are useful to theorists rather than developers, and leads others
compared to algorithms using hardware assist [DLM+ 10]. to wonder to what extent the designers of such data structures and
Nevertheless, the best we can hope for from such a scheme algorithms were considering the needs of their users.

v2023.06.11a
6.2. DESIGN CRITERIA 81

one thread to another. The tighter coordination required The design criteria for attaining the three parallel-
for pipelining in turn requires larger units of work to obtain programming goals are speedup, contention, overhead,
a given level of efficiency. read-to-write ratio, and complexity:
Quick Quiz 6.12: The tandem double-ended queue runs Speedup: As noted in Section 2.2, increased performance
about twice as fast as the hashed double-ended queue, even is the major reason to go to all of the time and trouble
when I increase the size of the hash table to an insanely large required to parallelize it. Speedup is defined to be the
number. Why is that? ratio of the time required to run a sequential version
of the program to the time required to run a parallel
Quick Quiz 6.13: Is there a significantly better way of version.
handling concurrency for double-ended queues?
Contention: If more CPUs are applied to a parallel pro-
These two examples show just how powerful partition- gram than can be kept busy by that program, the
ing can be in devising parallel algorithms. Section 6.3.5 excess CPUs are prevented from doing useful work
looks briefly at a third example, matrix multiply. However, by contention. This may be lock contention, memory
all three of these examples beg for more and better design contention, or a host of other performance killers.
criteria for parallel programs, a topic taken up in the next
Work-to-Synchronization Ratio: A uniprocessor, sin-
section.
gle-threaded, non-preemptible, and non-interrupt-
ible8 version of a given parallel program would not
need any synchronization primitives. Therefore,
6.2 Design Criteria any time consumed by these primitives (including
communication cache misses as well as message
One pound of learning requires ten pounds of latency, locking primitives, atomic instructions, and
commonsense to apply it. memory barriers) is overhead that does not contrib-
ute directly to the useful work that the program is
Persian proverb
intended to accomplish. Note that the important
measure is the relationship between the synchroniza-
One way to obtain the best performance and scalability is tion overhead and the overhead of the code in the
to simply hack away until you converge on the best possible critical section, with larger critical sections able to
parallel program. Unfortunately, if your program is other tolerate greater synchronization overhead. The work-
than microscopically tiny, the space of possible parallel to-synchronization ratio is related to the notion of
programs is so huge that convergence is not guaranteed in synchronization efficiency.
the lifetime of the universe. Besides, what exactly is the
“best possible parallel program”? After all, Section 2.2 Read-to-Write Ratio: A data structure that is rarely up-
called out no fewer than three parallel-programming goals dated may often be replicated rather than partitioned,
of performance, productivity, and generality, and the best and furthermore may be protected with asymmet-
possible performance will likely come at a cost in terms ric synchronization primitives that reduce readers’
of productivity and generality. We clearly need to be able synchronization overhead at the expense of that of
to make higher-level choices at design time in order to writers, thereby reducing overall synchronization
arrive at an acceptably good parallel program before that overhead. Corresponding optimizations are possible
program becomes obsolete. for frequently updated data structures, as discussed
However, more detailed design criteria are required to in Chapter 5.
actually produce a real-world design, a task taken up in
Complexity: A parallel program is more complex than an
this section. This being the real world, these criteria often
equivalent sequential program because the parallel
conflict to a greater or lesser degree, requiring that the
program has a much larger state space than does
designer carefully balance the resulting tradeoffs.
the sequential program, although large state spaces
As such, these criteria may be thought of as the
having regular structures can in some cases be easily
“forces” acting on the design, with particularly good
understood. A parallel programmer must consider
tradeoffs between these forces being called “design pat-
terns” [Ale79, GHJV95]. 8 Either by masking interrupts or by being oblivious to them.

v2023.06.11a
82 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

synchronization primitives, messaging, locking de- less than one tenth of its time in the most-restrictive
sign, critical-section identification, and deadlock in exclusive-lock critical section.
the context of this larger state space.
This greater complexity often translates to higher
2. Contention effects consume the excess CPU and/or
development and maintenance costs. Therefore, bud-
wallclock time when the actual speedup is less than
getary constraints can limit the number and types
the number of available CPUs. The larger the gap
of modifications made to an existing program, since
between the number of CPUs and the actual speedup,
a given degree of speedup is worth only so much
the less efficiently the CPUs will be used. Similarly,
time and trouble. Worse yet, added complexity can
the greater the desired efficiency, the smaller the
actually reduce performance and scalability.
achievable speedup.
Therefore, beyond a certain point, there may be
potential sequential optimizations that are cheaper
and more effective than parallelization. As noted 3. If the available synchronization primitives have high
in Section 2.2.1, parallelization is but one perfor- overhead compared to the critical sections that they
mance optimization of many, and is furthermore an guard, the best way to improve speedup is to reduce
optimization that applies most readily to CPU-based the number of times that the primitives are invoked.
bottlenecks. This can be accomplished by batching critical sec-
tions, using data ownership (see Chapter 8), using
These criteria will act together to enforce a maximum asymmetric primitives (see Chapter 9), or by using a
speedup. The first three criteria are deeply interrelated, coarse-grained design such as code locking.
so the remainder of this section analyzes these interrela-
tionships.9
Note that these criteria may also appear as part of 4. If the critical sections have high overhead compared
the requirements specification, and further that they are to the primitives guarding them, the best way to
one solution to the problem of summarizing the quality improve speedup is to increase parallelism by moving
of a concurrent algorithm from page 73. For example, to reader/writer locking, data locking, asymmetric,
speedup may act as a relative desideratum (“the faster, the or data ownership.
better”) or as an absolute requirement of the workload
(“the system must support at least 1,000,000 web hits
per second”). Classic design pattern languages describe 5. If the critical sections have high overhead compared
relative desiderata as forces and absolute requirements as to the primitives guarding them and the data structure
context. being guarded is read much more often than modified,
An understanding of the relationships between these the best way to increase parallelism is to move to
design criteria can be very helpful when identifying ap- reader/writer locking or asymmetric primitives.
propriate design tradeoffs for a parallel program.
1. The less time a program spends in exclusive-lock 6. Many changes that improve SMP performance, for
critical sections, the greater the potential speedup. example, reducing lock contention, also improve
This is a consequence of Amdahl’s Law [Amd67] real-time latencies [McK05c].
because only one CPU may execute within a given
exclusive-lock critical section at a given time.
Quick Quiz 6.14: Don’t all these problems with critical
More specifically, for unbounded linear scalability,
sections mean that we should just always use non-blocking
the fraction of time that the program spends in a
synchronization [Her90], which don’t have critical sections?
given exclusive critical section must decrease as the
number of CPUs increases. For example, a program
will not scale to 10 CPUs unless it spends much
It is worth reiterating that contention has many guises,
9 A real-world parallel system will be subject to many additional including lock contention, memory contention, cache
design criteria, such as data-structure layout, memory size, memory- overflow, thermal throttling, and much else besides. This
hierarchy latencies, bandwidth limitations, and I/O issues. chapter looks primarily at lock and memory contention.

v2023.06.11a
6.3. SYNCHRONIZATION GRANULARITY 83

Sequential 10000
Program

CPU Clock Frequency / MIPS


Partition Batch 1000
Code
Locking 100
Partition Batch
Data 10
Locking

Own Disown
1

Data
Ownership 0.1

1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
Figure 6.10: Design Patterns and Lock Granularity
Year

6.3 Synchronization Granularity Figure 6.11: MIPS/Clock-Frequency Trend for Intel


CPUs
Doing little things well is a step toward doing big
things better. 56 hardware threads per socket, parallelism is well and
Harry F. Banks truly here. It is also important to note that Ethernet band-
width is continuing to grow, as shown in Figure 6.12. This
Figure 6.10 gives a pictorial view of different levels of growth will continue to motivate multithreaded servers in
synchronization granularity, each of which is described order to handle the communications load.
in one of the following sections. These sections focus Please note that this does not mean that you should
primarily on locking, but similar granularity issues arise code each and every program in a multi-threaded manner.
with all forms of synchronization. Again, if a program runs quickly enough on a single
processor, spare yourself the overhead and complexity of
6.3.1 Sequential Program SMP synchronization primitives. The simplicity of the
hash-table lookup code in Listing 6.4 underscores this
If the program runs fast enough on a single processor, point.11 A key point is that speedups due to parallelism
and has no interactions with other processes, threads, or are normally limited to the number of CPUs. In contrast,
interrupt handlers, you should remove the synchronization speedups due to sequential optimizations, for example,
primitives and spare yourself their overhead and complex- careful choice of data structure, can be arbitrarily large.
ity. Some years back, there were those who would argue
Quick Quiz 6.15: What should you do to validate a hash
that Moore’s Law would eventually force all programs
table?
into this category. However, as can be seen in Figure 6.11,
the exponential increase in single-threaded performance On the other hand, if you are not in this happy situation,
halted in about 2003. Therefore, increasing performance read on!
will increasingly require parallelism.10 Given that back
in 2006 Paul typed the first version of this sentence on 6.3.2 Code Locking
a dual-core laptop, and further given that many of the
graphs added in 2020 were generated on a system with Code locking is quite simple due to the fact that it uses
only global locks.12 It is especially easy to retrofit an
10 This plot shows clock frequencies for newer CPUs theoretically
11 The examples in this section are taken from Hart et al. [HMB06],
capable of retiring one or more instructions per clock, and MIPS for
older CPUs requiring multiple clocks to execute even the simplest adapted for clarity by gathering related code from multiple files.
instruction. The reason for taking this approach is that the newer CPUs’ 12 If your program instead has locks in data structures, or, in the case

ability to retire multiple instructions per clock is typically limited by of Java, uses classes with synchronized instances, you are instead using
memory-system performance. “data locking”, described in Section 6.3.3.

v2023.06.11a
84 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Listing 6.5: Code-Locking Hash Table Search


1 spinlock_t hash_lock;
2
3 struct hash_table
4 {
6
1x10 5 long nbuckets;
6 struct node **buckets;
100000 7 };
Ethernet
Relative Performance

8
9 typedef struct node {
10000 10 unsigned long key;
11 struct node *next;
1000 12 } node_t;
13
int hash_search(struct hash_table *h, long key)
100 x86 CPUs 14
15 {
16 struct node *cur;
10 17 int retval;
18
1 19 spin_lock(&hash_lock);
20 cur = h->buckets[key % h->nbuckets];
while (cur != NULL) {
0.1
21
22 if (cur->key >= key) {
1970
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
23 retval = (cur->key == key);
24 spin_unlock(&hash_lock);
25 return retval;
26 }
Year 27 cur = cur->next;
28 }
Figure 6.12: Ethernet Bandwidth vs. Intel x86 CPU 29 spin_unlock(&hash_lock);
30 return 0;
Performance 31 }

existing program to use code locking in order to run it on


a multiprocessor. If the program has only a single shared
resource, code locking will even give optimal performance.
However, many of the larger and more complex programs
require much of the execution to occur in critical sections,
Listing 6.4: Sequential-Program Hash Table Search which in turn causes code locking to sharply limits their
1 struct hash_table scalability.
2 {
3 long nbuckets;
Therefore, you should use code locking on programs
4 struct node **buckets; that spend only a small fraction of their execution time
5 };
6
in critical sections or from which only modest scaling is
7 typedef struct node { required. In addition, programs that primarily use the
8 unsigned long key;
9 struct node *next;
more scalable approaches described in later sections often
10 } node_t; use code locking to handle rare error cases or significant
11
12 int hash_search(struct hash_table *h, long key)
state transitions. In these cases, code locking will provide
13 { a relatively simple program that is very similar to its
14 struct node *cur;
15
sequential counterpart, as can be seen in Listing 6.5.
16 cur = h->buckets[key % h->nbuckets]; However, note that the simple return of the comparison
17 while (cur != NULL) {
18 if (cur->key >= key) {
in hash_search() in Listing 6.4 has now become three
19 return (cur->key == key); statements due to the need to release the lock before
20 }
21 cur = cur->next;
returning.
22 } Note that the hash_lock acquisition and release state-
23 return 0;
24 }
ments on lines 19, 24, and 29 are mediating ownership of
the hash table among the CPUs wishing to concurrently
access that hash table. Another way of looking at this
is that hash_lock is partitioning time, thus giving each
requesting CPU its own partition of time during which

v2023.06.11a
6.3. SYNCHRONIZATION GRANULARITY 85

Listing 6.6: Data-Locking Hash Table Search


1 struct hash_table
2 {
3 long nbuckets;
4 struct bucket **buckets;
5 };
6
7 struct bucket {
8 spinlock_t bucket_lock;
9 node_t *list_head;
10 };
11
12 typedef struct node {
13 unsigned long key;
14 struct node *next;
15 } node_t;
16
17 int hash_search(struct hash_table *h, long key)
18 {
19 struct bucket *bp;
toy 20 struct node *cur;
21 int retval;
22
23 bp = h->buckets[key % h->nbuckets];
Figure 6.13: Lock Contention 24 spin_lock(&bp->bucket_lock);
25 cur = bp->list_head;
26 while (cur != NULL) {
27 if (cur->key >= key) {
it owns this hash table. In addition, in a well-designed 28 retval = (cur->key == key);
29 spin_unlock(&bp->bucket_lock);
algorithm, there should be ample partitions of time during 30 return retval;
which no CPU owns this hash table. 31 }
32 cur = cur->next;
Quick Quiz 6.16: “Partitioning time”? Isn’t that an odd turn 33 }
34 spin_unlock(&bp->bucket_lock);
of phrase? 35 return 0;
36 }
Unfortunately, code locking is particularly prone to
“lock contention”, where multiple CPUs need to acquire
the lock concurrently. SMP programmers who have taken results in a slight increase in complexity in the form of an
care of groups of small children (or groups of older people additional data structure, the struct bucket.
who are acting like children) will immediately recognize In contrast with the contentious situation shown in
the danger of having only one of something, as illustrated Figure 6.13, data locking helps promote harmony, as
in Figure 6.13. illustrated by Figure 6.14—and in parallel programs, this
One solution to this problem, named “data locking”, is almost always translates into increased performance and
described in the next section. scalability. For this reason, data locking was heavily used
by Sequent in its kernels [BK85, Inm85, Gar90, Dov90,
MD92, MG92, MS93].
6.3.3 Data Locking
Another way of looking at this is to think of each ->
Many data structures may be partitioned, with each par- bucket_lock as mediating ownership not of the entire
tition of the data structure having its own lock. Then hash table as was done for code locking, but only for the
the critical sections for each part of the data structure bucket corresponding to that ->bucket_lock. Each lock
can execute in parallel, although only one instance of the still partitions time, but the per-bucket-locking technique
critical section for a given part could be executing at a also partitions the address space, so that the overall tech-
given time. You should use data locking when contention nique can be said to partition spacetime. If the number of
must be reduced, and where synchronization overhead is buckets is large enough, this partitioning of space should
not limiting speedups. Data locking reduces contention with high probability permit a given CPU immediate
by distributing the instances of the overly-large critical access to a given hash bucket.
section across multiple data structures, for example, main- However, as those who have taken care of small children
taining per-hash-bucket critical sections in a hash table, can again attest, even providing enough to go around is
as shown in Listing 6.6. The increased scalability again no guarantee of tranquillity. The analogous situation

v2023.06.11a
86 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

such as hash tables, as well as in situations where multiple


entities are each represented by an instance of a given
data structure. The Linux-kernel task list is an example of
the latter, each task structure having its own alloc_lock
and pi_lock.
A key challenge with data locking on dynamically
allocated structures is ensuring that the structure remains
in existence while the lock is being acquired [GKAS99].
toy
The code in Listing 6.6 finesses this challenge by placing
the locks in the statically allocated hash buckets, which
yot are never freed. However, this trick would not work if
the hash table were resizeable, so that the locks were now
dynamically allocated. In this case, there would need to
toy be some means to prevent the hash bucket from being
freed during the time that its lock was being acquired.
Quick Quiz 6.17: What are some ways of preventing a
Figure 6.14: Data Locking structure from being freed while its lock is being acquired?

6.3.4 Data Ownership


Data ownership partitions a given data structure over the
threads or CPUs, so that each thread/CPU accesses its
subset of the data structure without any synchronization
overhead whatsoever. However, if one thread wishes
to access some other thread’s data, the first thread is
unable to do so directly. Instead, the first thread must
communicate with the second thread, so that the second
toy
toy toy
thread performs the operation on behalf of the first, or,
alternatively, migrates the data to the first thread.
Data ownership might seem arcane, but it is used very
toy frequently:

Figure 6.15: Data Locking and Skew 1. Any variables accessible by only one CPU or thread
(such as auto variables in C and C++) are owned by
that CPU or process.
can arise in SMP programs. For example, the Linux 2. An instance of a user interface owns the correspond-
kernel maintains a cache of files and directories (called ing user’s context. It is very common for applications
“dcache”). Each entry in this cache has its own lock, but the interacting with parallel database engines to be writ-
entries corresponding to the root directory and its direct ten as if they were entirely sequential programs. Such
descendants are much more likely to be traversed than applications own the user interface and his current
are more obscure entries. This can result in many CPUs action. Explicit parallelism is thus confined to the
contending for the locks of these popular entries, resulting database engine itself.
in a situation not unlike that shown in Figure 6.15.
In many cases, algorithms can be designed to re- 3. Parametric simulations are often trivially parallelized
duce the instance of data skew, and in some cases by granting each thread ownership of a particular
eliminate it entirely (for example, in the Linux ker- region of the parameter space. There are also com-
nel’s dcache [MSS04, Cor10a, Bro15a, Bro15b, Bro15c]). puting frameworks designed for this type of prob-
Data locking is often used for partitionable data structures lem [Uni08a].

v2023.06.11a
6.3. SYNCHRONIZATION GRANULARITY 87

If there is significant sharing, communication between The service rate 𝜇 is defined similarly, but for the
the threads or CPUs can result in significant complexity average number of synchronization operations per second
and overhead. Furthermore, if the most-heavily used data that the system would process if the overhead of each
happens to be that owned by a single CPU, that CPU will be transaction was zero, and ignoring the fact that CPUs
a “hot spot”, sometimes with results resembling that shown must wait on each other to complete their synchronization
in Figure 6.15. However, in situations where no sharing operations, in other words, 𝜇 can be roughly thought of as
is required, data ownership achieves ideal performance, the synchronization overhead in absence of contention. For
and with code that can be as simple as the sequential- example, suppose that each transaction’s synchronization
program case shown in Listing 6.4. Such situations are operation involves an atomic increment instruction, and
often referred to as “embarrassingly parallel”, and, in that a computer system is able to do a private-variable
the best case, resemble the situation previously shown in atomic increment every 5 nanoseconds on each CPU
Figure 6.14. (see Figure 5.1).13 The value of 𝜇 is therefore about
Another important instance of data ownership occurs 200,000,000 atomic increments per second.
when the data is read-only, in which case, all threads can Of course, the value of 𝜆 increases as increasing num-
“own” it via replication. bers of CPUs increment a shared variable because each
Where data locking partitions both the address space CPU is capable of processing transactions independently
(with one hash buckets per partition) and time (using (again, ignoring synchronization):
per-bucket locks), data ownership partitions only the ad-
dress space. The reason that data ownership need not 𝜆 = 𝑛𝜆0 (6.1)
partition time is because a given thread or CPU is assigned Here, 𝑛 is the number of CPUs and 𝜆 0 is the transaction-
permanent ownership of a given address-space partition. processing capability of a single CPU. Note that the
Quick Quiz 6.18: But won’t system boot and shutdown (or expected time for a single CPU to execute a single trans-
application startup and shutdown) be partitioning time, even action in the absence of contention is 1/𝜆0 .
for data ownership? Because the CPUs have to “wait in line” behind each
other to get their chance to increment the single shared vari-
Data ownership will be presented in more detail in able, we can use the M/M/1 queueing-model expression
Chapter 8. for the expected total waiting time:
1
𝑇= (6.2)
6.3.5 Locking Granularity and Perfor- 𝜇−𝜆
mance Substituting the above value of 𝜆:
This section looks at locking granularity and performance 1
𝑇= (6.3)
from a mathematical synchronization-efficiency viewpoint. 𝜇 − 𝑛𝜆0
Readers who are uninspired by mathematics might choose Now, the efficiency is just the ratio of the time required
to skip this section. to process a transaction in absence of synchronization
The approach is to use a crude queueing model for the (1/𝜆0 ) to the time required including synchronization
efficiency of synchronization mechanism that operate on (𝑇 + 1/𝜆0 ):
a single shared global variable, based on an M/M/1 queue.
M/M/1 queuing models are based on an exponentially 1/𝜆0
𝑒= (6.4)
distributed “inter-arrival rate” 𝜆 and an exponentially 𝑇 + 1/𝜆0
distributed “service rate” 𝜇. The inter-arrival rate 𝜆 can Substituting the above value for 𝑇 and simplifying:
be thought of as the average number of synchronization
𝜇
operations per second that the system would process if the 𝜆0 −𝑛
synchronization were free, in other words, 𝜆 is an inverse 𝑒= 𝜇 (6.5)
𝜆0 − (𝑛 − 1)
measure of the overhead of each non-synchronization
13 Of course, if there are 8 CPUs all incrementing the same shared
unit of work. For example, if each unit of work was a
variable, then each CPU must wait at least 35 nanoseconds for each
transaction, and if each transaction took one millisecond of the other CPUs to do its increment before consuming an additional
to process, excluding synchronization overhead, then 𝜆 5 nanoseconds doing its own increment. In fact, the wait will be longer
would be 1,000 transactions per second. due to the need to move the variable from one CPU to another.

v2023.06.11a
88 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1.2

Synchronization Efficiency
1

Matrix Multiply Efficiency


0.9 1
0.8 1024
0.7 0.8 256
100
0.6
0.6
0.5 75
0.4 50 128 512
0.4
0.3 25
0.2 10 0.2
0.1 64
0
10
20
30
40
50
60
70
80
90
100
1 10 100
Number of CPUs (Threads)
Number of CPUs (Threads)
Figure 6.17: Matrix Multiply Efficiency
Figure 6.16: Synchronization Efficiency

the corresponding columns of the result matrix. The


But the value of 𝜇/𝜆0 is just the ratio of the time threads can therefore operate entirely independently, with
required to process the transaction (absent synchronization no synchronization overhead whatsoever, as is done in
overhead) to that of the synchronization overhead itself matmul.c. One might therefore expect a perfect efficiency
(absent contention). If we call this ratio 𝑓 , we have: of 1.0.
However, Figure 6.17 tells a different story, especially
𝑓 −𝑛 for a 64-by-64 matrix multiply, which never gets above
𝑒= (6.6)
𝑓 − (𝑛 − 1) an efficiency of about 0.3, even when running single-
Figure 6.16 plots the synchronization efficiency 𝑒 as threaded, and drops sharply as more threads are added.14
a function of the number of CPUs/threads 𝑛 for a few The 128-by-128 matrix does better, but still fails to dem-
values of the overhead ratio 𝑓 . For example, again us- onstrate much performance increase with added threads.
ing the 5-nanosecond atomic increment, the 𝑓 = 10 The 256-by-256 matrix does scale reasonably well, but
line corresponds to each CPU attempting an atomic in- only up to a handful of CPUs. The 512-by-512 matrix
crement every 50 nanoseconds, and the 𝑓 = 100 line multiply’s efficiency is measurably less than 1.0 on as
corresponds to each CPU attempting an atomic increment few as 10 threads, and even the 1024-by-1024 matrix
every 500 nanoseconds, which in turn corresponds to some multiply deviates noticeably from perfection at a few tens
hundreds (perhaps thousands) of instructions. Given that of threads. Nevertheless, this figure clearly demonstrates
each trace drops off sharply with increasing numbers of the performance and scalability benefits of batching: If
CPUs or threads, we can conclude that synchronization you must incur synchronization overhead, you may as
mechanisms based on atomic manipulation of a single well get your money’s worth, which is the solution to the
global shared variable will not scale well if used heavily problem of deciding on granularity of synchronization put
on current commodity hardware. This is an abstract math- forth on page 73.
ematical depiction of the forces leading to the parallel Quick Quiz 6.19: How can a single-threaded 64-by-64 matrix
counting algorithms that were discussed in Chapter 5. multiple possibly have an efficiency of less than 1.0? Shouldn’t
Your real-world mileage may differ. all of the traces in Figure 6.17 have efficiency of exactly 1.0
Nevertheless, the concept of efficiency is useful, and when running on one thread?
even in cases having little or no formal synchronization.
Given these inefficiencies, it is worthwhile to look
Consider for example a matrix multiply, in which the
into more-scalable approaches such as the data locking
columns of one matrix are multiplied (via “dot product”)
described in Section 6.3.3 or the parallel-fastpath approach
by the rows of another, resulting in an entry in a third
discussed in the next section.
matrix. Because none of these operations conflict, it
is possible to partition the columns of the first matrix 14 In contrast to the smooth traces of Figure 6.16, the wide error bars

among a group of threads, with each thread computing and jagged traces of Figure 6.17 gives evidence of its real-world nature.

v2023.06.11a
6.4. PARALLEL FASTPATH 89

Reader/Writer 1. Reader/Writer Locking (described below in Sec-


Locking tion 6.4.1).

2. Read-copy update (RCU), which may be used as a


high-performance replacement for reader/writer lock-
RCU
ing, is introduced in Section 9.5. Other alternatives
Parallel include hazard pointers (Section 9.3) and sequence
Fastpath
locking (Section 9.4). These alternatives will not be
Hierarchical
Locking
discussed further in this chapter.

3. Hierarchical Locking ([McK96a]), which is touched


Allocator upon in Section 6.4.2.
Caches
4. Resource Allocator Caches ([McK96a, MS93]). See
Figure 6.18: Parallel-Fastpath Design Patterns Section 6.4.3 for more detail.

Quick Quiz 6.20: How are data-parallel techniques going to 6.4.1 Reader/Writer Locking
help with matrix multiply? It is already data parallel!!!
If synchronization overhead is negligible (for example, if
the program uses coarse-grained parallelism with large
Quick Quiz 6.21: What did you do to validate this matrix
multiply algorithm?
critical sections), and if only a small fraction of the critical
sections modify data, then allowing multiple readers to
proceed in parallel can greatly increase scalability. Writ-
ers exclude both readers and each other. There are many
6.4 Parallel Fastpath implementations of reader-writer locking, including the
POSIX implementation described in Section 4.2.4. List-
ing 6.7 shows how the hash search might be implemented
There are two ways of meeting difficulties: You alter using reader-writer locking.
the difficulties, or you alter yourself to meet them.
Reader/writer locking is a simple instance of asymmet-
Phyllis Bottome ric locking. Snaman [ST87] describes a more ornate six-
mode asymmetric locking design used in several clustered
Fine-grained (and therefore usually higher-performance) systems. Locking in general and reader-writer locking in
designs are typically more complex than are coarser- particular is described extensively in Chapter 7.
grained designs. In many cases, most of the overhead is
incurred by a small fraction of the code [Knu73]. So why
not focus effort on that small fraction? 6.4.2 Hierarchical Locking
This is the idea behind the parallel-fastpath design
pattern, to aggressively parallelize the common-case code The idea behind hierarchical locking is to have a coarse-
path without incurring the complexity that would be grained lock that is held only long enough to work out
required to aggressively parallelize the entire algorithm. which fine-grained lock to acquire. Listing 6.8 shows how
You must understand not only the specific algorithm you our hash-table search might be adapted to do hierarchical
wish to parallelize, but also the workload that the algorithm locking, but also shows the great weakness of this ap-
will be subjected to. Great creativity and design effort is proach: We have paid the overhead of acquiring a second
often required to construct a parallel fastpath. lock, but we only hold it for a short time. In this case,
Parallel fastpath combines different patterns (one for the data-locking approach would be simpler and likely
the fastpath, one elsewhere) and is therefore a template perform better.
pattern. The following instances of parallel fastpath occur Quick Quiz 6.22: In what situation would hierarchical
often enough to warrant their own patterns, as depicted in locking work well?
Figure 6.18:

v2023.06.11a
90 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Listing 6.7: Reader-Writer-Locking Hash Table Search Listing 6.8: Hierarchical-Locking Hash Table Search
1 rwlock_t hash_lock; 1 struct hash_table
2 2 {
3 struct hash_table 3 long nbuckets;
4 { 4 struct bucket **buckets;
5 long nbuckets; 5 };
6 struct node **buckets; 6
7 }; 7 struct bucket {
8 8 spinlock_t bucket_lock;
9 typedef struct node { 9 node_t *list_head;
10 unsigned long key; 10 };
11 struct node *next; 11
12 } node_t; 12 typedef struct node {
13 13 spinlock_t node_lock;
14 int hash_search(struct hash_table *h, long key) 14 unsigned long key;
15 { 15 struct node *next;
16 struct node *cur; 16 } node_t;
17 int retval; 17
18 18 int hash_search(struct hash_table *h, long key)
19 read_lock(&hash_lock); 19 {
20 cur = h->buckets[key % h->nbuckets]; 20 struct bucket *bp;
21 while (cur != NULL) { 21 struct node *cur;
22 if (cur->key >= key) { 22 int retval;
23 retval = (cur->key == key); 23
24 read_unlock(&hash_lock); 24 bp = h->buckets[key % h->nbuckets];
25 return retval; 25 spin_lock(&bp->bucket_lock);
26 } 26 cur = bp->list_head;
27 cur = cur->next; 27 while (cur != NULL) {
28 } 28 if (cur->key >= key) {
29 read_unlock(&hash_lock); 29 spin_lock(&cur->node_lock);
30 return 0; 30 spin_unlock(&bp->bucket_lock);
31 } 31 retval = (cur->key == key);
32 spin_unlock(&cur->node_lock);
33 return retval;
34 }
6.4.3 Resource Allocator Caches 35 cur = cur->next;
36 }
37 spin_unlock(&bp->bucket_lock);
This section presents a simplified schematic of a parallel 38 return 0;
fixed-block-size memory allocator. More detailed descrip- 39 }
tions may be found in the literature [MG92, MS93, BA01,
MSK01, Eva11, Ken20] or in the Linux kernel [Tor03].
6.4.3.2 Parallel Fastpath for Resource Allocation
6.4.3.1 Parallel Resource Allocation Problem The commonly used solution uses parallel fastpath with
The basic problem facing a parallel memory allocator each CPU owning a modest cache of blocks, and with a
is the tension between the need to provide extremely large code-locked shared pool for additional blocks. To
fast memory allocation and freeing in the common case prevent any given CPU from monopolizing the memory
and the need to efficiently distribute memory in face of blocks, we place a limit on the number of blocks that can
unfavorable allocation and freeing patterns. be in each CPU’s cache. In a two-CPU system, the flow of
To see this tension, consider a straightforward applica- memory blocks will be as shown in Figure 6.19: When a
tion of data ownership to this problem—simply carve up given CPU is trying to free a block when its pool is full, it
memory so that each CPU owns its share. For example, sends blocks to the global pool, and, similarly, when that
suppose that a system with 12 CPUs has 64 gigabytes of CPU is trying to allocate a block when its pool is empty,
memory, for example, the laptop I am using right now. it retrieves blocks from the global pool.
We could simply assign each CPU a five-gigabyte region
of memory, and allow each CPU to allocate from its own 6.4.3.3 Data Structures
region, without the need for locking and its complexities
and overheads. Unfortunately, this scheme fails when The actual data structures for a “toy” implementation of
CPU 0 only allocates memory and CPU 1 only frees it, as allocator caches are shown in Listing 6.9 (“smpalloc.c”).
happens in simple producer-consumer workloads. The “Global Pool” of Figure 6.19 is implemented by
The other extreme, code locking, suffers from excessive globalmem of type struct globalmempool, and the
lock contention and overhead [MS93]. two CPU pools by the per-thread variable perthreadmem

v2023.06.11a
6.4. PARALLEL FASTPATH 91

(Empty) −1

Global Pool
0
Overflow

Overflow
(Code Locked) 1
Empty

Empty
2

4
CPU 0 Pool CPU 1 Pool

5
(Owned by CPU 0) (Owned by CPU 1)
Figure 6.20: Allocator Pool Schematic

Allocate/Free
boxes represent non-NULL pointers, while the empty boxes
Figure 6.19: Allocator Cache Schematic represent NULL pointers. An important, though potentially
confusing, invariant of this data structure is that the cur
Listing 6.9: Allocator-Cache Data Structures field is always one smaller than the number of non-NULL
1 #define TARGET_POOL_SIZE 3 pointers.
2 #define GLOBAL_POOL_SIZE 40
3
4 struct globalmempool { 6.4.3.4 Allocation Function
5 spinlock_t mutex;
6 int cur;
7 struct memblock *pool[GLOBAL_POOL_SIZE]; The allocation function memblock_alloc() may be seen
8 } globalmem; in Listing 6.10. Line 7 picks up the current thread’s
9
10 struct perthreadmempool { per-thread pool, and line 8 checks to see if it is empty.
11 int cur; If so, lines 9–16 attempt to refill it from the global pool
12 struct memblock *pool[2 * TARGET_POOL_SIZE];
13 }; under the spinlock acquired on line 9 and released on
14 line 16. Lines 10–14 move blocks from the global to the
15 DEFINE_PER_THREAD(struct perthreadmempool, perthreadmem);
per-thread pool until either the local pool reaches its target
size (half full) or the global pool is exhausted, and line 15
sets the per-thread pool’s count to the proper value.
of type struct perthreadmempool. Both of these data
In either case, line 18 checks for the per-thread pool still
structures have arrays of pointers to blocks in their pool
being empty, and if not, lines 19–21 remove a block and
fields, which are filled from index zero upwards. Thus,
return it. Otherwise, line 23 tells the sad tale of memory
if globalmem.pool[3] is NULL, then the remainder of
exhaustion.
the array from index 4 up must also be NULL. The cur
fields contain the index of the highest-numbered full
element of the pool array, or −1 if all elements are 6.4.3.5 Free Function
empty. All elements from globalmem.pool[0] through Listing 6.11 shows the memory-block free function. Line 6
globalmem.pool[globalmem.cur] must be full, and gets a pointer to this thread’s pool, and line 7 checks to
all the rest must be empty.15 see if this per-thread pool is full.
The operation of the pool data structures is illustrated If so, lines 8–15 empty half of the per-thread pool
by Figure 6.20, with the six boxes representing the array into the global pool, with lines 8 and 14 acquiring and
of pointers making up the pool field, and the number releasing the spinlock. Lines 9–12 implement the loop
preceding them representing the cur field. The shaded moving blocks from the local to the global pool, and
15 Both pool sizes (TARGET_POOL_SIZE and GLOBAL_POOL_SIZE) line 13 sets the per-thread pool’s count to the proper value.
are unrealistically small, but this small size makes it easier to single-step In either case, line 16 then places the newly freed block
the program in order to get a feel for its operation. into the per-thread pool.

v2023.06.11a
92 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

30

Allocations/Frees Per Microsecond


25

20
Listing 6.10: Allocator-Cache Allocator Function
1 struct memblock *memblock_alloc(void) 15
2 {
3 int i;
4 struct memblock *p; 10
5 struct perthreadmempool *pcpp;
6
7 pcpp = &__get_thread_var(perthreadmem); 5
8 if (pcpp->cur < 0) {
9 spin_lock(&globalmem.mutex);
10 for (i = 0; i < TARGET_POOL_SIZE && 0
11 globalmem.cur >= 0; i++) { 0 5 10 15 20 25
12 pcpp->pool[i] = globalmem.pool[globalmem.cur]; Allocation Run Length
13 globalmem.pool[globalmem.cur--] = NULL;
14 }
15 pcpp->cur = i - 1;
Figure 6.21: Allocator Cache Performance
16 spin_unlock(&globalmem.mutex);
17 }
18 if (pcpp->cur >= 0) {
19 p = pcpp->pool[pcpp->cur]; Quick Quiz 6.23: Doesn’t this resource-allocator design
20 pcpp->pool[pcpp->cur--] = NULL; resemble that of the approximate limit counters covered in
21 return p;
22 }
Section 5.3?
23 return NULL;
24 }
6.4.3.6 Performance
Rough performance results16 are shown in Figure 6.21,
running on a dual-core Intel x86 running at 1 GHz (4300
bogomips per CPU) with at most six blocks allowed in
each CPU’s cache. In this micro-benchmark, each thread
repeatedly allocates a group of blocks and then frees all
the blocks in that group, with the number of blocks in
the group being the “allocation run length” displayed on
the x-axis. The y-axis shows the number of successful
allocation/free pairs per microsecond—failed allocations
Listing 6.11: Allocator-Cache Free Function
1 void memblock_free(struct memblock *p)
are not counted. The “X”s are from a two-thread run,
2 { while the “+”s are from a single-threaded run.
3 int i; Note that run lengths up to six scale linearly and give
4 struct perthreadmempool *pcpp;
5 excellent performance, while run lengths greater than
6 pcpp = &__get_thread_var(perthreadmem); six show poor performance and almost always also show
7 if (pcpp->cur >= 2 * TARGET_POOL_SIZE - 1) {
8 spin_lock(&globalmem.mutex); negative scaling. It is therefore quite important to size
9 for (i = pcpp->cur; i >= TARGET_POOL_SIZE; i--) { TARGET_POOL_SIZE sufficiently large, which fortunately
10 globalmem.pool[++globalmem.cur] = pcpp->pool[i];
11 pcpp->pool[i] = NULL; is usually quite easy to do in actual practice [MSK01],
12 } especially given today’s large memories. For example,
13 pcpp->cur = i;
14 spin_unlock(&globalmem.mutex); in most systems, it is quite reasonable to set TARGET_
15 } POOL_SIZE to 100, in which case allocations and frees
16 pcpp->pool[++pcpp->cur] = p;
17 }

16 This data was not collected in a statistically meaningful way, and

therefore should be viewed with great skepticism and suspicion. Good


data-collection and -reduction practice is discussed in Chapter 11. That
said, repeated runs gave similar results, and these results match more
careful evaluations of similar algorithms.

v2023.06.11a
6.4. PARALLEL FASTPATH 93

are guaranteed to be confined to per-thread pools at least Table 6.1: Schematic of Real-World Parallel Allocator
99 % of the time.
As can be seen from the figure, the situations where Level Locking Purpose
the common-case data-ownership applies (run lengths up Per-thread pool Data ownership High-speed
to six) provide greatly improved performance compared allocation
to the cases where locks must be acquired. Avoiding Global block pool Data locking Distributing blocks
synchronization in the common case will be a recurring among threads
theme through this book.
Coalescing Data locking Combining blocks
Quick Quiz 6.24: In Figure 6.21, there is a pattern of into pages
performance rising with increasing run length in groups of System memory Code locking Memory from/to
three samples, for example, for run lengths 10, 11, and 12. system
Why?

Quick Quiz 6.25: Allocation failures were observed in the


two-thread tests at run lengths of 19 and greater. Given the
global-pool size of 40 and the per-thread target pool size 𝑠
of three, number of threads 𝑛 equal to two, and assuming to offer a fixed set of sizes, spaced so as to balance external
that the per-thread pools are initially empty with none of the and internal fragmentation, such as in the late-1980s BSD
memory in use, what is the smallest allocation run length 𝑚 at memory allocator [MK88]. Doing this would mean that
which failures can occur? (Recall that each thread repeatedly the “globalmem” variable would need to be replicated
allocates 𝑚 block of memory, and then frees the 𝑚 blocks of
on a per-size basis, and that the associated lock would
memory.) Alternatively, given 𝑛 threads each with pool size
𝑠, and where each thread repeatedly first allocates 𝑚 blocks
similarly be replicated, resulting in data locking rather
of memory and then frees those 𝑚 blocks, how large must the than the toy program’s code locking.
global pool size be? Note: Obtaining the correct answer will
require you to examine the smpalloc.c source code, and very Second, production-quality systems must be able to
likely single-step it as well. You have been warned! repurpose memory, meaning that they must be able to co-
alesce blocks into larger structures, such as pages [MS93].
This coalescing will also need to be protected by a lock,
6.4.3.7 Validation which again could be replicated on a per-size basis.
Validation of this simple allocator spawns a specified
number of threads, with each thread repeatedly allocating a Third, coalesced memory must be returned to the un-
specified number of memory blocks and then deallocating derlying memory system, and pages of memory must
them. This simple regimen suffices to exercise both the also be allocated from the underlying memory system.
per-thread caches and the global pool, as can be seen in The locking required at this level will depend on that
Figure 6.21. of the underlying memory system, but could well be
Much more aggressive validation is required for mem- code locking. Code locking can often be tolerated at
ory allocators that are to be used in production. The this level, because this level is so infrequently reached in
test suites for tcmalloc [Ken20] and jemalloc [Eva11] are well-designed systems [MSK01].
instructive, as are the tests for the Linux kernel’s memory
allocator. Concurrent userspace allocators face similar chal-
lenges [Ken20, Eva11].
6.4.3.8 Real-World Design
The toy parallel resource allocator was quite simple, but Despite this real-world design’s greater complexity,
real-world designs expand on this approach in a number the underlying idea is the same—repeated application of
of ways. parallel fastpath, as shown in Table 6.1.
First, real-world allocators are required to handle a wide
range of allocation sizes, as opposed to the single size And “parallel fastpath” is one of the solutions to the non-
shown in this toy example. One popular way to do this is partitionable application problem put forth on page 73.

v2023.06.11a
94 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

6.5 Beyond Partitioning Listing 6.12: SEQ Pseudocode


1 int maze_solve(maze *mp, cell sc, cell ec)
2 {
3 cell c = sc;
It is all right to aim high if you have plenty of 4 cell n;
ammunition. 5 int vi = 0;
6
Hawley R. Everhart 7 maze_try_visit_cell(mp, c, c, &n, 1);
8 for (;;) {
9 while (!maze_find_any_next_cell(mp, c, &n)) {
This chapter has discussed how data partitioning can be 10 if (++vi >= mp->vi)
11 return 0;
used to design simple linearly scalable parallel programs. 12 c = mp->visited[vi].c;
Section 6.3.4 hinted at the possibilities of data replication, 13 }
14 do {
which will be used to great effect in Section 9.5. 15 if (n == ec) {
The main goal of applying partitioning and replication 16 return 1;
17 }
is to achieve linear speedups, in other words, to ensure 18 c = n;
that the total amount of work required does not increase 19 } while (maze_find_any_next_cell(mp, c, &n));
20 c = mp->visited[vi].c;
significantly as the number of CPUs or threads increases. 21 }
A problem that can be solved via partitioning and/or 22 }
replication, resulting in linear speedups, is embarrassingly
parallel. But can we do better?
To answer this question, let us examine the solution of array for a visited cell with an unvisited neighbor, and
labyrinths and mazes. Of course, labyrinths and mazes the loop spanning lines 14–19 traverses one fork of the
have been objects of fascination for millennia [Wik12], submaze headed by that neighbor. Line 20 initializes for
so it should come as no surprise that they are generated the next pass through the outer loop.
and solved using computers, including biological com- The pseudocode for maze_try_visit_cell() is
puters [Ada11], GPGPUs [Eri08], and even discrete hard- shown on lines 1–12 of Listing 6.13 (maze.c). Line 4
ware [KFC11]. Parallel solution of mazes is sometimes checks to see if cells c and t are adjacent and connected,
used as a class project in universities [ETH11, Uni10] while line 5 checks to see if cell t has not yet been vis-
and as a vehicle to demonstrate the benefits of parallel- ited. The celladdr() function returns the address of the
programming frameworks [Fos10]. specified cell. If either check fails, line 6 returns failure.
Common advice is to use a parallel work-queue algo- Line 7 indicates the next cell, line 8 records this cell in the
rithm (PWQ) [ETH11, Fos10]. This section evaluates this next slot of the ->visited[] array, line 9 indicates that
advice by comparing PWQ against a sequential algorithm this slot is now full, and line 10 marks this cell as visited
(SEQ) and also against an alternative parallel algorithm, and also records the distance from the maze start. Line 11
in all cases solving randomly generated square mazes. then returns success.
Section 6.5.1 discusses PWQ, Section 6.5.2 discusses The pseudocode for maze_find_any_next_cell()
an alternative parallel algorithm, Section 6.5.4 analyzes is shown on lines 14–28 of Listing 6.13 (maze.c). Line 17
its anomalous performance, Section 6.5.5 derives an im- picks up the current cell’s distance plus 1, while lines 19,
proved sequential algorithm from the alternative paral- 21, 23, and 25 check the cell in each direction, and
lel algorithm, Section 6.5.6 makes further performance lines 20, 22, 24, and 26 return true if the corresponding
comparisons, and finally Section 6.5.7 presents future cell is a candidate next cell. The prevcol(), nextcol(),
directions and concluding remarks. prevrow(), and nextrow() each do the specified array-
index-conversion operation. If none of the cells is a
candidate, line 27 returns false.
6.5.1 Work-Queue Parallel Maze Solver
The path is recorded in the maze by counting the number
PWQ is based on SEQ, which is shown in Listing 6.12 of cells from the starting point, as shown in Figure 6.22,
(pseudocode for maze_seq.c). The maze is represented where the starting cell is in the upper left and the ending
by a 2D array of cells and a linear-array-based work queue cell is in the lower right. Starting at the ending cell and
named ->visited. following consecutively decreasing cell numbers traverses
Line 7 visits the initial cell, and each iteration of the loop the solution.
spanning lines 8–21 traverses passages headed by one cell. The parallel work-queue solver is a straightforward
The loop spanning lines 9–13 scans the ->visited[] parallelization of the algorithm shown in Listings 6.12

v2023.06.11a
6.5. BEYOND PARTITIONING 95

and 6.13. Line 10 of Listing 6.12 must use fetch-and-


Listing 6.13: SEQ Helper Pseudocode add, and the local variable vi must be shared among the
1 int maze_try_visit_cell(struct maze *mp, cell c, cell t, various threads. Lines 5 and 10 of Listing 6.13 must be
2 cell *n, int d)
3 { combined into a CAS loop, with CAS failure indicating
4 if (!maze_cells_connected(mp, c, t) || a loop in the maze. Lines 8–9 of this listing must use
5 (*celladdr(mp, t) & VISITED))
6 return 0; fetch-and-add to arbitrate concurrent attempts to record
7 *n = t; cells in the ->visited[] array.
8 mp->visited[mp->vi] = t;
9 mp->vi++; This approach does provide significant speedups on a
10 *celladdr(mp, t) |= VISITED | d; dual-CPU Lenovo W500 running at 2.53 GHz, as shown
11 return 1;
12 } in Figure 6.23, which shows the cumulative distribution
13 functions (CDFs) for the solution times of the two al-
14 int maze_find_any_next_cell(struct maze *mp, cell c,
15 cell *n) gorithms, based on the solution of 500 different square
16 { 500-by-500 randomly generated mazes. The substantial
17 int d = (*celladdr(mp, c) & DISTANCE) + 1;
18 overlap of the projection of the CDFs onto the x-axis will
19 if (maze_try_visit_cell(mp, c, prevcol(c), n, d)) be addressed in Section 6.5.4.
20 return 1;
21 if (maze_try_visit_cell(mp, c, nextcol(c), n, d)) Interestingly enough, the sequential solution-path track-