0% found this document useful (0 votes)
114 views692 pages

Parallel Programming

Easy way to learn parallel programming

Uploaded by

Emanuel Guadamuz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views692 pages

Parallel Programming

Easy way to learn parallel programming

Uploaded by

Emanuel Guadamuz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 692

Is Parallel Programming Hard, And, If So, What

Can You Do About It?

Edited by:

Paul E. McKenney
Linux Technology Center
IBM Beaverton
[email protected]

January 2, 2017
ii

Legal Statement
This work represents the views of the editor and the authors and does not necessarily
represent the view of their respective employers.

Trademarks:
• IBM, zSeries, and PowerPC are trademarks or registered trademarks of Interna-
tional Business Machines Corporation in the United States, other countries, or
both.
• Linux is a registered trademark of Linus Torvalds.

• i386 is a trademark of Intel Corporation or its subsidiaries in the United States,


other countries, or both.
• Other company, product, and service names may be trademarks or service marks
of such companies.

The non-source-code text and images in this document are provided under the terms
of the Creative Commons Attribution-Share Alike 3.0 United States license.1 In brief,
you may use the contents of this document for any purpose, personal, commercial, or
otherwise, so long as attribution to the authors is maintained. Likewise, the document
may be modified, and derivative works and translations made available, so long as
such modifications and derivations are offered to the public on equal terms as the
non-source-code text and images in the original document.
Source code is covered by various versions of the GPL.2 Some of this code is
GPLv2-only, as it derives from the Linux kernel, while other code is GPLv2-or-later.
See the comment headers of the individual source files within the CodeSamples directory
in the git archive3 for the exact licenses. If you are unsure of the license for a given
code fragment, you should assume GPLv2-only.
Combined work © 2005-2016 by Paul E. McKenney.

1 http://creativecommons.org/licenses/by-sa/3.0/us/
2 http://www.gnu.org/licenses/gpl-2.0.html
3 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
Contents

1 How To Use This Book 1


1.1 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Quick Quizzes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Alternatives to This Book . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Sample Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Whose Book Is This? . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Introduction 7
2.1 Historic Parallel Programming Difficulties . . . . . . . . . . . . . . . . 7
2.2 Parallel Programming Goals . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Alternatives to Parallel Programming . . . . . . . . . . . . . . . . . . 14
2.3.1 Multiple Instances of a Sequential Application . . . . . . . . 15
2.3.2 Use Existing Parallel Software . . . . . . . . . . . . . . . . . 15
2.3.3 Performance Optimization . . . . . . . . . . . . . . . . . . . 15
2.4 What Makes Parallel Programming Hard? . . . . . . . . . . . . . . . 16
2.4.1 Work Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Parallel Access Control . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Resource Partitioning and Replication . . . . . . . . . . . . . 18
2.4.4 Interacting With Hardware . . . . . . . . . . . . . . . . . . . 19
2.4.5 Composite Capabilities . . . . . . . . . . . . . . . . . . . . . 19
2.4.6 How Do Languages and Environments Assist With These Tasks? 19
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Hardware and its Habits 21


3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Pipelined CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Memory References . . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.4 Memory Barriers . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.5 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.6 I/O Operations . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Hardware System Architecture . . . . . . . . . . . . . . . . . . 27
3.2.2 Costs of Operations . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Hardware Free Lunch? . . . . . . . . . . . . . . . . . . . . . . . . . 30

iii
iv CONTENTS

3.3.1 3D Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 Novel Materials and Processes . . . . . . . . . . . . . . . . . . 31
3.3.3 Light, Not Electrons . . . . . . . . . . . . . . . . . . . . . . 32
3.3.4 Special-Purpose Accelerators . . . . . . . . . . . . . . . . . 32
3.3.5 Existing Parallel Software . . . . . . . . . . . . . . . . . . . 33
3.4 Software Design Implications . . . . . . . . . . . . . . . . . . . . . . 33

4 Tools of the Trade 35


4.1 Scripting Languages . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 POSIX Multiprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 POSIX Process Creation and Destruction . . . . . . . . . . . 36
4.2.2 POSIX Thread Creation and Destruction . . . . . . . . . . . 38
4.2.3 POSIX Locking . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.4 POSIX Reader-Writer Locking . . . . . . . . . . . . . . . . . 42
4.2.5 Atomic Operations (gcc Classic) . . . . . . . . . . . . . . . . 45
4.2.6 Atomic Operations (C11) . . . . . . . . . . . . . . . . . . . . 46
4.2.7 Per-Thread Variables . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Alternatives to POSIX Operations . . . . . . . . . . . . . . . . . . . . 47
4.3.1 Organization and Initialization . . . . . . . . . . . . . . . . . . 47
4.3.2 Thread Creation, Destruction, and Control . . . . . . . . . . . . 47
4.3.3 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.4 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.5 Per-CPU Variables . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 The Right Tool for the Job: How to Choose? . . . . . . . . . . . . . . 53

5 Counting 55
5.1 Why Isn’t Concurrent Counting Trivial? . . . . . . . . . . . . . . . . 56
5.2 Statistical Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.2 Array-Based Implementation . . . . . . . . . . . . . . . . . . 59
5.2.3 Eventually Consistent Implementation . . . . . . . . . . . . . 60
5.2.4 Per-Thread-Variable-Based Implementation . . . . . . . . . . 63
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Approximate Limit Counters . . . . . . . . . . . . . . . . . . . . . . 64
5.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.2 Simple Limit Counter Implementation . . . . . . . . . . . . . 65
5.3.3 Simple Limit Counter Discussion . . . . . . . . . . . . . . . . 71
5.3.4 Approximate Limit Counter Implementation . . . . . . . . . 72
5.3.5 Approximate Limit Counter Discussion . . . . . . . . . . . . 72
5.4 Exact Limit Counters . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.1 Atomic Limit Counter Implementation . . . . . . . . . . . . 73
5.4.2 Atomic Limit Counter Discussion . . . . . . . . . . . . . . . . 77
5.4.3 Signal-Theft Limit Counter Design . . . . . . . . . . . . . . . 77
5.4.4 Signal-Theft Limit Counter Implementation . . . . . . . . . . 79
5.4.5 Signal-Theft Limit Counter Discussion . . . . . . . . . . . . 85
5.5 Applying Specialized Parallel Counters . . . . . . . . . . . . . . . . 85
5.6 Parallel Counting Discussion . . . . . . . . . . . . . . . . . . . . . . 86
5.6.1 Parallel Counting Performance . . . . . . . . . . . . . . . . . 86
5.6.2 Parallel Counting Specializations . . . . . . . . . . . . . . . . 87
CONTENTS v

5.6.3 Parallel Counting Lessons . . . . . . . . . . . . . . . . . . . 88

6 Partitioning and Synchronization Design 91


6.1 Partitioning Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.1 Dining Philosophers Problem . . . . . . . . . . . . . . . . . . 91
6.1.2 Double-Ended Queue . . . . . . . . . . . . . . . . . . . . . . 95
6.1.3 Partitioning Example Discussion . . . . . . . . . . . . . . . . 103
6.2 Design Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3 Synchronization Granularity . . . . . . . . . . . . . . . . . . . . . . 106
6.3.1 Sequential Program . . . . . . . . . . . . . . . . . . . . . . . 106
6.3.2 Code Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.3 Data Locking . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3.4 Data Ownership . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.5 Locking Granularity and Performance . . . . . . . . . . . . . 112
6.4 Parallel Fastpath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4.1 Reader/Writer Locking . . . . . . . . . . . . . . . . . . . . . 116
6.4.2 Hierarchical Locking . . . . . . . . . . . . . . . . . . . . . . 116
6.4.3 Resource Allocator Caches . . . . . . . . . . . . . . . . . . . . 117
6.5 Beyond Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.5.1 Work-Queue Parallel Maze Solver . . . . . . . . . . . . . . . 123
6.5.2 Alternative Parallel Maze Solver . . . . . . . . . . . . . . . . 126
6.5.3 Performance Comparison I . . . . . . . . . . . . . . . . . . . . 127
6.5.4 Alternative Sequential Maze Solver . . . . . . . . . . . . . . 130
6.5.5 Performance Comparison II . . . . . . . . . . . . . . . . . . . 131
6.5.6 Future Directions and Conclusions . . . . . . . . . . . . . . . 132
6.6 Partitioning, Parallelism, and Optimization . . . . . . . . . . . . . . . 133

7 Locking 135
7.1 Staying Alive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.1.1 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1.2 Livelock and Starvation . . . . . . . . . . . . . . . . . . . . 145
7.1.3 Unfairness . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.4 Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2 Types of Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.1 Exclusive Locks . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2.2 Reader-Writer Locks . . . . . . . . . . . . . . . . . . . . . . 148
7.2.3 Beyond Reader-Writer Locks . . . . . . . . . . . . . . . . . 148
7.2.4 Scoped Locking . . . . . . . . . . . . . . . . . . . . . . . . 150
7.3 Locking Implementation Issues . . . . . . . . . . . . . . . . . . . . . 152
7.3.1 Sample Exclusive-Locking Implementation Based on Atomic
Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3.2 Other Exclusive-Locking Implementations . . . . . . . . . . 153
7.4 Lock-Based Existence Guarantees . . . . . . . . . . . . . . . . . . . 155
7.5 Locking: Hero or Villain? . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.5.1 Locking For Applications: Hero! . . . . . . . . . . . . . . . . . 157
7.5.2 Locking For Parallel Libraries: Just Another Tool . . . . . . . . 157
7.5.3 Locking For Parallelizing Sequential Libraries: Villain! . . . . . 161
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
vi CONTENTS

8 Data Ownership 165


8.1 Multiple Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.2 Partial Data Ownership and pthreads . . . . . . . . . . . . . . . . . . 166
8.3 Function Shipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.4 Designated Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.5 Privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.6 Other Uses of Data Ownership . . . . . . . . . . . . . . . . . . . . . . 167

9 Deferred Processing 169


9.1 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.2 Reference Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.3 Hazard Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.4 Sequence Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.5 Read-Copy Update (RCU) . . . . . . . . . . . . . . . . . . . . . . . 183
9.5.1 Introduction to RCU . . . . . . . . . . . . . . . . . . . . . . 184
9.5.2 RCU Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . 187
9.5.3 RCU Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.5.4 RCU Linux-Kernel API . . . . . . . . . . . . . . . . . . . . 212
9.5.5 “Toy” RCU Implementations . . . . . . . . . . . . . . . . . . 218
9.5.6 RCU Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 236
9.6 Which to Choose? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
9.7 What About Updates? . . . . . . . . . . . . . . . . . . . . . . . . . . 238

10 Data Structures 241


10.1 Motivating Application . . . . . . . . . . . . . . . . . . . . . . . . . . 241
10.2 Partitionable Data Structures . . . . . . . . . . . . . . . . . . . . . . 242
10.2.1 Hash-Table Design . . . . . . . . . . . . . . . . . . . . . . . 242
10.2.2 Hash-Table Implementation . . . . . . . . . . . . . . . . . . 242
10.2.3 Hash-Table Performance . . . . . . . . . . . . . . . . . . . . 246
10.3 Read-Mostly Data Structures . . . . . . . . . . . . . . . . . . . . . . . 247
10.3.1 RCU-Protected Hash Table Implementation . . . . . . . . . . 248
10.3.2 RCU-Protected Hash Table Performance . . . . . . . . . . . 249
10.3.3 RCU-Protected Hash Table Discussion . . . . . . . . . . . . 252
10.4 Non-Partitionable Data Structures . . . . . . . . . . . . . . . . . . . 253
10.4.1 Resizable Hash Table Design . . . . . . . . . . . . . . . . . . 253
10.4.2 Resizable Hash Table Implementation . . . . . . . . . . . . . 255
10.4.3 Resizable Hash Table Discussion . . . . . . . . . . . . . . . . 261
10.4.4 Other Resizable Hash Tables . . . . . . . . . . . . . . . . . . 263
10.5 Other Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10.6 Micro-Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10.6.1 Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10.6.2 Bits and Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10.6.3 Hardware Considerations . . . . . . . . . . . . . . . . . . . . 268
10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

11 Validation 271
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
11.1.1 Where Do Bugs Come From? . . . . . . . . . . . . . . . . . 272
11.1.2 Required Mindset . . . . . . . . . . . . . . . . . . . . . . . . 273
11.1.3 When Should Validation Start? . . . . . . . . . . . . . . . . . 274
CONTENTS vii

11.1.4 The Open Source Way . . . . . . . . . . . . . . . . . . . . . 276


11.2 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
11.4 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
11.5 Code Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
11.5.1 Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
11.5.2 Walkthroughs . . . . . . . . . . . . . . . . . . . . . . . . . . 279
11.5.3 Self-Inspection . . . . . . . . . . . . . . . . . . . . . . . . . 280
11.6 Probability and Heisenbugs . . . . . . . . . . . . . . . . . . . . . . . 282
11.6.1 Statistics for Discrete Testing . . . . . . . . . . . . . . . . . 283
11.6.2 Abusing Statistics for Discrete Testing . . . . . . . . . . . . . 285
11.6.3 Statistics for Continuous Testing . . . . . . . . . . . . . . . . 285
11.6.4 Hunting Heisenbugs . . . . . . . . . . . . . . . . . . . . . . . 287
11.7 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 290
11.7.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . 291
11.7.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
11.7.3 Differential Profiling . . . . . . . . . . . . . . . . . . . . . . 292
11.7.4 Microbenchmarking . . . . . . . . . . . . . . . . . . . . . . 292
11.7.5 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
11.7.6 Detecting Interference . . . . . . . . . . . . . . . . . . . . . 294
11.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

12 Formal Verification 301


12.1 General-Purpose State-Space Search . . . . . . . . . . . . . . . . . . . 301
12.1.1 Promela and Spin . . . . . . . . . . . . . . . . . . . . . . . . . 301
12.1.2 How to Use Promela . . . . . . . . . . . . . . . . . . . . . . 305
12.1.3 Promela Example: Locking . . . . . . . . . . . . . . . . . . 308
12.1.4 Promela Example: QRCU . . . . . . . . . . . . . . . . . . . 310
12.1.5 Promela Parable: dynticks and Preemptible RCU . . . . . . . . 317
12.1.6 Validating Preemptible RCU and dynticks . . . . . . . . . . . 322
12.2 Special-Purpose State-Space Search . . . . . . . . . . . . . . . . . . 342
12.2.1 Anatomy of a Litmus Test . . . . . . . . . . . . . . . . . . . 343
12.2.2 What Does This Litmus Test Mean? . . . . . . . . . . . . . . 344
12.2.3 Running a Litmus Test . . . . . . . . . . . . . . . . . . . . . 345
12.2.4 PPCMEM Discussion . . . . . . . . . . . . . . . . . . . . . 346
12.3 Axiomatic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 347
12.4 SAT Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
12.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

13 Putting It All Together 353


13.1 Counter Conundrums . . . . . . . . . . . . . . . . . . . . . . . . . . 353
13.1.1 Counting Updates . . . . . . . . . . . . . . . . . . . . . . . 353
13.1.2 Counting Lookups . . . . . . . . . . . . . . . . . . . . . . . 353
13.2 Refurbish Reference Counting . . . . . . . . . . . . . . . . . . . . . 354
13.2.1 Implementation of Reference-Counting Categories . . . . . . 355
13.2.2 Linux Primitives Supporting Reference Counting . . . . . . . 360
13.2.3 Counter Optimizations . . . . . . . . . . . . . . . . . . . . . . 361
13.3 RCU Rescues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
13.3.1 RCU and Per-Thread-Variable-Based Statistical Counters . . . 362
13.3.2 RCU and Counters for Removable I/O Devices . . . . . . . . 364
viii CONTENTS

13.3.3 Array and Length . . . . . . . . . . . . . . . . . . . . . . . . 365


13.3.4 Correlated Fields . . . . . . . . . . . . . . . . . . . . . . . . 366
13.4 Hashing Hassles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
13.4.1 Correlated Data Elements . . . . . . . . . . . . . . . . . . . . 367
13.4.2 Update-Friendly Hash-Table Traversal . . . . . . . . . . . . . 368

14 Advanced Synchronization 369


14.1 Avoiding Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
14.2 Memory Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
14.2.1 Memory Ordering and Memory Barriers . . . . . . . . . . . . 370
14.2.2 If B Follows A, and C Follows B, Why Doesn’t C Follow A? . . 371
14.2.3 Variables Can Have More Than One Value . . . . . . . . . . 373
14.2.4 What Can You Trust? . . . . . . . . . . . . . . . . . . . . . . 374
14.2.5 Review of Locking Implementations . . . . . . . . . . . . . . 382
14.2.6 A Few Simple Rules . . . . . . . . . . . . . . . . . . . . . . 382
14.2.7 Abstract Memory Access Model . . . . . . . . . . . . . . . . 383
14.2.8 Device Operations . . . . . . . . . . . . . . . . . . . . . . . 384
14.2.9 Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
14.2.10 What Are Memory Barriers? . . . . . . . . . . . . . . . . . . 386
14.2.11 Locking Constraints . . . . . . . . . . . . . . . . . . . . . . 399
14.2.12 Memory-Barrier Examples . . . . . . . . . . . . . . . . . . . 400
14.2.13 The Effects of the CPU Cache . . . . . . . . . . . . . . . . . 402
14.2.14 Where Are Memory Barriers Needed? . . . . . . . . . . . . . 404
14.3 Non-Blocking Synchronization . . . . . . . . . . . . . . . . . . . . . 404
14.3.1 Simple NBS . . . . . . . . . . . . . . . . . . . . . . . . . . 406
14.3.2 NBS Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 407

15 Parallel Real-Time Computing 409


15.1 What is Real-Time Computing? . . . . . . . . . . . . . . . . . . . . 409
15.1.1 Soft Real Time . . . . . . . . . . . . . . . . . . . . . . . . . 409
15.1.2 Hard Real Time . . . . . . . . . . . . . . . . . . . . . . . . . 410
15.1.3 Real-World Real Time . . . . . . . . . . . . . . . . . . . . . . 411
15.2 Who Needs Real-Time Computing? . . . . . . . . . . . . . . . . . . 415
15.3 Who Needs Parallel Real-Time Computing? . . . . . . . . . . . . . . 416
15.4 Implementing Parallel Real-Time Systems . . . . . . . . . . . . . . . . 417
15.4.1 Implementing Parallel Real-Time Operating Systems . . . . . 418
15.4.2 Implementing Parallel Real-Time Applications . . . . . . . . . 431
15.4.3 The Role of RCU . . . . . . . . . . . . . . . . . . . . . . . . 434
15.5 Real Time vs. Real Fast: How to Choose? . . . . . . . . . . . . . . . 435

16 Ease of Use 437


16.1 What is Easy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
16.2 Rusty Scale for API Design . . . . . . . . . . . . . . . . . . . . . . . . 437
16.3 Shaving the Mandelbrot Set . . . . . . . . . . . . . . . . . . . . . . . 439

17 Conflicting Visions of the Future 441


17.1 The Future of CPU Technology Ain’t What it Used to Be . . . . . . . . 441
17.1.1 Uniprocessor Über Alles . . . . . . . . . . . . . . . . . . . . . 441
17.1.2 Multithreaded Mania . . . . . . . . . . . . . . . . . . . . . . 442
17.1.3 More of the Same . . . . . . . . . . . . . . . . . . . . . . . . 443
CONTENTS ix

17.1.4 Crash Dummies Slamming into the Memory Wall . . . . . . . 444


17.2 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 447
17.2.1 Outside World . . . . . . . . . . . . . . . . . . . . . . . . . . 447
17.2.2 Process Modification . . . . . . . . . . . . . . . . . . . . . . . 451
17.2.3 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . 456
17.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
17.3 Hardware Transactional Memory . . . . . . . . . . . . . . . . . . . . 462
17.3.1 HTM Benefits WRT to Locking . . . . . . . . . . . . . . . . 463
17.3.2 HTM Weaknesses WRT Locking . . . . . . . . . . . . . . . 465
17.3.3 HTM Weaknesses WRT to Locking When Augmented . . . . . 471
17.3.4 Where Does HTM Best Fit In? . . . . . . . . . . . . . . . . . 474
17.3.5 Potential Game Changers . . . . . . . . . . . . . . . . . . . . 475
17.3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
17.4 Functional Programming for Parallelism . . . . . . . . . . . . . . . . 478

A Important Questions 481


A.1 What Does “After” Mean? . . . . . . . . . . . . . . . . . . . . . . . . 481
A.2 What is the Difference Between “Concurrent” and “Parallel”? . . . . 484
A.3 What Time Is It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485

B Why Memory Barriers? 487


B.1 Cache Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
B.2 Cache-Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . 489
B.2.1 MESI States . . . . . . . . . . . . . . . . . . . . . . . . . . 490
B.2.2 MESI Protocol Messages . . . . . . . . . . . . . . . . . . . . 490
B.2.3 MESI State Diagram . . . . . . . . . . . . . . . . . . . . . . . 491
B.2.4 MESI Protocol Example . . . . . . . . . . . . . . . . . . . . 493
B.3 Stores Result in Unnecessary Stalls . . . . . . . . . . . . . . . . . . . 494
B.3.1 Store Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . 494
B.3.2 Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . 495
B.3.3 Store Buffers and Memory Barriers . . . . . . . . . . . . . . 496
B.4 Store Sequences Result in Unnecessary Stalls . . . . . . . . . . . . . 499
B.4.1 Invalidate Queues . . . . . . . . . . . . . . . . . . . . . . . . 499
B.4.2 Invalidate Queues and Invalidate Acknowledge . . . . . . . . 500
B.4.3 Invalidate Queues and Memory Barriers . . . . . . . . . . . . 500
B.5 Read and Write Memory Barriers . . . . . . . . . . . . . . . . . . . . 503
B.6 Example Memory-Barrier Sequences . . . . . . . . . . . . . . . . . . 504
B.6.1 Ordering-Hostile Architecture . . . . . . . . . . . . . . . . . 504
B.6.2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
B.6.3 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
B.6.4 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
B.7 Memory-Barrier Instructions For Specific CPUs . . . . . . . . . . . . . 507
B.7.1 Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
B.7.2 AMD64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
B.7.3 ARMv7-A/R . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
B.7.4 IA64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
B.7.5 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
B.7.6 PA-RISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
B.7.7 POWER / PowerPC . . . . . . . . . . . . . . . . . . . . . . . 514
B.7.8 SPARC RMO, PSO, and TSO . . . . . . . . . . . . . . . . . 515
x CONTENTS

B.7.9 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516


B.7.10 zSeries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
B.8 Are Memory Barriers Forever? . . . . . . . . . . . . . . . . . . . . . . 517
B.9 Advice to Hardware Designers . . . . . . . . . . . . . . . . . . . . . 518

C Answers to Quick Quizzes 521


C.1 How To Use This Book . . . . . . . . . . . . . . . . . . . . . . . . . . 521
C.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
C.3 Hardware and its Habits . . . . . . . . . . . . . . . . . . . . . . . . . 528
C.4 Tools of the Trade . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
C.5 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
C.6 Partitioning and Synchronization Design . . . . . . . . . . . . . . . . 560
C.7 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
C.8 Data Ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
C.9 Deferred Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
C.10 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
C.11 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
C.12 Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
C.13 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . 621
C.14 Advanced Synchronization . . . . . . . . . . . . . . . . . . . . . . . 625
C.15 Parallel Real-Time Computing . . . . . . . . . . . . . . . . . . . . . 630
C.16 Ease of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
C.17 Conflicting Visions of the Future . . . . . . . . . . . . . . . . . . . . 633
C.18 Important Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
C.19 Why Memory Barriers? . . . . . . . . . . . . . . . . . . . . . . . . . 638

D Glossary and Bibliography 643

E Credits 679
E.1 Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
E.2 Reviewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
E.3 Machine Owners . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
E.4 Original Publications . . . . . . . . . . . . . . . . . . . . . . . . . . 680
E.5 Figure Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
E.6 Other Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
Chapter 1

How To Use This Book

The purpose of this book is to help you program shared-memory parallel machines
without risking your sanity.1 We hope that this book’s design principles will help
you avoid at least some parallel-programming pitfalls. That said, you should think
of this book as a foundation on which to build, rather than as a completed cathedral.
Your mission, if you choose to accept, is to help make further progress in the exciting
field of parallel programming—progress that will in time render this book obsolete.
Parallel programming is not as hard as some say, and we hope that this book makes your
parallel-programming projects easier and more fun.
In short, where parallel programming once focused on science, research, and grand-
challenge projects, it is quickly becoming an engineering discipline. We therefore
examine specific parallel-programming tasks and describe how to approach them. In
some surprisingly common cases, they can even be automated.
This book is written in the hope that presenting the engineering discipline underlying
successful parallel-programming projects will free a new generation of parallel hackers
from the need to slowly and painstakingly reinvent old wheels, enabling them to instead
focus their energy and creativity on new frontiers. We sincerely hope that parallel
programming brings you at least as much fun, excitement, and challenge that it has
brought to us!

1.1 Roadmap
This book is a handbook of widely applicable and heavily used design techniques, rather
than a collection of optimal algorithms with tiny areas of applicability. You are currently
reading Chapter 1, but you knew that already. Chapter 2 gives a high-level overview of
parallel programming.
Chapter 3 introduces shared-memory parallel hardware. After all, it is difficult
to write good parallel code unless you understand the underlying hardware. Because
hardware constantly evolves, this chapter will always be out of date. We will nevertheless
do our best to keep up. Chapter 4 then provides a very brief overview of common shared-
memory parallel-programming primitives.
Chapter 5 takes an in-depth look at parallelizing one of the simplest problems
imaginable, namely counting. Because almost everyone has an excellent grasp of
1 Or, perhaps more accurately, without much greater risk to your sanity than that incurred by non-parallel

programming. Which, come to think of it, might not be saying all that much.

1
2 CHAPTER 1. HOW TO USE THIS BOOK

counting, this chapter is able to delve into many important parallel-programming issues
without the distractions of more-typical computer-science problems. My impression is
that this chapter has seen the greatest use in parallel-programming coursework.
Chapter 6 introduces a number of design-level methods of addressing the issues
identified in Chapter 5. It turns out that it is important to address parallelism at the
design level when feasible: To paraphrase Dijkstra [Dij68], “retrofitted parallelism
considered grossly suboptimal” [McK12b].
The next three chapters examine three important approaches to synchronization.
Chapter 7 covers locking, which in 2014 is not only the workhorse of production-quality
parallel programming, but is also widely considered to be parallel programming’s worst
villain. Chapter 8 gives a brief overview of data ownership, an often overlooked but
remarkably pervasive and powerful approach. Finally, Chapter 9 introduces a number
of deferred-processing mechanisms, including reference counting, hazard pointers,
sequence locking, and RCU.
Chapter 10 applies the lessons of previous chapters to hash tables, which are heavily
used due to their excellent partitionability, which (usually) leads to excellent perfor-
mance and scalability.
As many have learned to their sorrow, parallel programming without validation is a
sure path to abject failure. Chapter 11 covers various forms of testing. It is of course
impossible to test reliability into your program after the fact, so Chapter 12 follows up
with a brief overview of a couple of practical approaches to formal verification.
Chapter 13 contains a series of moderate-sized parallel programming problems.
The difficulty of these problems vary, but should be appropriate for someone who has
mastered the material in the previous chapters.
Chapter 14 looks at advanced synchronization methods, including memory barriers
and non-blocking synchronization, while Chapter 15 looks at the nascent field of
parallel real-time computing. Chapter 16 follows up with some ease-of-use advice.
Finally, Chapter 17 looks at a few possible future directions, including shared-memory
parallel system design, software and hardware transactional memory, and functional
programming for parallelism.
This chapter is followed by a number of appendices. The most popular of these
appears to be Appendix B, which covers memory barriers. Appendix C contains the
answers to the infamous Quick Quizzes, which are discussed in the next section.

1.2 Quick Quizzes


“Quick quizzes” appear throughout this book, and the answers may be found in Ap-
pendix C starting on page 521. Some of them are based on material in which that quick
quiz appears, but others require you to think beyond that section, and, in some cases,
beyond the realm of current knowledge. As with most endeavors, what you get out of
this book is largely determined by what you are willing to put into it. Therefore, readers
who make a genuine effort to solve a quiz before looking at the answer find their effort
repaid handsomely with increased understanding of parallel programming.
Quick Quiz 1.1: Where are the answers to the Quick Quizzes found?
Quick Quiz 1.2: Some of the Quick Quiz questions seem to be from the viewpoint
of the reader rather than the author. Is that really the intent?
Quick Quiz 1.3: These Quick Quizzes are just not my cup of tea. What can I do
about it?
1.3. ALTERNATIVES TO THIS BOOK 3

In short, if you need a deep understanding of the material, then you should invest
some time into answering the Quick Quizzes. Don’t get me wrong, passively reading
the material can be quite valuable, but gaining full problem-solving capability really
does require that you practice solving problems.
I learned this the hard way during coursework for my late-in-life Ph.D. I was
studying a familiar topic, and was surprised at how few of the chapter’s exercises I
could answer off the top of my head.2 Forcing myself to answer the questions greatly
increased my retention of the material. So with these Quick Quizzes I am not asking
you to do anything that I have not been doing myself!
Finally, the most common learning disability is thinking that you already know. The
quick quizzes can be an extremely effective cure.

1.3 Alternatives to This Book


As Knuth learned, if you want your book to be finite, it must be focused. This book
focuses on shared-memory parallel programming, with an emphasis on software that
lives near the bottom of the software stack, such as operating-system kernels, parallel
data-management systems, low-level libraries, and the like. The programming language
used by this book is C.
If you are interested in other aspects of parallelism, you might well be better served
by some other book. Fortunately, there are many alternatives available to you:

1. If you prefer a more academic and rigorous treatment of parallel programming,


you might like Herlihy’s and Shavit’s textbook [HS08]. This book starts with
an interesting combination of low-level primitives at high levels of abstraction
from the hardware, and works its way through locking and simple data structures
including lists, queues, hash tables, and counters, culminating with transactional
memory. Michael Scott’s textbook [Sco13] approaches similar material with
more of a software-engineering focus, and, as far as I know, is the first formally
published academic textbook to include a section devoted to RCU.
2. If you would like an academic treatment of parallel programming from a program-
ming-language-pragmatics viewpoint, you might be interested in the concurrency
chapter from Scott’s textbook [Sco06] on programming-language pragmatics.
3. If you are interested in an object-oriented patternist treatment of parallel pro-
gramming focussing on C++, you might try Volumes 2 and 4 of Schmidt’s POSA
series [SSRB00, BHS07]. Volume 4 in particular has some interesting chapters
applying this work to a warehouse application. The realism of this example is
attested to by the section entitled “Partitioning the Big Ball of Mud”, wherein the
problems inherent in parallelism often take a back seat to the problems inherent
in getting one’s head around a real-world application.
4. If you want to work with Linux-kernel device drivers, then Corbet’s, Rubini’s, and
Kroah-Hartman’s “Linux Device Drivers” [CRKH05] is indispensable, as is the
Linux Weekly News web site (http://lwn.net/). There is a large number
of books and resources on the more general topic of Linux kernel internals.

2 So I suppose that it was just as well that my professors refused to let me waive that class!
4 CHAPTER 1. HOW TO USE THIS BOOK

5. If your primary focus is scientific and technical computing, and you prefer a
patternist approach, you might try Mattson et al.’s textbook [MSM05]. It covers
Java, C/C++, OpenMP, and MPI. Its patterns are admirably focused first on design,
then on implementation.
6. If your primary focus is scientific and technical computing, and you are interested
in GPUs, CUDA, and MPI, you might check out Norm Matloff’s “Programming
on Parallel Machines” [Mat13].
7. If you are interested in POSIX Threads, you might take a look at David R. Buten-
hof’s book [But97]. In addition, W. Richard Stevens’s book [Ste92] covers UNIX
and POSIX, and Stewart Weiss’s lecture notes [Wei13] provide an thorough and
accessible introduction with a good set of examples.
8. If you are interested in C++11, you might like Anthony Williams’s “C++ Concur-
rency in Action: Practical Multithreading” [Wil12].
9. If you are interested in C++, but in a Windows environment, you might try Herb
Sutter’s “Effective Concurrency” series in Dr. Dobbs Journal [Sut08]. This series
does a reasonable job of presenting a commonsense approach to parallelism.
10. If you want to try out Intel Threading Building Blocks, then perhaps James
Reinders’s book [Rei07] is what you are looking for.
11. Those interested in learning how various types of multi-processor hardware cache
organizations affect the implementation of kernel internals should take a look at
Curt Schimmel’s classic treatment of this subject [Sch94].
12. Finally, those using Java might be well-served by Doug Lea’s textbooks [Lea97,
GPB+ 07].

However, if you are interested in principles of parallel design for low-level software,
especially software written in C, read on!

1.4 Sample Source Code


This book discusses its fair share of source code, and in many cases this source code
may be found in the CodeSamples directory of this book’s git tree. For example, on
UNIX systems, you should be able to type the following:
find CodeSamples -name rcu_rcpls.c -print

This command will locate the file rcu_rcpls.c, which is called out in Sec-
tion 9.5.5. Other types of systems have well-known ways of locating files by filename.

1.5 Whose Book Is This?


As the cover says, the editor is one Paul E. McKenney. However, the editor does accept
contributions via the [email protected] email list. These contributions
can be in pretty much any form, with popular approaches including text emails, patches
against the book’s LATEX source, and even git pull requests. Use whatever form
works best for you.
1.5. WHOSE BOOK IS THIS? 5

1 git clone git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git


2 cd perfbook
3 # You may need to install a font here. See item 1 in FAQ.txt.
4 make
5 evince perfbook.pdf & # Two-column version
6 make perfbook-1c.pdf
7 evince perfbook-1c.pdf & # One-column version for e-readers

Figure 1.1: Creating an Up-To-Date PDF


1 git remote update
2 git checkout origin/master
3 make
4 evince perfbook.pdf & # Two-column version
5 make perfbook-1c.pdf
6 evince perfbook-1c.pdf & # One-column version for e-readers

Figure 1.2: Generating an Updated PDF

To create patches or git pull requests, you will need the LATEX source to the
book, which is at git://git.kernel.org/pub/scm/linux/kernel/git/
paulmck/perfbook.git. You will of course also need git and LATEX, which
are available as part of most mainstream Linux distributions. Other packages may be
required, depending on the distribution you use. The required list of packages for a few
popular distributions is listed in the file FAQ-BUILD.txt in the LATEX source to the
book.
To create and display a current LATEX source tree of this book, use the list of Linux
commands shown in Figure 1.1. In some environments, the evince command that
displays perfbook.pdf may need to be replaced, for example, with acroread. The
git clone command need only be used the first time you create a PDF, subsequently,
you can run the commands shown in Figure 1.2 to pull in any updates and generate an
updated PDF. The commands in Figure 1.2 must be run within the perfbook directory
created by the commands shown in Figure 1.1.
PDFs of this book are sporadically posted at http://kernel.org/pub/linux/
kernel/people/paulmck/perfbook/perfbook.html and at http://www.
rdrop.com/users/paulmck/perfbook/.
The actual process of contributing patches and sending git pull requests is
similar to that of the Linux kernel, which is documented in the Documentation/
SubmittingPatches file in the Linux source tree. One important requirement is
that each patch (or commit, in the case of a git pull request) must contain a valid
Signed-off-by: line, which has the following format:

Signed-off-by: My Name <[email protected]>

Please see http://lkml.org/lkml/2007/1/15/219 for an example patch


containing a Signed-off-by: line.
It is important to note that the Signed-off-by: line has a very specific meaning,
namely that you are certifying that:

1. The contribution was created in whole or in part by me and I have the right to
submit it under the open source license indicated in the file; or
2. The contribution is based upon previous work that, to the best of my knowledge,
is covered under an appropriate open source License and I have the right under
6 CHAPTER 1. HOW TO USE THIS BOOK

that license to submit that work with modifications, whether created in whole
or in part by me, under the same open source license (unless I am permitted to
submit under a different license), as indicated in the file; or
3. The contribution was provided directly to me by some other person who certified
(a), (b) or (c) and I have not modified it.

4. I understand and agree that this project and the contribution are public and that
a record of the contribution (including all personal information I submit with
it, including my sign-off) is maintained indefinitely and may be redistributed
consistent with this project or the open source license(s) involved.

This is similar to the Developer’s Certificate of Origin (DCO) 1.1 used by the
Linux kernel. The only addition is item #4. This added item says that you wrote the
contribution yourself, as opposed to having (say) copied it from somewhere. If multiple
people authored a contribution, each should have a Signed-off-by: line.
You must use your real name: I unfortunately cannot accept pseudonymous or
anonymous contributions.
The language of this book is American English, however, the open-source nature
of this book permits translations, and I personally encourage them. The open-source
licenses covering this book additionally allow you to sell your translation, if you wish. I
do request that you send me a copy of the translation (hardcopy if available), but this
is a request made as a professional courtesy, and is not in any way a prerequisite to
the permission that you already have under the Creative Commons and GPL licenses.
Please see the FAQ.txt file in the source tree for a list of translations currently in
progress. I consider a translation effort to be “in progress” once at least one chapter has
been fully translated.
As noted at the beginning of this section, I am this book’s editor. However, if you
choose to contribute, it will be your book as well. With that, I offer you Chapter 2, our
introduction.
If parallel programming is so hard, why are there any
parallel programs?

Unknown

Chapter 2

Introduction

Parallel programming has earned a reputation as one of the most difficult areas a hacker
can tackle. Papers and textbooks warn of the perils of deadlock, livelock, race conditions,
non-determinism, Amdahl’s-Law limits to scaling, and excessive realtime latencies. And
these perils are quite real; we authors have accumulated uncounted years of experience
dealing with them, and all of the emotional scars, grey hairs, and hair loss that go with
such experiences.
However, new technologies that are difficult to use at introduction invariably become
easier over time. For example, the once-rare ability to drive a car is now commonplace
in many countries. This dramatic change came about for two basic reasons: (1) cars
became cheaper and more readily available, so that more people had the opportunity
to learn to drive, and (2) cars became easier to operate due to automatic transmissions,
automatic chokes, automatic starters, greatly improved reliability, and a host of other
technological improvements.
The same is true of a many other technologies, including computers. It is no
longer necessary to operate a keypunch in order to program. Spreadsheets allow
most non-programmers to get results from their computers that would have required
a team of specialists a few decades ago. Perhaps the most compelling example is
web-surfing and content creation, which since the early 2000s has been easily done
by untrained, uneducated people using various now-commonplace social-networking
tools. As recently as 1968, such content creation was a far-out research project [Eng68],
described at the time as “like a UFO landing on the White House lawn”[Gri00].
Therefore, if you wish to argue that parallel programming will remain as difficult as
it is currently perceived by many to be, it is you who bears the burden of proof, keeping
in mind the many centuries of counter-examples in a variety of fields of endeavor.

2.1 Historic Parallel Programming Difficulties


As indicated by its title, this book takes a different approach. Rather than complain about
the difficulty of parallel programming, it instead examines the reasons why parallel
programming is difficult, and then works to help the reader to overcome these difficulties.
As will be seen, these difficulties have fallen into several categories, including:

1. The historic high cost and relative rarity of parallel systems.

7
8 CHAPTER 2. INTRODUCTION

2. The typical researcher’s and practitioner’s lack of experience with parallel sys-
tems.

3. The paucity of publicly accessible parallel code.

4. The lack of a widely understood engineering discipline of parallel programming.

5. The high overhead of communication relative to that of processing, even in tightly


coupled shared-memory computers.

Many of these historic difficulties are well on the way to being overcome. First, over
the past few decades, the cost of parallel systems has decreased from many multiples of
that of a house to a fraction of that of a bicycle, courtesy of Moore’s Law. Papers calling
out the advantages of multicore CPUs were published as early as 1996 [ONH+ 96]. IBM
introduced simultaneous multi-threading into its high-end POWER family in 2000, and
multicore in 2001. Intel introduced hyperthreading into its commodity Pentium line
in November 2000, and both AMD and Intel introduced dual-core CPUs in 2005. Sun
followed with the multicore/multi-threaded Niagara in late 2005. In fact, by 2008, it
was becoming difficult to find a single-CPU desktop system, with single-core CPUs
being relegated to netbooks and embedded devices. By 2012, even smartphones were
starting to sport multiple CPUs.
Second, the advent of low-cost and readily available multicore systems means
that the once-rare experience of parallel programming is now available to almost all
researchers and practitioners. In fact, parallel systems are now well within the budget of
students and hobbyists. We can therefore expect greatly increased levels of invention
and innovation surrounding parallel systems, and that increased familiarity will over
time make the once prohibitively expensive field of parallel programming much more
friendly and commonplace.
Third, in the 20th century, large systems of highly parallel software were almost
always closely guarded proprietary secrets. In happy contrast, the 21st century has
seen numerous open-source (and thus publicly available) parallel software projects,
including the Linux kernel [Tor03], database systems [Pos08, MS08], and message-
passing systems [The08, UoC08]. This book will draw primarily from the Linux kernel,
but will provide much material suitable for user-level applications.
Fourth, even though the large-scale parallel-programming projects of the 1980s and
1990s were almost all proprietary projects, these projects have seeded other communities
with a cadre of developers who understand the engineering discipline required to develop
production-quality parallel code. A major purpose of this book is to present this
engineering discipline.
Unfortunately, the fifth difficulty, the high cost of communication relative to that
of processing, remains largely in force. Although this difficulty has been receiving
increasing attention during the new millennium, according to Stephen Hawking, the
finite speed of light and the atomic nature of matter is likely to limit progress in this
area [Gar07, Moo03]. Fortunately, this difficulty has been in force since the late 1980s,
so that the aforementioned engineering discipline has evolved practical and effective
strategies for handling it. In addition, hardware designers are increasingly aware of
these issues, so perhaps future hardware will be more friendly to parallel software as
discussed in Section 3.3.
Quick Quiz 2.1: Come on now!!! Parallel programming has been known to be
exceedingly hard for many decades. You seem to be hinting that it is not so hard. What
sort of game are you playing?
2.2. PARALLEL PROGRAMMING GOALS 9

However, even though parallel programming might not be as hard as is commonly


advertised, it is often more work than is sequential programming.
Quick Quiz 2.2: How could parallel programming ever be as easy as sequential
programming?
It therefore makes sense to consider alternatives to parallel programming. However,
it is not possible to reasonably consider parallel-programming alternatives without
understanding parallel-programming goals. This topic is addressed in the next section.

2.2 Parallel Programming Goals


The three major goals of parallel programming (over and above those of sequential
programming) are as follows:

1. Performance.
2. Productivity.
3. Generality.

Unfortunately, given the current state of the art, it is possible to achieve at best two
of these three goals for any given parallel program. These three goals therefore form the
iron triangle of parallel programming, a triangle upon which overly optimistic hopes all
too often come to grief.1
Quick Quiz 2.3: Oh, really??? What about correctness, maintainability, robustness,
and so on?
Quick Quiz 2.4: And if correctness, maintainability, and robustness don’t make the
list, why do productivity and generality?
Quick Quiz 2.5: Given that parallel programs are much harder to prove correct than
are sequential programs, again, shouldn’t correctness really be on the list?
Quick Quiz 2.6: What about just having fun?
Each of these goals is elaborated upon in the following sections.

2.2.1 Performance
Performance is the primary goal behind most parallel-programming effort. After all, if
performance is not a concern, why not do yourself a favor: Just write sequential code,
and be happy? It will very likely be easier and you will probably get done much more
quickly.
Quick Quiz 2.7: Are there no cases where parallel programming is about something
other than performance?
Note that “performance” is interpreted quite broadly here, including scalability
(performance per CPU) and efficiency (for example, performance per watt).
That said, the focus of performance has shifted from hardware to parallel software.
This change in focus is due to the fact that, although Moore’s Law continues to deliver
increases in transistor density, it has ceased to provide the traditional single-threaded

1 Kudos to Michael Wong for naming the iron triangle.


10 CHAPTER 2. INTRODUCTION

10000

CPU Clock Frequency / MIPS


1000

100

10

0.1
1975

1980

1985

1990

1995

2000

2005

2010

2015
Year

Figure 2.1: MIPS/Clock-Frequency Trend for Intel CPUs

performance increases. This can be seen in Figure 2.12 , which shows that writing
single-threaded code and simply waiting a year or two for the CPUs to catch up may
no longer be an option. Given the recent trends on the part of all major manufacturers
towards multicore/multithreaded systems, parallelism is the way to go for those wanting
the avail themselves of the full performance of their systems.
Even so, the first goal is performance rather than scalability, especially given that the
easiest way to attain linear scalability is to reduce the performance of each CPU [Tor01].
Given a four-CPU system, which would you prefer? A program that provides 100
transactions per second on a single CPU, but does not scale at all? Or a program that
provides 10 transactions per second on a single CPU, but scales perfectly? The first
program seems like a better bet, though the answer might change if you happened to
have a 32-CPU system.
That said, just because you have multiple CPUs is not necessarily in and of itself
a reason to use them all, especially given the recent decreases in price of multi-CPU
systems. The key point to understand is that parallel programming is primarily a
performance optimization, and, as such, it is one potential optimization of many. If your
program is fast enough as currently written, there is no reason to optimize, either by
parallelizing it or by applying any of a number of potential sequential optimizations.3
By the same token, if you are looking to apply parallelism as an optimization to a
sequential program, then you will need to compare parallel algorithms to the best
sequential algorithms. This may require some care, as far too many publications ignore

2 This plot shows clock frequencies for newer CPUs theoretically capable of retiring one or more

instructions per clock, and MIPS (millions of instructions per second, usually from the old Dhrystone
benchmark) for older CPUs requiring multiple clocks to execute even the simplest instruction. The reason for
shifting between these two measures is that the newer CPUs’ ability to retire multiple instructions per clock is
typically limited by memory-system performance. Furthermore, the benchmarks commonly used on the older
CPUs are obsolete, and it is difficult to run the newer benchmarks on systems containing the old CPUs, in part
because it is hard to find working instances of the old CPUs.
3 Of course, if you are a hobbyist whose primary interest is writing parallel software, that is more than

enough reason to parallelize whatever software you are interested in.


2.2. PARALLEL PROGRAMMING GOALS 11

100000

10000

1000

MIPS per Die 100

10

0.1
1975

1980

1985

1990

1995

2000

2005

2010

2015
Year

Figure 2.2: MIPS per Die for Intel CPUs

the sequential case when analyzing the performance of parallel algorithms.

2.2.2 Productivity
Quick Quiz 2.8: Why all this prattling on about non-technical issues??? And not just
any non-technical issue, but productivity of all things? Who cares?
Productivity has been becoming increasingly important in recent decades. To see
this, consider that the price of early computers was tens of millions of dollars at a time
when engineering salaries were but a few thousand dollars a year. If dedicating a team
of ten engineers to such a machine would improve its performance, even by only 10%,
then their salaries would be repaid many times over.
One such machine was the CSIRAC, the oldest still-intact stored-program computer,
which was put into operation in 1949 [Mus04, Dep06]. Because this machine was built
before the transistor era, it was constructed of 2,000 vacuum tubes, ran with a clock
frequency of 1kHz, consumed 30kW of power, and weighed more than three metric tons.
Given that this machine had but 768 words of RAM, it is safe to say that it did not suffer
from the productivity issues that often plague today’s large-scale software projects.
Today, it would be quite difficult to purchase a machine with so little computing
power. Perhaps the closest equivalents are 8-bit embedded microprocessors exemplified
by the venerable Z80 [Wik08], but even the old Z80 had a CPU clock frequency more
than 1,000 times faster than the CSIRAC. The Z80 CPU had 8,500 transistors, and could
be purchased in 2008 for less than $2 US per unit in 1,000-unit quantities. In stark
contrast to the CSIRAC, software-development costs are anything but insignificant for
the Z80.
The CSIRAC and the Z80 are two points in a long-term trend, as can be seen in
Figure 2.2. This figure plots an approximation to computational power per die over the
past three decades, showing a consistent four-order-of-magnitude increase. Note that
the advent of multicore CPUs has permitted this increase to continue unabated despite
the clock-frequency wall encountered in 2003.
One of the inescapable consequences of the rapid decrease in the cost of hardware
12 CHAPTER 2. INTRODUCTION

is that software productivity becomes increasingly important. It is no longer sufficient


merely to make efficient use of the hardware: It is now necessary to make extremely
efficient use of software developers as well. This has long been the case for sequential
hardware, but parallel hardware has become a low-cost commodity only recently. There-
fore, only recently has high productivity become critically important when creating
parallel software.
Quick Quiz 2.9: Given how cheap parallel systems have become, how can anyone
afford to pay people to program them?
Perhaps at one time, the sole purpose of parallel software was performance. Now,
however, productivity is gaining the spotlight.

2.2.3 Generality
One way to justify the high cost of developing parallel software is to strive for maximal
generality. All else being equal, the cost of a more-general software artifact can be
spread over more users than that of a less-general one. In fact, this economic force
explains much of the maniacal focus on portability, which can be seen as an important
special case of generality.4
Unfortunately, generality often comes at the cost of performance, productivity, or
both. For example, portability is often achieved via adaptation layers, which inevitably
exact a performance penalty. To see this more generally, consider the following popular
parallel programming environments:

C/C++ “Locking Plus Threads” : This category, which includes POSIX Threads
(pthreads) [Ope97], Windows Threads, and numerous operating-system kernel
environments, offers excellent performance (at least within the confines of a
single SMP system) and also offers good generality. Pity about the relatively low
productivity.

Java : This general purpose and inherently multithreaded programming environment


is widely believed to offer much higher productivity than C or C++, courtesy of
the automatic garbage collector and the rich set of class libraries. However, its
performance, though greatly improved in the early 2000s, lags that of C and C++.

MPI : This Message Passing Interface [MPI08] powers the largest scientific and
technical computing clusters in the world and offers unparalleled performance
and scalability. In theory, it is general purpose, but it is mainly used for scientific
and technical computing. Its productivity is believed by many to be even lower
than that of C/C++ “locking plus threads” environments.

OpenMP : This set of compiler directives can be used to parallelize loops. It is thus
quite specific to this task, and this specificity often limits its performance. It is,
however, much easier to use than MPI or C/C++ “locking plus threads.”

SQL : Structured Query Language [Int92] is specific to relational database queries.


However, its performance is quite good as measured by the Transaction Processing
Performance Council (TPC) benchmark results [Tra01]. Productivity is excellent;
in fact, this parallel programming environment enables people to make good
use of a large parallel system despite having little or no knowledge of parallel
programming concepts.
2.2. PARALLEL PROGRAMMING GOALS 13

Productivity
Application
Middleware (e.g., DBMS)
Performance

Generality
System Libraries
Operating System Kernel
Firmware
Hardware

Figure 2.3: Software Layers and Performance, Productivity, and Generality

Special−Purpose
User 1 Env Productive User 2
for User 1

HW / Special−Purpose
Abs Environment
Productive for User 2

User 3
General−Purpose User 4
Environment

Special−Purpose Environment
Special−Purpose
Productive for User 3
Environment
Productive for User 4

Figure 2.4: Tradeoff Between Productivity and Generality

The nirvana of parallel programming environments, one that offers world-class


performance, productivity, and generality, simply does not yet exist. Until such a
nirvana appears, it will be necessary to make engineering tradeoffs among performance,
productivity, and generality. One such tradeoff is shown in Figure 2.3, which shows how
productivity becomes increasingly important at the upper layers of the system stack,
while performance and generality become increasingly important at the lower layers
of the system stack. The huge development costs incurred at the lower layers must
be spread over equally huge numbers of users (hence the importance of generality),
and performance lost in lower layers cannot easily be recovered further up the stack.
In the upper layers of the stack, there might be very few users for a given specific
application, in which case productivity concerns are paramount. This explains the
tendency towards “bloatware” further up the stack: extra hardware is often cheaper than
the extra developers. This book is intended for developers working near the bottom of
the stack, where performance and generality are of great concern.

4 Kudos to Michael Wong for pointing this out.


14 CHAPTER 2. INTRODUCTION

It is important to note that a tradeoff between productivity and generality has existed
for centuries in many fields. For but one example, a nailgun is more productive than
a hammer for driving nails, but in contrast to the nailgun, a hammer can be used for
many things besides driving nails. It should therefore be no surprise to see similar
tradeoffs appear in the field of parallel computing. This tradeoff is shown schematically
in Figure 2.4. Here, users 1, 2, 3, and 4 have specific jobs that they need the computer to
help them with. The most productive possible language or environment for a given user is
one that simply does that user’s job, without requiring any programming, configuration,
or other setup.
Quick Quiz 2.10: This is a ridiculously unachievable ideal! Why not focus on
something that is achievable in practice?
Unfortunately, a system that does the job required by user 1 is unlikely to do
user 2’s job. In other words, the most productive languages and environments are
domain-specific, and thus by definition lacking generality.
Another option is to tailor a given programming language or environment to the
hardware system (for example, low-level languages such as assembly, C, C++, or Java)
or to some abstraction (for example, Haskell, Prolog, or Snobol), as is shown by the
circular region near the center of Figure 2.4. These languages can be considered to
be general in the sense that they are equally ill-suited to the jobs required by users 1,
2, 3, and 4. In other words, their generality is purchased at the expense of decreased
productivity when compared to domain-specific languages and environments. Worse yet,
a language that is tailored to a given abstraction is also likely to suffer from performance
and scalability problems unless and until someone figures out how to efficiently map
that abstraction to real hardware.
Is there no escape from iron triangle’s three conflicting goals of performance,
productivity, and generality?
It turns out that there often is an escape, for example, using the alternatives to
parallel programming discussed in the next section. After all, parallel programming can
be a great deal of fun, but it is not always the best tool for the job.

2.3 Alternatives to Parallel Programming


In order to properly consider alternatives to parallel programming, you must first decide
on what exactly you expect the parallelism to do for you. As seen in Section 2.2, the
primary goals of parallel programming are performance, productivity, and generality.
Because this book is intended for developers working on performance-critical code near
the bottom of the software stack, the remainder of this section focuses primarily on
performance improvement.
It is important to keep in mind that parallelism is but one way to improve perfor-
mance. Other well-known approaches include the following, in roughly increasing order
of difficulty:

1. Run multiple instances of a sequential application.

2. Make the application use existing parallel software.

3. Apply performance optimization to the serial application.

These approaches are covered in the following sections.


2.3. ALTERNATIVES TO PARALLEL PROGRAMMING 15

2.3.1 Multiple Instances of a Sequential Application


Running multiple instances of a sequential application can allow you to do parallel
programming without actually doing parallel programming. There are a large number
of ways to approach this, depending on the structure of the application.
If your program is analyzing a large number of different scenarios, or is analyzing a
large number of independent data sets, one easy and effective approach is to create a
single sequential program that carries out a single analysis, then use any of a number of
scripting environments (for example the bash shell) to run a number of instances of
that sequential program in parallel. In some cases, this approach can be easily extended
to a cluster of machines.
This approach may seem like cheating, and in fact some denigrate such programs
as “embarrassingly parallel”. And in fact, this approach does have some potential
disadvantages, including increased memory consumption, waste of CPU cycles recom-
puting common intermediate results, and increased copying of data. However, it is
often extremely productive, garnering extreme performance gains with little or no added
effort.

2.3.2 Use Existing Parallel Software


There is no longer any shortage of parallel software environments that can present
a single-threaded programming environment, including relational databases [Dat82],
web-application servers, and map-reduce environments. For example, a common design
provides a separate program for each user, each of which generates SQL programs.
These per-user SQL programs are run concurrently against a common relational database,
which automatically runs the users’ queries concurrently. The per-user programs are
responsible only for the user interface, with the relational database taking full responsi-
bility for the difficult issues surrounding parallelism and persistence.
In addition, there are a growing number of parallel library functions, particularly
for numeric computation. Even better, some libraries take advantage of special-purpose
hardware such as vector units and general-purpose graphical processing units (GPGPUs).
Taking this approach often sacrifices some performance, at least when compared to
carefully hand-coding a fully parallel application. However, such sacrifice is often well
repaid by a huge reduction in development effort.
Quick Quiz 2.11: Wait a minute! Doesn’t this approach simply shift the develop-
ment effort from you to whoever wrote the existing parallel software you are using?

2.3.3 Performance Optimization


Up through the early 2000s, CPU performance was doubling every 18 months. In such
an environment, it is often much more important to create new functionality than to do
careful performance optimization. Now that Moore’s Law is “only” increasing transistor
density instead of increasing both transistor density and per-transistor performance, it
might be a good time to rethink the importance of performance optimization. After
all, new hardware generations no longer bring significant single-threaded performance
improvements. Furthermore, many performance optimizations can also conserve energy.
From this viewpoint, parallel programming is but another performance optimization,
albeit one that is becoming much more attractive as parallel systems become cheaper and
more readily available. However, it is wise to keep in mind that the speedup available
16 CHAPTER 2. INTRODUCTION

from parallelism is limited to roughly the number of CPUs (but see Section 6.5 for an
interesting exception). In contrast, the speedup available from traditional single-threaded
software optimizations can be much larger. For example, replacing a long linked list with
a hash table or a search tree can improve performance by many orders of magnitude. This
highly optimized single-threaded program might run much faster than its unoptimized
parallel counterpart, making parallelization unnecessary. Of course, a highly optimized
parallel program would be even better, aside from the added development effort required.
Furthermore, different programs might have different performance bottlenecks. For
example, if your program spends most of its time waiting on data from your disk drive,
using multiple CPUs will probably just increase the time wasted waiting for the disks.
In fact, if the program was reading from a single large file laid out sequentially on a
rotating disk, parallelizing your program might well make it a lot slower due to the
added seek overhead. You should instead optimize the data layout so that the file can be
smaller (thus faster to read), split the file into chunks which can be accessed in parallel
from different drives, cache frequently accessed data in main memory, or, if possible,
reduce the amount of data that must be read.
Quick Quiz 2.12: What other bottlenecks might prevent additional CPUs from
providing additional performance?
Parallelism can be a powerful optimization technique, but it is not the only such
technique, nor is it appropriate for all situations. Of course, the easier it is to parallelize
your program, the more attractive parallelization becomes as an optimization. Paral-
lelization has a reputation of being quite difficult, which leads to the question “exactly
what makes parallel programming so difficult?”

2.4 What Makes Parallel Programming Hard?


It is important to note that the difficulty of parallel programming is as much a human-
factors issue as it is a set of technical properties of the parallel programming problem.
We do need human beings to be able to tell parallel systems what to do, otherwise known
as programming. But parallel programming involves two-way communication, with
a program’s performance and scalability being the communication from the machine
to the human. In short, the human writes a program telling the computer what to do,
and the computer critiques this program via the resulting performance and scalability.
Therefore, appeals to abstractions or to mathematical analyses will often be of severely
limited utility.
In the Industrial Revolution, the interface between human and machine was eval-
uated by human-factor studies, then called time-and-motion studies. Although there
have been a few human-factor studies examining parallel programming [ENS05, ES05,
HCS+ 05, SS94], these studies have been extremely narrowly focused, and hence unable
to demonstrate any general results. Furthermore, given that the normal range of pro-
grammer productivity spans more than an order of magnitude, it is unrealistic to expect
an affordable study to be capable of detecting (say) a 10% difference in productivity.
Although the multiple-order-of-magnitude differences that such studies can reliably
detect are extremely valuable, the most impressive improvements tend to be based on a
long series of 10% improvements.
We must therefore take a different approach.
One such approach is to carefully consider the tasks that parallel programmers must
undertake that are not required of sequential programmers. We can then evaluate how
well a given programming language or environment assists the developer with these
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 17

Performance Productivity
Work
Partitioning

Resource
Parallel
Partitioning and
Access Control Replication

Interacting
With Hardware

Generality

Figure 2.5: Categories of Tasks Required of Parallel Programmers

tasks. These tasks fall into the four categories shown in Figure 2.5, each of which is
covered in the following sections.

2.4.1 Work Partitioning


Work partitioning is absolutely required for parallel execution: if there is but one “glob”
of work, then it can be executed by at most one CPU at a time, which is by definition
sequential execution. However, partitioning the code requires great care. For example,
uneven partitioning can result in sequential execution once the small partitions have
completed [Amd67]. In less extreme cases, load balancing can be used to fully utilize
available hardware and restore performance and scalabilty.
Although partitioning can greatly improve performance and scalability, it can also
increase complexity. For example, partitioning can complicate handling of global errors
and events: A parallel program may need to carry out non-trivial synchronization in order
to safely process such global events. More generally, each partition requires some sort of
communication: After all, if a given thread did not communicate at all, it would have no
effect and would thus not need to be executed. However, because communication incurs
overhead, careless partitioning choices can result in severe performance degradation.
Furthermore, the number of concurrent threads must often be controlled, as each
such thread occupies common resources, for example, space in CPU caches. If too many
threads are permitted to execute concurrently, the CPU caches will overflow, resulting
in high cache miss rate, which in turn degrades performance. Conversely, large numbers
of threads are often required to overlap computation and I/O so as to fully utilize I/O
devices.
Quick Quiz 2.13: Other than CPU cache capacity, what might require limiting the
number of concurrent threads?
Finally, permitting threads to execute concurrently greatly increases the program’s
state space, which can make the program difficult to understand and debug, degrading
productivity. All else being equal, smaller state spaces having more regular structure
are more easily understood, but this is a human-factors statement as much as it is a
technical or mathematical statement. Good parallel designs might have extremely large
state spaces, but nevertheless be easy to understand due to their regular structure, while
poor designs can be impenetrable despite having a comparatively small state space. The
best designs exploit embarrassing parallelism, or transform the problem to one having
18 CHAPTER 2. INTRODUCTION

an embarrassingly parallel solution. In either case, “embarrassingly parallel” is in fact


an embarrassment of riches. The current state of the art enumerates good designs; more
work is required to make more general judgments on state-space size and structure.

2.4.2 Parallel Access Control


Given a single-threaded sequential program, that single thread has full access to all of
the program’s resources. These resources are most often in-memory data structures, but
can be CPUs, memory (including caches), I/O devices, computational accelerators, files,
and much else besides.
The first parallel-access-control issue is whether the form of the access to a given
resource depends on that resource’s location. For example, in many message-passing
environments, local-variable access is via expressions and assignments, while remote-
variable access uses an entirely different syntax, usually involving messaging. The
POSIX Threads environment [Ope97], Structured Query Language (SQL) [Int92], and
partitioned global address-space (PGAS) environments such as Universal Parallel C
(UPC) [EGCD03] offer implicit access, while Message Passing Interface (MPI) [MPI08]
offers explicit access because access to remote data requires explicit messaging.
The other parallel-access-control issue is how threads coordinate access to the re-
sources. This coordination is carried out by the very large number of synchronization
mechanisms provided by various parallel languages and environments, including mes-
sage passing, locking, transactions, reference counting, explicit timing, shared atomic
variables, and data ownership. Many traditional parallel-programming concerns such as
deadlock, livelock, and transaction rollback stem from this coordination. This frame-
work can be elaborated to include comparisons of these synchronization mechanisms,
for example locking vs. transactional memory [MMW07], but such elaboration is be-
yond the scope of this section. (See Sections 17.2 and 17.3 for more information on
transactional memory.)

2.4.3 Resource Partitioning and Replication


The most effective parallel algorithms and systems exploit resource parallelism, so much
so that it is usually wise to begin parallelization by partitioning your write-intensive
resources and replicating frequently accessed read-mostly resources. The resource in
question is most frequently data, which might be partitioned over computer systems,
mass-storage devices, NUMA nodes, CPU cores (or dies or hardware threads), pages,
cache lines, instances of synchronization primitives, or critical sections of code. For
example, partitioning over locking primitives is termed “data locking” [BK85].
Resource partitioning is frequently application dependent. For example, numerical
applications frequently partition matrices by row, column, or sub-matrix, while com-
mercial applications frequently partition write-intensive data structures and replicate
read-mostly data structures. Thus, a commercial application might assign the data for a
given customer to a given few computers out of a large cluster. An application might
statically partition data, or dynamically change the partitioning over time.
Resource partitioning is extremely effective, but it can be quite challenging for
complex multilinked data structures.
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 19

Performance Productivity
Work
Partitioning

Resource
Parallel
Partitioning and
Access Control Replication

Interacting
With Hardware

Generality

Figure 2.6: Ordering of Parallel-Programming Tasks

2.4.4 Interacting With Hardware


Hardware interaction is normally the domain of the operating system, the compiler,
libraries, or other software-environment infrastructure. However, developers working
with novel hardware features and components will often need to work directly with such
hardware. In addition, direct access to the hardware can be required when squeezing
the last drop of performance out of a given system. In this case, the developer may
need to tailor or configure the application to the cache geometry, system topology, or
interconnect protocol of the target hardware.
In some cases, hardware may be considered to be a resource which is subject to
partitioning or access control, as described in the previous sections.

2.4.5 Composite Capabilities


Although these four capabilities are fundamental, good engineering practice uses com-
posites of these capabilities. For example, the data-parallel approach first partitions the
data so as to minimize the need for inter-partition communication, partitions the code
accordingly, and finally maps data partitions and threads so as to maximize throughput
while minimizing inter-thread communication, as shown in Figure 2.6. The developer
can then consider each partition separately, greatly reducing the size of the relevant state
space, in turn increasing productivity. Even though some problems are non-partitionable,
clever transformations into forms permitting partitioning can sometimes greatly enhance
both performance and scalability [Met99].

2.4.6 How Do Languages and Environments Assist With These Tasks?


Although many environments require the developer to deal manually with these tasks,
there are long-standing environments that bring significant automation to bear. The
poster child for these environments is SQL, many implementations of which auto-
matically parallelize single large queries and also automate concurrent execution of
independent queries and updates.
These four categories of tasks must be carried out in all parallel programs, but that
of course does not necessarily mean that the developer must manually carry out these
tasks. We can expect to see ever-increasing automation of these four tasks as parallel
20 CHAPTER 2. INTRODUCTION

systems continue to become cheaper and more readily available.


Quick Quiz 2.14: Are there any other obstacles to parallel programming?

2.5 Discussion
This section has given an overview of the difficulties with, goals of, and alternatives
to parallel programming. This overview was followed by a discussion of what can
make parallel programming hard, along with a high-level approach for dealing with
parallel programming’s difficulties. Those who still insist that parallel programming
is impossibly difficult should review some of the older guides to parallel program-
mming [Seq88, Dig89, BK85, Inm85]. The following quote from Andrew Birrell’s
monograph [Dig89] is especially telling:

Writing concurrent programs has a reputation for being exotic and difficult.
I believe it is neither. You need a system that provides you with good
primitives and suitable libraries, you need a basic caution and carefulness,
you need an armory of useful techniques, and you need to know of the
common pitfalls. I hope that this paper has helped you towards sharing my
belief.

The authors of these older guides were well up to the parallel programming challenge
back in the 1980s. As such, there are simply no excuses for refusing to step up to the
parallel-programming challenge here in the 21st century!
We are now ready to proceed to the next chapter, which dives into the relevant
properties of the parallel hardware underlying our parallel software.
Premature abstraction is the root of all evil.

A cast of thousands

Chapter 3

Hardware and its Habits

Most people have an intuitive understanding that passing messages between systems is
considerably more expensive than performing simple calculations within the confines of
a single system. However, it is not always so clear that communicating among threads
within the confines of a single shared-memory system can also be quite expensive. This
chapter therefore looks at the cost of synchronization and communication within a
shared-memory system. These few pages can do no more than scratch the surface of
shared-memory parallel hardware design; readers desiring more detail would do well to
start with a recent edition of Hennessy and Patterson’s classic text [HP11, HP95].
Quick Quiz 3.1: Why should parallel programmers bother learning low-level prop-
erties of the hardware? Wouldn’t it be easier, better, and more general to remain at a
higher level of abstraction?

3.1 Overview
Careless reading of computer-system specification sheets might lead one to believe that
CPU performance is a footrace on a clear track, as illustrated in Figure 3.1, where the
race always goes to the swiftest.
Although there are a few CPU-bound benchmarks that approach the ideal shown
in Figure 3.1, the typical program more closely resembles an obstacle course than a
race track. This is because the internal architecture of CPUs has changed dramatically
over the past few decades, courtesy of Moore’s Law. These changes are described in the
following sections.

3.1.1 Pipelined CPUs


In the early 1980s, the typical microprocessor fetched an instruction, decoded it, and
executed it, typically taking at least three clock cycles to complete one instruction
before proceeding to the next. In contrast, the CPU of the late 1990s and early 2000s
will be executing many instructions simultaneously, using a deep “pipeline” to control
the flow of instructions internally to the CPU. These modern hardware features can
greatly improve performance, as illustrated by Figure 3.2.
Achieving full performance with a CPU having a long pipeline requires highly
predictable control flow through the program. Suitable control flow can be provided
by a program that executes primarily in tight loops, for example, arithmetic on large

21
22 CHAPTER 3. HARDWARE AND ITS HABITS

CPU Benchmark Trackmeet

Figure 3.1: CPU Performance at its Best

4.0 GHz clock, 20MB L3


cache, 20 stage pipeline...

The only pipeline I need


is to cool off that hot-
headed brat.

Figure 3.2: CPUs Old and New

matrices or vectors. The CPU can then correctly predict that the branch at the end of the
loop will be taken in almost all cases, allowing the pipeline to be kept full and the CPU
to execute at full speed.
However, branch prediction is not always so easy. For example, consider a program
with many loops, each of which iterates a small but random number of times. For
another example, consider an object-oriented program with many virtual objects that can
reference many different real objects, all with different implementations for frequently
invoked member functions. In these cases, it is difficult or even impossible for the
CPU to predict where the next branch might lead. Then either the CPU must stall
waiting for execution to proceed far enough to be certain where that branch leads, or
it must guess. Although guessing works extremely well for programs with predictable
control flow, for unpredictable branches (such as those in binary search) the guesses will
frequently be wrong. A wrong guess can be expensive because the CPU must discard
any speculatively executed instructions following the corresponding branch, resulting in
a pipeline flush. If pipeline flushes appear too frequently, they drastically reduce overall
performance, as fancifully depicted in Figure 3.3.
3.1. OVERVIEW 23

N
IO
CT
EDI
R
SP
PI MI
PE
LI CH
NE AN
ER BR
RO
R

Figure 3.3: CPU Meets a Pipeline Flush

Unfortunately, pipeline flushes are not the only hazards in the obstacle course that
modern CPUs must run. The next section covers the hazards of referencing memory.

3.1.2 Memory References


In the 1980s, it often took less time for a microprocessor to load a value from memory
than it did to execute an instruction. In 2006, a microprocessor might be capable of exe-
cuting hundreds or even thousands of instructions in the time required to access memory.
This disparity is due to the fact that Moore’s Law has increased CPU performance at a
much greater rate than it has decreased memory latency, in part due to the rate at which
memory sizes have grown. For example, a typical 1970s minicomputer might have 4KB
(yes, kilobytes, not megabytes, let alone gigabytes) of main memory, with single-cycle
access.1 In 2008, CPU designers still can construct a 4KB memory with single-cycle
access, even on systems with multi-GHz clock frequencies. And in fact they frequently
do construct such memories, but they now call them “level-0 caches”, and they can be
quite a bit bigger than 4KB.
Although the large caches found on modern microprocessors can do quite a bit to
help combat memory-access latencies, these caches require highly predictable data-
access patterns to successfully hide those latencies. Unfortunately, common operations
such as traversing a linked list have extremely unpredictable memory-access patterns—
after all, if the pattern was predictable, us software types would not bother with the
pointers, right? Therefore, as shown in Figure 3.4, memory references often pose severe
obstacles to modern CPUs.
Thus far, we have only been considering obstacles that can arise during a given
CPU’s execution of single-threaded code. Multi-threading presents additional obstacles
to the CPU, as described in the following sections.

1 It is only fair to add that each of these single cycles lasted no less than 1.6 microseconds.
24 CHAPTER 3. HARDWARE AND ITS HABITS

Figure 3.4: CPU Meets a Memory Reference

3.1.3 Atomic Operations

One such obstacle is atomic operations. The problem here is that the whole idea of an
atomic operation conflicts with the piece-at-a-time assembly-line operation of a CPU
pipeline. To hardware designers’ credit, modern CPUs use a number of extremely clever
tricks to make such operations look atomic even though they are in fact being executed
piece-at-a-time, with one common trick being to identify all the cachelines containing
the data to be atomically operated on, ensure that these cachelines are owned by the
CPU executing the atomic operation, and only then proceed with the atomic operation
while ensuring that these cachelines remained owned by this CPU. Because all the data
is private to this CPU, other CPUs are unable to interfere with the atomic operation
despite the piece-at-a-time nature of the CPU’s pipeline. Needless to say, this sort of
trick can require that the pipeline must be delayed or even flushed in order to perform
the setup operations that permit a given atomic operation to complete correctly.
In contrast, when executing a non-atomic operation, the CPU can load values from
cachelines as they appear and place the results in the store buffer, without the need
to wait for cacheline ownership. Fortunately, CPU designers have focused heavily on
atomic operations, so that as of early 2014 they have greatly reduced their overhead.
Even so, the resulting effect on performance is all too often as depicted in Figure 3.5.
Unfortunately, atomic operations usually apply only to single elements of data. Be-
cause many parallel algorithms require that ordering constraints be maintained between
updates of multiple data elements, most CPUs provide memory barriers. These memory
barriers also serve as performance-sapping obstacles, as described in the next section.
Quick Quiz 3.2: What types of machines would allow atomic operations on multiple
data elements?
3.1. OVERVIEW 25

Figure 3.5: CPU Meets an Atomic Operation

Memory
Barrier

Figure 3.6: CPU Meets a Memory Barrier

3.1.4 Memory Barriers


Memory barriers will be considered in more detail in Section 14.2 and Appendix B. In
the meantime, consider the following simple lock-based critical section:
1 spin_lock(&mylock);
2 a = a + 1;
3 spin_unlock(&mylock);

If the CPU were not constrained to execute these statements in the order shown, the
effect would be that the variable “a” would be incremented without the protection of
“mylock”, which would certainly defeat the purpose of acquiring it. To prevent such
destructive reordering, locking primitives contain either explicit or implicit memory
barriers. Because the whole purpose of these memory barriers is to prevent reorderings
26 CHAPTER 3. HARDWARE AND ITS HABITS

CACHE-
MISS
TOLL
BOOTH

Figure 3.7: CPU Meets a Cache Miss

that the CPU would otherwise undertake in order to increase performance, memory
barriers almost always reduce performance, as depicted in Figure 3.6.
As with atomic operations, CPU designers have been working hard to reduce
memory-barrier overhead, and have made substantial progress.

3.1.5 Cache Misses


An additional multi-threading obstacle to CPU performance is the “cache miss”. As
noted earlier, modern CPUs sport large caches in order to reduce the performance
penalty that would otherwise be incurred due to high memory latencies. However, these
caches are actually counter-productive for variables that are frequently shared among
CPUs. This is because when a given CPU wishes to modify the variable, it is most likely
the case that some other CPU has modified it recently. In this case, the variable will be
in that other CPU’s cache, but not in this CPU’s cache, which will therefore incur an
expensive cache miss (see Section B.1 for more detail). Such cache misses form a major
obstacle to CPU performance, as shown in Figure 3.7.
Quick Quiz 3.3: So have CPU designers also greatly reduced the overhead of cache
misses?

3.1.6 I/O Operations


A cache miss can be thought of as a CPU-to-CPU I/O operation, and as such is one
of the cheapest I/O operations available. I/O operations involving networking, mass
storage, or (worse yet) human beings pose much greater obstacles than the internal
obstacles called out in the prior sections, as illustrated by Figure 3.8.
This is one of the differences between shared-memory and distributed-system paral-
lelism: shared-memory parallel programs must normally deal with no obstacle worse
than a cache miss, while a distributed parallel program will typically incur the larger
network communication latencies. In both cases, the relevant latencies can be thought
of as a cost of communication—a cost that would be absent in a sequential program.
Therefore, the ratio between the overhead of the communication to that of the actual
3.2. OVERHEADS 27

TELE Please stay on the


line. Your call is very
important to us...

Figure 3.8: CPU Waits for I/O Completion

work being performed is a key design parameter. A major goal of parallel hardware de-
sign is to reduce this ratio as needed to achieve the relevant performance and scalability
goals. In turn, as will be seen in Chapter 6, a major goal of parallel software design is to
reduce the frequency of expensive operations like communications cache misses.
Of course, it is one thing to say that a given operation is an obstacle, and quite
another to show that the operation is a significant obstacle. This distinction is discussed
in the following sections.

3.2 Overheads
This section presents actual overheads of the obstacles to performance listed out in the
previous section. However, it is first necessary to get a rough view of hardware system
architecture, which is the subject of the next section.

3.2.1 Hardware System Architecture


Figure 3.9 shows a rough schematic of an eight-core computer system. Each die has a
pair of CPU cores, each with its cache, as well as an interconnect allowing the pair of
CPUs to communicate with each other. The system interconnect in the middle of the
diagram allows the four dies to communicate, and also connects them to main memory.
Data moves through this system in units of “cache lines”, which are power-of-two
fixed-size aligned blocks of memory, usually ranging from 32 to 256 bytes in size.
When a CPU loads a variable from memory to one of its registers, it must first load
the cacheline containing that variable into its cache. Similarly, when a CPU stores a
value from one of its registers into memory, it must also load the cacheline containing
that variable into its cache, but must also ensure that no other CPU has a copy of that
cacheline.
For example, if CPU 0 were to perform a compare-and-swap (CAS) operation on
a variable whose cacheline resided in CPU 7’s cache, the following over-simplified
sequence of events might ensue:
28 CHAPTER 3. HARDWARE AND ITS HABITS

CPU 0 CPU 1 CPU 2 CPU 3


Cache Cache Cache Cache
Interconnect Interconnect

Memory System Interconnect Memory

Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7

Speed−of−Light Round−Trip Distance in Vacuum


for 1.8GHz Clock Period (8cm)

Figure 3.9: System Hardware Architecture

1. CPU 0 checks its local cache, and does not find the cacheline.

2. The request is forwarded to CPU 0’s and 1’s interconnect, which checks CPU 1’s
local cache, and does not find the cacheline.

3. The request is forwarded to the system interconnect, which checks with the other
three dies, learning that the cacheline is held by the die containing CPU 6 and 7.

4. The request is forwarded to CPU 6’s and 7’s interconnect, which checks both
CPUs’ caches, finding the value in CPU 7’s cache.

5. CPU 7 forwards the cacheline to its interconnect, and also flushes the cacheline
from its cache.

6. CPU 6’s and 7’s interconnect forwards the cacheline to the system interconnect.

7. The system interconnect forwards the cacheline to CPU 0’s and 1’s interconnect.

8. CPU 0’s and 1’s interconnect forwards the cacheline to CPU 0’s cache.

9. CPU 0 can now perform the CAS operation on the value in its cache.

Quick Quiz 3.4: This is a simplified sequence of events? How could it possibly be
any more complex?
Quick Quiz 3.5: Why is it necessary to flush the cacheline from CPU 7’s cache?
This simplified sequence is just the beginning of a discipline called cache-coherency
protocols [HP95, CSG99, MHS12, SHW11].

3.2.2 Costs of Operations


The overheads of some common operations important to parallel programs are displayed
in Table 3.1. This system’s clock period rounds to 0.6ns. Although it is not unusual
for modern microprocessors to be able to retire multiple instructions per clock period,
3.2. OVERHEADS 29

Ratio
Operation Cost (ns) (cost/clock)
Clock period 0.6 1.0
Best-case CAS 37.9 63.2
Best-case lock 65.6 109.3
Single cache miss 139.5 232.5
CAS cache miss 306.0 510.0
Comms Fabric 5,000.0 8,330.0
Global Comms 195,000,000.0 325,000,000.0

Table 3.1: Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD


Opteron 844 System

the operations’s costs are nevertheless normalized to a clock period in the third column,
labeled “Ratio”. The first thing to note about this table is the large values of many of the
ratios.
The best-case compare-and-swap (CAS) operation consumes almost forty nanosec-
onds, a duration more than sixty times that of the clock period. Here, “best case” means
that the same CPU now performing the CAS operation on a given variable was the
last CPU to operate on this variable, so that the corresponding cache line is already
held in that CPU’s cache. Similarly, the best-case lock operation (a “round trip” pair
consisting of a lock acquisition followed by a lock release) consumes more than sixty
nanoseconds, or more than one hundred clock cycles. Again, “best case” means that
the data structure representing the lock is already in the cache belonging to the CPU
acquiring and releasing the lock. The lock operation is more expensive than CAS
because it requires two atomic operations on the lock data structure.
An operation that misses the cache consumes almost one hundred and forty nanosec-
onds, or more than two hundred clock cycles. The code used for this cache-miss
measurement passes the cache line back and forth between a pair of CPUs, so this cache
miss is satisfied not from memory, but rather from the other CPU’s cache. A CAS
operation, which must look at the old value of the variable as well as store a new value,
consumes over three hundred nanoseconds, or more than five hundred clock cycles.
Think about this a bit. In the time required to do one CAS operation, the CPU could
have executed more than five hundred normal instructions. This should demonstrate the
limitations not only of fine-grained locking, but of any other synchronization mechanism
relying on fine-grained global agreement.
Quick Quiz 3.6: Surely the hardware designers could be persuaded to improve
this situation! Why have they been content with such abysmal performance for these
single-instruction operations?
I/O operations are even more expensive. As shown in the “Comms Fabric” row,
high performance (and expensive!) communications fabric, such as InfiniBand or any
number of proprietary interconnects, has a latency of roughly five microseconds for an
end-to-end round trip, during which time more than eight thousand instructions might
have been executed. Standards-based communications networks often require some
sort of protocol processing, which further increases the latency. Of course, geographic
distance also increases latency, with the speed-of-light through optical fiber latency
around the world coming to roughly 195 milliseconds, or more than 300 million clock
30 CHAPTER 3. HARDWARE AND ITS HABITS

Figure 3.10: Hardware and Software: On Same Side

cycles, as shown in the “Global Comms” row.


Quick Quiz 3.7: These numbers are insanely large! How can I possibly get my
head around them?
In short, hardware and software engineers are really fighting on the same side, trying
to make computers go fast despite the best efforts of the laws of physics, as fancifully
depicted in Figure 3.10 where our data stream is trying its best to exceed the speed
of light. The next section discusses some of the things that the hardware engineers
might (or might not) be able to do. Software’s contribution to this fight is outlined in the
remaining chapters of this book.

3.3 Hardware Free Lunch?


The major reason that concurrency has been receiving so much focus over the past few
years is the end of Moore’s-Law induced single-threaded performance increases (or
“free lunch” [Sut08]), as shown in Figure 2.1 on page 10. This section briefly surveys a
few ways that hardware designers might be able to bring back some form of the “free
lunch”.
However, the preceding section presented some substantial hardware obstacles to
exploiting concurrency. One severe physical limitation that hardware designers face is
the finite speed of light. As noted in Figure 3.9 on page 28, light can travel only about
an 8-centimeters round trip in a vacuum during the duration of a 1.8 GHz clock period.
This distance drops to about 3 centimeters for a 5 GHz clock. Both of these distances
are relatively small compared to the size of a modern computer system.
To make matters even worse, electric waves in silicon move from three to thirty
times more slowly than does light in a vacuum, and common clocked logic constructs
run still more slowly, for example, a memory reference may need to wait for a local
cache lookup to complete before the request may be passed on to the rest of the system.
Furthermore, relatively low speed and high power drivers are required to move electrical
signals from one silicon die to another, for example, to communicate between a CPU
and main memory.
Quick Quiz 3.8: But individual electrons don’t move anywhere near that fast, even
in conductors!!! The electron drift velocity in a conductor under the low voltages found
in semiconductors is on the order of only one millimeter per second. What gives???
There are nevertheless some technologies (both hardware and software) that might
help improve matters:
3.3. HARDWARE FREE LUNCH? 31

70 um

3 cm 1.5 cm

Figure 3.11: Latency Benefit of 3D Integration

1. 3D integration,
2. Novel materials and processes,
3. Substituting light for electricity,
4. Special-purpose accelerators, and
5. Existing parallel software.

Each of these is described in one of the following sections.

3.3.1 3D Integration
3-dimensional integration (3DI) is the practice of bonding very thin silicon dies to
each other in a vertical stack. This practice provides potential benefits, but also poses
significant fabrication challenges [Kni08].
Perhaps the most important benefit of 3DI is decreased path length through the
system, as shown in Figure 3.11. A 3-centimeter silicon die is replaced with a stack of
four 1.5-centimeter dies, in theory decreasing the maximum path through the system by
a factor of two, keeping in mind that each layer is quite thin. In addition, given proper
attention to design and placement, long horizontal electrical connections (which are
both slow and power hungry) can be replaced by short vertical electrical connections,
which are both faster and more power efficient.
However, delays due to levels of clocked logic will not be decreased by 3D in-
tegration, and significant manufacturing, testing, power-supply, and heat-dissipation
problems must be solved for 3D integration to reach production while still delivering on
its promise. The heat-dissipation problems might be solved using semiconductors based
on diamond, which is a good conductor for heat, but an electrical insulator. That said, it
remains difficult to grow large single diamond crystals, to say nothing of slicing them
into wafers. In addition, it seems unlikely that any of these technologies will be able to
deliver the exponential increases to which some people have become accustomed. That
said, they may be necessary steps on the path to the late Jim Gray’s “smoking hairy golf
balls” [Gra02].

3.3.2 Novel Materials and Processes


Stephen Hawking is said to have claimed that semiconductor manufacturers have but
two fundamental problems: (1) the finite speed of light and (2) the atomic nature of
matter [Gar07]. It is possible that semiconductor manufacturers are approaching these
32 CHAPTER 3. HARDWARE AND ITS HABITS

limits, but there are nevertheless a few avenues of research and development focused on
working around these fundamental limits.
One workaround for the atomic nature of matter are so-called “high-K dielectric”
materials, which allow larger devices to mimic the electrical properties of infeasibly
small devices. These materials pose some severe fabrication challenges, but nevertheless
may help push the frontiers out a bit farther. Another more-exotic workaround stores
multiple bits in a single electron, relying on the fact that a given electron can exist at a
number of energy levels. It remains to be seen if this particular approach can be made
to work reliably in production semiconductor devices.
Another proposed workaround is the “quantum dot” approach that allows much
smaller device sizes, but which is still in the research stage.

3.3.3 Light, Not Electrons


Although the speed of light would be a hard limit, the fact is that semiconductor devices
are limited by the speed of electricity rather than that of light, given that electric waves
in semiconductor materials move at between 3% and 30% of the speed of light in a
vacuum. The use of copper connections on silicon devices is one way to increase the
speed of electricity, and it is quite possible that additional advances will push closer still
to the actual speed of light. In addition, there have been some experiments with tiny
optical fibers as interconnects within and between chips, based on the fact that the speed
of light in glass is more than 60% of the speed of light in a vacuum. One obstacle to
such optical fibers is the inefficiency conversion between electricity and light and vice
versa, resulting in both power-consumption and heat-dissipation problems.
That said, absent some fundamental advances in the field of physics, any exponential
increases in the speed of data flow will be sharply limited by the actual speed of light in
a vacuum.

3.3.4 Special-Purpose Accelerators


A general-purpose CPU working on a specialized problem is often spending significant
time and energy doing work that is only tangentially related to the problem at hand. For
example, when taking the dot product of a pair of vectors, a general-purpose CPU will
normally use a loop (possibly unrolled) with a loop counter. Decoding the instructions,
incrementing the loop counter, testing this counter, and branching back to the top of the
loop are in some sense wasted effort: the real goal is instead to multiply corresponding
elements of the two vectors. Therefore, a specialized piece of hardware designed
specifically to multiply vectors could get the job done more quickly and with less energy
consumed.
This is in fact the motivation for the vector instructions present in many commodity
microprocessors. Because these instructions operate on multiple data items simultane-
ously, they would permit a dot product to be computed with less instruction-decode and
loop overhead.
Similarly, specialized hardware can more efficiently encrypt and decrypt, compress
and decompress, encode and decode, and many other tasks besides. Unfortunately, this
efficiency does not come for free. A computer system incorporating this specialized
hardware will contain more transistors, which will consume some power even when not
in use. Software must be modified to take advantage of this specialized hardware, and
this specialized hardware must be sufficiently generally useful that the high up-front
hardware-design costs can be spread over enough users to make the specialized hardware
3.4. SOFTWARE DESIGN IMPLICATIONS 33

affordable. In part due to these sorts of economic considerations, specialized hardware


has thus far appeared only for a few application areas, including graphics processing
(GPUs), vector processors (MMX, SSE, and VMX instructions), and, to a lesser extent,
encryption.
Unlike the server and PC arena, smartphones have long used a wide variety of
hardware accelerators. These hardware accelerators are often used for media decoding,
so much so that a high-end MP3 player might be able to play audio for several minutes—
with its CPU fully powered off the entire time. The purpose of these accelerators is
to improve energy efficiency and thus extend battery life: special purpose hardware
can often compute more efficiently than can a general-purpose CPU. This is another
example of the principle called out in Section 2.2.3: Generality is almost never free.
Nevertheless, given the end of Moore’s-Law-induced single-threaded performance
increases, it seems safe to predict that there will be an increasing variety of special-
purpose hardware going forward.

3.3.5 Existing Parallel Software


Although multicore CPUs seem to have taken the computing industry by surprise, the
fact remains that shared-memory parallel computer systems have been commercially
available for more than a quarter century. This is more than enough time for significant
parallel software to make its appearance, and it indeed has. Parallel operating systems
are quite commonplace, as are parallel threading libraries, parallel relational database
management systems, and parallel numerical software. Use of existing parallel software
can go a long ways towards solving any parallel-software crisis we might encounter.
Perhaps the most common example is the parallel relational database management
system. It is not unusual for single-threaded programs, often written in high-level
scripting languages, to access a central relational database concurrently. In the resulting
highly parallel system, only the database need actually deal directly with parallelism. A
very nice trick when it works!

3.4 Software Design Implications


The values of the ratios in Table 3.1 are critically important, as they limit the efficiency
of a given parallel application. To see this, suppose that the parallel application uses
CAS operations to communicate among threads. These CAS operations will typically
involve a cache miss, that is, assuming that the threads are communicating primarily
with each other rather than with themselves. Suppose further that the unit of work
corresponding to each CAS communication operation takes 300ns, which is sufficient
time to compute several floating-point transcendental functions. Then about half of the
execution time will be consumed by the CAS communication operations! This in turn
means that a two-CPU system running such a parallel program would run no faster than
a sequential implementation running on a single CPU.
The situation is even worse in the distributed-system case, where the latency of
a single communications operation might take as long as thousands or even millions
of floating-point operations. This illustrates how important it is for communications
operations to be extremely infrequent and to enable very large quantities of processing.
Quick Quiz 3.9: Given that distributed-systems communication is so horribly
expensive, why does anyone bother with such systems?
34 CHAPTER 3. HARDWARE AND ITS HABITS

The lesson should be quite clear: parallel algorithms must be explicitly designed with
these hardware properties firmly in mind. One approach is to run nearly independent
threads. The less frequently the threads communicate, whether by atomic operations,
locks, or explicit messages, the better the application’s performance and scalability will
be. This approach will be touched on in Chapter 5, explored in Chapter 6, and taken to
its logical extreme in Chapter 8.
Another approach is to make sure that any sharing be read-mostly, which allows the
CPUs’ caches to replicate the read-mostly data, in turn allowing all CPUs fast access.
This approach is touched on in Section 5.2.3, and explored more deeply in Chapter 9.
In short, achieving excellent parallel performance and scalability means striving for
embarrassingly parallel algorithms and implementations, whether by careful choice of
data structures and algorithms, use of existing parallel applications and environments, or
transforming the problem into one for which an embarrassingly parallel solution exists.
Quick Quiz 3.10: OK, if we are going to have to apply distributed-programming
techniques to shared-memory parallel programs, why not just always use these dis-
tributed techniques and dispense with shared memory?
So, to sum up:

1. The good news is that multicore systems are inexpensive and readily available.
2. More good news: The overhead of many synchronization operations is much
lower than it was on parallel systems from the early 2000s.
3. The bad news is that the overhead of cache misses is still high, especially on large
systems.

The remainder of this book describes ways of handling this bad news.
In particular, Chapter 4 will cover some of the low-level tools used for parallel
programming, Chapter 5 will investigate problems and solutions to parallel counting,
and Chapter 6 will discuss design disciplines that promote performance and scalability.
You are only as good as your tools, and your tools
are only as good as you are.

Unknown

Chapter 4

Tools of the Trade

This chapter provides a brief introduction to some basic tools of the parallel-programming
trade, focusing mainly on those available to user applications running on operating
systems similar to Linux. Section 4.1 begins with scripting languages, Section 4.2
describes the multi-process parallelism supported by the POSIX API and touches on
POSIX threads, Section 4.3 presents analogous operations in other environments, and
finally, Section 4.4 helps to choose the tool that will get the job done.
Quick Quiz 4.1: You call these tools??? They look more like low-level synchro-
nization primitives to me!
Please note that this chapter provides but a brief introduction. More detail is available
from the references cited (and especially from Internet), and more information on how
best to use these tools will be provided in later chapters.

4.1 Scripting Languages


The Linux shell scripting languages provide simple but effective ways of managing
parallelism. For example, suppose that you had a program compute_it that you
needed to run twice with two different sets of arguments. This can be accomplished
using UNIX shell scripting as follows:
1 compute_it 1 > compute_it.1.out &
2 compute_it 2 > compute_it.2.out &
3 wait
4 cat compute_it.1.out
5 cat compute_it.2.out

Lines 1 and 2 launch two instances of this program, redirecting their output to two
separate files, with the & character directing the shell to run the two instances of the
program in the background. Line 3 waits for both instances to complete, and lines 4
and 5 display their output. The resulting execution is as shown in Figure 4.1: the two
instances of compute_it execute in parallel, wait completes after both of them do,
and then the two instances of cat execute sequentially.
Quick Quiz 4.2: But this silly shell script isn’t a real parallel program! Why bother
with such trivia???
Quick Quiz 4.3: Is there a simpler way to create a parallel shell script? If so, how?
If not, why not?
For another example, the make software-build scripting language provides a -j

35
36 CHAPTER 4. TOOLS OF THE TRADE

compute_it 1 > compute_it 2 >


compute_it.1.out & compute_it.2.out &

wait

cat compute_it.1.out

cat compute_it.2.out

Figure 4.1: Execution Diagram for Parallel Shell Execution

option that specifies how much parallelism should be introduced into the build process.
For example, typing make -j4 when building a Linux kernel specifies that up to four
parallel compiles be carried out concurrently.
It is hoped that these simple examples convince you that parallel programming need
not always be complex or difficult.
Quick Quiz 4.4: But if script-based parallel programming is so easy, why bother
with anything else?

4.2 POSIX Multiprocessing


This section scratches the surface of the POSIX environment, including pthreads [Ope97],
as this environment is readily available and widely implemented. Section 4.2.1 provides
a glimpse of the POSIX fork() and related primitives, Section 4.2.2 touches on thread
creation and destruction, Section 4.2.3 gives a brief overview of POSIX locking, and,
finally, Section 4.2.4 describes a specific lock which can be used for data that is read by
many threads and only occasionally updated.

4.2.1 POSIX Process Creation and Destruction


Processes are created using the fork() primitive, they may be destroyed using the
kill() primitive, they may destroy themselves using the exit() primitive. A
process executing a fork() primitive is said to be the “parent” of the newly created
process. A parent may wait on its children using the wait() primitive.
Please note that the examples in this section are quite simple. Real-world applica-
tions using these primitives might need to manipulate signals, file descriptors, shared
memory segments, and any number of other resources. In addition, some applications
need to take specific actions if a given child terminates, and might also need to be
concerned with the reason that the child terminated. These concerns can of course
add substantial complexity to the code. For more information, see any of a number of
textbooks on the subject [Ste92, Wei13].
If fork() succeeds, it returns twice, once for the parent and again for the child.
The value returned from fork() allows the caller to tell the difference, as shown in
4.2. POSIX MULTIPROCESSING 37

1 pid = fork();
2 if (pid == 0) {
3 /* child */
4 } else if (pid < 0) {
5 /* parent, upon error */
6 perror("fork");
7 exit(-1);
8 } else {
9 /* parent, pid == child ID */
10 }

Figure 4.2: Using the fork() Primitive

1 void waitall(void)
2 {
3 int pid;
4 int status;
5
6 for (;;) {
7 pid = wait(&status);
8 if (pid == -1) {
9 if (errno == ECHILD)
10 break;
11 perror("wait");
12 exit(-1);
13 }
14 }
15 }

Figure 4.3: Using the wait() Primitive

Figure 4.2 (forkjoin.c). Line 1 executes the fork() primitive, and saves its return
value in local variable pid. Line 2 checks to see if pid is zero, in which case, this is the
child, which continues on to execute line 3. As noted earlier, the child may terminate via
the exit() primitive. Otherwise, this is the parent, which checks for an error return
from the fork() primitive on line 4, and prints an error and exits on lines 5-7 if so.
Otherwise, the fork() has executed successfully, and the parent therefore executes
line 9 with the variable pid containing the process ID of the child.
The parent process may use the wait() primitive to wait for its children to com-
plete. However, use of this primitive is a bit more complicated than its shell-script
counterpart, as each invocation of wait() waits for but one child process. It is there-
fore customary to wrap wait() into a function similar to the waitall() function
shown in Figure 4.3 (api-pthread.h), with this waitall() function having se-
mantics similar to the shell-script wait command. Each pass through the loop spanning
lines 6-15 waits on one child process. Line 7 invokes the wait() primitive, which
blocks until a child process exits, and returns that child’s process ID. If the process ID
is instead −1, this indicates that the wait() primitive was unable to wait on a child. If
so, line 9 checks for the ECHILD errno, which indicates that there are no more child
processes, so that line 10 exits the loop. Otherwise, lines 11 and 12 print an error and
exit.
Quick Quiz 4.5: Why does this wait() primitive need to be so complicated? Why
not just make it work like the shell-script wait does?
It is critically important to note that the parent and child do not share memory. This
is illustrated by the program shown in Figure 4.4 (forkjoinvar.c), in which the
child sets a global variable x to 1 on line 6, prints a message on line 7, and exits on
line 8. The parent continues at line 14, where it waits on the child, and on line 15 finds
that its copy of the variable x is still zero. The output is thus as follows:
38 CHAPTER 4. TOOLS OF THE TRADE

1 int x = 0;
2 int pid;
3
4 pid = fork();
5 if (pid == 0) { /* child */
6 x = 1;
7 printf("Child process set x=1\n");
8 exit(0);
9 }
10 if (pid < 0) { /* parent, upon error */
11 perror("fork");
12 exit(-1);
13 }
14 waitall();
15 printf("Parent process sees x=%d\n", x);

Figure 4.4: Processes Created Via fork() Do Not Share Memory

Child process set x=1


Parent process sees x=0

Quick Quiz 4.6: Isn’t there a lot more to fork() and wait() than discussed
here?
The finest-grained parallelism requires shared memory, and this is covered in Sec-
tion 4.2.2. That said, shared-memory parallelism can be significantly more complex
than fork-join parallelism.

4.2.2 POSIX Thread Creation and Destruction


To create a thread within an existing process, invoke the pthread_create() primi-
tive, for example, as shown on lines 15 and 16 of Figure 4.5 (pcreate.c). The first
argument is a pointer to a pthread_t in which to store the ID of the thread to be
created, the second NULL argument is a pointer to an optional pthread_attr_t,
the third argument is the function (in this case, mythread()) that is to be invoked
by the new thread, and the last NULL argument is the argument that will be passed to
mythread.
In this example, mythread() simply returns, but it could instead call pthread_
exit().
Quick Quiz 4.7: If the mythread() function in Figure 4.5 can simply return,
why bother with pthread_exit()?
The pthread_join() primitive, shown on line 20, is analogous to the fork-join
wait() primitive. It blocks until the thread specified by the tid variable completes
execution, either by invoking pthread_exit() or by returning from the thread’s
top-level function. The thread’s exit value will be stored through the pointer passed as
the second argument to pthread_join(). The thread’s exit value is either the value
passed to pthread_exit() or the value returned by the thread’s top-level function,
depending on how the thread in question exits.
The program shown in Figure 4.5 produces output as follows, demonstrating that
memory is in fact shared between the two threads:
Child process set x=1
Parent process sees x=1

Note that this program carefully makes sure that only one of the threads stores a
value to variable x at a time. Any situation in which one thread might be storing a
value to a given variable while some other thread either loads from or stores to that
4.2. POSIX MULTIPROCESSING 39

1 int x = 0;
2
3 void *mythread(void *arg)
4 {
5 x = 1;
6 printf("Child process set x=1\n");
7 return NULL;
8 }
9
10 int main(int argc, char *argv[])
11 {
12 pthread_t tid;
13 void *vp;
14
15 if (pthread_create(&tid, NULL,
16 mythread, NULL) != 0) {
17 perror("pthread_create");
18 exit(-1);
19 }
20 if (pthread_join(tid, &vp) != 0) {
21 perror("pthread_join");
22 exit(-1);
23 }
24 printf("Parent process sees x=%d\n", x);
25 return 0;
26 }

Figure 4.5: Threads Created Via pthread_create() Share Memory

same variable is termed a “data race”. Because the C language makes no guarantee that
the results of a data race will be in any way reasonable, we need some way of safely
accessing and modifying data concurrently, such as the locking primitives discussed in
the following section.
Quick Quiz 4.8: If the C language makes no guarantees in presence of a data race,
then why does the Linux kernel have so many data races? Are you trying to tell me that
the Linux kernel is completely broken???

4.2.3 POSIX Locking


The POSIX standard allows the programmer to avoid data races via “POSIX locking”.
POSIX locking features a number of primitives, the most fundamental of which are
pthread_mutex_lock() and pthread_mutex_unlock(). These primitives
operate on locks, which are of type pthread_mutex_t. These locks may be declared
statically and initialized with PTHREAD_MUTEX_INITIALIZER, or they may be
allocated dynamically and initialized using the pthread_mutex_init() primitive.
The demonstration code in this section will take the former course.
The pthread_mutex_lock() primitive “acquires” the specified lock, and the
pthread_mutex_unlock() “releases” the specified lock. Because these are “ex-
clusive” locking primitives, only one thread at a time may “hold” a given lock at a given
time. For example, if a pair of threads attempt to acquire the same lock concurrently,
one of the pair will be “granted” the lock first, and the other will wait until the first
thread releases the lock. A simple and reasonably useful programming model permits a
given data item to be accessed only while holding the corresponding lock [Hoa74].
Quick Quiz 4.9: What if I want several threads to hold the same lock at the same
time?
This exclusive-locking property is demonstrated using the code shown in Figure 4.6
(lock.c). Line 1 defines and initializes a POSIX lock named lock_a, while line 2
similarly defines and initializes a lock named lock_b. Line 3 defines and initializes a
40 CHAPTER 4. TOOLS OF THE TRADE

1 pthread_mutex_t lock_a = PTHREAD_MUTEX_INITIALIZER;


2 pthread_mutex_t lock_b = PTHREAD_MUTEX_INITIALIZER;
3 int x = 0;
4
5 void *lock_reader(void *arg)
6 {
7 int i;
8 int newx = -1;
9 int oldx = -1;
10 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg;
11
12 if (pthread_mutex_lock(pmlp) != 0) {
13 perror("lock_reader:pthread_mutex_lock");
14 exit(-1);
15 }
16 for (i = 0; i < 100; i++) {
17 newx = READ_ONCE(x);
18 if (newx != oldx) {
19 printf("lock_reader(): x = %d\n", newx);
20 }
21 oldx = newx;
22 poll(NULL, 0, 1);
23 }
24 if (pthread_mutex_unlock(pmlp) != 0) {
25 perror("lock_reader:pthread_mutex_unlock");
26 exit(-1);
27 }
28 return NULL;
29 }
30
31 void *lock_writer(void *arg)
32 {
33 int i;
34 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg;
35
36 if (pthread_mutex_lock(pmlp) != 0) {
37 perror("lock_writer:pthread_mutex_lock");
38 exit(-1);
39 }
40 for (i = 0; i < 3; i++) {
41 WRITE_ONCE(x, READ_ONCE(x) + 1);
42 poll(NULL, 0, 5);
43 }
44 if (pthread_mutex_unlock(pmlp) != 0) {
45 perror("lock_writer:pthread_mutex_unlock");
46 exit(-1);
47 }
48 return NULL;
49 }

Figure 4.6: Demonstration of Exclusive Locks


4.2. POSIX MULTIPROCESSING 41

1 printf("Creating two threads using same lock:\n");


2 if (pthread_create(&tid1, NULL,
3 lock_reader, &lock_a) != 0) {
4 perror("pthread_create");
5 exit(-1);
6 }
7 if (pthread_create(&tid2, NULL,
8 lock_writer, &lock_a) != 0) {
9 perror("pthread_create");
10 exit(-1);
11 }
12 if (pthread_join(tid1, &vp) != 0) {
13 perror("pthread_join");
14 exit(-1);
15 }
16 if (pthread_join(tid2, &vp) != 0) {
17 perror("pthread_join");
18 exit(-1);
19 }

Figure 4.7: Demonstration of Same Exclusive Lock

shared variable x.
Lines 5-28 defines a function lock_reader() which repeatedly reads the shared
variable x while holding the lock specified by arg. Line 10 casts arg to a pointer to a
pthread_mutex_t, as required by the pthread_mutex_lock() and pthread_
mutex_unlock() primitives.
Quick Quiz 4.10: Why not simply make the argument to lock_reader() on
line 5 of Figure 4.6 be a pointer to a pthread_mutex_t?
Lines 12-15 acquire the specified pthread_mutex_t, checking for errors and
exiting the program if any occur. Lines 16-23 repeatedly check the value of x, printing
the new value each time that it changes. Line 22 sleeps for one millisecond, which
allows this demonstration to run nicely on a uniprocessor machine. Lines 24-27 release
the pthread_mutex_t, again checking for errors and exiting the program if any
occur. Finally, line 28 returns NULL, again to match the function type required by
pthread_create().
Quick Quiz 4.11: Writing four lines of code for each acquisition and release of a
pthread_mutex_t sure seems painful! Isn’t there a better way?
Lines 31-49 of Figure 4.6 shows lock_writer(), which periodically update
the shared variable x while holding the specified pthread_mutex_t. As with
lock_reader(), line 34 casts arg to a pointer to pthread_mutex_t, lines 36-
39 acquires the specified lock, and lines 44-47 releases it. While holding the lock,
lines 40-43 increment the shared variable x, sleeping for five milliseconds between each
increment. Finally, lines 44-47 release the lock.
Figure 4.7 shows a code fragment that runs lock_reader() and lock_writer()
as threads using the same lock, namely, lock_a. Lines 2-6 create a thread running
lock_reader(), and then Lines 7-11 create a thread running lock_writer().
Lines 12-19 wait for both threads to complete. The output of this code fragment is as
follows:
Creating two threads using same lock:
lock_reader(): x = 0

Because both threads are using the same lock, the lock_reader() thread cannot
see any of the intermediate values of x produced by lock_writer() while holding
42 CHAPTER 4. TOOLS OF THE TRADE

1 printf("Creating two threads w/different locks:\n");


2 x = 0;
3 if (pthread_create(&tid1, NULL,
4 lock_reader, &lock_a) != 0) {
5 perror("pthread_create");
6 exit(-1);
7 }
8 if (pthread_create(&tid2, NULL,
9 lock_writer, &lock_b) != 0) {
10 perror("pthread_create");
11 exit(-1);
12 }
13 if (pthread_join(tid1, &vp) != 0) {
14 perror("pthread_join");
15 exit(-1);
16 }
17 if (pthread_join(tid2, &vp) != 0) {
18 perror("pthread_join");
19 exit(-1);
20 }

Figure 4.8: Demonstration of Different Exclusive Locks

the lock.
Quick Quiz 4.12: Is “x = 0” the only possible output from the code fragment shown
in Figure 4.7? If so, why? If not, what other output could appear, and why?
Figure 4.8 shows a similar code fragment, but this time using different locks: lock_
a for lock_reader() and lock_b for lock_writer(). The output of this code
fragment is as follows:
Creating two threads w/different locks:
lock_reader(): x = 0
lock_reader(): x = 1
lock_reader(): x = 2
lock_reader(): x = 3

Because the two threads are using different locks, they do not exclude each other,
and can run concurrently. The lock_reader() function can therefore see the inter-
mediate values of x stored by lock_writer().
Quick Quiz 4.13: Using different locks could cause quite a bit of confusion, what
with threads seeing each others’ intermediate states. So should well-written parallel
programs restrict themselves to using a single lock in order to avoid this kind of
confusion?
Quick Quiz 4.14: In the code shown in Figure 4.8, is lock_reader() guaran-
teed to see all the values produced by lock_writer()? Why or why not?
Quick Quiz 4.15: Wait a minute here!!! Figure 4.7 didn’t initialize shared variable
x, so why does it need to be initialized in Figure 4.8?
Although there is quite a bit more to POSIX exclusive locking, these primitives
provide a good start and are in fact sufficient in a great many situations. The next section
takes a brief look at POSIX reader-writer locking.

4.2.4 POSIX Reader-Writer Locking


The POSIX API provides a reader-writer lock, which is represented by a pthread_
rwlock_t. As with pthread_mutex_t, pthread_rwlock_t may be statically
initialized via PTHREAD_RWLOCK_INITIALIZER or dynamically initialized via
the pthread_rwlock_init() primitive. The pthread_rwlock_rdlock()
primitive read-acquires the specified pthread_rwlock_t, the pthread_rwlock_
4.2. POSIX MULTIPROCESSING 43

1 pthread_rwlock_t rwl = PTHREAD_RWLOCK_INITIALIZER;


2 int holdtime = 0;
3 int thinktime = 0;
4 long long *readcounts;
5 int nreadersrunning = 0;
6
7 #define GOFLAG_INIT 0
8 #define GOFLAG_RUN 1
9 #define GOFLAG_STOP 2
10 char goflag = GOFLAG_INIT;
11
12 void *reader(void *arg)
13 {
14 int i;
15 long long loopcnt = 0;
16 long me = (long)arg;
17
18 __sync_fetch_and_add(&nreadersrunning, 1);
19 while (READ_ONCE(goflag) == GOFLAG_INIT) {
20 continue;
21 }
22 while (READ_ONCE(goflag) == GOFLAG_RUN) {
23 if (pthread_rwlock_rdlock(&rwl) != 0) {
24 perror("pthread_rwlock_rdlock");
25 exit(-1);
26 }
27 for (i = 1; i < holdtime; i++) {
28 barrier();
29 }
30 if (pthread_rwlock_unlock(&rwl) != 0) {
31 perror("pthread_rwlock_unlock");
32 exit(-1);
33 }
34 for (i = 1; i < thinktime; i++) {
35 barrier();
36 }
37 loopcnt++;
38 }
39 readcounts[me] = loopcnt;
40 return NULL;
41 }

Figure 4.9: Measuring Reader-Writer Lock Scalability

wrlock() primitive write-acquires it, and the pthread_rwlock_unlock() prim-


itive releases it. Only a single thread may write-hold a given pthread_rwlock_t at
any given time, but multiple threads may read-hold a given pthread_rwlock_t, at
least while there is no thread currently write-holding it.
As you might expect, reader-writer locks are designed for read-mostly situations. In
these situations, a reader-writer lock can provide greater scalability than can an exclusive
lock because the exclusive lock is by definition limited to a single thread holding the
lock at any given time, while the reader-writer lock permits an arbitrarily large number
of readers to concurrently hold the lock. However, in practice, we need to know how
much additional scalability is provided by reader-writer locks.
Figure 4.9 (rwlockscale.c) shows one way of measuring reader-writer lock
scalability. Line 1 shows the definition and initialization of the reader-writer lock, line 2
shows the holdtime argument controlling the time each thread holds the reader-writer
lock, line 3 shows the thinktime argument controlling the time between the release
of the reader-writer lock and the next acquisition, line 4 defines the readcounts array
into which each reader thread places the number of times it acquired the lock, and line 5
defines the nreadersrunning variable, which determines when all reader threads
have started running.
Lines 7-10 define goflag, which synchronizes the start and the end of the test.
44 CHAPTER 4. TOOLS OF THE TRADE

1.1

1
ideal
0.9

0.8

Critical Section Performance


0.7
100M
0.6

10M
0.5

0.4

0.3
1M
0.2 10K

0.1 100K
1K
0
0 20 40 60 80 100 120 140
Number of CPUs (Threads)

Figure 4.10: Reader-Writer Lock Scalability

This variable is initially set to GOFLAG_INIT, then set to GOFLAG_RUN after all the
reader threads have started, and finally set to GOFLAG_STOP to terminate the test run.
Lines 12-41 define reader(), which is the reader thread. Line 18 atomically
increments the nreadersrunning variable to indicate that this thread is now running,
and lines 19-21 wait for the test to start. The READ_ONCE() primitive forces the
compiler to fetch goflag on each pass through the loop—the compiler would otherwise
be within its rights to assume that the value of goflag would never change.
Quick Quiz 4.16: Instead of using READ_ONCE() everywhere, why not just
declare goflag as volatile on line 10 of Figure 4.9?
Quick Quiz 4.17: READ_ONCE() only affects the compiler, not the CPU. Don’t we
also need memory barriers to make sure that the change in goflag’s value propagates
to the CPU in a timely fashion in Figure 4.9?
Quick Quiz 4.18: Would it ever be necessary to use READ_ONCE() when access-
ing a per-thread variable, for example, a variable declared using the gcc __thread
storage class?
The loop spanning lines 22-38 carries out the performance test. Lines 23-26 acquire
the lock, lines 27-29 hold the lock for the specified duration (and the barrier()
directive prevents the compiler from optimizing the loop out of existence), lines 30-33
release the lock, and lines 34-36 wait for the specified duration before re-acquiring the
lock. Line 37 counts this lock acquisition.
Line 39 moves the lock-acquisition count to this thread’s element of the readcounts[]
array, and line 40 returns, terminating this thread.
Figure 4.10 shows the results of running this test on a 64-core Power-5 system
with two hardware threads per core for a total of 128 software-visible CPUs. The
thinktime parameter was zero for all these tests, and the holdtime parameter set
to values ranging from one thousand (“1K” on the graph) to 100 million (“100M” on
the graph). The actual value plotted is:
LN
(4.1)
NL1
4.2. POSIX MULTIPROCESSING 45

where N is the number of threads, LN is the number of lock acquisitions by N threads,


and L1 is the number of lock acquisitions by a single thread. Given ideal hardware and
software scalability, this value will always be 1.0.
As can be seen in the figure, reader-writer locking scalability is decidedly non-ideal,
especially for smaller sizes of critical sections. To see why read-acquisition can be so
slow, consider that all the acquiring threads must update the pthread_rwlock_t
data structure. Therefore, if all 128 executing threads attempt to read-acquire the reader-
writer lock concurrently, they must update this underlying pthread_rwlock_t one
at a time. One lucky thread might do so almost immediately, but the least-lucky thread
must wait for all the other 127 threads to do their updates. This situation will only get
worse as you add CPUs.
Quick Quiz 4.19: Isn’t comparing against single-CPU throughput a bit harsh?
Quick Quiz 4.20: But 1,000 instructions is not a particularly small size for a critical
section. What do I do if I need a much smaller critical section, for example, one
containing only a few tens of instructions?
Quick Quiz 4.21: In Figure 4.10, all of the traces other than the 100M trace deviate
gently from the ideal line. In contrast, the 100M trace breaks sharply from the ideal line
at 64 CPUs. In addition, the spacing between the 100M trace and the 10M trace is much
smaller than that between the 10M trace and the 1M trace. Why does the 100M trace
behave so much differently than the other traces?
Quick Quiz 4.22: Power-5 is several years old, and new hardware should be faster.
So why should anyone worry about reader-writer locks being slow?
Despite these limitations, reader-writer locking is quite useful in many cases, for ex-
ample when the readers must do high-latency file or network I/O. There are alternatives,
some of which will be presented in Chapters 5 and 9.

4.2.5 Atomic Operations (gcc Classic)


Given that Figure 4.10 shows that the overhead of reader-writer locking is most severe
for the smallest critical sections, it would be nice to have some other way to protect
the tiniest of critical sections. One such way are atomic operations. We have seen one
atomic operations already, in the form of the __sync_fetch_and_add() primitive
on line 18 of Figure 4.9. This primitive atomically adds the value of its second argument
to the value referenced by its first argument, returning the old value (which was ignored
in this case). If a pair of threads concurrently execute __sync_fetch_and_add()
on the same variable, the resulting value of the variable will include the result of both
additions.
The gcc compiler offers a number of additional atomic operations, including
__sync_fetch_and_sub(), __sync_fetch_and_or(), __sync_fetch_
and_and(), __sync_fetch_and_xor(), and __sync_fetch_and_nand(),
all of which return the old value. If you instead need the new value, you can in-
stead use the __sync_add_and_fetch(), __sync_sub_and_fetch(), __
sync_or_and_fetch(), __sync_and_and_fetch(), __sync_xor_and_
fetch(), and __sync_nand_and_fetch() primitives.
Quick Quiz 4.23: Is it really necessary to have both sets of primitives?
The classic compare-and-swap operation is provided by a pair of primitives, __
sync_bool_compare_and_swap() and __sync_val_compare_and_swap().
Both of these primitive atomically update a location to a new value, but only if its prior
value was equal to the specified old value. The first variant returns 1 if the operation
46 CHAPTER 4. TOOLS OF THE TRADE

succeeded and 0 if it failed, for example, if the prior value was not equal to the spec-
ified old value. The second variant returns the prior value of the location, which, if
equal to the specified old value, indicates that the operation succeeded. Either of the
compare-and-swap operation is “universal” in the sense that any atomic operation on a
single location can be implemented in terms of compare-and-swap, though the earlier
operations are often more efficient where they apply. The compare-and-swap operation
is also capable of serving as the basis for a wider set of atomic operations, though
the more elaborate of these often suffer from complexity, scalability, and performance
problems [Her90].
The __sync_synchronize() primitive issues a “memory barrier”, which con-
strains both the compiler’s and the CPU’s ability to reorder operations, as discussed in
Section 14.2. In some cases, it is sufficient to constrain the compiler’s ability to reorder
operations, while allowing the CPU free rein, in which case the barrier() primitive
may be used, as it in fact was on line 28 of Figure 4.9. In some cases, it is only necessary
to ensure that the compiler avoids optimizing away a given memory read, in which case
the READ_ONCE() primitive may be used, as it was on line 17 of Figure 4.6. Similarly,
the WRITE_ONCE() primitive may be used to prevent the compiler from optimizing a
way a given memory write. These last two primitives are not provided directly by gcc,
but may be implemented straightforwardly as follows:

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))


#define READ_ONCE(x) ACCESS_ONCE(x)
#define WRITE_ONCE(x, val) ({ ACCESS_ONCE(x) = (val); })
#define barrier() __asm__ __volatile__("": : :"memory")

Quick Quiz 4.24: Given that these atomic operations will often be able to generate
single atomic instructions that are directly supported by the underlying instruction set,
shouldn’t they be the fastest possible way to get things done?

4.2.6 Atomic Operations (C11)


The C11 standard added atomic operations, including loads (atomic_load()),
stores (atomic_store()), memory barriers (atomic_thread_fence() and
atomic_signal_fence()), and read-modify-write atomics. The read-modify-
write atomics include atomic_fetch_add(), atomic_fetch_sub(), atomic_
fetch_and(), atomic_fetch_xor(), atomic_exchange(), atomic_compare_
exchange_strong(), and atomic_compare_exchange_weak(). These op-
erate in a manner similar to those described in Section 4.2.5, but with the addition of
memory-order arguments to _explicit variants of all of the operations. Without
memory-order arguments, all the atomic operations are fully ordered, and the argu-
ments permit weaker orderings. For example, “memory_order_explicit(&a,
memory_order_relaxed)” is vaguely similar to the Linux kernel’s “READ_ONCE()”.1
One restriction of the C11 atomics is that they apply only to special atomic types,
which can be restrictive. The gcc compiler therefore provides __atomic_load(), __
atomic_load_n(), __atomic_store(), __atomic_store_n(), etc. These
primitives offer the same semantics as their C11 counterparts, but may be used on plain
non-atomic objects.

1 Memory ordering is described in more detail in Section 14.2 and Appendix B.


4.3. ALTERNATIVES TO POSIX OPERATIONS 47

4.2.7 Per-Thread Variables


Per-thread variables, also called thread-specific data, thread-local storage, and other
less-polite names, are used extremely heavily in concurrent code, as will be explored in
Chapters 5 and 8. POSIX supplies the pthread_key_create() function to create
a per-thread variable (and return the corresponding key, pthread_key_delete()
to delete the per-thread variable corresponding to key, pthread_setspecific()
to set the value of the current thread’s variable corresponding to the specified key, and
pthread_getspecific() to return that value.
A number of compilers (including gcc) provide a __thread specifier that may be
used in a variable definition to designate that variable as being per-thread. The name
of the variable may then be used normally to access the value of the current thread’s
instance of that variable. Of course, __thread is much easier to use than the POSIX
thead-specific data, and so __thread is usually preferred for code that is to be built
only with gcc or other compilers supporting __thread.
Fortunately, the C11 standard introduced a _Thread_local keyword that can be
used in place of __thread. In the fullness of time, this new keyword should combine
the ease of use of __thread with the portability of POSIX thread-specific data.

4.3 Alternatives to POSIX Operations


Unfortunately, threading operations, locking primitives, and atomic operations were
in reasonably wide use long before the various standards committees got around to
them. As a result, there is considerable variation in how these operations are supported.
It is still quite common to find these operations implemented in assembly language,
either for historical reasons or to obtain better performance in specialized circumstances.
For example, the gcc __sync_ family of primitives all provide full memory-ordering
semantics, which in the past motivated many developers to create their own implemen-
tations for situations where the full memory ordering semantics are not required. The
following sections show some alternatives from the Linux kernel and some historical
primitives used by this book’s sample code.

4.3.1 Organization and Initialization


Although many environments do not require any special initialization code, the code
samples in this book start with a call to smp_init(), which initializes a mapping from
pthread_t to consecutive integers. The userspace RCU library similarly requires
a call to rcu_init(). Although these calls can be hidden in environments (such
as that of gcc) that support constructors, most of the RCU flavors supported by the
userspace RCU library also require each thread invoke rcu_register_thread()
upon thread creation and rcu_unregister_thread() before thread exit.
In the case of the Linux kernel, it is a philosophical question as to whether the kernel
does not require calls to special initialization code or whether the kernel’s boot-time
code is in fact the required initialization code.

4.3.2 Thread Creation, Destruction, and Control


The Linux kernel uses struct task_struct pointers to track kthreads, kthread_
create to create them, kthread_should_stop() to externally suggest that they
48 CHAPTER 4. TOOLS OF THE TRADE

int smp_thread_id(void)
thread_id_t create_thread(void *(*func)(void *), void *arg)
for_each_thread(t)
for_each_running_thread(t)
void *wait_thread(thread_id_t tid)
void wait_all_threads(void)

Figure 4.11: Thread API

stop (which has no POSIX equivalent), kthread_stop() to wait for them to stop,
and schedule_timeout_interruptible() for a timed wait. There are quite
a few additional kthread-management APIs, but this provides a good start, as well as
good search terms.
The CodeSamples API focuses on “threads”, which are a locus of control.2 Each
such thread has an identifier of type thread_id_t, and no two threads running at a
given time will have the same identifier. Threads share everything except for per-thread
local state,3 which includes program counter and stack.
The thread API is shown in Figure 4.11, and members are described in the following
sections.

4.3.2.1 create_thread()
The create_thread() primitive creates a new thread, starting the new thread’s
execution at the function func specified by create_thread()’s first argument,
and passing it the argument specified by create_thread()’s second argument.
This newly created thread will terminate when it returns from the starting function
specified by func. The create_thread() primitive returns the thread_id_t
corresponding to the newly created child thread.
This primitive will abort the program if more than NR_THREADS threads are created,
counting the one implicitly created by running the program. NR_THREADS is a compile-
time constant that may be modified, though some systems may have an upper bound for
the allowable number of threads.

4.3.2.2 smp_thread_id()
Because the thread_id_t returned from create_thread() is system-dependent,
the smp_thread_id() primitive returns a thread index corresponding to the thread
making the request. This index is guaranteed to be less than the maximum number of
threads that have been in existence since the program started, and is therefore useful for
bitmasks, array indices, and the like.

4.3.2.3 for_each_thread()
The for_each_thread() macro loops through all threads that exist, including all
threads that would exist if created. This macro is useful for handling per-thread variables
as will be seen in Section 4.2.7.

2There are many other names for similar software constructs, including “process”, “task”, “fiber”,
“event”, and so on. Similar design principles apply to all of them.
3 How is that for a circular definition?
4.3. ALTERNATIVES TO POSIX OPERATIONS 49

1 void *thread_test(void *arg)


2 {
3 int myarg = (int)arg;
4
5 printf("child thread %d: smp_thread_id() = %d\n",
6 myarg, smp_thread_id());
7 return NULL;
8 }

Figure 4.12: Example Child Thread

4.3.2.4 for_each_running_thread()

The for_each_running_thread() macro loops through only those threads that


currently exist. It is the caller’s responsibility to synchronize with thread creation and
deletion if required.

4.3.2.5 wait_thread()

The wait_thread() primitive waits for completion of the thread specified by the
thread_id_t passed to it. This in no way interferes with the execution of the
specified thread; instead, it merely waits for it. Note that wait_thread() returns the
value that was returned by the corresponding thread.

4.3.2.6 wait_all_threads()

The wait_all_threads() primitive waits for completion of all currently running


threads. It is the caller’s responsibility to synchronize with thread creation and deletion
if required. However, this primitive is normally used to clean up at the end of a run, so
such synchronization is normally not needed.

4.3.2.7 Example Usage

Figure 4.12 shows an example hello-world-like child thread. As noted earlier, each
thread is allocated its own stack, so each thread has its own private arg argument
and myarg variable. Each child simply prints its argument and its smp_thread_
id() before exiting. Note that the return statement on line 7 terminates the thread,
returning a NULL to whoever invokes wait_thread() on this thread.
The parent program is shown in Figure 4.13. It invokes smp_init() to initialize
the threading system on line 6, parses arguments on lines 7-14, and announces its
presence on line 15. It creates the specified number of child threads on lines 16-17, and
waits for them to complete on line 18. Note that wait_all_threads() discards
the threads return values, as in this case they are all NULL, which is not very interesting.
Quick Quiz 4.25: What happened to the Linux-kernel equivalents to fork() and
wait()?

4.3.3 Locking
A good starting subset of the Linux kernel’s locking API is shown in Figure 4.14, each
API element being described in the following sections. This book’s CodeSamples
locking API closely follows that of the Linux kernel.
50 CHAPTER 4. TOOLS OF THE TRADE

1 int main(int argc, char *argv[])


2 {
3 int i;
4 int nkids = 1;
5
6 smp_init();
7 if (argc > 1) {
8 nkids = strtoul(argv[1], NULL, 0);
9 if (nkids > NR_THREADS) {
10 fprintf(stderr, "nkids=%d too big, max=%d\n",
11 nkids, NR_THREADS);
12 usage(argv[0]);
13 }
14 }
15 printf("Parent spawning %d threads.\n", nkids);
16 for (i = 0; i < nkids; i++)
17 create_thread(thread_test, (void *)i);
18 wait_all_threads();
19 printf("All threads completed.\n", nkids);
20 exit(0);
21 }

Figure 4.13: Example Parent Thread


void spin_lock_init(spinlock_t *sp);
void spin_lock(spinlock_t *sp);
int spin_trylock(spinlock_t *sp);
void spin_unlock(spinlock_t *sp);

Figure 4.14: Locking API

4.3.3.1 spin_lock_init()
The spin_lock_init() primitive initializes the specified spinlock_t variable,
and must be invoked before this variable is passed to any other spinlock primitive.

4.3.3.2 spin_lock()
The spin_lock() primitive acquires the specified spinlock, if necessary, waiting
until the spinlock becomes available. In some environments, such as pthreads, this
waiting will involve “spinning”, while in others, such as the Linux kernel, it will involve
blocking.
The key point is that only one thread may hold a spinlock at any given time.

4.3.3.3 spin_trylock()
The spin_trylock() primitive acquires the specified spinlock, but only if it is
immediately available. It returns true if it was able to acquire the spinlock and false
otherwise.

4.3.3.4 spin_unlock()
The spin_unlock() primitive releases the specified spinlock, allowing other threads
to acquire it.

4.3.3.5 Example Usage


A spinlock named mutex may be used to protect a variable counter as follows:
4.3. ALTERNATIVES TO POSIX OPERATIONS 51

spin_lock(&mutex);
counter++;
spin_unlock(&mutex);

Quick Quiz 4.26: What problems could occur if the variable counter were
incremented without the protection of mutex?
However, the spin_lock() and spin_unlock() primitives do have perfor-
mance consequences, as will be seen in Section 4.3.6.

4.3.4 Atomic Operations


The Linux kernel provides a wide variety of atomic operations, but those defined on type
atomic_t provide a good start. Normal non-tearing reads and stores are provided by
atomic_read() and atomic_set(), respectively. Acquire load is provided by
smp_load_acquire() and release store by smp_store_release().
Non-value-returning fetch-and-add operations are provided by atomic_add(),
atomic_sub(), atomic_inc(), and atomic_dec(), among others. An atomic
decrement that returns a reached-zero indication is provided by both atomic_dec_
and_test() and atomic_sub_and_test(). An atomic add that returns the new
value is provided by atomic_add_return(). Both atomic_add_unless()
and atomic_inc_not_zero() provide conditional atomic operations, where noth-
ing happens unless the original value of the atomic variable is different than the value
specified (these are very handy for managing reference counters, for example).
An atomic exchange operation is provided by atomic_xchg(), and the celebrated
compare-and-swap (CAS) operation is provided by atomic_cmpxchg(). Both of
these return the old value. Many additional atomic RMW primitives are available in the
Linux kernel, see the Documentation/atomic_ops.txt file in the Linux-kernel
source tree.
This book’s CodeSamples API closely follows that of the Linux kernel.

4.3.5 Per-CPU Variables


The Linux kernel uses DEFINE_PER_CPU() to define a per-CPU variable, this_
cpu_ptr() to form a reference to this CPU’s instance of a given per-CPU variable,
per_cpu() to access a specified CPU’s instance of a given per-CPU variable, along
with many other special-purpose per-CPU operations.
Figure 4.15 shows this book’s per-thread-variable API, which is patterned after the
Linux kernel’s per-CPU-variable API. This API provides the per-thread equivalent of
global variables. Although this API is, strictly speaking, not necessary4 , it can provide a
good userspace analogy to Linux kernel code.
DEFINE_PER_THREAD(type, name)
DECLARE_PER_THREAD(type, name)
per_thread(name, thread)
__get_thread_var(name)
init_per_thread(name, v)

Figure 4.15: Per-Thread-Variable API

Quick Quiz 4.27: How could you work around the lack of a per-thread-variable
API on systems that do not provide it?
4 You could instead use __thread or _Thread_local.
52 CHAPTER 4. TOOLS OF THE TRADE

4.3.5.1 DEFINE_PER_THREAD()
The DEFINE_PER_THREAD() primitive defines a per-thread variable. Unfortunately,
it is not possible to provide an initializer in the way permitted by the Linux kernel’s
DEFINE_PER_THREAD() primitive, but there is an init_per_thread() primi-
tive that permits easy runtime initialization.

4.3.5.2 DECLARE_PER_THREAD()
The DECLARE_PER_THREAD() primitive is a declaration in the C sense, as opposed
to a definition. Thus, a DECLARE_PER_THREAD() primitive may be used to access a
per-thread variable defined in some other file.

4.3.5.3 per_thread()
The per_thread() primitive accesses the specified thread’s variable.

4.3.5.4 __get_thread_var()
The __get_thread_var() primitive accesses the current thread’s variable.

4.3.5.5 init_per_thread()
The init_per_thread() primitive sets all threads’ instances of the specified vari-
able to the specified value. The Linux kernel accomplishes this via normal C initializa-
tion, relying in clever use of linker scripts and code executed during the CPU-online
process.

4.3.5.6 Usage Example


Suppose that we have a counter that is incremented very frequently but read out quite
rarely. As will become clear in Section 4.3.6, it is helpful to implement such a counter
using a per-thread variable. Such a variable can be defined as follows:
DEFINE_PER_THREAD(int, counter);

The counter must be initialized as follows:


init_per_thread(counter, 0);

A thread can increment its instance of this counter as follows:


__get_thread_var(counter)++;

The value of the counter is then the sum of its instances. A snapshot of the value of
the counter can thus be collected as follows:
for_each_thread(i)
sum += per_thread(counter, i);

Again, it is possible to gain a similar effect using other mechanisms, but per-thread
variables combine convenience and high performance.
4.4. THE RIGHT TOOL FOR THE JOB: HOW TO CHOOSE? 53

4.3.6 Performance
It is instructive to compare the performance of the locked increment shown in Sec-
tion 4.3.4 to that of per-CPU (or per-thread) variables (see Section 4.3.5), as well as to
conventional increment (as in “counter++”).
The difference in performance is quite large, to put it mildly. The purpose of this
book is to help you write SMP programs, perhaps with realtime response, while avoiding
such performance pitfalls. Chapter 5 starts this process by describing a few parallel
counting algorithms.

4.4 The Right Tool for the Job: How to Choose?


As a rough rule of thumb, use the simplest tool that will get the job done. If you
can, simply program sequentially. If that is insufficient, try using a shell script to
mediate parallelism. If the resulting shell-script fork()/exec() overhead (about
480 microseconds for a minimal C program on an Intel Core Duo laptop) is too large,
try using the C-language fork() and wait() primitives. If the overhead of these
primitives (about 80 microseconds for a minimal child process) is still too large, then
you might need to use the POSIX threading primitives, choosing the appropriate locking
and/or atomic-operation primitives. If the overhead of the POSIX threading primitives
(typically sub-microsecond) is too great, then the primitives introduced in Chapter 9 may
be required. Always remember that inter-process communication and message-passing
can be good alternatives to shared-memory multithreaded execution.
Quick Quiz 4.28: Wouldn’t the shell normally use vfork() rather than fork()?

Of course, the actual overheads will depend not only on your hardware, but most
critically on the manner in which you use the primitives. In particular, randomly hacking
multi-threaded code is a spectacularly bad idea, especially given that shared-memory
parallel systems use your own intelligence against you: The smarter you are, the deeper
a hole you will dig for yourself before you realize that you are in trouble [Pok16].
Therefore, it is necessary to make the right design choices as well as the correct choice
of individual primitives, as is discussed at length in subsequent chapters.
54 CHAPTER 4. TOOLS OF THE TRADE
As easy as 1, 2, 3!

Unknown

Chapter 5

Counting

Counting is perhaps the simplest and most natural thing a computer can do. However,
counting efficiently and scalably on a large shared-memory multiprocessor can be quite
challenging. Furthermore, the simplicity of the underlying concept of counting allows
us to explore the fundamental issues of concurrency without the distractions of elaborate
data structures or complex synchronization primitives. Counting therefore provides an
excellent introduction to parallel programming.
This chapter covers a number of special cases for which there are simple, fast, and
scalable counting algorithms. But first, let us find out how much you already know
about concurrent counting.
Quick Quiz 5.1: Why on earth should efficient and scalable counting be hard? After
all, computers have special hardware for the sole purpose of doing counting, addition,
subtraction, and lots more besides, don’t they???
Quick Quiz 5.2: Network-packet counting problem. Suppose that you need
to collect statistics on the number of networking packets (or total number of bytes)
transmitted and/or received. Packets might be transmitted or received by any CPU on
the system. Suppose further that this large machine is capable of handling a million
packets per second, and that there is a systems-monitoring package that reads out the
count every five seconds. How would you implement this statistical counter?
Quick Quiz 5.3: Approximate structure-allocation limit problem. Suppose
that you need to maintain a count of the number of structures allocated in order to
fail any allocations once the number of structures in use exceeds a limit (say, 10,000).
Suppose further that these structures are short-lived, that the limit is rarely exceeded,
and that a “sloppy” approximate limit is acceptable.
Quick Quiz 5.4: Exact structure-allocation limit problem. Suppose that you
need to maintain a count of the number of structures allocated in order to fail any
allocations once the number of structures in use exceeds an exact limit (again, say
10,000). Suppose further that these structures are short-lived, and that the limit is rarely
exceeded, that there is almost always at least one structure in use, and suppose further
still that it is necessary to know exactly when this counter reaches zero, for example, in
order to free up some memory that is not required unless there is at least one structure
in use.
Quick Quiz 5.5: Removable I/O device access-count problem. Suppose that
you need to maintain a reference count on a heavily used removable mass-storage device,
so that you can tell the user when it is safe to remove the device. This device follows
the usual removal procedure where the user indicates a desire to remove the device, and

55
56 CHAPTER 5. COUNTING

the system tells the user when it is safe to do so.


The remainder of this chapter will develop answers to these questions. Section 5.1
asks why counting on multicore systems isn’t trivial, and Section 5.2 looks into ways
of solving the network-packet counting problem. Section 5.3 investigates the approxi-
mate structure-allocation limit problem, while Section 5.4 takes on the exact structure-
allocation limit problem. Section 5.5 discusses how to use the various specialized
parallel counters introduced in the preceding sections. Finally, Section 5.6 concludes
the chapter with performance measurements.
Sections 5.1 and 5.2 contain introductory material, while the remaining sections are
more appropriate for advanced students.

5.1 Why Isn’t Concurrent Counting Trivial?


Let’s start with something simple, for example, the straightforward use of arithmetic
shown in Figure 5.1 (count_nonatomic.c). Here, we have a counter on line 1, we
increment it on line 5, and we read out its value on line 10. What could be simpler?
This approach has the additional advantage of being blazingly fast if you are doing
lots of reading and almost no incrementing, and on small systems, the performance is
excellent.
There is just one large fly in the ointment: this approach can lose counts. On my
dual-core laptop, a short run invoked inc_count() 100,014,000 times, but the final
value of the counter was only 52,909,118. Although approximate values do have their
place in computing, accuracies far greater than 50% are almost always necessary.
Quick Quiz 5.6: But doesn’t the ++ operator produce an x86 add-to-memory
instruction? And won’t the CPU cache cause this to be atomic?
Quick Quiz 5.7: The 8-figure accuracy on the number of failures indicates that you
really did test this. Why would it be necessary to test such a trivial program, especially
when the bug is easily seen by inspection?
The straightforward way to count accurately is to use atomic operations, as shown in
Figure 5.2 (count_atomic.c). Line 1 defines an atomic variable, line 5 atomically
increments it, and line 10 reads it out. Because this is atomic, it keeps perfect count.
However, it is slower: on a Intel Core Duo laptop, it is about six times slower than
non-atomic increment when a single thread is incrementing, and more than ten times
slower if two threads are incrementing.1

1 long counter = 0;
2
3 void inc_count(void)
4 {
5 counter++;
6 }
7
8 long read_count(void)
9 {
10 return counter;
11 }

Figure 5.1: Just Count!

1 Interestingly enough, a pair of threads non-atomically incrementing a counter will cause the counter to

increase more quickly than a pair of threads atomically incrementing the counter. Of course, if your only goal
is to make the counter increase quickly, an easier approach is to simply assign a large value to the counter.
Nevertheless, there is likely to be a role for algorithms that use carefully relaxed notions of correctness in
5.1. WHY ISN’T CONCURRENT COUNTING TRIVIAL? 57

1 atomic_t counter = ATOMIC_INIT(0);


2
3 void inc_count(void)
4 {
5 atomic_inc(&counter);
6 }
7
8 long read_count(void)
9 {
10 return atomic_read(&counter);
11 }

Figure 5.2: Just Count Atomically!


900
Time Per Increment (nanoseconds)

800

700

600

500

400

300

200

100

0
1 2 3 4 5 6 7 8
Number of CPUs (Threads)

Figure 5.3: Atomic Increment Scalability on Nehalem

This poor performance should not be a surprise, given the discussion in Chapter 3,
nor should it be a surprise that the performance of atomic increment gets slower as
the number of CPUs and threads increase, as shown in Figure 5.3. In this figure, the
horizontal dashed line resting on the x axis is the ideal performance that would be
achieved by a perfectly scalable algorithm: with such an algorithm, a given increment
would incur the same overhead that it would in a single-threaded program. Atomic
increment of a single global variable is clearly decidedly non-ideal, and gets worse as
you add CPUs.
Quick Quiz 5.8: Why doesn’t the dashed line on the x axis meet the diagonal line
at x = 1?
Quick Quiz 5.9: But atomic increment is still pretty fast. And incrementing a single
variable in a tight loop sounds pretty unrealistic to me, after all, most of the program’s
execution should be devoted to actually doing work, not accounting for the work it has
done! Why should I care about making this go faster?
For another perspective on global atomic increment, consider Figure 5.4. In order
for each CPU to get a chance to increment a given global variable, the cache line
containing that variable must circulate among all the CPUs, as shown by the red arrows.
Such circulation will take significant time, resulting in the poor performance seen in
Figure 5.3, which might be thought of as shown in Figure 5.5.
The following sections discuss high-performance counting, which avoids the delays

order to gain greater performance and scalability [And91, ACMS03, Ung11].


58 CHAPTER 5. COUNTING

CPU 0 CPU 1 CPU 2 CPU 3


Cache Cache Cache Cache
Interconnect Interconnect

Memory System Interconnect Memory

Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7

Figure 5.4: Data Flow For Global Atomic Increment

One one thousand.


Two one thousand.
Three one thousand...

Figure 5.5: Waiting to Count

inherent in such circulation.


Quick Quiz 5.10: But why can’t CPU designers simply ship the addition operation
to the data, avoiding the need to circulate the cache line containing the global variable
being incremented?

5.2 Statistical Counters


This section covers the common special case of statistical counters, where the count is
updated extremely frequently and the value is read out rarely, if ever. These will be used
to solve the network-packet counting problem posed in Quick Quiz 5.2.

5.2.1 Design
Statistical counting is typically handled by providing a counter per thread (or CPU,
when running in the kernel), so that each thread updates its own counter. The aggregate
value of the counters is read out by simply summing up all of the threads’ counters,
relying on the commutative and associative properties of addition. This is an example
5.2. STATISTICAL COUNTERS 59

1 DEFINE_PER_THREAD(long, counter);
2
3 void inc_count(void)
4 {
5 __get_thread_var(counter)++;
6 }
7
8 long read_count(void)
9 {
10 int t;
11 long sum = 0;
12
13 for_each_thread(t)
14 sum += per_thread(counter, t);
15 return sum;
16 }

Figure 5.6: Array-Based Per-Thread Statistical Counters

of the Data Ownership pattern that will be introduced in Section 6.3.4.


Quick Quiz 5.11: But doesn’t the fact that C’s “integers” are limited in size compli-
cate things?

5.2.2 Array-Based Implementation


One way to provide per-thread variables is to allocate an array with one element per
thread (presumably cache aligned and padded to avoid false sharing).
Quick Quiz 5.12: An array??? But doesn’t that limit the number of threads?
Such an array can be wrapped into per-thread primitives, as shown in Figure 5.6
(count_stat.c). Line 1 defines an array containing a set of per-thread counters of
type long named, creatively enough, counter.
Lines 3-6 show a function that increments the counters, using the __get_thread_
var() primitive to locate the currently running thread’s element of the counter
array. Because this element is modified only by the corresponding thread, non-atomic
increment suffices.
Lines 8-16 show a function that reads out the aggregate value of the counter, us-
ing the for_each_thread() primitive to iterate over the list of currently running
threads, and using the per_thread() primitive to fetch the specified thread’s counter.
Because the hardware can fetch and store a properly aligned long atomically, and
because gcc is kind enough to make use of this capability, normal loads suffice, and no
special atomic instructions are required.
Quick Quiz 5.13: What other choice does gcc have, anyway???
Quick Quiz 5.14: How does the per-thread counter variable in Figure 5.6 get
initialized?
Quick Quiz 5.15: How is the code in Figure 5.6 supposed to permit more than one
counter?
This approach scales linearly with increasing number of updater threads invoking
inc_count(). As is shown by the green arrows on each CPU in Figure 5.7, the
reason for this is that each CPU can make rapid progress incrementing its thread’s
variable, without any expensive cross-system communication. As such, this section
solves the network-packet counting problem presented at the beginning of this chapter.
Quick Quiz 5.16: The read operation takes time to sum up the per-thread values,
and during that time, the counter could well be changing. This means that the value
returned by read_count() in Figure 5.6 will not necessarily be exact. Assume
60 CHAPTER 5. COUNTING

CPU 0 CPU 1 CPU 2 CPU 3


Cache Cache Cache Cache
Interconnect Interconnect

Memory System Interconnect Memory

Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7

Figure 5.7: Data Flow For Per-Thread Increment

that the counter is being incremented at rate r counts per unit time, and that read_
count()’s execution consumes ∆ units of time. What is the expected error in the
return value?
However, this excellent update-side scalability comes at great read-side expense for
large numbers of threads. The next section shows one way to reduce read-side expense
while still retaining the update-side scalability.

5.2.3 Eventually Consistent Implementation


One way to retain update-side scalability while greatly improving read-side performance
is to weaken consistency requirements. The counting algorithm in the previous section
is guaranteed to return a value between the value that an ideal counter would have taken
on near the beginning of read_count()’s execution and that near the end of read_
count()’s execution. Eventual consistency [Vog09] provides a weaker guarantee: in
absence of calls to inc_count(), calls to read_count() will eventually return
an accurate count.
We exploit eventual consistency by maintaining a global counter. However, updaters
only manipulate their per-thread counters. A separate thread is provided to transfer
counts from the per-thread counters to the global counter. Readers simply access the
value of the global counter. If updaters are active, the value used by the readers will be
out of date, however, once updates cease, the global counter will eventually converge on
the true value—hence this approach qualifies as eventually consistent.
The implementation is shown in Figure 5.8 (count_stat_eventual.c). Lines 1-
2 show the per-thread variable and the global variable that track the counter’s value,
and line three shows stopflag which is used to coordinate termination (for the
case where we want to terminate the program with an accurate counter value). The
inc_count() function shown on lines 5-8 is similar to its counterpart in Figure 5.6.
The read_count() function shown on lines 10-13 simply returns the value of the
global_count variable.
However, the count_init() function on lines 34-42 creates the eventual()
thread shown on lines 15-32, which cycles through all the threads, summing the per-
thread local counter and storing the sum to the global_count variable. The
eventual() thread waits an arbitrarily chosen one millisecond between passes. The
count_cleanup() function on lines 44-50 coordinates termination.
5.2. STATISTICAL COUNTERS 61

1 DEFINE_PER_THREAD(unsigned long, counter);


2 unsigned long global_count;
3 int stopflag;
4
5 void inc_count(void)
6 {
7 ACCESS_ONCE(__get_thread_var(counter))++;
8 }
9
10 unsigned long read_count(void)
11 {
12 return ACCESS_ONCE(global_count);
13 }
14
15 void *eventual(void *arg)
16 {
17 int t;
18 int sum;
19
20 while (stopflag < 3) {
21 sum = 0;
22 for_each_thread(t)
23 sum += ACCESS_ONCE(per_thread(counter, t));
24 ACCESS_ONCE(global_count) = sum;
25 poll(NULL, 0, 1);
26 if (stopflag) {
27 smp_mb();
28 stopflag++;
29 }
30 }
31 return NULL;
32 }
33
34 void count_init(void)
35 {
36 thread_id_t tid;
37
38 if (pthread_create(&tid, NULL, eventual, NULL)) {
39 perror("count_init:pthread_create");
40 exit(-1);
41 }
42 }
43
44 void count_cleanup(void)
45 {
46 stopflag = 1;
47 while (stopflag < 3)
48 poll(NULL, 0, 1);
49 smp_mb();
50 }

Figure 5.8: Array-Based Per-Thread Eventually Consistent Counters


62 CHAPTER 5. COUNTING

1 long __thread counter = 0;


2 long *counterp[NR_THREADS] = { NULL };
3 long finalcount = 0;
4 DEFINE_SPINLOCK(final_mutex);
5
6 void inc_count(void)
7 {
8 counter++;
9 }
10
11 long read_count(void)
12 {
13 int t;
14 long sum;
15
16 spin_lock(&final_mutex);
17 sum = finalcount;
18 for_each_thread(t)
19 if (counterp[t] != NULL)
20 sum += *counterp[t];
21 spin_unlock(&final_mutex);
22 return sum;
23 }
24
25 void count_register_thread(void)
26 {
27 int idx = smp_thread_id();
28
29 spin_lock(&final_mutex);
30 counterp[idx] = &counter;
31 spin_unlock(&final_mutex);
32 }
33
34 void count_unregister_thread(int nthreadsexpected)
35 {
36 int idx = smp_thread_id();
37
38 spin_lock(&final_mutex);
39 finalcount += counter;
40 counterp[idx] = NULL;
41 spin_unlock(&final_mutex);
42 }

Figure 5.9: Per-Thread Statistical Counters

This approach gives extremely fast counter read-out while still supporting linear
counter-update performance. However, this excellent read-side performance and update-
side scalability comes at the cost of the additional thread running eventual().

Quick Quiz 5.17: Why doesn’t inc_count() in Figure 5.8 need to use atomic
instructions? After all, we now have multiple threads accessing the per-thread counters!

Quick Quiz 5.18: Won’t the single global thread in the function eventual() of
Figure 5.8 be just as severe a bottleneck as a global lock would be?

Quick Quiz 5.19: Won’t the estimate returned by read_count() in Figure 5.8
become increasingly inaccurate as the number of threads rises?

Quick Quiz 5.20: Given that in the eventually-consistent algorithm shown in


Figure 5.8 both reads and updates have extremely low overhead and are extremely
scalable, why would anyone bother with the implementation described in Section 5.2.2,
given its costly read-side code?
5.2. STATISTICAL COUNTERS 63

5.2.4 Per-Thread-Variable-Based Implementation


Fortunately, gcc provides an __thread storage class that provides per-thread storage.
This can be used as shown in Figure 5.9 (count_end.c) to implement a statistical
counter that not only scales, but that also incurs little or no performance penalty to
incrementers compared to simple non-atomic increment.
Lines 1-4 define needed variables: counter is the per-thread counter variable, the
counterp[] array allows threads to access each others’ counters, finalcount ac-
cumulates the total as individual threads exit, and final_mutex coordinates between
threads accumulating the total value of the counter and exiting threads.
Quick Quiz 5.21: Why do we need an explicit array to find the other threads’
counters? Why doesn’t gcc provide a per_thread() interface, similar to the Linux
kernel’s per_cpu() primitive, to allow threads to more easily access each others’
per-thread variables?
The inc_count() function used by updaters is quite simple, as can be seen on
lines 6-9.
The read_count() function used by readers is a bit more complex. Line 16
acquires a lock to exclude exiting threads, and line 21 releases it. Line 17 initializes the
sum to the count accumulated by those threads that have already exited, and lines 18-20
sum the counts being accumulated by threads currently running. Finally, line 22 returns
the sum.
Quick Quiz 5.22: Doesn’t the check for NULL on line 19 of Figure 5.9 add extra
branch mispredictions? Why not have a variable set permanently to zero, and point
unused counter-pointers to that variable rather than setting them to NULL?
Quick Quiz 5.23: Why on earth do we need something as heavyweight as a lock
guarding the summation in the function read_count() in Figure 5.9?
Lines 25-32 show the count_register_thread() function, which must be
called by each thread before its first use of this counter. This function simply sets up
this thread’s element of the counterp[] array to point to its per-thread counter
variable.
Quick Quiz 5.24: Why on earth do we need to acquire the lock in count_
register_thread() in Figure 5.9? It is a single properly aligned machine-word
store to a location that no other thread is modifying, so it should be atomic anyway,
right?
Lines 34-42 show the count_unregister_thread() function, which must
be called prior to exit by each thread that previously called count_register_
thread(). Line 38 acquires the lock, and line 41 releases it, thus excluding any
calls to read_count() as well as other calls to count_unregister_thread().
Line 39 adds this thread’s counter to the global finalcount, and then line 40
NULLs out its counterp[] array entry. A subsequent call to read_count() will
see the exiting thread’s count in the global finalcount, and will skip the exiting
thread when sequencing through the counterp[] array, thus obtaining the correct
total.
This approach gives updaters almost exactly the same performance as a non-atomic
add, and also scales linearly. On the other hand, concurrent reads contend for a single
global lock, and therefore perform poorly and scale abysmally. However, this is not a
problem for statistical counters, where incrementing happens often and readout happens
almost never. Of course, this approach is considerably more complex than the array-
based scheme, due to the fact that a given thread’s per-thread variables vanish when that
64 CHAPTER 5. COUNTING

thread exits.
Quick Quiz 5.25: Fine, but the Linux kernel doesn’t have to acquire a lock when
reading out the aggregate value of per-CPU counters. So why should user-space code
need to do this???

5.2.5 Discussion
These three implementations show that it is possible to obtain uniprocessor performance
for statistical counters, despite running on a parallel machine.
Quick Quiz 5.26: What fundamental difference is there between counting packets
and counting the total number of bytes in the packets, given that the packets vary in
size?
Quick Quiz 5.27: Given that the reader must sum all the threads’ counters, this
could take a long time given large numbers of threads. Is there any way that the
increment operation can remain fast and scalable while allowing readers to also enjoy
reasonable performance and scalability?
Given what has been presented in this section, you should now be able to answer the
Quick Quiz about statistical counters for networking near the beginning of this chapter.

5.3 Approximate Limit Counters


Another special case of counting involves limit-checking. For example, as noted in the
approximate structure-allocation limit problem in Quick Quiz 5.3, suppose that you need
to maintain a count of the number of structures allocated in order to fail any allocations
once the number of structures in use exceeds a limit, in this case, 10,000. Suppose
further that these structures are short-lived, that this limit is rarely exceeded, and that this
limit is approximate in that it is OK to exceed it sometimes by some bounded amount
(see Section 5.4 if you instead need the limit to be exact).

5.3.1 Design
One possible design for limit counters is to divide the limit of 10,000 by the number
of threads, and give each thread a fixed pool of structures. For example, given 100
threads, each thread would manage its own pool of 100 structures. This approach is
simple, and in some cases works well, but it does not handle the common case where
a given structure is allocated by one thread and freed by another [MS93]. On the one
hand, if a given thread takes credit for any structures it frees, then the thread doing
most of the allocating runs out of structures, while the threads doing most of the freeing
have lots of credits that they cannot use. On the other hand, if freed structures are
credited to the CPU that allocated them, it will be necessary for CPUs to manipulate
each others’ counters, which will require expensive atomic instructions or other means
of communicating between threads.2
In short, for many important workloads, we cannot fully partition the counter.
Given that partitioning the counters was what brought the excellent update-side perfor-
mance for the three schemes discussed in Section 5.2, this might be grounds for some
pessimism. However, the eventually consistent algorithm presented in Section 5.2.3 pro-
vides an interesting hint. Recall that this algorithm kept two sets of books, a per-thread
2 That said, if each structure will always be freed by the same CPU (or thread) that allocated it, then this

simple partitioning approach works extremely well.


5.3. APPROXIMATE LIMIT COUNTERS 65

counter variable for updaters and a global_count variable for readers, with an
eventual() thread that periodically updated global_count to be eventually con-
sistent with the values of the per-thread counter. The per-thread counter perfectly
partitioned the counter value, while global_count kept the full value.
For limit counters, we can use a variation on this theme, in that we partially partition
the counter. For example, each of four threads could have a per-thread counter, but
each could also have a per-thread maximum value (call it countermax).
But then what happens if a given thread needs to increment its counter, but
counter is equal to its countermax? The trick here is to move half of that thread’s
counter value to a globalcount, then increment counter. For example, if a
given thread’s counter and countermax variables were both equal to 10, we do
the following:
1. Acquire a global lock.
2. Add five to globalcount.
3. To balance out the addition, subtract five from this thread’s counter.
4. Release the global lock.
5. Increment this thread’s counter, resulting in a value of six.
Although this procedure still requires a global lock, that lock need only be ac-
quired once for every five increment operations, greatly reducing that lock’s level of
contention. We can reduce this contention as low as we wish by increasing the value
of countermax. However, the corresponding penalty for increasing the value of
countermax is reduced accuracy of globalcount. To see this, note that on a
four-CPU system, if countermax is equal to ten, globalcount will be in error by
at most 40 counts. In contrast, if countermax is increased to 100, globalcount
might be in error by as much as 400 counts.
This raises the question of just how much we care about globalcount’s de-
viation from the aggregate value of the counter, where this aggregate value is the
sum of globalcount and each thread’s counter variable. The answer to this
question depends on how far the aggregate value is from the counter’s limit (call it
globalcountmax). The larger the difference between these two values, the larger
countermax can be without risk of exceeding the globalcountmax limit. This
means that the value of a given thread’s countermax variable can be set based on this
difference. When far from the limit, the countermax per-thread variables are set to
large values to optimize for performance and scalability, while when close to the limit,
these same variables are set to small values to minimize the error in the checks against
the globalcountmax limit.
This design is an example of parallel fastpath, which is an important design pattern
in which the common case executes with no expensive instructions and no interactions
between threads, but where occasional use is also made of a more conservatively
designed (and higher overhead) global algorithm. This design pattern is covered in more
detail in Section 6.4.

5.3.2 Simple Limit Counter Implementation


Figure 5.10 shows both the per-thread and global variables used by this implemen-
tation. The per-thread counter and countermax variables are the correspond-
ing thread’s local counter and the upper bound on that counter, respectively. The
66 CHAPTER 5. COUNTING

1 unsigned long __thread counter = 0;


2 unsigned long __thread countermax = 0;
3 unsigned long globalcountmax = 10000;
4 unsigned long globalcount = 0;
5 unsigned long globalreserve = 0;
6 unsigned long *counterp[NR_THREADS] = { NULL };
7 DEFINE_SPINLOCK(gblcnt_mutex);

Figure 5.10: Simple Limit Counter Variables

countermax 3
counter 3
globalreserve

countermax 2 counter 2
globalcountmax

countermax 1 counter 1

countermax 0
counter 0
globalcount

Figure 5.11: Simple Limit Counter Variable Relationships

globalcountmax variable on line 3 contains the upper bound for the aggregate
counter, and the globalcount variable on line 4 is the global counter. The sum of
globalcount and each thread’s counter gives the aggregate value of the overall
counter. The globalreserve variable on line 5 is the sum of all of the per-thread
countermax variables. The relationship among these variables is shown by Fig-
ure 5.11:

1. The sum of globalcount and globalreserve must be less than or equal


to globalcountmax.

2. The sum of all threads’ countermax values must be less than or equal to
globalreserve.

3. Each thread’s counter must be less than or equal to that thread’s countermax.

Each element of the counterp[] array references the corresponding thread’s


counter variable, and, finally, the gblcnt_mutex spinlock guards all of the global
variables, in other words, no thread is permitted to access or modify any of the global
variables unless it has acquired gblcnt_mutex.
5.3. APPROXIMATE LIMIT COUNTERS 67

1 int add_count(unsigned long delta)


2 {
3 if (countermax - counter >= delta) {
4 counter += delta;
5 return 1;
6 }
7 spin_lock(&gblcnt_mutex);
8 globalize_count();
9 if (globalcountmax -
10 globalcount - globalreserve < delta) {
11 spin_unlock(&gblcnt_mutex);
12 return 0;
13 }
14 globalcount += delta;
15 balance_count();
16 spin_unlock(&gblcnt_mutex);
17 return 1;
18 }
19
20 int sub_count(unsigned long delta)
21 {
22 if (counter >= delta) {
23 counter -= delta;
24 return 1;
25 }
26 spin_lock(&gblcnt_mutex);
27 globalize_count();
28 if (globalcount < delta) {
29 spin_unlock(&gblcnt_mutex);
30 return 0;
31 }
32 globalcount -= delta;
33 balance_count();
34 spin_unlock(&gblcnt_mutex);
35 return 1;
36 }
37
38 unsigned long read_count(void)
39 {
40 int t;
41 unsigned long sum;
42
43 spin_lock(&gblcnt_mutex);
44 sum = globalcount;
45 for_each_thread(t)
46 if (counterp[t] != NULL)
47 sum += *counterp[t];
48 spin_unlock(&gblcnt_mutex);
49 return sum;
50 }

Figure 5.12: Simple Limit Counter Add, Subtract, and Read


68 CHAPTER 5. COUNTING

Figure 5.12 shows the add_count(), sub_count(), and read_count()


functions (count_lim.c).
Quick Quiz 5.28: Why does Figure 5.12 provide add_count() and sub_
count() instead of the inc_count() and dec_count() interfaces show in Sec-
tion 5.2?
Lines 1-18 show add_count(), which adds the specified value delta to the
counter. Line 3 checks to see if there is room for delta on this thread’s counter, and,
if so, line 4 adds it and line 6 returns success. This is the add_counter() fastpath,
and it does no atomic operations, references only per-thread variables, and should not
incur any cache misses.
Quick Quiz 5.29: What is with the strange form of the condition on line 3 of
Figure 5.12? Why not the following more intuitive form of the fastpath?

3 if (counter + delta <= countermax){


4 counter += delta;
5 return 1;
6 }

If the test on line 3 fails, we must access global variables, and thus must acquire
gblcnt_mutex on line 7, which we release on line 11 in the failure case or on line 16
in the success case. Line 8 invokes globalize_count(), shown in Figure 5.13,
which clears the thread-local variables, adjusting the global variables as needed, thus
simplifying global processing. (But don’t take my word for it, try coding it yourself!)
Lines 9 and 10 check to see if addition of delta can be accommodated, with the
meaning of the expression preceding the less-than sign shown in Figure 5.11 as the
difference in height of the two red (leftmost) bars. If the addition of delta cannot be
accommodated, then line 11 (as noted earlier) releases gblcnt_mutex and line 12
returns indicating failure.
Otherwise, we take the slowpath. Line 14 adds delta to globalcount, and then
line 15 invokes balance_count() (shown in Figure 5.13) in order to update both the
global and the per-thread variables. This call to balance_count() will usually set
this thread’s countermax to re-enable the fastpath. Line 16 then releases gblcnt_
mutex (again, as noted earlier), and, finally, line 17 returns indicating success.
Quick Quiz 5.30: Why does globalize_count() zero the per-thread variables,
only to later call balance_count() to refill them in Figure 5.12? Why not just leave
the per-thread variables non-zero?
Lines 20-36 show sub_count(), which subtracts the specified delta from the
counter. Line 22 checks to see if the per-thread counter can accommodate this subtrac-
tion, and, if so, line 23 does the subtraction and line 24 returns success. These lines
form sub_count()’s fastpath, and, as with add_count(), this fastpath executes
no costly operations.
If the fastpath cannot accommodate subtraction of delta, execution proceeds to
the slowpath on lines 26-35. Because the slowpath must access global state, line 26
acquires gblcnt_mutex, which is released either by line 29 (in case of failure) or
by line 34 (in case of success). Line 27 invokes globalize_count(), shown in
Figure 5.13, which again clears the thread-local variables, adjusting the global variables
as needed. Line 28 checks to see if the counter can accommodate subtracting delta,
and, if not, line 29 releases gblcnt_mutex (as noted earlier) and line 30 returns
5.3. APPROXIMATE LIMIT COUNTERS 69

1 static void globalize_count(void)


2 {
3 globalcount += counter;
4 counter = 0;
5 globalreserve -= countermax;
6 countermax = 0;
7 }
8
9 static void balance_count(void)
10 {
11 countermax = globalcountmax -
12 globalcount - globalreserve;
13 countermax /= num_online_threads();
14 globalreserve += countermax;
15 counter = countermax / 2;
16 if (counter > globalcount)
17 counter = globalcount;
18 globalcount -= counter;
19 }
20
21 void count_register_thread(void)
22 {
23 int idx = smp_thread_id();
24
25 spin_lock(&gblcnt_mutex);
26 counterp[idx] = &counter;
27 spin_unlock(&gblcnt_mutex);
28 }
29
30 void count_unregister_thread(int nthreadsexpected)
31 {
32 int idx = smp_thread_id();
33
34 spin_lock(&gblcnt_mutex);
35 globalize_count();
36 counterp[idx] = NULL;
37 spin_unlock(&gblcnt_mutex);
38 }

Figure 5.13: Simple Limit Counter Utility Functions

failure.
Quick Quiz 5.31: Given that globalreserve counted against us in add_
count(), why doesn’t it count for us in sub_count() in Figure 5.12?
Quick Quiz 5.32: Suppose that one thread invokes add_count() shown in
Figure 5.12, and then another thread invokes sub_count(). Won’t sub_count()
return failure even though the value of the counter is non-zero?
If, on the other hand, line 28 finds that the counter can accommodate subtracting
delta, we complete the slowpath. Line 32 does the subtraction and then line 33
invokes balance_count() (shown in Figure 5.13) in order to update both global
and per-thread variables (hopefully re-enabling the fastpath). Then line 34 releases
gblcnt_mutex, and line 35 returns success.
Quick Quiz 5.33: Why have both add_count() and sub_count() in Fig-
ure 5.12? Why not simply pass a negative number to add_count()?
Lines 38-50 show read_count(), which returns the aggregate value of the
counter. It acquires gblcnt_mutex on line 43 and releases it on line 48, excluding
global operations from add_count() and sub_count(), and, as we will see, also
excluding thread creation and exit. Line 44 initializes local variable sum to the value of
globalcount, and then the loop spanning lines 45-47 sums the per-thread counter
variables. Line 49 then returns the sum.
Figure 5.13 shows a number of utility functions used by the add_count(), sub_
count(), and read_count() primitives shown in Figure 5.12.
70 CHAPTER 5. COUNTING

Lines 1-7 show globalize_count(), which zeros the current thread’s per-
thread counters, adjusting the global variables appropriately. It is important to note that
this function does not change the aggregate value of the counter, but instead changes how
the counter’s current value is represented. Line 3 adds the thread’s counter variable to
globalcount, and line 4 zeroes counter. Similarly, line 5 subtracts the per-thread
countermax from globalreserve, and line 6 zeroes countermax. It is helpful
to refer to Figure 5.11 when reading both this function and balance_count(),
which is next.
Lines 9-19 show balance_count(), which is roughly speaking the inverse of
globalize_count(). This function’s job is to set the current thread’s countermax
variable to the largest value that avoids the risk of the counter exceeding the globalcountmax
limit. Changing the current thread’s countermax variable of course requires corre-
sponding adjustments to counter, globalcount and globalreserve, as can
be seen by referring back to Figure 5.11. By doing this, balance_count() max-
imizes use of add_count()’s and sub_count()’s low-overhead fastpaths. As
with globalize_count(), balance_count() is not permitted to change the
aggregate value of the counter.
Lines 11-13 compute this thread’s share of that portion of globalcountmax that
is not already covered by either globalcount or globalreserve, and assign the
computed quantity to this thread’s countermax. Line 14 makes the corresponding ad-
justment to globalreserve. Line 15 sets this thread’s counter to the middle of the
range from zero to countermax. Line 16 checks to see whether globalcount can
in fact accommodate this value of counter, and, if not, line 17 decreases counter
accordingly. Finally, in either case, line 18 makes the corresponding adjustment to
globalcount.
Quick Quiz 5.34: Why set counter to countermax / 2 in line 15 of Fig-
ure 5.13? Wouldn’t it be simpler to just take countermax counts?
It is helpful to look at a schematic depicting how the relationship of the coun-
ters changes with the execution of first globalize_count() and then balance_
count, as shown in Figure 5.14. Time advances from left to right, with the leftmost
configuration roughly that of Figure 5.11. The center configuration shows the rela-
tionship of these same counters after globalize_count() is executed by thread 0.
As can be seen from the figure, thread 0’s counter (“c 0” in the figure) is added
to globalcount, while the value of globalreserve is reduced by this same
amount. Both thread 0’s counter and its countermax (“cm 0” in the figure) are
reduced to zero. The other three threads’ counters are unchanged. Note that this
change did not affect the overall value of the counter, as indicated by the bottommost
dotted line connecting the leftmost and center configurations. In other words, the
sum of globalcount and the four threads’ counter variables is the same in both
configurations. Similarly, this change did not affect the sum of globalcount and
globalreserve, as indicated by the upper dotted line.
The rightmost configuration shows the relationship of these counters after balance_
count() is executed, again by thread 0. One-quarter of the remaining count, denoted
by the vertical line extending up from all three configurations, is added to thread 0’s
countermax and half of that to thread 0’s counter. The amount added to thread 0’s
counter is also subtracted from globalcount in order to avoid changing the
overall value of the counter (which is again the sum of globalcount and the three
threads’ counter variables), again as indicated by the lowermost of the two dotted
lines connecting the center and rightmost configurations. The globalreserve vari-
5.3. APPROXIMATE LIMIT COUNTERS 71

globalize_count() balance_count()

cm 3
c3

globalreserve
cm 3 cm 3

globalreserve
c3 c3
globalreserve

cm 2
c2
cm 2 cm 2
c2 c2
cm 1 c1
cm 1 c1 cm 1 c1
cm 0
c0
cm 0 c0
globalcount

globalcount
globalcount

Figure 5.14: Schematic of Globalization and Balancing

able is also adjusted so that this variable remains equal to the sum of the four threads’
countermax variables. Because thread 0’s counter is less than its countermax,
thread 0 can once again increment the counter locally.
Quick Quiz 5.35: In Figure 5.14, even though a quarter of the remaining count up
to the limit is assigned to thread 0, only an eighth of the remaining count is consumed,
as indicated by the uppermost dotted line connecting the center and the rightmost
configurations. Why is that?
Lines 21-28 show count_register_thread(), which sets up state for newly
created threads. This function simply installs a pointer to the newly created thread’s
counter variable into the corresponding entry of the counterp[] array under the
protection of gblcnt_mutex.
Finally, lines 30-38 show count_unregister_thread(), which tears down
state for a soon-to-be-exiting thread. Line 34 acquires gblcnt_mutex and line 37
releases it. Line 35 invokes globalize_count() to clear out this thread’s counter
state, and line 36 clears this thread’s entry in the counterp[] array.

5.3.3 Simple Limit Counter Discussion


This type of counter is quite fast when aggregate values are near zero, with some over-
head due to the comparison and branch in both add_count()’s and sub_count()’s
fastpaths. However, the use of a per-thread countermax reserve means that add_
count() can fail even when the aggregate value of the counter is nowhere near
globalcountmax. Similarly, sub_count() can fail even when the aggregate
value of the counter is nowhere near zero.
In many cases, this is unacceptable. Even if the globalcountmax is intended to
72 CHAPTER 5. COUNTING

1 unsigned long __thread counter = 0;


2 unsigned long __thread countermax = 0;
3 unsigned long globalcountmax = 10000;
4 unsigned long globalcount = 0;
5 unsigned long globalreserve = 0;
6 unsigned long *counterp[NR_THREADS] = { NULL };
7 DEFINE_SPINLOCK(gblcnt_mutex);
8 #define MAX_COUNTERMAX 100

Figure 5.15: Approximate Limit Counter Variables


1 static void balance_count(void)
2 {
3 countermax = globalcountmax -
4 globalcount - globalreserve;
5 countermax /= num_online_threads();
6 if (countermax > MAX_COUNTERMAX)
7 countermax = MAX_COUNTERMAX;
8 globalreserve += countermax;
9 counter = countermax / 2;
10 if (counter > globalcount)
11 counter = globalcount;
12 globalcount -= counter;
13 }

Figure 5.16: Approximate Limit Counter Balancing

be an approximate limit, there is usually a limit to exactly how much approximation can
be tolerated. One way to limit the degree of approximation is to impose an upper limit
on the value of the per-thread countermax instances. This task is undertaken in the
next section.

5.3.4 Approximate Limit Counter Implementation


Because this implementation (count_lim_app.c) is quite similar to that in the
previous section (Figures 5.10, 5.12, and 5.13), only the changes are shown here.
Figure 5.15 is identical to Figure 5.10, with the addition of MAX_COUNTERMAX, which
sets the maximum permissible value of the per-thread countermax variable.
Similarly, Figure 5.16 is identical to the balance_count() function in Fig-
ure 5.13, with the addition of lines 6 and 7, which enforce the MAX_COUNTERMAX
limit on the per-thread countermax variable.

5.3.5 Approximate Limit Counter Discussion


These changes greatly reduce the limit inaccuracy seen in the previous version, but
present another problem: any given value of MAX_COUNTERMAX will cause a workload-
dependent fraction of accesses to fall off the fastpath. As the number of threads increase,
non-fastpath execution will become both a performance and a scalability problem.
However, we will defer this problem and turn instead to counters with exact limits.

5.4 Exact Limit Counters


To solve the exact structure-allocation limit problem noted in Quick Quiz 5.4, we need a
limit counter that can tell exactly when its limits are exceeded. One way of implementing
such a limit counter is to cause threads that have reserved counts to give them up. One
5.4. EXACT LIMIT COUNTERS 73

1 atomic_t __thread ctrandmax = ATOMIC_INIT(0);


2 unsigned long globalcountmax = 10000;
3 unsigned long globalcount = 0;
4 unsigned long globalreserve = 0;
5 atomic_t *counterp[NR_THREADS] = { NULL };
6 DEFINE_SPINLOCK(gblcnt_mutex);
7 #define CM_BITS (sizeof(atomic_t) * 4)
8 #define MAX_COUNTERMAX ((1 << CM_BITS) - 1)
9
10 static void
11 split_ctrandmax_int(int cami, int *c, int *cm)
12 {
13 *c = (cami >> CM_BITS) & MAX_COUNTERMAX;
14 *cm = cami & MAX_COUNTERMAX;
15 }
16
17 static void
18 split_ctrandmax(atomic_t *cam, int *old,
19 int *c, int *cm)
20 {
21 unsigned int cami = atomic_read(cam);
22
23 *old = cami;
24 split_ctrandmax_int(cami, c, cm);
25 }
26
27 static int merge_ctrandmax(int c, int cm)
28 {
29 unsigned int cami;
30
31 cami = (c << CM_BITS) | cm;
32 return ((int)cami);
33 }

Figure 5.17: Atomic Limit Counter Variables and Access Functions

way to do this is to use atomic instructions. Of course, atomic instructions will slow
down the fastpath, but on the other hand, it would be silly not to at least give them a try.

5.4.1 Atomic Limit Counter Implementation


Unfortunately, if one thread is to safely remove counts from another thread, both threads
will need to atomically manipulate that thread’s counter and countermax variables.
The usual way to do this is to combine these two variables into a single variable, for
example, given a 32-bit variable, using the high-order 16 bits to represent counter
and the low-order 16 bits to represent countermax.
Quick Quiz 5.36: Why is it necessary to atomically manipulate the thread’s
counter and countermax variables as a unit? Wouldn’t it be good enough to
atomically manipulate them individually?
The variables and access functions for a simple atomic limit counter are shown in
Figure 5.17 (count_lim_atomic.c). The counter and countermax variables
in earlier algorithms are combined into the single variable ctrandmax shown on line 1,
with counter in the upper half and countermax in the lower half. This variable is
of type atomic_t, which has an underlying representation of int.
Lines 2-6 show the definitions for globalcountmax, globalcount, globalreserve,
counterp, and gblcnt_mutex, all of which take on roles similar to their coun-
terparts in Figure 5.15. Line 7 defines CM_BITS, which gives the number of bits in
each half of ctrandmax, and line 8 defines MAX_COUNTERMAX, which gives the
maximum value that may be held in either half of ctrandmax.
Quick Quiz 5.37: In what way does line 7 of Figure 5.17 violate the C standard?
74 CHAPTER 5. COUNTING

Lines 10-15 show the split_ctrandmax_int() function, which, when given


the underlying int from the atomic_t ctrandmax variable, splits it into its
counter (c) and countermax (cm) components. Line 13 isolates the most-significant
half of this int, placing the result as specified by argument c, and line 14 isolates the
least-significant half of this int, placing the result as specified by argument cm.
Lines 17-25 show the split_ctrandmax() function, which picks up the un-
derlying int from the specified variable on line 21, stores it as specified by the old
argument on line 23, and then invokes split_ctrandmax_int() to split it on
line 24.
Quick Quiz 5.38: Given that there is only one ctrandmax variable, why bother
passing in a pointer to it on line 18 of Figure 5.17?
Lines 27-33 show the merge_ctrandmax() function, which can be thought
of as the inverse of split_ctrandmax(). Line 31 merges the counter and
countermax values passed in c and cm, respectively, and returns the result.
Quick Quiz 5.39: Why does merge_ctrandmax() in Figure 5.17 return an
int rather than storing directly into an atomic_t?
Figure 5.18 shows the add_count() and sub_count() functions.
Lines 1-32 show add_count(), whose fastpath spans lines 8-15, with the remain-
der of the function being the slowpath. Lines 8-14 of the fastpath form a compare-and-
swap (CAS) loop, with the atomic_cmpxchg() primitives on lines 13-14 perform-
ing the actual CAS. Line 9 splits the current thread’s ctrandmax variable into its
counter (in c) and countermax (in cm) components, while placing the underlying
int into old. Line 10 checks whether the amount delta can be accommodated
locally (taking care to avoid integer overflow), and if not, line 11 transfers to the
slowpath. Otherwise, line 12 combines an updated counter value with the original
countermax value into new. The atomic_cmpxchg() primitive on lines 13-14
then atomically compares this thread’s ctrandmax variable to old, updating its value
to new if the comparison succeeds. If the comparison succeeds, line 15 returns success,
otherwise, execution continues in the loop at line 9.
Quick Quiz 5.40: Yecch! Why the ugly goto on line 11 of Figure 5.18? Haven’t
you heard of the break statement???
Quick Quiz 5.41: Why would the atomic_cmpxchg() primitive at lines 13-14
of Figure 5.18 ever fail? After all, we picked up its old value on line 9 and have not
changed it!
Lines 16-31 of Figure 5.18 show add_count()’s slowpath, which is protected
by gblcnt_mutex, which is acquired on line 17 and released on lines 24 and 30.
Line 18 invokes globalize_count(), which moves this thread’s state to the global
counters. Lines 19-20 check whether the delta value can be accommodated by the
current global state, and, if not, line 21 invokes flush_local_count() to flush
all threads’ local state to the global counters, and then lines 22-23 recheck whether
delta can be accommodated. If, after all that, the addition of delta still cannot
be accommodated, then line 24 releases gblcnt_mutex (as noted earlier), and then
line 25 returns failure.
Otherwise, line 28 adds delta to the global counter, line 29 spreads counts to the
local state if appropriate, line 30 releases gblcnt_mutex (again, as noted earlier),
and finally, line 31 returns success.
Lines 34-63 of Figure 5.18 show sub_count(), which is structured similarly to
add_count(), having a fastpath on lines 41-48 and a slowpath on lines 49-62. A
line-by-line analysis of this function is left as an exercise to the reader.
5.4. EXACT LIMIT COUNTERS 75

1 int add_count(unsigned long delta)


2 {
3 int c;
4 int cm;
5 int old;
6 int new;
7
8 do {
9 split_ctrandmax(&ctrandmax, &old, &c, &cm);
10 if (delta > MAX_COUNTERMAX || c + delta > cm)
11 goto slowpath;
12 new = merge_ctrandmax(c + delta, cm);
13 } while (atomic_cmpxchg(&ctrandmax,
14 old, new) != old);
15 return 1;
16 slowpath:
17 spin_lock(&gblcnt_mutex);
18 globalize_count();
19 if (globalcountmax - globalcount -
20 globalreserve < delta) {
21 flush_local_count();
22 if (globalcountmax - globalcount -
23 globalreserve < delta) {
24 spin_unlock(&gblcnt_mutex);
25 return 0;
26 }
27 }
28 globalcount += delta;
29 balance_count();
30 spin_unlock(&gblcnt_mutex);
31 return 1;
32 }
33
34 int sub_count(unsigned long delta)
35 {
36 int c;
37 int cm;
38 int old;
39 int new;
40
41 do {
42 split_ctrandmax(&ctrandmax, &old, &c, &cm);
43 if (delta > c)
44 goto slowpath;
45 new = merge_ctrandmax(c - delta, cm);
46 } while (atomic_cmpxchg(&ctrandmax,
47 old, new) != old);
48 return 1;
49 slowpath:
50 spin_lock(&gblcnt_mutex);
51 globalize_count();
52 if (globalcount < delta) {
53 flush_local_count();
54 if (globalcount < delta) {
55 spin_unlock(&gblcnt_mutex);
56 return 0;
57 }
58 }
59 globalcount -= delta;
60 balance_count();
61 spin_unlock(&gblcnt_mutex);
62 return 1;
63 }

Figure 5.18: Atomic Limit Counter Add and Subtract


76 CHAPTER 5. COUNTING

1 unsigned long read_count(void)


2 {
3 int c;
4 int cm;
5 int old;
6 int t;
7 unsigned long sum;
8
9 spin_lock(&gblcnt_mutex);
10 sum = globalcount;
11 for_each_thread(t)
12 if (counterp[t] != NULL) {
13 split_ctrandmax(counterp[t], &old, &c, &cm);
14 sum += c;
15 }
16 spin_unlock(&gblcnt_mutex);
17 return sum;
18 }

Figure 5.19: Atomic Limit Counter Read

Figure 5.19 shows read_count(). Line 9 acquires gblcnt_mutex and line 16


releases it. Line 10 initializes local variable sum to the value of globalcount, and
the loop spanning lines 11-15 adds the per-thread counters to this sum, isolating each
per-thread counter using split_ctrandmax on line 13. Finally, line 17 returns the
sum.
Figures 5.20 and 5.21 shows the utility functions globalize_count(), flush_
local_count(), balance_count(), count_register_thread(), and count_
unregister_thread(). The code for globalize_count() is shown on lines 1-
12, of Figure 5.20 and is similar to that of previous algorithms, with the addition of line 7,
which is now required to split out counter and countermax from ctrandmax.
The code for flush_local_count(), which moves all threads’ local counter
state to the global counter, is shown on lines 14-32. Line 22 checks to see if the
value of globalreserve permits any per-thread counts, and, if not, line 23 returns.
Otherwise, line 24 initializes local variable zero to a combined zeroed counter and
countermax. The loop spanning lines 25-31 sequences through each thread. Line 26
checks to see if the current thread has counter state, and, if so, lines 27-30 move that
state to the global counters. Line 27 atomically fetches the current thread’s state while
replacing it with zero. Line 28 splits this state into its counter (in local variable
c) and countermax (in local variable cm) components. Line 29 adds this thread’s
counter to globalcount, while line 30 subtracts this thread’s countermax from
globalreserve.
Quick Quiz 5.42: What stops a thread from simply refilling its ctrandmax vari-
able immediately after flush_local_count() on line 14 of Figure 5.20 empties
it?
Quick Quiz 5.43: What prevents concurrent execution of the fastpath of either
add_count() or sub_count() from interfering with the ctrandmax variable
while flush_local_count() is accessing it on line 27 of Figure 5.20 empties it?

Lines 1-22 on Figure 5.21 show the code for balance_count(), which refills
the calling thread’s local ctrandmax variable. This function is quite similar to that
of the preceding algorithms, with changes required to handle the merged ctrandmax
variable. Detailed analysis of the code is left as an exercise for the reader, as it is with
the count_register_thread() function starting on line 24 and the count_
5.4. EXACT LIMIT COUNTERS 77

1 static void globalize_count(void)


2 {
3 int c;
4 int cm;
5 int old;
6
7 split_ctrandmax(&ctrandmax, &old, &c, &cm);
8 globalcount += c;
9 globalreserve -= cm;
10 old = merge_ctrandmax(0, 0);
11 atomic_set(&ctrandmax, old);
12 }
13
14 static void flush_local_count(void)
15 {
16 int c;
17 int cm;
18 int old;
19 int t;
20 int zero;
21
22 if (globalreserve == 0)
23 return;
24 zero = merge_ctrandmax(0, 0);
25 for_each_thread(t)
26 if (counterp[t] != NULL) {
27 old = atomic_xchg(counterp[t], zero);
28 split_ctrandmax_int(old, &c, &cm);
29 globalcount += c;
30 globalreserve -= cm;
31 }
32 }

Figure 5.20: Atomic Limit Counter Utility Functions 1

unregister_thread() function starting on line 33.


Quick Quiz 5.44: Given that the atomic_set() primitive does a simple store to
the specified atomic_t, how can line 21 of balance_count() in Figure 5.21 work
correctly in face of concurrent flush_local_count() updates to this variable?
The next section qualitatively evaluates this design.

5.4.2 Atomic Limit Counter Discussion


This is the first implementation that actually allows the counter to be run all the way
to either of its limits, but it does so at the expense of adding atomic operations to the
fastpaths, which slow down the fastpaths significantly on some systems. Although some
workloads might tolerate this slowdown, it is worthwhile looking for algorithms with
better read-side performance. One such algorithm uses a signal handler to steal counts
from other threads. Because signal handlers run in the context of the signaled thread,
atomic operations are not necessary, as shown in the next section.
Quick Quiz 5.45: But signal handlers can be migrated to some other CPU while
running. Doesn’t this possibility require that atomic instructions and memory barriers are
required to reliably communicate between a thread and a signal handler that interrupts
that thread?

5.4.3 Signal-Theft Limit Counter Design


Even though per-thread state will now be manipulated only by the corresponding thread,
there will still need to be synchronization with the signal handlers. This synchronization
is provided by the state machine shown in Figure 5.22. The state machine starts out in
78 CHAPTER 5. COUNTING

1 static void balance_count(void)


2 {
3 int c;
4 int cm;
5 int old;
6 unsigned long limit;
7
8 limit = globalcountmax - globalcount -
9 globalreserve;
10 limit /= num_online_threads();
11 if (limit > MAX_COUNTERMAX)
12 cm = MAX_COUNTERMAX;
13 else
14 cm = limit;
15 globalreserve += cm;
16 c = cm / 2;
17 if (c > globalcount)
18 c = globalcount;
19 globalcount -= c;
20 old = merge_ctrandmax(c, cm);
21 atomic_set(&ctrandmax, old);
22 }
23
24 void count_register_thread(void)
25 {
26 int idx = smp_thread_id();
27
28 spin_lock(&gblcnt_mutex);
29 counterp[idx] = &ctrandmax;
30 spin_unlock(&gblcnt_mutex);
31 }
32
33 void count_unregister_thread(int nthreadsexpected)
34 {
35 int idx = smp_thread_id();
36
37 spin_lock(&gblcnt_mutex);
38 globalize_count();
39 counterp[idx] = NULL;
40 spin_unlock(&gblcnt_mutex);
41 }

Figure 5.21: Atomic Limit Counter Utility Functions 2

the IDLE state, and when add_count() or sub_count() find that the combination
of the local thread’s count and the global count cannot accommodate the request, the
corresponding slowpath sets each thread’s theft state to REQ (unless that thread has
no count, in which case it transitions directly to READY). Only the slowpath, which
holds the gblcnt_mutex lock, is permitted to transition from the IDLE state, as
indicated by the green color.3 The slowpath then sends a signal to each thread, and the
corresponding signal handler checks the corresponding thread’s theft and counting
variables. If the theft state is not REQ, then the signal handler is not permitted to
change the state, and therefore simply returns. Otherwise, if the counting variable is
set, indicating that the current thread’s fastpath is in progress, the signal handler sets the
theft state to ACK, otherwise to READY.
If the theft state is ACK, only the fastpath is permitted to change the theft
state, as indicated by the blue color. When the fastpath completes, it sets the theft
state to READY.
Once the slowpath sees a thread’s theft state is READY, the slowpath is permitted

3 For those with black-and-white versions of this book, IDLE and READY are green, REQ is red, and

ACK is blue.
5.4. EXACT LIMIT COUNTERS 79

IDLE

need no
flushed
flush count

!counting
REQ READY

done
counting
counting

ACK

Figure 5.22: Signal-Theft State Machine


1 #define THEFT_IDLE 0
2 #define THEFT_REQ 1
3 #define THEFT_ACK 2
4 #define THEFT_READY 3
5
6 int __thread theft = THEFT_IDLE;
7 int __thread counting = 0;
8 unsigned long __thread counter = 0;
9 unsigned long __thread countermax = 0;
10 unsigned long globalcountmax = 10000;
11 unsigned long globalcount = 0;
12 unsigned long globalreserve = 0;
13 unsigned long *counterp[NR_THREADS] = { NULL };
14 unsigned long *countermaxp[NR_THREADS] = { NULL };
15 int *theftp[NR_THREADS] = { NULL };
16 DEFINE_SPINLOCK(gblcnt_mutex);
17 #define MAX_COUNTERMAX 100

Figure 5.23: Signal-Theft Limit Counter Data

to steal that thread’s count. The slowpath then sets that thread’s theft state to IDLE.
Quick Quiz 5.46: In Figure 5.22, why is the REQ theft state colored red?
Quick Quiz 5.47: In Figure 5.22, what is the point of having separate REQ and
ACK theft states? Why not simplify the state machine by collapsing them into a
single REQACK state? Then whichever of the signal handler or the fastpath gets there
first could set the state to READY.

5.4.4 Signal-Theft Limit Counter Implementation


Figure 5.23 (count_lim_sig.c) shows the data structures used by the signal-theft
based counter implementation. Lines 1-7 define the states and values for the per-thread
theft state machine described in the preceding section. Lines 8-17 are similar to earlier
implementations, with the addition of lines 14 and 15 to allow remote access to a
thread’s countermax and theft variables, respectively.
Figure 5.24 shows the functions responsible for migrating counts between per-thread
variables and the global variables. Lines 1-7 shows globalize_count(), which
80 CHAPTER 5. COUNTING

1 static void globalize_count(void)


2 {
3 globalcount += counter;
4 counter = 0;
5 globalreserve -= countermax;
6 countermax = 0;
7 }
8
9 static void flush_local_count_sig(int unused)
10 {
11 if (ACCESS_ONCE(theft) != THEFT_REQ)
12 return;
13 smp_mb();
14 ACCESS_ONCE(theft) = THEFT_ACK;
15 if (!counting) {
16 ACCESS_ONCE(theft) = THEFT_READY;
17 }
18 smp_mb();
19 }
20
21 static void flush_local_count(void)
22 {
23 int t;
24 thread_id_t tid;
25
26 for_each_tid(t, tid)
27 if (theftp[t] != NULL) {
28 if (*countermaxp[t] == 0) {
29 ACCESS_ONCE(*theftp[t]) = THEFT_READY;
30 continue;
31 }
32 ACCESS_ONCE(*theftp[t]) = THEFT_REQ;
33 pthread_kill(tid, SIGUSR1);
34 }
35 for_each_tid(t, tid) {
36 if (theftp[t] == NULL)
37 continue;
38 while (ACCESS_ONCE(*theftp[t]) != THEFT_READY) {
39 poll(NULL, 0, 1);
40 if (ACCESS_ONCE(*theftp[t]) == THEFT_REQ)
41 pthread_kill(tid, SIGUSR1);
42 }
43 globalcount += *counterp[t];
44 *counterp[t] = 0;
45 globalreserve -= *countermaxp[t];
46 *countermaxp[t] = 0;
47 ACCESS_ONCE(*theftp[t]) = THEFT_IDLE;
48 }
49 }
50
51 static void balance_count(void)
52 {
53 countermax = globalcountmax -
54 globalcount - globalreserve;
55 countermax /= num_online_threads();
56 if (countermax > MAX_COUNTERMAX)
57 countermax = MAX_COUNTERMAX;
58 globalreserve += countermax;
59 counter = countermax / 2;
60 if (counter > globalcount)
61 counter = globalcount;
62 globalcount -= counter;
63 }

Figure 5.24: Signal-Theft Limit Counter Value-Migration Functions


5.4. EXACT LIMIT COUNTERS 81

is identical to earlier implementations. Lines 9-19 shows flush_local_count_


sig(), which is the signal handler used in the theft process. Lines 11 and 12 check
to see if the theft state is REQ, and, if not returns without change. Line 13 executes
a memory barrier to ensure that the sampling of the theft variable happens before any
change to that variable. Line 14 sets the theft state to ACK, and, if line 15 sees that
this thread’s fastpaths are not running, line 16 sets the theft state to READY.
Quick Quiz 5.48: In Figure 5.24 function flush_local_count_sig(), why
are there ACCESS_ONCE() wrappers around the uses of the theft per-thread vari-
able?
Lines 21-49 shows flush_local_count(), which is called from the slowpath
to flush all threads’ local counts. The loop spanning lines 26-34 advances the theft
state for each thread that has local count, and also sends that thread a signal. Line 27
skips any non-existent threads. Otherwise, line 28 checks to see if the current thread
holds any local count, and, if not, line 29 sets the thread’s theft state to READY and
line 30 skips to the next thread. Otherwise, line 32 sets the thread’s theft state to
REQ and line 33 sends the thread a signal.
Quick Quiz 5.49: In Figure 5.24, why is it safe for line 28 to directly access the
other thread’s countermax variable?
Quick Quiz 5.50: In Figure 5.24, why doesn’t line 33 check for the current thread
sending itself a signal?
Quick Quiz 5.51: The code in Figure 5.24, works with gcc and POSIX. What
would be required to make it also conform to the ISO C standard?
The loop spanning lines 35-48 waits until each thread reaches READY state, then
steals that thread’s count. Lines 36-37 skip any non-existent threads, and the loop
spanning lines 38-42 wait until the current thread’s theft state becomes READY.
Line 39 blocks for a millisecond to avoid priority-inversion problems, and if line 40
determines that the thread’s signal has not yet arrived, line 41 resends the signal.
Execution reaches line 43 when the thread’s theft state becomes READY, so lines 43-
46 do the thieving. Line 47 then sets the thread’s theft state back to IDLE.
Quick Quiz 5.52: In Figure 5.24, why does line 41 resend the signal?
Lines 51-63 show balance_count(), which is similar to that of earlier exam-
ples.
Figure 5.25 shows the add_count() function. The fastpath spans lines 5-20, and
the slowpath lines 21-35. Line 5 sets the per-thread counting variable to 1 so that
any subsequent signal handlers interrupting this thread will set the theft state to ACK
rather than READY, allowing this fastpath to complete properly. Line 6 prevents the
compiler from reordering any of the fastpath body to precede the setting of counting.
Lines 7 and 8 check to see if the per-thread data can accommodate the add_count()
and if there is no ongoing theft in progress, and if so line 9 does the fastpath addition
and line 10 notes that the fastpath was taken.
In either case, line 12 prevents the compiler from reordering the fastpath body to
follow line 13, which permits any subsequent signal handlers to undertake theft. Line 14
again disables compiler reordering, and then line 15 checks to see if the signal handler
deferred the theft state-change to READY, and, if so, line 16 executes a memory
barrier to ensure that any CPU that sees line 17 setting state to READY also sees the
effects of line 9. If the fastpath addition at line 9 was executed, then line 20 returns
success.
Otherwise, we fall through to the slowpath starting at line 21. The structure of the
slowpath is similar to those of earlier examples, so its analysis is left as an exercise to
82 CHAPTER 5. COUNTING

1 int add_count(unsigned long delta)


2 {
3 int fastpath = 0;
4
5 counting = 1;
6 barrier();
7 if (countermax - counter >= delta &&
8 ACCESS_ONCE(theft) <= THEFT_REQ) {
9 counter += delta;
10 fastpath = 1;
11 }
12 barrier();
13 counting = 0;
14 barrier();
15 if (ACCESS_ONCE(theft) == THEFT_ACK) {
16 smp_mb();
17 ACCESS_ONCE(theft) = THEFT_READY;
18 }
19 if (fastpath)
20 return 1;
21 spin_lock(&gblcnt_mutex);
22 globalize_count();
23 if (globalcountmax - globalcount -
24 globalreserve < delta) {
25 flush_local_count();
26 if (globalcountmax - globalcount -
27 globalreserve < delta) {
28 spin_unlock(&gblcnt_mutex);
29 return 0;
30 }
31 }
32 globalcount += delta;
33 balance_count();
34 spin_unlock(&gblcnt_mutex);
35 return 1;
36 }

Figure 5.25: Signal-Theft Limit Counter Add Function


5.4. EXACT LIMIT COUNTERS 83

38 int sub_count(unsigned long delta)


39 {
40 int fastpath = 0;
41
42 counting = 1;
43 barrier();
44 if (counter >= delta &&
45 ACCESS_ONCE(theft) <= THEFT_REQ) {
46 counter -= delta;
47 fastpath = 1;
48 }
49 barrier();
50 counting = 0;
51 barrier();
52 if (ACCESS_ONCE(theft) == THEFT_ACK) {
53 smp_mb();
54 ACCESS_ONCE(theft) = THEFT_READY;
55 }
56 if (fastpath)
57 return 1;
58 spin_lock(&gblcnt_mutex);
59 globalize_count();
60 if (globalcount < delta) {
61 flush_local_count();
62 if (globalcount < delta) {
63 spin_unlock(&gblcnt_mutex);
64 return 0;
65 }
66 }
67 globalcount -= delta;
68 balance_count();
69 spin_unlock(&gblcnt_mutex);
70 return 1;
71 }

Figure 5.26: Signal-Theft Limit Counter Subtract Function

1 unsigned long read_count(void)


2 {
3 int t;
4 unsigned long sum;
5
6 spin_lock(&gblcnt_mutex);
7 sum = globalcount;
8 for_each_thread(t)
9 if (counterp[t] != NULL)
10 sum += *counterp[t];
11 spin_unlock(&gblcnt_mutex);
12 return sum;
13 }

Figure 5.27: Signal-Theft Limit Counter Read Function

the reader. Similarly, the structure of sub_count() on Figure 5.26 is the same as that
of add_count(), so the analysis of sub_count() is also left as an exercise for the
reader, as is the analysis of read_count() in Figure 5.27.

Lines 1-12 of Figure 5.28 show count_init(), which set up flush_local_


count_sig() as the signal handler for SIGUSR1, enabling the pthread_kill()
calls in flush_local_count() to invoke flush_local_count_sig(). The
code for thread registry and unregistry is similar to that of earlier examples, so its
analysis is left as an exercise for the reader.
84 CHAPTER 5. COUNTING

1 void count_init(void)
2 {
3 struct sigaction sa;
4
5 sa.sa_handler = flush_local_count_sig;
6 sigemptyset(&sa.sa_mask);
7 sa.sa_flags = 0;
8 if (sigaction(SIGUSR1, &sa, NULL) != 0) {
9 perror("sigaction");
10 exit(-1);
11 }
12 }
13
14 void count_register_thread(void)
15 {
16 int idx = smp_thread_id();
17
18 spin_lock(&gblcnt_mutex);
19 counterp[idx] = &counter;
20 countermaxp[idx] = &countermax;
21 theftp[idx] = &theft;
22 spin_unlock(&gblcnt_mutex);
23 }
24
25 void count_unregister_thread(int nthreadsexpected)
26 {
27 int idx = smp_thread_id();
28
29 spin_lock(&gblcnt_mutex);
30 globalize_count();
31 counterp[idx] = NULL;
32 countermaxp[idx] = NULL;
33 theftp[idx] = NULL;
34 spin_unlock(&gblcnt_mutex);
35 }

Figure 5.28: Signal-Theft Limit Counter Initialization Functions


5.5. APPLYING SPECIALIZED PARALLEL COUNTERS 85

5.4.5 Signal-Theft Limit Counter Discussion

The signal-theft implementation runs more than twice as fast as the atomic implementa-
tion on my Intel Core Duo laptop. Is it always preferable?
The signal-theft implementation would be vastly preferable on Pentium-4 systems,
given their slow atomic instructions, but the old 80386-based Sequent Symmetry sys-
tems would do much better with the shorter path length of the atomic implementation.
However, this increased update-side performance comes at the prices of higher read-side
overhead: Those POSIX signals are not free. If ultimate performance is of the essence,
you will need to measure them both on the system that your application is to be deployed
on.
Quick Quiz 5.53: Not only are POSIX signals slow, sending one to each thread
simply does not scale. What would you do if you had (say) 10,000 threads and needed
the read side to be fast?
This is but one reason why high-quality APIs are so important: they permit imple-
mentations to be changed as required by ever-changing hardware performance charac-
teristics.
Quick Quiz 5.54: What if you want an exact limit counter to be exact only for its
lower limit, but to allow the upper limit to be inexact?

5.5 Applying Specialized Parallel Counters

Although the exact limit counter implementations in Section 5.4 can be very useful, they
are not much help if the counter’s value remains near zero at all times, as it might when
counting the number of outstanding accesses to an I/O device. The high overhead of
such near-zero counting is especially painful given that we normally don’t care how
many references there are. As noted in the removable I/O device access-count problem
posed by Quick Quiz 5.5, the number of accesses is irrelevant except in those rare cases
when someone is actually trying to remove the device.
One simple solution to this problem is to add a large “bias” (for example, one
billion) to the counter in order to ensure that the value is far enough from zero that
the counter can operate efficiently. When someone wants to remove the device, this
bias is subtracted from the counter value. Counting the last few accesses will be quite
inefficient, but the important point is that the many prior accesses will have been counted
at full speed.
Quick Quiz 5.55: What else had you better have done when using a biased counter?

Although a biased counter can be quite helpful and useful, it is only a partial
solution to the removable I/O device access-count problem called out on page 55. When
attempting to remove a device, we must not only know the precise number of current
I/O accesses, we also need to prevent any future accesses from starting. One way to
accomplish this is to read-acquire a reader-writer lock when updating the counter, and to
write-acquire that same reader-writer lock when checking the counter. Code for doing
I/O might be as follows:
86 CHAPTER 5. COUNTING

1 read_lock(&mylock);
2 if (removing) {
3 read_unlock(&mylock);
4 cancel_io();
5 } else {
6 add_count(1);
7 read_unlock(&mylock);
8 do_io();
9 sub_count(1);
10 }

Line 1 read-acquires the lock, and either line 3 or 7 releases it. Line 2 checks to
see if the device is being removed, and, if so, line 3 releases the lock and line 4 cancels
the I/O, or takes whatever action is appropriate given that the device is to be removed.
Otherwise, line 6 increments the access count, line 7 releases the lock, line 8 performs
the I/O, and line 9 decrements the access count.
Quick Quiz 5.56: This is ridiculous! We are read-acquiring a reader-writer lock to
update the counter? What are you playing at???
The code to remove the device might be as follows:
1 write_lock(&mylock);
2 removing = 1;
3 sub_count(mybias);
4 write_unlock(&mylock);
5 while (read_count() != 0) {
6 poll(NULL, 0, 1);
7 }
8 remove_device();

Line 1 write-acquires the lock and line 4 releases it. Line 2 notes that the device is
being removed, and the loop spanning lines 5-7 wait for any I/O operations to complete.
Finally, line 8 does any additional processing needed to prepare for device removal.
Quick Quiz 5.57: What other issues would need to be accounted for in a real
system?

5.6 Parallel Counting Discussion


This chapter has presented the reliability, performance, and scalability problems with
traditional counting primitives. The C-language ++ operator is not guaranteed to
function reliably in multithreaded code, and atomic operations to a single variable
neither perform nor scale well. This chapter therefore presented a number of counting
algorithms that perform and scale extremely well in certain special cases.
It is well worth reviewing the lessons from these counting algorithms. To that
end, Section 5.6.1 summarizes performance and scalability, Section 5.6.2 discusses the
need for specialization, and finally, Section 5.6.3 enumerates lessons learned and calls
attention to later chapters that will expand on these lessons.

5.6.1 Parallel Counting Performance


Table 5.1 shows the performance of the four parallel statistical counting algorithms.
All four algorithms provide near-perfect linear scalability for updates. The per-thread-
variable implementation (count_end.c) is significantly faster on updates than the
5.6. PARALLEL COUNTING DISCUSSION 87

Reads
Algorithm Section Updates 1 Core 32 Cores
count_stat.c 5.2.2 11.5 ns 408 ns 409 ns
count_stat_eventual.c 5.2.3 11.6 ns 1 ns 1 ns
count_end.c 5.2.4 6.3 ns 389 ns 51,200 ns
count_end_rcu.c 13.3.1 5.7 ns 354 ns 501 ns

Table 5.1: Statistical Counter Performance on Power-6

Reads
Algorithm Section Exact? Updates 1 Core 64 Cores
count_lim.c 5.3.2 N 3.6 ns 375 ns 50,700 ns
count_lim_app.c 5.3.4 N 11.7 ns 369 ns 51,000 ns
count_lim_atomic.c 5.4.1 Y 51.4 ns 427 ns 49,400 ns
count_lim_sig.c 5.4.4 Y 10.2 ns 370 ns 54,000 ns

Table 5.2: Limit Counter Performance on Power-6

array-based implementation (count_stat.c), but is slower at reads on large numbers


of core, and suffers severe lock contention when there are many parallel readers. This
contention can be addressed using the deferred-processing techniques introduced in
Chapter 9, as shown on the count_end_rcu.c row of Table 5.1. Deferred processing
also shines on the count_stat_eventual.c row, courtesy of eventual consistency.
Quick Quiz 5.58: On the count_stat.c row of Table 5.1, we see that the read-
side scales linearly with the number of threads. How is that possible given that the more
threads there are, the more per-thread counters must be summed up?
Quick Quiz 5.59: Even on the last row of Table 5.1, the read-side performance of
these statistical counter implementations is pretty horrible. So why bother with them?
Figure 5.2 shows the performance of the parallel limit-counting algorithms. Exact
enforcement of the limits incurs a substantial performance penalty, although on this
4.7GHz Power-6 system that penalty can be reduced by substituting signals for atomic
operations. All of these implementations suffer from read-side lock contention in the
face of concurrent readers.
Quick Quiz 5.60: Given the performance data shown in Table 5.2, we should
always prefer signals over atomic operations, right?
Quick Quiz 5.61: Can advanced techniques be applied to address the lock con-
tention for readers seen in Table 5.2?
In short, this chapter has demonstrated a number of counting algorithms that perform
and scale extremely well in a number of special cases. But must our parallel counting
be confined to special cases? Wouldn’t it be better to have a general algorithm that
operated efficiently in all cases? The next section looks at these questions.

5.6.2 Parallel Counting Specializations


The fact that these algorithms only work well in their respective special cases might
be considered a major problem with parallel programming in general. After all, the
C-language ++ operator works just fine in single-threaded code, and not just for special
cases, but in general, right?
This line of reasoning does contain a grain of truth, but is in essence misguided.
The problem is not parallelism as such, but rather scalability. To understand this, first
88 CHAPTER 5. COUNTING

consider the C-language ++ operator. The fact is that it does not work in general, only
for a restricted range of numbers. If you need to deal with 1,000-digit decimal numbers,
the C-language ++ operator will not work for you.
Quick Quiz 5.62: The ++ operator works just fine for 1,000-digit numbers! Haven’t
you heard of operator overloading???
This problem is not specific to arithmetic. Suppose you need to store and query
data. Should you use an ASCII file? XML? A relational database? A linked list? A
dense array? A B-tree? A radix tree? Or one of the plethora of other data structures and
environments that permit data to be stored and queried? It depends on what you need
to do, how fast you need it done, and how large your data set is—even on sequential
systems.
Similarly, if you need to count, your solution will depend on how large of numbers
you need to work with, how many CPUs need to be manipulating a given number
concurrently, how the number is to be used, and what level of performance and scalability
you will need.
Nor is this problem specific to software. The design for a bridge meant to allow
people to walk across a small brook might be a simple as a single wooden plank. But
you would probably not use a plank to span the kilometers-wide mouth of the Columbia
River, nor would such a design be advisable for bridges carrying concrete trucks. In
short, just as bridge design must change with increasing span and load, so must software
design change as the number of CPUs increases. That said, it would be good to automate
this process, so that the software adapts to changes in hardware configuration and in
workload. There has in fact been some research into this sort of automation [AHS+ 03,
SAH+ 03], and the Linux kernel does some boot-time reconfiguration, including limited
binary rewriting. This sort of adaptation will become increasingly important as the
number of CPUs on mainstream systems continues to increase.
In short, as discussed in Chapter 3, the laws of physics constrain parallel software
just as surely as they constrain mechanical artifacts such as bridges. These constraints
force specialization, though in the case of software it might be possible to automate the
choice of specialization to fit the hardware and workload in question.
Of course, even generalized counting is quite specialized. We need to do a great
number of other things with computers. The next section relates what we have learned
from counters to topics taken up later in this book.

5.6.3 Parallel Counting Lessons


The opening paragraph of this chapter promised that our study of counting would
provide an excellent introduction to parallel programming. This section makes explicit
connections between the lessons from this chapter and the material presented in a
number of later chapters.
The examples in this chapter have shown that an important scalability and perfor-
mance tool is partitioning. The counters might be fully partitioned, as in the statistical
counters discussed in Section 5.2, or partially partitioned as in the limit counters dis-
cussed in Sections 5.3 and 5.4. Partitioning will be considered in far greater depth
in Chapter 6, and partial parallelization in particular in Section 6.4, where it is called
parallel fastpath.
Quick Quiz 5.63: But if we are going to have to partition everything, why bother
with shared-memory multithreading? Why not just partition the problem completely
and run as multiple processes, each in its own address space?
5.6. PARALLEL COUNTING DISCUSSION 89

The partially partitioned counting algorithms used locking to guard the global data,
and locking is the subject of Chapter 7. In contrast, the partitioned data tended to be fully
under the control of the corresponding thread, so that no synchronization whatsoever
was required. This data ownership will be introduced in Section 6.3.4 and discussed in
more detail in Chapter 8.
Because integer addition and subtraction are extremely cheap operations compared
to typical synchronization operations, achieving reasonable scalability requires synchro-
nization operations be used sparingly. One way of achieving this is to batch the addition
and subtraction operations, so that a great many of these cheap operations are handled
by a single synchronization operation. Batching optimizations of one sort or another are
used by each of the counting algorithms listed in Tables 5.1 and 5.2.
Finally, the eventually consistent statistical counter discussed in Section 5.2.3
showed how deferring activity (in that case, updating the global counter) can pro-
vide substantial performance and scalability benefits. This approach allows common
case code to use much cheaper synchronization operations than would otherwise be
possible. Chapter 9 will examine a number of additional ways that deferral can improve
performance, scalability, and even real-time response.
Summarizing the summary:

1. Partitioning promotes performance and scalability.

2. Partial partitioning, that is, partitioning applied only to common code paths, works
almost as well.

3. Partial partitioning can be applied to code (as in Section 5.2’s statistical counters’
partitioned updates and non-partitioned reads), but also across time (as in Sec-
tion 5.3’s and Section 5.4’s limit counters running fast when far from the limit,
but slowly when close to the limit).

4. Partitioning across time often batches updates locally in order to reduce the num-
ber of expensive global operations, thereby decreasing synchronization overhead,
in turn improving performance and scalability. All the algorithms shown in
Tables 5.1 and 5.2 make heavy use of batching.

5. Read-only code paths should remain read-only: Spurious synchronization writes


to shared memory kill performance and scalability, as seen in the count_end.c
row of Table 5.1.

6. Judicious use of delay promotes performance and scalability, as seen in Sec-


tion 5.2.3.

7. Parallel performance and scalability is usually a balancing act: Beyond a certain


point, optimizing some code paths will degrade others. The count_stat.c
and count_end_rcu.c rows of Table 5.1 illustrate this point.

8. Different levels of performance and scalability will affect algorithm and data-
structure design, as do a large number of other factors. Figure 5.3 illustrates this
point: Atomic increment might be completely acceptable for a two-CPU system,
but be completely inadequate for an eight-CPU system.

Summarizing still further, we have the “big three” methods of increasing perfor-
mance and scalability, namely (1) partitioning over CPUs or threads, (2) batching
90 CHAPTER 5. COUNTING

Batch

Work
Partitioning

Resource
Parallel
Partitioning and
Access Control Replication

Interacting
With Hardware
Weaken Partition

Figure 5.29: Optimization and the Four Parallel-Programming Tasks

so that more work can be done by each expensive synchronization operations, and
(3) weakening synchronization operations where feasible. As a rough rule of thumb, you
should apply these methods in this order, as was noted earlier in the discussion of Fig-
ure 2.6 on page 19. The partitioning optimization applies to the “Resource Partitioning
and Replication” bubble, the batching optimization to the “Work Partitioning” bubble,
and the weakening optimization to the “Parallel Access Control” bubble, as shown in
Figure 5.29. Of course, if you are using special-purpose hardware such as digital signal
processors (DSPs), field-programmable gate arrays (FPGAs), or general-purpose graph-
ical processing units (GPGPUs), you may need to pay close attention to the “Interacting
With Hardware” bubble throughout the design process. For example, the structure of a
GPGPU’s hardware threads and memory connectivity might richly reward very careful
partitioning and batching design decisions.
In short, as noted at the beginning of this chapter, the simplicity of counting have
allowed us to explore many fundamental concurrency issues without the distraction of
complex synchronization primitives or elaborate data structures. Such synchronization
primitives and data structures are covered in later chapters.
Divide and rule.

Philip II of Macedon

Chapter 6

Partitioning and
Synchronization Design

This chapter describes how to design software to take advantage of the multiple CPUs
that are increasingly appearing in commodity systems. It does this by presenting a
number of idioms, or “design patterns” [Ale79, GHJV95, SSRB00] that can help you
balance performance, scalability, and response time. As noted in earlier chapters, the
most important decision you will make when creating parallel software is how to carry
out the partitioning. Correctly partitioned problems lead to simple, scalable, and high-
performance solutions, while poorly partitioned problems result in slow and complex
solutions. This chapter will help you design partitioning into your code, with some
discussion of batching and weakening as well. The word “design” is very important:
You should partition first, batch second, weaken third, and code fourth. Changing this
order often leads to poor performance and scalability along with great frustration.
To this end, Section 6.1 presents partitioning exercises, Section 6.2 reviews partition-
ability design criteria, Section 6.3 discusses selecting an appropriate synchronization
granularity, Section 6.4 gives an overview of important parallel-fastpath designs that
provide speed and scalability in the common case with a simpler but less-scalable
fallback “slow path” for unusual situations, and finally Section 6.5 takes a brief look
beyond partitioning.

6.1 Partitioning Exercises


This section uses a pair of exercises (the classic Dining Philosophers problem and a
double-ended queue) to demonstrate the value of partitioning.

6.1.1 Dining Philosophers Problem


Figure 6.1 shows a diagram of the classic Dining Philosophers problem [Dij71]. This
problem features five philosophers who do nothing but think and eat a “very difficult
kind of spaghetti” which requires two forks to eat. A given philosopher is permitted to
use only the forks to his or her immediate right and left, and once a philosopher picks

91
92 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

P1

P5 P2

P4 P3

Figure 6.1: Dining Philosophers Problem

Figure 6.2: Partial Starvation Is Also Bad

up a fork, he or she will not put it down until sated.1


The object is to construct an algorithm that, quite literally, prevents starvation. One
starvation scenario would be if all of the philosophers picked up their leftmost forks
simultaneously. Because none of them would put down their fork until after they ate, and
because none of them may pick up their second fork until at least one has finished eating,
they all starve. Please note that it is not sufficient to allow at least one philosopher to
eat. As Figure 6.2 shows, starvation of even a few of the philosophers is to be avoided.
Dijkstra’s solution used a global semaphore, which works fine assuming negligible
communications delays, an assumption that became invalid in the late 1980s or early
1990s.2 Therefore, recent solutions number the forks as shown in Figure 6.3. Each
philosopher picks up the lowest-numbered fork next to his or her plate, then picks up

1 Readers who have difficulty imagining a food that requires two forks are invited to instead think in

terms of chopsticks.
2 It is all too easy to denigrate Dijkstra from the viewpoint of the year 2012, more than 40 years after the

fact. If you still feel the need to denigrate Dijkstra, my advice is to publish something, wait 40 years, and then
see how your words stood the test of time.
6.1. PARTITIONING EXERCISES 93

P1

5 1

P5 P2

4 2

P4 P3

Figure 6.3: Dining Philosophers Problem, Textbook Solution

the highest-numbered fork. The philosopher sitting in the uppermost position in the
diagram thus picks up the leftmost fork first, then the rightmost fork, while the rest of the
philosophers instead pick up their rightmost fork first. Because two of the philosophers
will attempt to pick up fork 1 first, and because only one of those two philosophers will
succeed, there will be five forks available to four philosophers. At least one of these
four will be guaranteed to have two forks, and thus be able to proceed eating.
This general technique of numbering resources and acquiring them in numerical
order is heavily used as a deadlock-prevention technique. However, it is easy to imagine
a sequence of events that will result in only one philosopher eating at a time even though
all are hungry:

1. P2 picks up fork 1, preventing P1 from taking a fork.

2. P3 picks up fork 2.

3. P4 picks up fork 3.

4. P5 picks up fork 4.

5. P5 picks up fork 5 and eats.

6. P5 puts down forks 4 and 5.

7. P4 picks up fork 4 and eats.

In short, this algorithm can result in only one philosopher eating at a given time,
even when all five philosophers are hungry, despite the fact that there are more than
enough forks for two philosophers to eat concurrently.
Please think about ways of partitioning the Dining Philosophers Problem before
reading further.
94 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

(Intentional blank page)


6.1. PARTITIONING EXERCISES 95

P1

P4 P2

P3

Figure 6.4: Dining Philosophers Problem, Partitioned

One approach is shown in Figure 6.4, which includes four philosophers rather than
five to better illustrate the partition technique. Here the upper and rightmost philosophers
share a pair of forks, while the lower and leftmost philosophers share another pair of
forks. If all philosophers are simultaneously hungry, at least two will always be able to
eat concurrently. In addition, as shown in the figure, the forks can now be bundled so
that the pair are picked up and put down simultaneously, simplifying the acquisition and
release algorithms.
Quick Quiz 6.1: Is there a better solution to the Dining Philosophers Problem?
This is an example of “horizontal parallelism” [Inm85] or “data parallelism”, so
named because there is no dependency among the pairs of philosophers. In a horizontally
parallel data-processing system, a given item of data would be processed by only one of
a replicated set of software components.
Quick Quiz 6.2: And in just what sense can this “horizontal parallelism” be said to
be “horizontal”?

6.1.2 Double-Ended Queue


A double-ended queue is a data structure containing a list of elements that may be
inserted or removed from either end [Knu73]. It has been claimed that a lock-based
implementation permitting concurrent operations on both ends of the double-ended
queue is difficult [Gro07]. This section shows how a partitioning design strategy can
result in a reasonably simple implementation, looking at three general approaches in the
following sections.

6.1.2.1 Left- and Right-Hand Locks


One seemingly straightforward approach would be to use a doubly linked list with a
left-hand lock for left-hand-end enqueue and dequeue operations along with a right-hand
lock for right-hand-end operations, as shown in Figure 6.5. However, the problem with
this approach is that the two locks’ domains must overlap when there are fewer than
96 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Lock L Lock R

Header L Header R

Lock L Lock R

Header L 0 Header R

Lock L Lock R

Header L 0 1 Header R

Lock L Lock R

Header L 0 1 2 Header R

Lock L Lock R

Header L 0 1 2 3 Header R

Figure 6.5: Double-Ended Queue With Left- and Right-Hand Locks

Lock L Lock R

DEQ L DEQ R

Figure 6.6: Compound Double-Ended Queue

four elements on the list. This overlap is due to the fact that removing any given element
affects not only that element, but also its left- and right-hand neighbors. These domains
are indicated by color in the figure, with blue with downward stripes indicating the
domain of the left-hand lock, red with upward stripes indicating the domain of the
right-hand lock, and purple (with no stripes) indicating overlapping domains. Although
it is possible to create an algorithm that works this way, the fact that it has no fewer than
five special cases should raise a big red flag, especially given that concurrent activity at
the other end of the list can shift the queue from one special case to another at any time.
It is far better to consider other designs.

6.1.2.2 Compound Double-Ended Queue


One way of forcing non-overlapping lock domains is shown in Figure 6.6. Two separate
double-ended queues are run in tandem, each protected by its own lock. This means
that elements must occasionally be shuttled from one of the double-ended queues to
the other, in which case both locks must be held. A simple lock hierarchy may be used
to avoid deadlock, for example, always acquiring the left-hand lock before acquiring
the right-hand lock. This will be much simpler than applying two locks to the same
double-ended queue, as we can unconditionally left-enqueue elements to the left-hand
queue and right-enqueue elements to the right-hand queue. The main complication
arises when dequeuing from an empty queue, in which case it is necessary to:
6.1. PARTITIONING EXERCISES 97

DEQ 0 DEQ 1 DEQ 2 DEQ 3

Lock 0 Lock 1 Lock 2 Lock 3

Index L
Index R

Lock L Lock R

Figure 6.7: Hashed Double-Ended Queue

1. If holding the right-hand lock, release it and acquire the left-hand lock.

2. Acquire the right-hand lock.

3. Rebalance the elements across the two queues.

4. Remove the required element if there is one.

5. Release both locks.

Quick Quiz 6.3: In this compound double-ended queue implementation, what


should be done if the queue has become non-empty while releasing and reacquiring the
lock?
The resulting code (locktdeq.c) is quite straightforward. The rebalancing opera-
tion might well shuttle a given element back and forth between the two queues, wasting
time and possibly requiring workload-dependent heuristics to obtain optimal perfor-
mance. Although this might well be the best approach in some cases, it is interesting to
try for an algorithm with greater determinism.

6.1.2.3 Hashed Double-Ended Queue


One of the simplest and most effective ways to deterministically partition a data structure
is to hash it. It is possible to trivially hash a double-ended queue by assigning each
element a sequence number based on its position in the list, so that the first element left-
enqueued into an empty queue is numbered zero and the first element right-enqueued
into an empty queue is numbered one. A series of elements left-enqueued into an
otherwise-idle queue would be assigned decreasing numbers (−1, −2, −3, . . .), while
a series of elements right-enqueued into an otherwise-idle queue would be assigned
increasing numbers (2, 3, 4, . . .). A key point is that it is not necessary to actually
represent a given element’s number, as this number will be implied by its position in the
queue.
Given this approach, we assign one lock to guard the left-hand index, one to guard
the right-hand index, and one lock for each hash chain. Figure 6.7 shows the resulting
data structure given four hash chains. Note that the lock domains do not overlap, and
that deadlock is avoided by acquiring the index locks before the chain locks, and by
never acquiring more than one lock of each type (index or chain) at a time.
Each hash chain is itself a double-ended queue, and in this example, each holds
every fourth element. The uppermost portion of Figure 6.8 shows the state after a
single element (“R1 ”) has been right-enqueued, with the right-hand index having been
98 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

R1

DEQ 0 DEQ 1 DEQ 2 DEQ 3

Index L Index R

Enq 3R

R4 R1 R2 R3

DEQ 0 DEQ 1 DEQ 2 DEQ 3

Index L Index R

Enq 3L1R

R4 R5 R2 R3

L0 R1 L −2 L −1

DEQ 0 DEQ 1 DEQ 2 DEQ 3

Index L Index R

Figure 6.8: Hashed Double-Ended Queue After Insertions

incremented to reference hash chain 2. The middle portion of this same figure shows
the state after three more elements have been right-enqueued. As you can see, the
indexes are back to their initial states (see Figure 6.7), however, each hash chain is
now non-empty. The lower portion of this figure shows the state after three additional
elements have been left-enqueued and an additional element has been right-enqueued.
From the last state shown in Figure 6.8, a left-dequeue operation would return
element “L−2 ” and leave the left-hand index referencing hash chain 2, which would
then contain only a single element (“R2 ”). In this state, a left-enqueue running concur-
rently with a right-enqueue would result in lock contention, but the probability of such
contention can be reduced to arbitrarily low levels by using a larger hash table.
Figure 6.9 shows how 16 elements would be organized in a four-hash-bucket parallel
double-ended queue. Each underlying single-lock double-ended queue holds a one-
quarter slice of the full parallel double-ended queue.
6.1. PARTITIONING EXERCISES 99

R4 R5 R6 R7

L0 R1 R2 R3

L−4 L−3 L−2 L−1

L−8 L−7 L−6 L−5

Figure 6.9: Hashed Double-Ended Queue With 16 Elements


1 struct pdeq {
2 spinlock_t llock;
3 int lidx;
4 spinlock_t rlock;
5 int ridx;
6 struct deq bkt[DEQ_N_BKTS];
7 };

Figure 6.10: Lock-Based Parallel Double-Ended Queue Data Structure

Figure 6.10 shows the corresponding C-language data structure, assuming an existing
struct deq that provides a trivially locked double-ended-queue implementation.
This data structure contains the left-hand lock on line 2, the left-hand index on line 3,
the right-hand lock on line 4 (which is cache-aligned in the actual implementation),
the right-hand index on line 5, and, finally, the hashed array of simple lock-based
double-ended queues on line 6. A high-performance implementation would of course
use padding or special alignment directives to avoid false sharing.
Figure 6.11 (lockhdeq.c) shows the implementation of the enqueue and de-
queue functions.3 Discussion will focus on the left-hand operations, as the right-hand
operations are trivially derived from them.
Lines 1-13 show pdeq_pop_l(), which left-dequeues and returns an element if
possible, returning NULL otherwise. Line 6 acquires the left-hand spinlock, and line 7
computes the index to be dequeued from. Line 8 dequeues the element, and, if line 9
finds the result to be non-NULL, line 10 records the new left-hand index. Either way,
line 11 releases the lock, and, finally, line 12 returns the element if there was one, or
NULL otherwise.
Lines 29-38 shows pdeq_push_l(), which left-enqueues the specified element.
Line 33 acquires the left-hand lock, and line 34 picks up the left-hand index. Line 35 left-
enqueues the specified element onto the double-ended queue indexed by the left-hand
index. Line 36 then updates the left-hand index and line 37 releases the lock.
As noted earlier, the right-hand operations are completely analogous to their left-
handed counterparts, so their analysis is left as an exercise for the reader.
Quick Quiz 6.4: Is the hashed double-ended queue a good solution? Why or why
not?

6.1.2.4 Compound Double-Ended Queue Revisited


This section revisits the compound double-ended queue, using a trivial rebalancing
scheme that moves all the elements from the non-empty queue to the now-empty queue.
Quick Quiz 6.5: Move all the elements to the queue that became empty? In what

3 One could easily create a polymorphic implementation in any number of languages, but doing so is left

as an exercise for the reader.


100 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1 struct cds_list_head *pdeq_pop_l(struct pdeq *d)


2 {
3 struct cds_list_head *e;
4 int i;
5
6 spin_lock(&d->llock);
7 i = moveright(d->lidx);
8 e = deq_pop_l(&d->bkt[i]);
9 if (e != NULL)
10 d->lidx = i;
11 spin_unlock(&d->llock);
12 return e;
13 }
14
15 struct cds_list_head *pdeq_pop_r(struct pdeq *d)
16 {
17 struct cds_list_head *e;
18 int i;
19
20 spin_lock(&d->rlock);
21 i = moveleft(d->ridx);
22 e = deq_pop_r(&d->bkt[i]);
23 if (e != NULL)
24 d->ridx = i;
25 spin_unlock(&d->rlock);
26 return e;
27 }
28
29 void pdeq_push_l(struct cds_list_head *e, struct pdeq *d)
30 {
31 int i;
32
33 spin_lock(&d->llock);
34 i = d->lidx;
35 deq_push_l(e, &d->bkt[i]);
36 d->lidx = moveleft(d->lidx);
37 spin_unlock(&d->llock);
38 }
39
40 void pdeq_push_r(struct cds_list_head *e, struct pdeq *d)
41 {
42 int i;
43
44 spin_lock(&d->rlock);
45 i = d->ridx;
46 deq_push_r(e, &d->bkt[i]);
47 d->ridx = moveright(d->ridx);
48 spin_unlock(&d->rlock);
49 }

Figure 6.11: Lock-Based Parallel Double-Ended Queue Implementation


6.1. PARTITIONING EXERCISES 101

possible universe is this brain-dead solution in any way optimal???


In contrast to the hashed implementation presented in the previous section, the
compound implementation will build on a sequential implementation of a double-ended
queue that uses neither locks nor atomic operations.
Figure 6.12 shows the implementation. Unlike the hashed implementation, this
compound implementation is asymmetric, so that we must consider the pdeq_pop_
l() and pdeq_pop_r() implementations separately.
Quick Quiz 6.6: Why can’t the compound parallel double-ended queue implemen-
tation be symmetric?
The pdeq_pop_l() implementation is shown on lines 1-16 of the figure. Line 5
acquires the left-hand lock, which line 14 releases. Line 6 attempts to left-dequeue an
element from the left-hand underlying double-ended queue, and, if successful, skips
lines 8-13 to simply return this element. Otherwise, line 8 acquires the right-hand
lock, line 9 left-dequeues an element from the right-hand queue, and line 10 moves any
remaining elements on the right-hand queue to the left-hand queue, line 11 initializes
the right-hand queue, and line 12 releases the right-hand lock. The element, if any, that
was dequeued on line 10 will be returned.
The pdeq_pop_r() implementation is shown on lines 18-38 of the figure. As
before, line 22 acquires the right-hand lock (and line 36 releases it), and line 23 attempts
to right-dequeue an element from the right-hand queue, and, if successful, skips lines 24-
35 to simply return this element. However, if line 24 determines that there was no
element to dequeue, line 25 releases the right-hand lock and lines 26-27 acquire both
locks in the proper order. Line 28 then attempts to right-dequeue an element from the
right-hand list again, and if line 29 determines that this second attempt has failed, line 30
right-dequeues an element from the left-hand queue (if there is one available), line 31
moves any remaining elements from the left-hand queue to the right-hand queue, and
line 32 initializes the left-hand queue. Either way, line 34 releases the left-hand lock.
Quick Quiz 6.7: Why is it necessary to retry the right-dequeue operation on line 28
of Figure 6.12?
Quick Quiz 6.8: Surely the left-hand lock must sometimes be available!!! So why
is it necessary that line 25 of Figure 6.12 unconditionally release the right-hand lock?
The pdeq_push_l() implementation is shown on lines 40-47 of Figure 6.12.
Line 44 acquires the left-hand spinlock, line 45 left-enqueues the element onto the
left-hand queue, and finally line 46 releases the lock. The pdeq_enqueue_r()
implementation (shown on lines 49-56) is quite similar.

6.1.2.5 Double-Ended Queue Discussion

The compound implementation is somewhat more complex than the hashed variant
presented in Section 6.1.2.3, but is still reasonably simple. Of course, a more intelligent
rebalancing scheme could be arbitrarily complex, but the simple scheme shown here
has been shown to perform well compared to software alternatives [DCW+ 11] and even
compared to algorithms using hardware assist [DLM+ 10]. Nevertheless, the best we
can hope for from such a scheme is 2x scalability, as at most two threads can be holding
the dequeue’s locks concurrently. This limitation also applies to algorithms based on
non-blocking synchronization, such as the compare-and-swap-based dequeue algorithm
102 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1 struct cds_list_head *pdeq_pop_l(struct pdeq *d)


2 {
3 struct cds_list_head *e;
4
5 spin_lock(&d->llock);
6 e = deq_pop_l(&d->ldeq);
7 if (e == NULL) {
8 spin_lock(&d->rlock);
9 e = deq_pop_l(&d->rdeq);
10 cds_list_splice(&d->rdeq.chain, &d->ldeq.chain);
11 CDS_INIT_LIST_HEAD(&d->rdeq.chain);
12 spin_unlock(&d->rlock);
13 }
14 spin_unlock(&d->llock);
15 return e;
16 }
17
18 struct cds_list_head *pdeq_pop_r(struct pdeq *d)
19 {
20 struct cds_list_head *e;
21
22 spin_lock(&d->rlock);
23 e = deq_pop_r(&d->rdeq);
24 if (e == NULL) {
25 spin_unlock(&d->rlock);
26 spin_lock(&d->llock);
27 spin_lock(&d->rlock);
28 e = deq_pop_r(&d->rdeq);
29 if (e == NULL) {
30 e = deq_pop_r(&d->ldeq);
31 cds_list_splice(&d->ldeq.chain, &d->rdeq.chain);
32 CDS_INIT_LIST_HEAD(&d->ldeq.chain);
33 }
34 spin_unlock(&d->llock);
35 }
36 spin_unlock(&d->rlock);
37 return e;
38 }
39
40 void pdeq_push_l(struct cds_list_head *e, struct pdeq *d)
41 {
42 int i;
43
44 spin_lock(&d->llock);
45 deq_push_l(e, &d->ldeq);
46 spin_unlock(&d->llock);
47 }
48
49 void pdeq_push_r(struct cds_list_head *e, struct pdeq *d)
50 {
51 int i;
52
53 spin_lock(&d->rlock);
54 deq_push_r(e, &d->rdeq);
55 spin_unlock(&d->rlock);
56 }

Figure 6.12: Compound Parallel Double-Ended Queue Implementation


6.2. DESIGN CRITERIA 103

of Michael [Mic03].4
Quick Quiz 6.9: Why are there not one but two solutions to the double-ended queue
problem?
In fact, as noted by Dice et al. [DLM+ 10], an unsynchronized single-threaded
double-ended queue significantly outperforms any of the parallel implementations they
studied. Therefore, the key point is that there can be significant overhead enqueuing to
or dequeuing from a shared queue, regardless of implementation. This should come as
no surprise given the material in Chapter 3, given the strict FIFO nature of these queues.
Furthermore, these strict FIFO queues are strictly FIFO only with respect to lin-
earization points [HW90]5 that are not visible to the caller, in fact, in these examples,
the linearization points are buried in the lock-based critical sections. These queues
are not strictly FIFO with respect to (say) the times at which the individual operations
started [HKLP12]. This indicates that the strict FIFO property is not all that valuable in
concurrent programs, and in fact, Kirsch et al. present less-strict queues that provide
improved performance and scalability [KLP12].6 All that said, if you are pushing all
the data used by your concurrent program through a single queue, you really need to
rethink your overall design.

6.1.3 Partitioning Example Discussion


The optimal solution to the dining philosophers problem given in the answer to the
Quick Quiz in Section 6.1.1 is an excellent example of “horizontal parallelism” or “data
parallelism”. The synchronization overhead in this case is nearly (or even exactly)
zero. In contrast, the double-ended queue implementations are examples of “vertical
parallelism” or “pipelining”, given that data moves from one thread to another. The
tighter coordination required for pipelining in turn requires larger units of work to obtain
a given level of efficiency.
Quick Quiz 6.10: The tandem double-ended queue runs about twice as fast as
the hashed double-ended queue, even when I increase the size of the hash table to an
insanely large number. Why is that?
Quick Quiz 6.11: Is there a significantly better way of handling concurrency for
double-ended queues?
These two examples show just how powerful partitioning can be in devising parallel
algorithms. Section 6.3.5 looks briefly at a third example, matrix multiply. However, all
three of these examples beg for more and better design criteria for parallel programs, a
topic taken up in the next section.

6.2 Design Criteria


One way to obtain the best performance and scalability is to simply hack away until
you converge on the best possible parallel program. Unfortunately, if your program is
4 This paper is interesting in that it showed that special double-compare-and-swap (DCAS) instructions are

not needed for lock-free implementations of double-ended queues. Instead, the common compare-and-swap
(e.g., x86 cmpxchg) suffices.
5 In short, a linearization point is a single point within a given function where that function can be said

to have taken effect. In this lock-based implementation, the linearization points can be said to be anywhere
within the critical section that does the work.
6 Nir Shavit produced relaxed stacks for roughly the same reasons [Sha11]. This situation leads some to

believe that the linearization points are useful to theorists rather than developers, and leads others to wonder
to what extent the designers of such data structures and algorithms were considering the needs of their users.
104 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

other than microscopically tiny, the space of possible parallel programs is so huge that
convergence is not guaranteed in the lifetime of the universe. Besides, what exactly is
the “best possible parallel program”? After all, Section 2.2 called out no fewer than
three parallel-programming goals of performance, productivity, and generality, and
the best possible performance will likely come at a cost in terms of productivity and
generality. We clearly need to be able to make higher-level choices at design time in
order to arrive at an acceptably good parallel program before that program becomes
obsolete.
However, more detailed design criteria are required to actually produce a real-world
design, a task taken up in this section. This being the real world, these criteria often
conflict to a greater or lesser degree, requiring that the designer carefully balance the
resulting tradeoffs.
As such, these criteria may be thought of as the “forces” acting on the design, with
particularly good tradeoffs between these forces being called “design patterns” [Ale79,
GHJV95].
The design criteria for attaining the three parallel-programming goals are speedup,
contention, overhead, read-to-write ratio, and complexity:

Speedup: As noted in Section 2.2, increased performance is the major reason to go to


all of the time and trouble required to parallelize it. Speedup is defined to be the
ratio of the time required to run a sequential version of the program to the time
required to run a parallel version.
Contention: If more CPUs are applied to a parallel program than can be kept busy
by that program, the excess CPUs are prevented from doing useful work by
contention. This may be lock contention, memory contention, or a host of other
performance killers.
Work-to-Synchronization Ratio: A uniprocessor, single-threaded, non-preemptible,
and non-interruptible7 version of a given parallel program would not need any
synchronization primitives. Therefore, any time consumed by these primitives
(including communication cache misses as well as message latency, locking
primitives, atomic instructions, and memory barriers) is overhead that does not
contribute directly to the useful work that the program is intended to accomplish.
Note that the important measure is the relationship between the synchroniza-
tion overhead and the overhead of the code in the critical section, with larger
critical sections able to tolerate greater synchronization overhead. The work-to-
synchronization ratio is related to the notion of synchronization efficiency.
Read-to-Write Ratio: A data structure that is rarely updated may often be replicated
rather than partitioned, and furthermore may be protected with asymmetric syn-
chronization primitives that reduce readers’ synchronization overhead at the
expense of that of writers, thereby reducing overall synchronization overhead.
Corresponding optimizations are possible for frequently updated data structures,
as discussed in Chapter 5.
Complexity: A parallel program is more complex than an equivalent sequential pro-
gram because the parallel program has a much larger state space than does the
sequential program, although these larger state spaces can in some cases be easily
understood given sufficient regularity and structure. A parallel programmer must
7 Either by masking interrupts or by being oblivious to them.
6.2. DESIGN CRITERIA 105

consider synchronization primitives, messaging, locking design, critical-section


identification, and deadlock in the context of this larger state space.
This greater complexity often translates to higher development and maintenance
costs. Therefore, budgetary constraints can limit the number and types of modifi-
cations made to an existing program, since a given degree of speedup is worth
only so much time and trouble. Worse yet, added complexity can actually reduce
performance and scalability.
Therefore, beyond a certain point, there may be potential sequential optimizations
that are cheaper and more effective than parallelization. As noted in Section 2.2.1,
parallelization is but one performance optimization of many, and is furthermore
an optimization that applies most readily to CPU-based bottlenecks.

These criteria will act together to enforce a maximum speedup. The first three criteria are
deeply interrelated, so the remainder of this section analyzes these interrelationships.8
Note that these criteria may also appear as part of the requirements specification.
For example, speedup may act as a relative desideratum (“the faster, the better”) or as
an absolute requirement of the workload (“the system must support at least 1,000,000
web hits per second”). Classic design pattern languages describe relative desiderata as
forces and absolute requirements as context.
An understanding of the relationships between these design criteria can be very
helpful when identifying appropriate design tradeoffs for a parallel program.
1. The less time a program spends in critical sections, the greater the potential
speedup. This is a consequence of Amdahl’s Law [Amd67] and of the fact that
only one CPU may execute within a given critical section at a given time.
More specifically, the fraction of time that the program spends in a given exclusive
critical section must be much less than the reciprocal of the number of CPUs for
the actual speedup to approach the number of CPUs. For example, a program
running on 10 CPUs must spend much less than one tenth of its time in the
most-restrictive critical section if it is to scale at all well.
2. Contention effects will consume the excess CPU and/or wallclock time should
the actual speedup be less than the number of available CPUs. The larger the
gap between the number of CPUs and the actual speedup, the less efficiently the
CPUs will be used. Similarly, the greater the desired efficiency, the smaller the
achievable speedup.

3. If the available synchronization primitives have high overhead compared to the


critical sections that they guard, the best way to improve speedup is to reduce
the number of times that the primitives are invoked (perhaps by batching critical
sections, using data ownership, using asymmetric primitives (see Section 9), or
by moving toward a more coarse-grained design such as code locking).

4. If the critical sections have high overhead compared to the primitives guarding
them, the best way to improve speedup is to increase parallelism by moving to
reader/writer locking, data locking, asymmetric, or data ownership.

8 A real-world parallel system will be subject to many additional design criteria, such as data-structure

layout, memory size, memory-hierarchy latencies, bandwidth limitations, and I/O issues.
106 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Sequential
Program

Partition Batch
Code
Locking

Partition Batch
Data
Locking

Own Disown
Data
Ownership

Figure 6.13: Design Patterns and Lock Granularity

5. If the critical sections have high overhead compared to the primitives guarding
them and the data structure being guarded is read much more often than modi-
fied, the best way to increase parallelism is to move to reader/writer locking or
asymmetric primitives.

6. Many changes that improve SMP performance, for example, reducing lock con-
tention, also improve real-time latencies [McK05c].
Quick Quiz 6.12: Don’t all these problems with critical sections mean that we
should just always use non-blocking synchronization [Her90], which don’t have critical
sections?

6.3 Synchronization Granularity


Figure 6.13 gives a pictorial view of different levels of synchronization granularity, each
of which is described in one of the following sections. These sections focus primarily
on locking, but similar granularity issues arise with all forms of synchronization.

6.3.1 Sequential Program


If the program runs fast enough on a single processor, and has no interactions with
other processes, threads, or interrupt handlers, you should remove the synchronization
primitives and spare yourself their overhead and complexity. Some years back, there
were those who would argue that Moore’s Law would eventually force all programs
into this category. However, as can be seen in Figure 6.14, the exponential increase in
single-threaded performance halted in about 2003. Therefore, increasing performance
will increasingly require parallelism.9 The debate as to whether this new trend will
9 This plot shows clock frequencies for newer CPUs theoretically capable of retiring one or more

instructions per clock, and MIPS for older CPUs requiring multiple clocks to execute even the simplest
6.3. SYNCHRONIZATION GRANULARITY 107

10000

CPU Clock Frequency / MIPS


1000

100

10

0.1
1975

1980

1985

1990

1995

2000

2005

2010

2015
Year

Figure 6.14: MIPS/Clock-Frequency Trend for Intel CPUs

result in single chips with thousands of CPUs will not be settled soon, but given that
Paul is typing this sentence on a dual-core laptop, the age of SMP does seem to be upon
us. It is also important to note that Ethernet bandwidth is continuing to grow, as shown
in Figure 6.15. This growth will motivate multithreaded servers in order to handle the
communications load.
Please note that this does not mean that you should code each and every program in
a multi-threaded manner. Again, if a program runs quickly enough on a single processor,
spare yourself the overhead and complexity of SMP synchronization primitives. The
simplicity of the hash-table lookup code in Figure 6.16 underscores this point.10 A key
point is that speedups due to parallelism are normally limited to the number of CPUs.
In contrast, speedups due to sequential optimizations, for example, careful choice of
data structure, can be arbitrarily large.
On the other hand, if you are not in this happy situation, read on!

6.3.2 Code Locking


Code locking is quite simple due to the fact that is uses only global locks.11 It is
especially easy to retrofit an existing program to use code locking in order to run it on a
multiprocessor. If the program has only a single shared resource, code locking will even
give optimal performance. However, many of the larger and more complex programs
require much of the execution to occur in critical sections, which in turn causes code
locking to sharply limits their scalability.
Therefore, you should use code locking on programs that spend only a small fraction
of their execution time in critical sections or from which only modest scaling is required.

instruction. The reason for taking this approach is that the newer CPUs’ ability to retire multiple instructions
per clock is typically limited by memory-system performance.
10 The examples in this section are taken from Hart et al. [HMB06], adapted for clarity by gathering

related code from multiple files.


11 If your program instead has locks in data structures, or, in the case of Java, uses classes with synchronized

instances, you are instead using “data locking”, described in Section 6.3.3.
108 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1e+06

100000 Ethernet
Relative Performance 10000

1000

100 x86 CPUs

10

0.1
1970
1975
1980
1985
1990
1995
2000
2005
2010
2015
Year

Figure 6.15: Ethernet Bandwidth vs. Intel x86 CPU Performance

1 struct hash_table
2 {
3 long nbuckets;
4 struct node **buckets;
5 };
6
7 typedef struct node {
8 unsigned long key;
9 struct node *next;
10 } node_t;
11
12 int hash_search(struct hash_table *h, long key)
13 {
14 struct node *cur;
15
16 cur = h->buckets[key % h->nbuckets];
17 while (cur != NULL) {
18 if (cur->key >= key) {
19 return (cur->key == key);
20 }
21 cur = cur->next;
22 }
23 return 0;
24 }

Figure 6.16: Sequential-Program Hash Table Search


6.3. SYNCHRONIZATION GRANULARITY 109

In these cases, code locking will provide a relatively simple program that is very similar
to its sequential counterpart, as can be seen in Figure 6.17. However, note that the
simple return of the comparison in hash_search() in Figure 6.16 has now become
three statements due to the need to release the lock before returning.
1 spinlock_t hash_lock;
2
3 struct hash_table
4 {
5 long nbuckets;
6 struct node **buckets;
7 };
8
9 typedef struct node {
10 unsigned long key;
11 struct node *next;
12 } node_t;
13
14 int hash_search(struct hash_table *h, long key)
15 {
16 struct node *cur;
17 int retval;
18
19 spin_lock(&hash_lock);
20 cur = h->buckets[key % h->nbuckets];
21 while (cur != NULL) {
22 if (cur->key >= key) {
23 retval = (cur->key == key);
24 spin_unlock(&hash_lock);
25 return retval;
26 }
27 cur = cur->next;
28 }
29 spin_unlock(&hash_lock);
30 return 0;
31 }

Figure 6.17: Code-Locking Hash Table Search

Unfortunately, code locking is particularly prone to “lock contention”, where mul-


tiple CPUs need to acquire the lock concurrently. SMP programmers who have taken
care of groups of small children (or groups of older people who are acting like children)
will immediately recognize the danger of having only one of something, as illustrated in
Figure 6.18.
One solution to this problem, named “data locking”, is described in the next section.

6.3.3 Data Locking


Many data structures may be partitioned, with each partition of the data structure having
its own lock. Then the critical sections for each part of the data structure can execute
in parallel, although only one instance of the critical section for a given part could
be executing at a given time. You should use data locking when contention must be
reduced, and where synchronization overhead is not limiting speedups. Data locking
reduces contention by distributing the instances of the overly-large critical section across
multiple data structures, for example, maintaining per-hash-bucket critical sections in
a hash table, as shown in Figure 6.19. The increased scalability again results in a
slight increase in complexity in the form of an additional data structure, the struct
bucket.
In contrast with the contentious situation shown in Figure 6.18, data locking helps
promote harmony, as illustrated by Figure 6.20—and in parallel programs, this almost
110 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

toy

Figure 6.18: Lock Contention

always translates into increased performance and scalability. For this reason, data
locking was heavily used by Sequent in both its DYNIX and DYNIX/ptx operating
systems [BK85, Inm85, Gar90, Dov90, MD92, MG92, MS93].
However, as those who have taken care of small children can again attest, even
providing enough to go around is no guarantee of tranquillity. The analogous situation
can arise in SMP programs. For example, the Linux kernel maintains a cache of files
and directories (called “dcache”). Each entry in this cache has its own lock, but the
entries corresponding to the root directory and its direct descendants are much more
likely to be traversed than are more obscure entries. This can result in many CPUs
contending for the locks of these popular entries, resulting in a situation not unlike that
shown in Figure 6.21.
In many cases, algorithms can be designed to reduce the instance of data skew, and
in some cases eliminate it entirely (as appears to be possible with the Linux kernel’s
dcache [MSS04]). Data locking is often used for partitionable data structures such as
hash tables, as well as in situations where multiple entities are each represented by an
instance of a given data structure. The task list in version 2.6.17 of the Linux kernel is
an example of the latter, each task structure having its own proc_lock.
A key challenge with data locking on dynamically allocated structures is ensuring
that the structure remains in existence while the lock is being acquired. The code in
Figure 6.19 finesses this challenge by placing the locks in the statically allocated hash
buckets, which are never freed. However, this trick would not work if the hash table
were resizeable, so that the locks were now dynamically allocated. In this case, there
would need to be some means to prevent the hash bucket from being freed during the
time that its lock was being acquired.
Quick Quiz 6.13: What are some ways of preventing a structure from being freed
while its lock is being acquired?
6.3. SYNCHRONIZATION GRANULARITY 111

1 struct hash_table
2 {
3 long nbuckets;
4 struct bucket **buckets;
5 };
6
7 struct bucket {
8 spinlock_t bucket_lock;
9 node_t *list_head;
10 };
11
12 typedef struct node {
13 unsigned long key;
14 struct node *next;
15 } node_t;
16
17 int hash_search(struct hash_table *h, long key)
18 {
19 struct bucket *bp;
20 struct node *cur;
21 int retval;
22
23 bp = h->buckets[key % h->nbuckets];
24 spin_lock(&bp->bucket_lock);
25 cur = bp->list_head;
26 while (cur != NULL) {
27 if (cur->key >= key) {
28 retval = (cur->key == key);
29 spin_unlock(&bp->bucket_lock);
30 return retval;
31 }
32 cur = cur->next;
33 }
34 spin_unlock(&bp->bucket_lock);
35 return 0;
36 }

Figure 6.19: Data-Locking Hash Table Search

6.3.4 Data Ownership


Data ownership partitions a given data structure over the threads or CPUs, so that
each thread/CPU accesses its subset of the data structure without any synchronization
overhead whatsoever. However, if one thread wishes to access some other thread’s data,
the first thread is unable to do so directly. Instead, the first thread must communicate
with the second thread, so that the second thread performs the operation on behalf of
the first, or, alternatively, migrates the data to the first thread.
Data ownership might seem arcane, but it is used very frequently:
1. Any variables accessible by only one CPU or thread (such as auto variables in
C and C++) are owned by that CPU or process.
2. An instance of a user interface owns the corresponding user’s context. It is
very common for applications interacting with parallel database engines to be
written as if they were entirely sequential programs. Such applications own the
user interface and his current action. Explicit parallelism is thus confined to the
database engine itself.
3. Parametric simulations are often trivially parallelized by granting each thread
ownership of a particular region of the parameter space. There are also computing
frameworks designed for this type of problem [UoC08].
If there is significant sharing, communication between the threads or CPUs can
result in significant complexity and overhead. Furthermore, if the most-heavily used data
112 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

toy

yot

toy

Figure 6.20: Data Locking

happens to be that owned by a single CPU, that CPU will be a “hot spot”, sometimes
with results resembling that shown in Figure 6.21. However, in situations where no
sharing is required, data ownership achieves ideal performance, and with code that can
be as simple as the sequential-program case shown in Figure 6.16. Such situations
are often referred to as “embarrassingly parallel”, and, in the best case, resemble the
situation previously shown in Figure 6.20.
Another important instance of data ownership occurs when the data is read-only, in
which case, all threads can “own” it via replication.
Data ownership will be presented in more detail in Chapter 8.

6.3.5 Locking Granularity and Performance


This section looks at locking granularity and performance from a mathematical synchronization-
efficiency viewpoint. Readers who are uninspired by mathematics might choose to skip
this section.
The approach is to use a crude queueing model for the efficiency of synchronization
mechanism that operate on a single shared global variable, based on an M/M/1 queue.
M/M/1 queuing models are based on an exponentially distributed “inter-arrival rate”
λ and an exponentially distributed “service rate” µ. The inter-arrival rate λ can be
thought of as the average number of synchronization operations per second that the
system would process if the synchronization were free, in other words, λ is an inverse
measure of the overhead of each non-synchronization unit of work. For example, if each
unit of work was a transaction, and if each transaction took one millisecond to process,
excluding synchronization overhead, then λ would be 1,000 transactions per second.
The service rate µ is defined similarly, but for the average number of synchronization
operations per second that the system would process if the overhead of each transac-
6.3. SYNCHRONIZATION GRANULARITY 113

toy
toy
toy

toy

Figure 6.21: Data Locking and Skew

tion was zero, and ignoring the fact that CPUs must wait on each other to complete
their synchronization operations, in other words, µ can be roughly thought of as the
synchronization overhead in absence of contention. For example, suppose that each
synchronization operation involves an atomic increment instruction, and that a computer
system is able to do an atomic increment every 25 nanoseconds on each CPU to a private
variable.12 The value of µ is therefore about 40,000,000 atomic increments per second.
Of course, the value of λ increases with increasing numbers of CPUs, as each CPU
is capable of processing transactions independently (again, ignoring synchronization):

λ = nλ0 (6.1)
where n is the number of CPUs and λ0 is the transaction-processing capability of a
single CPU. Note that the expected time for a single CPU to execute a single transaction
is 1/λ0 .
Because the CPUs have to “wait in line” behind each other to get their chance to
increment the single shared variable, we can use the M/M/1 queueing-model expression
for the expected total waiting time:
1
T= (6.2)
µ −λ
Substituting the above value of λ :
1
T= (6.3)
µ − nλ0
Now, the efficiency is just the ratio of the time required to process a transaction
in absence of synchronization (1/λ0 ) to the time required including synchronization
(T + 1/λ0 ):

12 Of course, if there are 8 CPUs all incrementing the same shared variable, then each CPU must wait

at least 175 nanoseconds for each of the other CPUs to do its increment before consuming an additional 25
nanoseconds doing its own increment. In actual fact, the wait will be longer due to the need to move the
variable from one CPU to another.
114 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Synchronization Efficiency
0.9
0.8
0.7
100
0.6
0.5 75
0.4 50
0.3 25
0.2 10
0.1

10
20
30
40
50
60
70
80
90
100
Number of CPUs (Threads)

Figure 6.22: Synchronization Efficiency

1/λ0
e= (6.4)
T + 1/λ0
Substituting the above value for T and simplifying:
µ
λ0 −n
e= µ (6.5)
λ0 − (n − 1)
But the value of µ/λ0 is just the ratio of the time required to process the transaction
(absent synchronization overhead) to that of the synchronization overhead itself (absent
contention). If we call this ratio f , we have:
f −n
e= (6.6)
f − (n − 1)
Figure 6.22 plots the synchronization efficiency e as a function of the number of
CPUs/threads n for a few values of the overhead ratio f . For example, again using the
25-nanosecond atomic increment, the f = 10 line corresponds to each CPU attempting
an atomic increment every 250 nanoseconds, and the f = 100 line corresponds to each
CPU attempting an atomic increment every 2.5 microseconds, which in turn corresponds
to several thousand instructions. Given that each trace drops off sharply with increasing
numbers of CPUs or threads, we can conclude that synchronization mechanisms based
on atomic manipulation of a single global shared variable will not scale well if used
heavily on current commodity hardware. This is a mathematical depiction of the forces
leading to the parallel counting algorithms that were discussed in Chapter 5.
The concept of efficiency is useful even in cases having little or no formal synchro-
nization. Consider for example a matrix multiply, in which the columns of one matrix
are multiplied (via “dot product”) by the rows of another, resulting in an entry in a
third matrix. Because none of these operations conflict, it is possible to partition the
columns of the first matrix among a group of threads, with each thread computing the
corresponding columns of the result matrix. The threads can therefore operate entirely
independently, with no synchronization overhead whatsoever, as is done in matmul.c.
One might therefore expect a parallel matrix multiply to have a perfect efficiency of 1.0.
6.4. PARALLEL FASTPATH 115

Matrix Multiply Efficiency


0.9 1024
0.8
0.7 128 256
0.6 512
0.5
0.4 64
0.3
0.2
0.1
0
1 10 100
Number of CPUs (Threads)

Figure 6.23: Matrix Multiply Efficiency

However, Figure 6.23 tells a different story, especially for a 64-by-64 matrix multiply,
which never gets above an efficiency of about 0.7, even when running single-threaded.
The 512-by-512 matrix multiply’s efficiency is measurably less than 1.0 on as few as 10
threads, and even the 1024-by-1024 matrix multiply deviates noticeably from perfection
at a few tens of threads. Nevertheless, this figure clearly demonstrates the performance
and scalability benefits of batching: If you must incur synchronization overhead, you
may as well get your money’s worth.
Quick Quiz 6.14: How can a single-threaded 64-by-64 matrix multiple possibly
have an efficiency of less than 1.0? Shouldn’t all of the traces in Figure 6.23 have
efficiency of exactly 1.0 when running on only one thread?
Given these inefficiencies, it is worthwhile to look into more-scalable approaches
such as the data locking described in Section 6.3.3 or the parallel-fastpath approach
discussed in the next section.
Quick Quiz 6.15: How are data-parallel techniques going to help with matrix
multiply? It is already data parallel!!!

6.4 Parallel Fastpath


Fine-grained (and therefore usually higher-performance) designs are typically more
complex than are coarser-grained designs. In many cases, most of the overhead is
incurred by a small fraction of the code [Knu73]. So why not focus effort on that small
fraction?
This is the idea behind the parallel-fastpath design pattern, to aggressively parallelize
the common-case code path without incurring the complexity that would be required to
aggressively parallelize the entire algorithm. You must understand not only the specific
algorithm you wish to parallelize, but also the workload that the algorithm will be
subjected to. Great creativity and design effort is often required to construct a parallel
fastpath.
Parallel fastpath combines different patterns (one for the fastpath, one elsewhere)
and is therefore a template pattern. The following instances of parallel fastpath occur
often enough to warrant their own patterns, as depicted in Figure 6.24:
1. Reader/Writer Locking (described below in Section 6.4.1).
116 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Reader/Writer
Locking

RCU
Parallel
Fastpath
Hierarchical
Locking

Allocator
Caches

Figure 6.24: Parallel-Fastpath Design Patterns

2. Read-copy update (RCU), which may be used as a high-performance replacement


for reader/writer locking, is introduced in Section 9.5, and will not be discussed
further in this chapter.

3. Hierarchical Locking ([McK96a]), which is touched upon in Section 6.4.2.

4. Resource Allocator Caches ([McK96a, MS93]). See Section 6.4.3 for more detail.

6.4.1 Reader/Writer Locking


If synchronization overhead is negligible (for example, if the program uses coarse-
grained parallelism with large critical sections), and if only a small fraction of the
critical sections modify data, then allowing multiple readers to proceed in parallel can
greatly increase scalability. Writers exclude both readers and each other. There are
many implementations of reader-writer locking, including the POSIX implementation
described in Section 4.2.4. Figure 6.25 shows how the hash search might be implemented
using reader-writer locking.
Reader/writer locking is a simple instance of asymmetric locking. Snaman [ST87]
describes a more ornate six-mode asymmetric locking design used in several clus-
tered systems. Locking in general and reader-writer locking in particular is described
extensively in Chapter 7.

6.4.2 Hierarchical Locking


The idea behind hierarchical locking is to have a coarse-grained lock that is held only
long enough to work out which fine-grained lock to acquire. Figure 6.26 shows how our
hash-table search might be adapted to do hierarchical locking, but also shows the great
weakness of this approach: we have paid the overhead of acquiring a second lock, but
we only hold it for a short time. In this case, the simpler data-locking approach would
be simpler and likely perform better.
6.4. PARALLEL FASTPATH 117

1 rwlock_t hash_lock;
2
3 struct hash_table
4 {
5 long nbuckets;
6 struct node **buckets;
7 };
8
9 typedef struct node {
10 unsigned long key;
11 struct node *next;
12 } node_t;
13
14 int hash_search(struct hash_table *h, long key)
15 {
16 struct node *cur;
17 int retval;
18
19 read_lock(&hash_lock);
20 cur = h->buckets[key % h->nbuckets];
21 while (cur != NULL) {
22 if (cur->key >= key) {
23 retval = (cur->key == key);
24 read_unlock(&hash_lock);
25 return retval;
26 }
27 cur = cur->next;
28 }
29 read_unlock(&hash_lock);
30 return 0;
31 }

Figure 6.25: Reader-Writer-Locking Hash Table Search

Quick Quiz 6.16: In what situation would hierarchical locking work well?

6.4.3 Resource Allocator Caches


This section presents a simplified schematic of a parallel fixed-block-size memory
allocator. More detailed descriptions may be found in the literature [MG92, MS93,
BA01, MSK01] or in the Linux kernel [Tor03].

6.4.3.1 Parallel Resource Allocation Problem

The basic problem facing a parallel memory allocator is the tension between the need to
provide extremely fast memory allocation and freeing in the common case and the need
to efficiently distribute memory in face of unfavorable allocation and freeing patterns.
To see this tension, consider a straightforward application of data ownership to this
problem—simply carve up memory so that each CPU owns its share. For example,
suppose that a system with two CPUs has two gigabytes of memory (such as the one that
I am typing on right now). We could simply assign each CPU one gigabyte of memory,
and allow each CPU to access its own private chunk of memory, without the need for
locking and its complexities and overheads. Unfortunately, this simple scheme breaks
down if an algorithm happens to have CPU 0 allocate all of the memory and CPU 1 the
free it, as would happen in a simple producer-consumer workload.
The other extreme, code locking, suffers from excessive lock contention and over-
head [MS93].
118 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1 struct hash_table
2 {
3 long nbuckets;
4 struct bucket **buckets;
5 };
6
7 struct bucket {
8 spinlock_t bucket_lock;
9 node_t *list_head;
10 };
11
12 typedef struct node {
13 spinlock_t node_lock;
14 unsigned long key;
15 struct node *next;
16 } node_t;
17
18 int hash_search(struct hash_table *h, long key)
19 {
20 struct bucket *bp;
21 struct node *cur;
22 int retval;
23
24 bp = h->buckets[key % h->nbuckets];
25 spin_lock(&bp->bucket_lock);
26 cur = bp->list_head;
27 while (cur != NULL) {
28 if (cur->key >= key) {
29 spin_lock(&cur->node_lock);
30 spin_unlock(&bp->bucket_lock);
31 retval = (cur->key == key);
32 spin_unlock(&cur->node_lock);
33 return retval;
34 }
35 cur = cur->next;
36 }
37 spin_unlock(&bp->bucket_lock);
38 return 0;
39 }

Figure 6.26: Hierarchical-Locking Hash Table Search

6.4.3.2 Parallel Fastpath for Resource Allocation


The commonly used solution uses parallel fastpath with each CPU owning a modest
cache of blocks, and with a large code-locked shared pool for additional blocks. To
prevent any given CPU from monopolizing the memory blocks, we place a limit on the
number of blocks that can be in each CPU’s cache. In a two-CPU system, the flow of
memory blocks will be as shown in Figure 6.27: when a given CPU is trying to free a
block when its pool is full, it sends blocks to the global pool, and, similarly, when that
CPU is trying to allocate a block when its pool is empty, it retrieves blocks from the
global pool.

6.4.3.3 Data Structures


The actual data structures for a “toy” implementation of allocator caches are shown
in Figure 6.28. The “Global Pool” of Figure 6.27 is implemented by globalmem of
type struct globalmempool, and the two CPU pools by the per-CPU variable
percpumem of type struct percpumempool. Both of these data structures have
arrays of pointers to blocks in their pool fields, which are filled from index zero up-
wards. Thus, if globalmem.pool[3] is NULL, then the remainder of the array from
index 4 up must also be NULL. The cur fields contain the index of the highest-numbered
full element of the pool array, or −1 if all elements are empty. All elements from
6.4. PARALLEL FASTPATH 119

Global Pool

Overflow

Overflow
(Code Locked)

Empty

Empty
CPU 0 Pool CPU 1 Pool

(Owned by CPU 0) (Owned by CPU 1)

Allocate/Free

Figure 6.27: Allocator Cache Schematic


1 #define TARGET_POOL_SIZE 3
2 #define GLOBAL_POOL_SIZE 40
3
4 struct globalmempool {
5 spinlock_t mutex;
6 int cur;
7 struct memblock *pool[GLOBAL_POOL_SIZE];
8 } globalmem;
9
10 struct percpumempool {
11 int cur;
12 struct memblock *pool[2 * TARGET_POOL_SIZE];
13 };
14
15 DEFINE_PER_THREAD(struct percpumempool, percpumem);

Figure 6.28: Allocator-Cache Data Structures

globalmem.pool[0] through globalmem.pool[globalmem.cur] must be


full, and all the rest must be empty.13
The operation of the pool data structures is illustrated by Figure 6.29, with the six
boxes representing the array of pointers making up the pool field, and the number
preceding them representing the cur field. The shaded boxes represent non-NULL
pointers, while the empty boxes represent NULL pointers. An important, though po-
tentially confusing, invariant of this data structure is that the cur field is always one
smaller than the number of non-NULL pointers.

6.4.3.4 Allocation Function


The allocation function memblock_alloc() may be seen in Figure 6.30. Line 7
picks up the current thread’s per-thread pool, and line 8 check to see if it is empty.
If so, lines 9-16 attempt to refill it from the global pool under the spinlock acquired

13 Both pool sizes (TARGET_POOL_SIZE and GLOBAL_POOL_SIZE) are unrealistically small, but

this small size makes it easier to single-step the program in order to get a feel for its operation.
120 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

(Empty) −1

Figure 6.29: Allocator Pool Schematic


1 struct memblock *memblock_alloc(void)
2 {
3 int i;
4 struct memblock *p;
5 struct percpumempool *pcpp;
6
7 pcpp = &__get_thread_var(percpumem);
8 if (pcpp->cur < 0) {
9 spin_lock(&globalmem.mutex);
10 for (i = 0; i < TARGET_POOL_SIZE &&
11 globalmem.cur >= 0; i++) {
12 pcpp->pool[i] = globalmem.pool[globalmem.cur];
13 globalmem.pool[globalmem.cur--] = NULL;
14 }
15 pcpp->cur = i - 1;
16 spin_unlock(&globalmem.mutex);
17 }
18 if (pcpp->cur >= 0) {
19 p = pcpp->pool[pcpp->cur];
20 pcpp->pool[pcpp->cur--] = NULL;
21 return p;
22 }
23 return NULL;
24 }

Figure 6.30: Allocator-Cache Allocator Function

on line 9 and released on line 16. Lines 10-14 move blocks from the global to the
per-thread pool until either the local pool reaches its target size (half full) or the global
pool is exhausted, and line 15 sets the per-thread pool’s count to the proper value.
In either case, line 18 checks for the per-thread pool still being empty, and if not,
lines 19-21 remove a block and return it. Otherwise, line 23 tells the sad tale of memory
exhaustion.

6.4.3.5 Free Function


Figure 6.31 shows the memory-block free function. Line 6 gets a pointer to this thread’s
pool, and line 7 checks to see if this per-thread pool is full.
If so, lines 8-15 empty half of the per-thread pool into the global pool, with lines 8
6.4. PARALLEL FASTPATH 121

1 void memblock_free(struct memblock *p)


2 {
3 int i;
4 struct percpumempool *pcpp;
5
6 pcpp = &__get_thread_var(percpumem);
7 if (pcpp->cur >= 2 * TARGET_POOL_SIZE - 1) {
8 spin_lock(&globalmem.mutex);
9 for (i = pcpp->cur; i >= TARGET_POOL_SIZE; i--) {
10 globalmem.pool[++globalmem.cur] = pcpp->pool[i];
11 pcpp->pool[i] = NULL;
12 }
13 pcpp->cur = i;
14 spin_unlock(&globalmem.mutex);
15 }
16 pcpp->pool[++pcpp->cur] = p;
17 }

Figure 6.31: Allocator-Cache Free Function

and 14 acquiring and releasing the spinlock. Lines 9-12 implement the loop moving
blocks from the local to the global pool, and line 13 sets the per-thread pool’s count to
the proper value.
In either case, line 16 then places the newly freed block into the per-thread pool.

6.4.3.6 Performance
Rough performance results14 are shown in Figure 6.32, running on a dual-core Intel
x86 running at 1GHz (4300 bogomips per CPU) with at most six blocks allowed in
each CPU’s cache. In this micro-benchmark, each thread repeatedly allocates a group
of blocks and then frees all the blocks in that group, with the number of blocks in the
group being the “allocation run length” displayed on the x-axis. The y-axis shows the
number of successful allocation/free pairs per microsecond—failed allocations are not
counted. The “X”s are from a two-thread run, while the “+”s are from a single-threaded
run.
Note that run lengths up to six scale linearly and give excellent performance, while
run lengths greater than six show poor performance and almost always also show nega-
tive scaling. It is therefore quite important to size TARGET_POOL_SIZE sufficiently
large, which fortunately is usually quite easy to do in actual practice [MSK01], espe-
cially given today’s large memories. For example, in most systems, it is quite reasonable
to set TARGET_POOL_SIZE to 100, in which case allocations and frees are guaranteed
to be confined to per-thread pools at least 99% of the time.
As can be seen from the figure, the situations where the common-case data-ownership
applies (run lengths up to six) provide greatly improved performance compared to the
cases where locks must be acquired. Avoiding synchronization in the common case will
be a recurring theme through this book.
Quick Quiz 6.17: In Figure 6.32, there is a pattern of performance rising with
increasing run length in groups of three samples, for example, for run lengths 10, 11,
and 12. Why?
Quick Quiz 6.18: Allocation failures were observed in the two-thread tests at run
lengths of 19 and greater. Given the global-pool size of 40 and the per-thread target
pool size s of three, number of threads n equal to two, and assuming that the per-thread
14 This data was not collected in a statistically meaningful way, and therefore should be viewed with great

skepticism and suspicion. Good data-collection and -reduction practice is discussed in Chapter 11. That said,
repeated runs gave similar results, and these results match more careful evaluations of similar algorithms.
122 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

30

Allocations/Frees Per Microsecond


25

20

15

10

0
0 5 10 15 20 25
Allocation Run Length

Figure 6.32: Allocator Cache Performance

pools are initially empty with none of the memory in use, what is the smallest allocation
run length m at which failures can occur? (Recall that each thread repeatedly allocates
m block of memory, and then frees the m blocks of memory.) Alternatively, given n
threads each with pool size s, and where each thread repeatedly first allocates m blocks
of memory and then frees those m blocks, how large must the global pool size be? Note:
Obtaining the correct answer will require you to examine the smpalloc.c source
code, and very likely single-step it as well. You have been warned!

6.4.3.7 Real-World Design

The toy parallel resource allocator was quite simple, but real-world designs expand on
this approach in a number of ways.
First, real-world allocators are required to handle a wide range of allocation sizes,
as opposed to the single size shown in this toy example. One popular way to do this is
to offer a fixed set of sizes, spaced so as to balance external and internal fragmentation,
such as in the late-1980s BSD memory allocator [MK88]. Doing this would mean that
the “globalmem” variable would need to be replicated on a per-size basis, and that the
associated lock would similarly be replicated, resulting in data locking rather than the
toy program’s code locking.
Second, production-quality systems must be able to repurpose memory, meaning
that they must be able to coalesce blocks into larger structures, such as pages [MS93].
This coalescing will also need to be protected by a lock, which again could be replicated
on a per-size basis.
Third, coalesced memory must be returned to the underlying memory system, and
pages of memory must also be allocated from the underlying memory system. The
locking required at this level will depend on that of the underlying memory system, but
could well be code locking. Code locking can often be tolerated at this level, because
this level is so infrequently reached in well-designed systems [MSK01].
Despite this real-world design’s greater complexity, the underlying idea is the same—
repeated application of parallel fastpath, as shown in Table 6.1.
6.5. BEYOND PARTITIONING 123

Level Locking Purpose


Per-thread pool Data ownership High-speed allocation
Global block pool Data locking Distributing blocks among threads
Coalescing Data locking Combining blocks into pages
System memory Code locking Memory from/to system

Table 6.1: Schematic of Real-World Parallel Allocator

6.5 Beyond Partitioning


This chapter has discussed how data partitioning can be used to design simple linearly
scalable parallel programs. Section 6.3.4 hinted at the possibilities of data replication,
which will be used to great effect in Section 9.5.
The main goal of applying partitioning and replication is to achieve linear speedups,
in other words, to ensure that the total amount of work required does not increase
significantly as the number of CPUs or threads increases. A problem that can be
solved via partitioning and/or replication, resulting in linear speedups, is embarrassingly
parallel. But can we do better?
To answer this question, let us examine the solution of labyrinths and mazes. Of
course, labyrinths and mazes have been objects of fascination for millennia [Wik12],
so it should come as no surprise that they are generated and solved using computers,
including biological computers [Ada11], GPGPUs [Eri08], and even discrete hard-
ware [KFC11]. Parallel solution of mazes is sometimes used as a class project in
universities [ETH11, Uni10] and as a vehicle to demonstrate the benefits of parallel-
programming frameworks [Fos10].
Common advice is to use a parallel work-queue algorithm (PWQ) [ETH11, Fos10].
This section evaluates this advice by comparing PWQ against a sequential algorithm
(SEQ) and also against an alternative parallel algorithm, in all cases solving randomly
generated square mazes. Section 6.5.1 discusses PWQ, Section 6.5.2 discusses an
alternative parallel algorithm, Section 6.5.3 analyzes its anomalous performance, Sec-
tion 6.5.4 derives an improved sequential algorithm from the alternative parallel algo-
rithm, Section 6.5.5 makes further performance comparisons, and finally Section 6.5.6
presents future directions and concluding remarks.

6.5.1 Work-Queue Parallel Maze Solver


PWQ is based on SEQ, which is shown in Figure 6.33 (maze_seq.c). The maze
is represented by a 2D array of cells and a linear-array-based work queue named
->visited.
Line 7 visits the initial cell, and each iteration of the loop spanning lines 8-21
traverses passages headed by one cell. The loop spanning lines 9-13 scans the ->
visited[] array for a visited cell with an unvisited neighbor, and the loop spanning
lines 14-19 traverses one fork of the submaze headed by that neighbor. Line 20 initializes
for the next pass through the outer loop.
The pseudocode for maze_try_visit_cell() is shown on lines 1-12 of Fig-
ure 6.34 (maze.c). Line 4 checks to see if cells c and n are adjacent and connected,
while line 5 checks to see if cell n has not yet been visited. The celladdr() function
returns the address of the specified cell. If either check fails, line 6 returns failure. Line 7
indicates the next cell, line 8 records this cell in the next slot of the ->visited[]
array, line 9 indicates that this slot is now full, and line 10 marks this cell as visited and
also records the distance from the maze start. Line 11 then returns success.
124 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1 int maze_solve(maze *mp, cell sc, cell ec)


2 {
3 cell c = sc;
4 cell n;
5 int vi = 0;
6
7 maze_try_visit_cell(mp, c, c, &n, 1);
8 for (;;) {
9 while (!maze_find_any_next_cell(mp, c, &n)) {
10 if (++vi >= mp->vi)
11 return 0;
12 c = mp->visited[vi].c;
13 }
14 do {
15 if (n == ec) {
16 return 1;
17 }
18 c = n;
19 } while (maze_find_any_next_cell(mp, c, &n));
20 c = mp->visited[vi].c;
21 }
22 }

Figure 6.33: SEQ Pseudocode

1 int maze_try_visit_cell(struct maze *mp, cell c, cell t,


2 cell *n, int d)
3 {
4 if (!maze_cells_connected(mp, c, t) ||
5 (*celladdr(mp, t) & VISITED))
6 return 0;
7 *n = t;
8 mp->visited[mp->vi] = t;
9 mp->vi++;
10 *celladdr(mp, t) |= VISITED | d;
11 return 1;
12 }
13
14 int maze_find_any_next_cell(struct maze *mp, cell c,
15 cell *n)
16 {
17 int d = (*celladdr(mp, c) & DISTANCE) + 1;
18
19 if (maze_try_visit_cell(mp, c, prevcol(c), n, d))
20 return 1;
21 if (maze_try_visit_cell(mp, c, nextcol(c), n, d))
22 return 1;
23 if (maze_try_visit_cell(mp, c, prevrow(c), n, d))
24 return 1;
25 if (maze_try_visit_cell(mp, c, nextrow(c), n, d))
26 return 1;
27 return 0;
28 }

Figure 6.34: SEQ Helper Pseudocode


6.5. BEYOND PARTITIONING 125

1 2 3

2 3 4

3 4 5

Figure 6.35: Cell-Number Solution Tracking

1
0.9
0.8
0.7 PWQ
0.6
Probability

0.5 SEQ
0.4
0.3
0.2
0.1
0
0 20 40 60 80 100 120 140
CDF of Solution Time (ms)

Figure 6.36: CDF of Solution Times For SEQ and PWQ

The pseudocode for maze_find_any_next_cell() is shown on lines 14-28


of Figure 6.34 (maze.c). Line 17 picks up the current cell’s distance plus 1, while
lines 19, 21, 23, and 25 check the cell in each direction, and lines 20, 22, 24, and 26 return
true if the corresponding cell is a candidate next cell. The prevcol(), nextcol(),
prevrow(), and nextrow() each do the specified array-index-conversion operation.
If none of the cells is a candidate, line 27 returns false.
The path is recorded in the maze by counting the number of cells from the starting
point, as shown in Figure 6.35, where the starting cell is in the upper left and the
ending cell is in the lower right. Starting at the ending cell and following consecutively
decreasing cell numbers traverses the solution.
The parallel work-queue solver is a straightforward parallelization of the algorithm
shown in Figures 6.33 and 6.34. Line 10 of Figure 6.33 must use fetch-and-add, and
the local variable vi must be shared among the various threads. Lines 5 and 10 of
Figure 6.34 must be combined into a CAS loop, with CAS failure indicating a loop
in the maze. Lines 8-9 of this figure must use fetch-and-add to arbitrate concurrent
attempts to record cells in the ->visited[] array.
This approach does provide significant speedups on a dual-CPU Lenovo™ W500
running at 2.53GHz, as shown in Figure 6.36, which shows the cumulative distribution
functions (CDFs) for the solution times of the two algorithms, based on the solution of
500 different square 500-by-500 randomly generated mazes. The substantial overlap of
the projection of the CDFs onto the x-axis will be addressed in Section 6.5.3.
Interestingly enough, the sequential solution-path tracking works unchanged for
the parallel algorithm. However, this uncovers a significant weakness in the parallel
126 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1 int maze_solve_child(maze *mp, cell *visited, cell sc)


2 {
3 cell c;
4 cell n;
5 int vi = 0;
6
7 myvisited = visited; myvi = &vi;
8 c = visited[vi];
9 do {
10 while (!maze_find_any_next_cell(mp, c, &n)) {
11 if (visited[++vi].row < 0)
12 return 0;
13 if (ACCESS_ONCE(mp->done))
14 return 1;
15 c = visited[vi];
16 }
17 do {
18 if (ACCESS_ONCE(mp->done))
19 return 1;
20 c = n;
21 } while (maze_find_any_next_cell(mp, c, &n));
22 c = visited[vi];
23 } while (!ACCESS_ONCE(mp->done));
24 return 1;
25 }

Figure 6.37: Partitioned Parallel Solver Pseudocode

algorithm: At most one thread may be making progress along the solution path at any
given time. This weakness is addressed in the next section.

6.5.2 Alternative Parallel Maze Solver


Youthful maze solvers are often urged to start at both ends, and this advice has been
repeated more recently in the context of automated maze solving [Uni10]. This advice
amounts to partitioning, which has been a powerful parallelization strategy in the
context of parallel programming for both operating-system kernels [BK85, Inm85] and
applications [Pat10]. This section applies this strategy, using two child threads that start
at opposite ends of the solution path, and takes a brief look at the performance and
scalability consequences.
The partitioned parallel algorithm (PART), shown in Figure 6.37 (maze_part.c),
is similar to SEQ, but has a few important differences. First, each child thread has
its own visited array, passed in by the parent as shown on line 1, which must be
initialized to all [−1, −1]. Line 7 stores a pointer to this array into the per-thread
variable myvisited to allow access by helper functions, and similarly stores a pointer
to the local visit index. Second, the parent visits the first cell on each child’s behalf,
which the child retrieves on line 8. Third, the maze is solved as soon as one child
locates a cell that has been visited by the other child. When maze_try_visit_
cell() detects this, it sets a ->done field in the maze structure. Fourth, each child
must therefore periodically check the ->done field, as shown on lines 13, 18, and 23.
The ACCESS_ONCE() primitive must disable any compiler optimizations that might
combine consecutive loads or that might reload the value. A C++1x volatile relaxed
load suffices [Bec11]. Finally, the maze_find_any_next_cell() function must
use compare-and-swap to mark a cell as visited, however no constraints on ordering are
required beyond those provided by thread creation and join.
The pseudocode for maze_find_any_next_cell() is identical to that shown
in Figure 6.34, but the pseudocode for maze_try_visit_cell() differs, and is
6.5. BEYOND PARTITIONING 127

1 int maze_try_visit_cell(struct maze *mp, int c, int t,


2 int *n, int d)
3 {
4 cell_t t;
5 cell_t *tp;
6 int vi;
7
8 if (!maze_cells_connected(mp, c, t))
9 return 0;
10 tp = celladdr(mp, t);
11 do {
12 t = ACCESS_ONCE(*tp);
13 if (t & VISITED) {
14 if ((t & TID) != mytid)
15 mp->done = 1;
16 return 0;
17 }
18 } while (!CAS(tp, t, t | VISITED | myid | d));
19 *n = t;
20 vi = (*myvi)++;
21 myvisited[vi] = t;
22 return 1;
23 }

Figure 6.38: Partitioned Parallel Helper Pseudocode


1
0.9
0.8 PART
0.7 PWQ
0.6
Probability

0.5 SEQ
0.4
0.3
0.2
0.1
0
0 20 40 60 80 100 120 140
CDF of Solution Time (ms)

Figure 6.39: CDF of Solution Times For SEQ, PWQ, and PART

shown in Figure 6.38. Lines 8-9 check to see if the cells are connected, returning failure
if not. The loop spanning lines 11-18 attempts to mark the new cell visited. Line 13
checks to see if it has already been visited, in which case line 16 returns failure, but only
after line 14 checks to see if we have encountered the other thread, in which case line 15
indicates that the solution has been located. Line 19 updates to the new cell, lines 20
and 21 update this thread’s visited array, and line 22 returns success.
Performance testing revealed a surprising anomaly, shown in Figure 6.39. The
median solution time for PART (17 milliseconds) is more than four times faster than
that of SEQ (79 milliseconds), despite running on only two threads. The next section
analyzes this anomaly.

6.5.3 Performance Comparison I


The first reaction to a performance anomaly is to check for bugs. Although the algorithms
were in fact finding valid solutions, the plot of CDFs in Figure 6.39 assumes independent
data points. This is not the case: The performance tests randomly generate a maze, and
128 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1
0.9
0.8
0.7
0.6

Probability
0.5 SEQ/PWQ SEQ/PART
0.4
0.3
0.2
0.1
0
0.1 1 10 100
CDF of Speedup Relative to SEQ

Figure 6.40: CDF of SEQ/PWQ and SEQ/PART Solution-Time Ratios

Figure 6.41: Reason for Small Visit Percentages

then run all solvers on that maze. It therefore makes sense to plot the CDF of the ratios
of solution times for each generated maze, as shown in Figure 6.40, greatly reducing
the CDFs’ overlap. This plot reveals that for some mazes, PART is more than forty
times faster than SEQ. In contrast, PWQ is never more than about two times faster than
SEQ. A forty-times speedup on two threads demands explanation. After all, this is
not merely embarrassingly parallel, where partitionability means that adding threads
does not increase the overall computational cost. It is instead humiliatingly parallel:
Adding threads significantly reduces the overall computational cost, resulting in large
algorithmic superlinear speedups.
Further investigation showed that PART sometimes visited fewer than 2% of the
maze’s cells, while SEQ and PWQ never visited fewer than about 9%. The reason for
this difference is shown by Figure 6.41. If the thread traversing the solution from the
upper left reaches the circle, the other thread cannot reach the upper-right portion of
the maze. Similarly, if the other thread reaches the square, the first thread cannot reach
the lower-left portion of the maze. Therefore, PART will likely visit a small fraction of
the non-solution-path cells. In short, the superlinear speedups are due to threads getting
in each others’ way. This is a sharp contrast with decades of experience with parallel
programming, where workers have struggled to keep threads out of each others’ way.
Figure 6.42 confirms a strong correlation between cells visited and solution time
for all three methods. The slope of PART’s scatterplot is smaller than that of SEQ,
indicating that PART’s pair of threads visits a given fraction of the maze faster than can
SEQ’s single thread. PART’s scatterplot is also weighted toward small visit percentages,
6.5. BEYOND PARTITIONING 129

140

120

100

Solution Time (ms)


SEQ
80

60 PWQ

40

20 PART

0
0 10 20 30 40 50 60 70 80 90 100
Percent of Maze Cells Visited

Figure 6.42: Correlation Between Visit Percentage and Solution Time

Figure 6.43: PWQ Potential Contention Points

confirming that PART does less total work, hence the observed humiliating parallelism.
The fraction of cells visited by PWQ is similar to that of SEQ. In addition, PWQ’s
solution time is greater than that of PART, even for equal visit fractions. The reason for
this is shown in Figure 6.43, which has a red circle on each cell with more than two
neighbors. Each such cell can result in contention in PWQ, because one thread can enter
but two threads can exit, which hurts performance, as noted earlier in this chapter. In
contrast, PART can incur such contention but once, namely when the solution is located.
Of course, SEQ never contends.
Although PART’s speedup is impressive, we should not neglect sequential optimiza-
tions. Figure 6.44 shows that SEQ, when compiled with -O3, is about twice as fast as
unoptimized PWQ, approaching the performance of unoptimized PART. Compiling all
three algorithms with -O3 gives results similar to (albeit faster than) those shown in
Figure 6.40, except that PWQ provides almost no speedup compared to SEQ, in keeping
with Amdahl’s Law [Amd67]. However, if the goal is to double performance compared
to unoptimized SEQ, as opposed to achieving optimality, compiler optimizations are
quite attractive.
Cache alignment and padding often improves performance by reducing false sharing.
However, for these maze-solution algorithms, aligning and padding the maze-cell array
degrades performance by up to 42% for 1000x1000 mazes. Cache locality is more
important than avoiding false sharing, especially for large mazes. For smaller 20-by-
20 or 50-by-50 mazes, aligning and padding can produce up to a 40% performance
improvement for PART, but for these small sizes, SEQ performs better anyway because
130 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1
0.9
PART
0.8
0.7
0.6

Probability
0.5 PWQ
0.4
0.3
0.2
0.1 SEQ -O3
0
0.1 1 10 100
CDF of Speedup Relative to SEQ

Figure 6.44: Effect of Compiler Optimization (-O3)

1
COPART
0.9
PWQ
0.8 PART
0.7
0.6
Probability

0.5
0.4
0.3
0.2
0.1
0
0.1 1 10 100
CDF of Speedup Relative to SEQ (-O3)

Figure 6.45: Partitioned Coroutines

there is insufficient time for PART to make up for the overhead of thread creation and
destruction.
In short, the partitioned parallel maze solver is an interesting example of an algo-
rithmic superlinear speedup. If “algorithmic superlinear speedup” causes cognitive
dissonance, please proceed to the next section.

6.5.4 Alternative Sequential Maze Solver

The presence of algorithmic superlinear speedups suggests simulating parallelism via


co-routines, for example, manually switching context between threads on each pass
through the main do-while loop in Figure 6.37. This context switching is straightforward
because the context consists only of the variables c and vi: Of the numerous ways to
achieve the effect, this is a good tradeoff between context-switch overhead and visit
percentage. As can be seen in Figure 6.45, this coroutine algorithm (COPART) is quite
effective, with the performance on one thread being within about 30% of PART on two
threads (maze_2seq.c).
6.5. BEYOND PARTITIONING 131

12

Speedup Relative to SEQ (-O3)


10

2 PART PWQ

0
10 100 1000
Maze Size

Figure 6.46: Varying Maze Size vs. SEQ

1.8
Speedup Relative to COPART (-O3)

1.6

1.4

1.2

0.8 PART

0.6
PWQ
0.4

0.2

0
10 100 1000
Maze Size

Figure 6.47: Varying Maze Size vs. COPART

6.5.5 Performance Comparison II


Figures 6.46 and 6.47 show the effects of varying maze size, comparing both PWQ
and PART running on two threads against either SEQ or COPART, respectively, with
90%-confidence error bars. PART shows superlinear scalability against SEQ and modest
scalability against COPART for 100-by-100 and larger mazes. PART exceeds theoretical
energy-efficiency breakeven against COPART at roughly the 200-by-200 maze size,
given that power consumption rises as roughly the square of the frequency for high
frequencies [Mud00], so that 1.4x scaling on two threads consumes the same energy as
a single thread at equal solution speeds. In contrast, PWQ shows poor scalability against
both SEQ and COPART unless unoptimized: Figures 6.46 and 6.47 were generated
using -O3.
Figure 6.48 shows the performance of PWQ and PART relative to COPART. For
PART runs with more than two threads, the additional threads were started evenly
spaced along the diagonal connecting the starting and ending cells. Simplified link-state
routing [BG87] was used to detect early termination on PART runs with more than
two threads (the solution is flagged when a thread is connected to both beginning and
end). PWQ performs quite poorly, but PART hits breakeven at two threads and again
at five threads, achieving modest speedups beyond five threads. Theoretical energy
132 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

3.5

Mean Speedup Relative to COPART (-O3)


3

2.5

1.5
PART
1

0.5
PWQ
0
1 2 3 4 5 6 7 8
Number of Threads

Figure 6.48: Mean Speedup vs. Number of Threads, 1000x1000 Maze

efficiency breakeven is within the 90% confidence interval for seven and eight threads.
The reasons for the peak at two threads are (1) the lower complexity of termination
detection in the two-thread case and (2) the fact that there is a lower probability of the
third and subsequent threads making useful forward progress: Only the first two threads
are guaranteed to start on the solution line. This disappointing performance compared
to results in Figure 6.47 is due to the less-tightly integrated hardware available in the
larger and older Xeon® system running at 2.66GHz.

6.5.6 Future Directions and Conclusions


Much future work remains. First, this section applied only one technique used by
human maze solvers. Others include following walls to exclude portions of the maze
and choosing internal starting points based on the locations of previously traversed
paths. Second, different choices of starting and ending points might favor different
algorithms. Third, although placement of the PART algorithm’s first two threads is
straightforward, there are any number of placement schemes for the remaining threads.
Optimal placement might well depend on the starting and ending points. Fourth, study
of unsolvable mazes and cyclic mazes is likely to produce interesting results. Fifth, the
lightweight C++11 atomic operations might improve performance. Sixth, it would be
interesting to compare the speedups for three-dimensional mazes (or of even higher-order
mazes). Finally, for mazes, humiliating parallelism indicated a more-efficient sequential
implementation using coroutines. Do humiliatingly parallel algorithms always lead to
more-efficient sequential implementations, or are there inherently humiliatingly parallel
algorithms for which coroutine context-switch overhead overwhelms the speedups?
This section demonstrated and analyzed parallelization of maze-solution algorithms.
A conventional work-queue-based algorithm did well only when compiler optimizations
were disabled, suggesting that some prior results obtained using high-level/overhead
languages will be invalidated by advances in optimization.
This section gave a clear example where approaching parallelism as a first-class
optimization technique rather than as a derivative of a sequential algorithm paves
the way for an improved sequential algorithm. High-level design-time application of
parallelism is likely to be a fruitful field of study. This section took the problem of
solving mazes from mildly scalable to humiliatingly parallel and back again. It is hoped
6.6. PARTITIONING, PARALLELISM, AND OPTIMIZATION 133

that this experience will motivate work on parallelism as a first-class design-time whole-
application optimization technique, rather than as a grossly suboptimal after-the-fact
micro-optimization to be retrofitted into existing programs.

6.6 Partitioning, Parallelism, and Optimization


Most important, although this chapter has demonstrated that although applying paral-
lelism at the design level gives excellent results, this final section shows that this is not
enough. For search problems such as maze solution, this section has shown that search
strategy is even more important than parallel design. Yes, for this particular type of
maze, intelligently applying parallelism identified a superior search strategy, but this
sort of luck is no substitute for a clear focus on search strategy itself.
As noted back in Section 2.2, parallelism is but one potential optimization of many.
A successful design needs to focus on the most important optimization. Much though I
might wish to claim otherwise, that optimization might or might not be parallelism.
However, for the many cases where parallelism is the right optimization, the next
section covers that synchronization workhorse, locking.
134 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Locking is the worst general-purpose
synchronization mechanism except for all those other
mechanisms that have been tried from time to time.

With apologies to the memory of Winston Churchill


and to whoever he was quoting
Chapter 7

Locking

In recent concurrency research, the role of villain is often played by locking. In many
papers and presentations, locking stands accused of promoting deadlocks, convoying,
starvation, unfairness, data races, and all manner of other concurrency sins. Interestingly
enough, the role of workhorse in production-quality shared-memory parallel software is
played by, you guessed it, locking. This chapter will look into this dichotomy between
villain and hero, as fancifully depicted in Figures 7.1 and 7.2.
There are a number of reasons behind this Jekyll-and-Hyde dichotomy:

1. Many of locking’s sins have pragmatic design solutions that work well in most
cases, for example:

(a) Use of lock hierarchies to avoid deadlock.


(b) Deadlock-detection tools, for example, the Linux kernel’s lockdep facil-
ity [Cor06a].
(c) Locking-friendly data structures, such as arrays, hash tables, and radix trees,
which will be covered in Chapter 10.

2. Some of locking’s sins are problems only at high levels of contention, levels
reached only by poorly designed programs.

3. Some of locking’s sins are avoided by using other synchronization mechanisms


in concert with locking. These other mechanisms include statistical counters (see
Chapter 5), reference counters (see Section 9.2), hazard pointers (see Section 9.3),
sequence-locking readers (see Section 9.4), RCU (see Section 9.5), and simple
non-blocking data structures (see Section 14.3).

4. Until quite recently, almost all large shared-memory parallel programs were
developed in secret, so that it was difficult for most researchers to learn of these
pragmatic solutions.

5. Locking works extremely well for some software artifacts and extremely poorly
for others. Developers who have worked on artifacts for which locking works
well can be expected to have a much more positive opinion of locking than those
who have worked on artifacts for which locking works poorly, as will be discussed
in Section 7.5.

135
136 CHAPTER 7. LOCKING

XXXX

Figure 7.1: Locking: Villain or Slob?

Figure 7.2: Locking: Workhorse or Hero?

6. All good stories need a villain, and locking has a long and honorable history
serving as a research-paper whipping boy.

Quick Quiz 7.1: Just how can serving as a whipping boy be considered to be in any
way honorable???
This chapter will give an overview of a number of ways to avoid locking’s more
serious sins.

7.1 Staying Alive


Given that locking stands accused of deadlock and starvation, one important concern
for shared-memory parallel developers is simply staying alive. The following sections
7.1. STAYING ALIVE 137

Lock 1

Thread A Lock 2

Lock 3 Thread B

Thread C Lock 4

Figure 7.3: Deadlock Cycle

therefore cover deadlock, livelock, starvation, unfairness, and inefficiency.

7.1.1 Deadlock
Deadlock occurs when each of a group of threads is holding at least one lock while at
the same time waiting on a lock held by a member of that same group.
Without some sort of external intervention, deadlock is forever. No thread can
acquire the lock it is waiting on until that lock is released by the thread holding it, but
the thread holding it cannot release it until the holding thread acquires the lock that it is
waiting on.
We can create a directed-graph representation of a deadlock scenario with nodes for
threads and locks, as shown in Figure 7.3. An arrow from a lock to a thread indicates
that the thread holds the lock, for example, Thread B holds Locks 2 and 4. An arrow
from a thread to a lock indicates that the thread is waiting on the lock, for example,
Thread B is waiting on Lock 3.
A deadlock scenario will always contain at least one deadlock cycle. In Figure 7.3,
this cycle is Thread B, Lock 3, Thread C, Lock 4, and back to Thread B.
Quick Quiz 7.2: But the definition of deadlock only said that each thread was
holding at least one lock and waiting on another lock that was held by some thread.
How do you know that there is a cycle?
Although there are some software environments such as database systems that can
repair an existing deadlock, this approach requires either that one of the threads be
killed or that a lock be forcibly stolen from one of the threads. This killing and forcible
stealing can be appropriate for transactions, but is often problematic for kernel and
application-level use of locking: dealing with the resulting partially updated structures
can be extremely complex, hazardous, and error-prone.
Kernels and applications therefore work to avoid deadlocks rather than to recover
from them. There are a number of deadlock-avoidance strategies, including locking
hierarchies (Section 7.1.1.1), local locking hierarchies (Section 7.1.1.2), layered locking
hierarchies (Section 7.1.1.3), strategies for dealing with APIs containing pointers to
locks (Section 7.1.1.4), conditional locking (Section 7.1.1.5), acquiring all needed locks
first (Section 7.1.1.6), single-lock-at-a-time designs (Section 7.1.1.7), and strategies for
signal/interrupt handlers (Section 7.1.1.8). Although there is no deadlock-avoidance
138 CHAPTER 7. LOCKING

strategy that works perfectly for all situations, there is a good selection of deadlock-
avoidance tools to choose from.

7.1.1.1 Locking Hierarchies

Locking hierarchies order the locks and prohibit acquiring locks out of order. In
Figure 7.3, we might order the locks numerically, so that a thread was forbidden from
acquiring a given lock if it already held a lock with the same or a higher number.
Thread B has violated this hierarchy because it is attempting to acquire Lock 3 while
holding Lock 4, which permitted the deadlock to occur.
Again, to apply a locking hierarchy, order the locks and prohibit out-of-order
lock acquisition. In large program, it is wise to use tools to enforce your locking
hierarchy [Cor06a].

7.1.1.2 Local Locking Hierarchies

However, the global nature of locking hierarchies make them difficult to apply to library
functions. After all, the program using a given library function has not even been written
yet, so how can the poor library-function implementor possibly hope to adhere to the
yet-to-be-written program’s locking hierarchy?
One special case that is fortunately the common case is when the library function
does not invoke any of the caller’s code. In this case, the caller’s locks will never be
acquired while holding any of the library’s locks, so that there cannot be a deadlock
cycle containing locks from both the library and the caller.
Quick Quiz 7.3: Are there any exceptions to this rule, so that there really could be
a deadlock cycle containing locks from both the library and the caller, even given that
the library code never invokes any of the caller’s functions?
But suppose that a library function does invoke the caller’s code. For example,
the qsort() function invokes a caller-provided comparison function. A concurrent
implementation of qsort() likely uses locking, which might result in deadlock in
the perhaps-unlikely case where the comparison function is a complicated function
involving also locking. How can the library function avoid deadlock?
The golden rule in this case is “Release all locks before invoking unknown code.”
To follow this rule, the qsort() function must release all locks before invoking the
comparison function.
Quick Quiz 7.4: But if qsort() releases all its locks before invoking the compar-
ison function, how can it protect against races with other qsort() threads?
To see the benefits of local locking hierarchies, compare Figures 7.4 and 7.5. In
both figures, application functions foo() and bar() invoke qsort() while holding
Locks A and B, respectively. Because this is a parallel implementation of qsort(), it
acquires Lock C. Function foo() passes function cmp() to qsort(), and cmp()
acquires Lock B. Function bar() passes a simple integer-comparison function (not
shown) to qsort(), and this simple function does not acquire any locks.
Now, if qsort() holds Lock C while calling cmp() in violation of the golden
release-all-locks rule above, as shown in Figure 7.4, deadlock can occur. To see this,
suppose that one thread invokes foo() while a second thread concurrently invokes
bar(). The first thread will acquire Lock A and the second thread will acquire Lock B.
If the first thread’s call to qsort() acquires Lock C, then it will be unable to acquire
Lock B when it calls cmp(). But the first thread holds Lock C, so the second thread’s
7.1. STAYING ALIVE 139

Application

Lock A Lock B Lock B

foo() bar() cmp()

Library

Lock C
qsort()

Figure 7.4: Without Local Locking Hierarchy for qsort()

Application

Lock A Lock B Lock B

foo() bar() cmp()

Library

Lock C
qsort()

Figure 7.5: Local Locking Hierarchy for qsort()

call to qsort() will be unable to acquire it, and thus unable to release Lock B,
resulting in deadlock.
In contrast, if qsort() releases Lock C before invoking the comparison function
(which is unknown code from qsort()’s perspective, then deadlock is avoided as
shown in Figure 7.5.
If each module releases all locks before invoking unknown code, then deadlock is
avoided if each module separately avoids deadlock. This rule therefore greatly simplifies
deadlock analysis and greatly improves modularity.

7.1.1.3 Layered Locking Hierarchies


Unfortunately, it might not be possible for qsort() to release all of its locks before
invoking the comparison function. In this case, we cannot construct a local locking
hierarchy by releasing all locks before invoking unknown code. However, we can instead
140 CHAPTER 7. LOCKING

Application

Lock A Lock B

foo() bar()

Library

Lock C
qsort()

Lock D

cmp()

Figure 7.6: Layered Locking Hierarchy for qsort()

construct a layered locking hierarchy, as shown in Figure 7.6. here, the cmp() function
uses a new Lock D that is acquired after all of Locks A, B, and C, avoiding deadlock.
we therefore have three layers to the global deadlock hierarchy, the first containing
Locks A and B, the second containing Lock C, and the third containing Lock D.
Please note that it is not typically possible to mechanically change cmp() to use
the new Lock D. Quite the opposite: It is often necessary to make profound design-level
modifications. Nevertheless, the effort required for such modifications is normally a
small price to pay in order to avoid deadlock.
For another example where releasing all locks before invoking unknown code is
impractical, imagine an iterator over a linked list, as shown in Figure 7.7 (locked_
list.c). The list_start() function acquires a lock on the list and returns the
first element (if there is one), and list_next() either returns a pointer to the next
element in the list or releases the lock and returns NULL if the end of the list has been
reached.
Figure 7.8 shows how this list iterator may be used. Lines 1-4 define the list_
ints element containing a single integer, and lines 6-17 show how to iterate over the
list. Line 11 locks the list and fetches a pointer to the first element, line 13 provides a
pointer to our enclosing list_ints structure, line 14 prints the corresponding integer,
and line 15 moves to the next element. This is quite simple, and hides all of the locking.
That is, the locking remains hidden as long as the code processing each list element
does not itself acquire a lock that is held across some other call to list_start() or
list_next(), which results in deadlock. We can avoid the deadlock by layering the
locking hierarchy to take the list-iterator locking into account.
This layered approach can be extended to an arbitrarily large number of layers, but
7.1. STAYING ALIVE 141

1 struct locked_list {
2 spinlock_t s;
3 struct list_head h;
4 };
5
6 struct list_head *list_start(struct locked_list *lp)
7 {
8 spin_lock(&lp->s);
9 return list_next(lp, &lp->h);
10 }
11
12 struct list_head *list_next(struct locked_list *lp,
13 struct list_head *np)
14 {
15 struct list_head *ret;
16
17 ret = np->next;
18 if (ret == &lp->h) {
19 spin_unlock(&lp->s);
20 ret = NULL;
21 }
22 return ret;
23 }

Figure 7.7: Concurrent List Iterator

1 struct list_ints {
2 struct list_head n;
3 int a;
4 };
5
6 void list_print(struct locked_list *lp)
7 {
8 struct list_head *np;
9 struct list_ints *ip;
10
11 np = list_start(lp);
12 while (np != NULL) {
13 ip = list_entry(np, struct list_ints, n);
14 printf("\t%d\n", ip->a);
15 np = list_next(lp, np);
16 }
17 }

Figure 7.8: Concurrent List Iterator Usage


142 CHAPTER 7. LOCKING

1 spin_lock(&lock2);
2 layer_2_processing(pkt);
3 nextlayer = layer_1(pkt);
4 spin_lock(&nextlayer->lock1);
5 layer_1_processing(pkt);
6 spin_unlock(&lock2);
7 spin_unlock(&nextlayer->lock1);

Figure 7.9: Protocol Layering and Deadlock

each added layer increases the complexity of the locking design. Such increases in
complexity are particularly inconvenient for some types of object-oriented designs, in
which control passes back and forth among a large group of objects in an undisciplined
manner.1 This mismatch between the habits of object-oriented design and the need to
avoid deadlock is an important reason why parallel programming is perceived by some
to be so difficult.
Some alternatives to highly layered locking hierarchies are covered in Chapter 9.

7.1.1.4 Locking Hierarchies and Pointers to Locks


Although there are some exceptions, an external API containing a pointer to a lock
is very often a misdesigned API. Handing an internal lock to some other software
component is after all the antithesis of information hiding, which is in turn a key design
principle.
Quick Quiz 7.5: Name one common exception where it is perfectly reasonable to
pass a pointer to a lock into a function.
One exception is functions that hand off some entity, where the caller’s lock must
be held until the handoff is complete, but where the lock must be released before the
function returns. One example of such a function is the POSIX pthread_cond_
wait() function, where passing an pointer to a pthread_mutex_t prevents hangs
due to lost wakeups.
Quick Quiz 7.6: Doesn’t the fact that pthread_cond_wait() first releases the
mutex and then re-acquires it eliminate the possibility of deadlock?
In short, if you find yourself exporting an API with a pointer to a lock as an argument
or the return value, do yourself a favor and carefully reconsider your API design. It
might well be the right thing to do, but experience indicates that this is unlikely.

7.1.1.5 Conditional Locking


But suppose that there is no reasonable locking hierarchy. This can happen in real life,
for example, in layered network protocol stacks where packets flow in both directions.
In the networking case, it might be necessary to hold the locks from both layers when
passing a packet from one layer to another. Given that packets travel both up and down
the protocol stack, this is an excellent recipe for deadlock, as illustrated in Figure 7.9.
Here, a packet moving down the stack towards the wire must acquire the next layer’s
lock out of order. Given that packets moving up the stack away from the wire are
acquiring the locks in order, the lock acquisition in line 4 of the figure can result in
deadlock.
One way to avoid deadlocks in this case is to impose a locking hierarchy, but when
it is necessary to acquire a lock out of order, acquire it conditionally, as shown in

1 One name for this is “object-oriented spaghetti code.”


7.1. STAYING ALIVE 143

1 retry:
2 spin_lock(&lock2);
3 layer_2_processing(pkt);
4 nextlayer = layer_1(pkt);
5 if (!spin_trylock(&nextlayer->lock1)) {
6 spin_unlock(&lock2);
7 spin_lock(&nextlayer->lock1);
8 spin_lock(&lock2);
9 if (layer_1(pkt) != nextlayer) {
10 spin_unlock(&nextlayer->lock1);
11 spin_unlock(&lock2);
12 goto retry;
13 }
14 }
15 layer_1_processing(pkt);
16 spin_unlock(&lock2);
17 spin_unlock(&nextlayer->lock1);

Figure 7.10: Avoiding Deadlock Via Conditional Locking

Figure 7.10. Instead of unconditionally acquiring the layer-1 lock, line 5 conditionally
acquires the lock using the spin_trylock() primitive. This primitive acquires the
lock immediately if the lock is available (returning non-zero), and otherwise returns
zero without acquiring the lock.
If spin_trylock() was successful, line 15 does the needed layer-1 processing.
Otherwise, line 6 releases the lock, and lines 7 and 8 acquire them in the correct order.
Unfortunately, there might be multiple networking devices on the system (e.g., Ethernet
and WiFi), so that the layer_1() function must make a routing decision. This
decision might change at any time, especially if the system is mobile.2 Therefore, line 9
must recheck the decision, and if it has changed, must release the locks and start over.
Quick Quiz 7.7: Can the transformation from Figure 7.9 to Figure 7.10 be applied
universally?
Quick Quiz 7.8: But the complexity in Figure 7.10 is well worthwhile given that it
avoids deadlock, right?

7.1.1.6 Acquire Needed Locks First


In an important special case of conditional locking all needed locks are acquired before
any processing is carried out. In this case, processing need not be idempotent: if it turns
out to be impossible to acquire a given lock without first releasing one that was already
acquired, just release all the locks and try again. Only once all needed locks are held
will any processing be carried out.
However, this procedure can result in livelock, which will be discussed in Sec-
tion 7.1.2.
Quick Quiz 7.9: When using the “acquire needed locks first” approach described
in Section 7.1.1.6, how can livelock be avoided?
A related approach, two-phase locking [BHG87], has seen long production use in
transactional database systems. In the first phase of a two-phase locking transaction,
locks are acquired but not released. Once all needed locks have been acquired, the trans-
action enters the second phase, where locks are released, but not acquired. This locking
approach allows databases to provide serializability guarantees for their transactions,
in other words, to guarantee that all values seen and produced by the transactions are
consistent with some global ordering of all the transactions. Many such systems rely

2 And, in contrast to the 1900s, mobility is the common case.


144 CHAPTER 7. LOCKING

on the ability to abort transactions, although this can be simplified by avoiding making
any changes to shared data until all needed locks are acquired. Livelock and deadlock
are issues in such systems, but practical solutions may be found in any of a number of
database textbooks.

7.1.1.7 Single-Lock-at-a-Time Designs

In some cases, it is possible to avoid nesting locks, thus avoiding deadlock. For example,
if a problem is perfectly partitionable, a single lock may be assigned to each partition.
Then a thread working on a given partition need only acquire the one corresponding
lock. Because no thread ever holds more than one lock at a time, deadlock is impossible.
However, there must be some mechanism to ensure that the needed data structures
remain in existence during the time that neither lock is held. One such mechanism is
discussed in Section 7.4 and several others are presented in Chapter 9.

7.1.1.8 Signal/Interrupt Handlers

Deadlocks involving signal handlers are often quickly dismissed by noting that it is
not legal to invoke pthread_mutex_lock() from within a signal handler [Ope97].
However, it is possible (though almost always unwise) to hand-craft locking primitives
that can be invoked from signal handlers. Besides which, almost all operating-system
kernels permit locks to be acquired from within interrupt handlers, which are the kernel
analog to signal handlers.
The trick is to block signals (or disable interrupts, as the case may be) when acquiring
any lock that might be acquired within an interrupt handler. Furthermore, if holding
such a lock, it is illegal to attempt to acquire any lock that is ever acquired outside of a
signal handler without blocking signals.
Quick Quiz 7.10: Why is it illegal to acquire a Lock A that is acquired outside of a
signal handler without blocking signals while holding a Lock B that is acquired within a
signal handler?
If a lock is acquired by the handlers for several signals, then each and every one of
these signals must be blocked whenever that lock is acquired, even when that lock is
acquired within a signal handler.
Quick Quiz 7.11: How can you legally block signals within a signal handler?
Unfortunately, blocking and unblocking signals can be expensive in some operating
systems, notably including Linux, so performance concerns often mean that locks
acquired in signal handlers are only acquired in signal handlers, and that lockless
synchronization mechanisms are used to communicate between application code and
signal handlers.
Or that signal handlers are avoided completely except for handling fatal errors.
Quick Quiz 7.12: If acquiring locks in signal handlers is such a bad idea, why even
discuss ways of making it safe?

7.1.1.9 Discussion

There are a large number of deadlock-avoidance strategies available to the shared-


memory parallel programmer, but there are sequential programs for which none of them
is a good fit. This is one of the reasons that expert programmers have more than one
7.1. STAYING ALIVE 145

1 void thread1(void)
2 {
3 retry:
4 spin_lock(&lock1);
5 do_one_thing();
6 if (!spin_trylock(&lock2)) {
7 spin_unlock(&lock1);
8 goto retry;
9 }
10 do_another_thing();
11 spin_unlock(&lock2);
12 spin_unlock(&lock1);
13 }
14
15 void thread2(void)
16 {
17 retry:
18 spin_lock(&lock2);
19 do_a_third_thing();
20 if (!spin_trylock(&lock1)) {
21 spin_unlock(&lock2);
22 goto retry;
23 }
24 do_a_fourth_thing();
25 spin_unlock(&lock1);
26 spin_unlock(&lock2);
27 }

Figure 7.11: Abusing Conditional Locking

tool in their toolbox: locking is a powerful concurrency tool, but there are jobs better
addressed with other tools.
Quick Quiz 7.13: Given an object-oriented application that passes control freely
among a group of objects such that there is no straightforward locking hierarchy,3
layered or otherwise, how can this application be parallelized?
Nevertheless, the strategies described in this section have proven quite useful in
many settings.

7.1.2 Livelock and Starvation


Although conditional locking can be an effective deadlock-avoidance mechanism, it
can be abused. Consider for example the beautifully symmetric example shown in
Figure 7.11. This example’s beauty hides an ugly livelock. To see this, consider the
following sequence of events:

1. Thread 1 acquires lock1 on line 4, then invokes do_one_thing().

2. Thread 2 acquires lock2 on line 18, then invokes do_a_third_thing().


3. Thread 1 attempts to acquire lock2 on line 6, but fails because Thread 2 holds
it.
4. Thread 2 attempts to acquire lock1 on line 20, but fails because Thread 1 holds
it.
5. Thread 1 releases lock1 on line 7, then jumps to retry at line 3.
6. Thread 2 releases lock2 on line 21, and jumps to retry at line 17.

3 Also known as “object-oriented spaghetti code.”


146 CHAPTER 7. LOCKING

1 void thread1(void)
2 {
3 unsigned int wait = 1;
4 retry:
5 spin_lock(&lock1);
6 do_one_thing();
7 if (!spin_trylock(&lock2)) {
8 spin_unlock(&lock1);
9 sleep(wait);
10 wait = wait << 1;
11 goto retry;
12 }
13 do_another_thing();
14 spin_unlock(&lock2);
15 spin_unlock(&lock1);
16 }
17
18 void thread2(void)
19 {
20 unsigned int wait = 1;
21 retry:
22 spin_lock(&lock2);
23 do_a_third_thing();
24 if (!spin_trylock(&lock1)) {
25 spin_unlock(&lock2);
26 sleep(wait);
27 wait = wait << 1;
28 goto retry;
29 }
30 do_a_fourth_thing();
31 spin_unlock(&lock1);
32 spin_unlock(&lock2);
33 }

Figure 7.12: Conditional Locking and Exponential Backoff

7. The livelock dance repeats from the beginning.

Quick Quiz 7.14: How can the livelock shown in Figure 7.11 be avoided?
Livelock can be thought of as an extreme form of starvation where a group of threads
starve, rather than just one of them.4
Livelock and starvation are serious issues in software transactional memory imple-
mentations, and so the concept of contention manager has been introduced to encapsu-
late these issues. In the case of locking, simple exponential backoff can often address
livelock and starvation. The idea is to introduce exponentially increasing delays before
each retry, as shown in Figure 7.12.
Quick Quiz 7.15: What problems can you spot in the code in Figure 7.12?
However, for better results, the backoff should be bounded, and even better high-
contention results have been obtained via queued locking [And90], which is discussed
more in Section 7.3.2. Of course, best of all is to use a good parallel design so that lock
contention remains low.

7.1.3 Unfairness
Unfairness can be thought of as a less-severe form of starvation, where a subset of
threads contending for a given lock are granted the lion’s share of the acquisitions. This
can happen on machines with shared caches or NUMA characteristics, for example, as

4 Try not to get too hung up on the exact definitions of terms like livelock, starvation, and unfairness.

Anything that causes a group of threads to fail to make adequate forward progress is a problem that needs to
be fixed, regardless of what name you choose for it.
7.2. TYPES OF LOCKS 147

CPU 0 CPU 1 CPU 2 CPU 3


Cache Cache Cache Cache
Interconnect Interconnect

Memory System Interconnect Memory

Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7

Speed−of−Light Round−Trip Distance in Vacuum


for 1.8GHz Clock Period (8cm)

Figure 7.13: System Architecture and Lock Unfairness

shown in Figure 7.13. If CPU 0 releases a lock that all the other CPUs are attempting to
acquire, the interconnect shared between CPUs 0 and 1 means that CPU 1 will have an
advantage over CPUs 2-7. Therefore CPU 1 will likely acquire the lock. If CPU 1 hold
the lock long enough for CPU 0 to be requesting the lock by the time CPU 1 releases it
and vice versa, the lock can shuttle between CPUs 0 and 1, bypassing CPUs 2-7.
Quick Quiz 7.16: Wouldn’t it be better just to use a good parallel design so that
lock contention was low enough to avoid unfairness?

7.1.4 Inefficiency
Locks are implemented using atomic instructions and memory barriers, and often involve
cache misses. As we saw in Chapter 3, these instructions are quite expensive, roughly
two orders of magnitude greater overhead than simple instructions. This can be a serious
problem for locking: If you protect a single instruction with a lock, you will increase the
overhead by a factor of one hundred. Even assuming perfect scalability, one hundred
CPUs would be required to keep up with a single CPU executing the same code without
locking.
This situation underscores the synchronization-granularity tradeoff discussed in
Section 6.3, especially Figure 6.22: Too coarse a granularity will limit scalability, while
too fine a granularity will result in excessive synchronization overhead.
That said, once a lock is held, the data protected by that lock can be accessed by
the lock holder without interference. Acquiring a lock might be expensive, but once
held, the CPU’s caches are an effective performance booster, at least for large critical
sections.
Quick Quiz 7.17: How might the lock holder be interfered with?

7.2 Types of Locks


There are a surprising number of types of locks, more than this short chapter can
possibly do justice to. The following sections discuss exclusive locks (Section 7.2.1),
148 CHAPTER 7. LOCKING

reader-writer locks (Section 7.2.2), multi-role locks (Section 7.2.3), and scoped locking
(Section 7.2.4).

7.2.1 Exclusive Locks


Exclusive locks are what they say they are: only one thread may hold the lock at a time.
The holder of such a lock thus has exclusive access to all data protected by that lock,
hence the name.
Of course, this all assumes that this lock is held across all accesses to data purportedly
protected by the lock. Although there are some tools that can help, the ultimate
responsibility for ensuring that the lock is acquired in all necessary code paths rests with
the developer.
Quick Quiz 7.18: Does it ever make sense to have an exclusive lock acquisition
immediately followed by a release of that same lock, that is, an empty critical section?

7.2.2 Reader-Writer Locks


Reader-writer locks [CHP71] permit any number of readers to hold the lock concurrently
on the one hand or a single writer to hold the lock on the other. In theory, then, reader-
writer locks should allow excellent scalability for data that is read often and written
rarely. In practice, the scalability will depend on the reader-writer lock implementation.
The classic reader-writer lock implementation involves a set of counters and flags
that are manipulated atomically. This type of implementation suffers from the same
problem as does exclusive locking for short critical sections: The overhead of acquiring
and releasing the lock is about two orders of magnitude greater than the overhead of
a simple instruction. Of course, if the critical section is long enough, the overhead of
acquiring and releasing the lock becomes negligible. However, because only one thread
at a time can be manipulating the lock, the required critical-section size increases with
the number of CPUs.
It is possible to design a reader-writer lock that is much more favorable to readers
through use of per-thread exclusive locks [HW92]. To read, a thread acquires only its
own lock. To write, a thread acquires all locks. In the absence of writers, each reader
incurs only atomic-instruction and memory-barrier overhead, with no cache misses,
which is quite good for a locking primitive. Unfortunately, writers must incur cache
misses as well as atomic-instruction and memory-barrier overhead—multiplied by the
number of threads.
In short, reader-writer locks can be quite useful in a number of situations, but each
type of implementation does have its drawbacks. The canonical use case for reader-
writer locking involves very long read-side critical sections, preferably measured in
hundreds of microseconds or even milliseconds.

7.2.3 Beyond Reader-Writer Locks


Reader-writer locks and exclusive locks differ in their admission policy: exclusive
locks allow at most one holder, while reader-writer locks permit an arbitrary number
of read-holders (but only one write-holder). There is a very large number of possible
admission policies, one of which is that of the VAX/VMS distributed lock manager
(DLM) [ST87], which is shown in Table 7.1. Blank cells indicate compatible modes,
while cells containing “X” indicate incompatible modes.
7.2. TYPES OF LOCKS 149

Concurrent Write
Concurrent Read
Null (Not Held)

Protected Write
Protected Read

Exclusive
Null (Not Held)
Concurrent Read X
Concurrent Write X X X
Protected Read X X X
Protected Write X X X X
Exclusive X X X X X

Table 7.1: VAX/VMS Distributed Lock Manager Policy

The VAX/VMS DLM uses six modes. For purposes of comparison, exclusive locks
use two modes (not held and held), while reader-writer locks use three modes (not held,
read held, and write held).
The first mode is null, or not held. This mode is compatible with all other modes,
which is to be expected: If a thread is not holding a lock, it should not prevent any other
thread from acquiring that lock.
The second mode is concurrent read, which is compatible with every other mode ex-
cept for exclusive. The concurrent-read mode might be used to accumulate approximate
statistics on a data structure, while permitting updates to proceed concurrently.
The third mode is concurrent write, which is compatible with null, concurrent read,
and concurrent write. The concurrent-write mode might be used to update approximate
statistics, while still permitting reads and concurrent updates to proceed concurrently.
The fourth mode is protected read, which is compatible with null, concurrent read,
and protected read. The protected-read mode might be used to obtain a consistent
snapshot of the data structure, while permitting reads but not updates to proceed concur-
rently.
The fifth mode is protected write, which is compatible with null and concurrent
read. The protected-write mode might be used to carry out updates to a data structure
that could interfere with protected readers but which could be tolerated by concurrent
readers.
The sixth and final mode is exclusive, which is compatible only with null. The
exclusive mode is used when it is necessary to exclude all other accesses.
It is interesting to note that exclusive locks and reader-writer locks can be emulated
by the VAX/VMS DLM. Exclusive locks would use only the null and exclusive modes,
while reader-writer locks might use the null, protected-read, and protected-write modes.
Quick Quiz 7.19: Is there any other way for the VAX/VMS DLM to emulate a
reader-writer lock?
Although the VAX/VMS DLM policy has seen widespread production use for dis-
tributed databases, it does not appear to be used much in shared-memory applications.
One possible reason for this is that the greater communication overheads of distributed
databases can hide the greater overhead of the VAX/VMS DLM’s more-complex admis-
sion policy.
Nevertheless, the VAX/VMS DLM is an interesting illustration of just how flexible
the concepts behind locking can be. It also serves as a very simple introduction to the
150 CHAPTER 7. LOCKING

locking schemes used by modern DBMSes, which can have more than thirty locking
modes, compared to VAX/VMS’s six.

7.2.4 Scoped Locking


The locking primitives discussed thus far require explicit acquisition and release prim-
itives, for example, spin_lock() and spin_unlock(), respectively. Another
approach is to use the object-oriented “resource allocation is initialization” (RAII)
pattern [ES90].5 This pattern is often applied to auto variables in languages like C++,
where the corresponding constructor is invoked upon entry to the object’s scope, and
the corresponding destructor is invoked upon exit from that scope. This can be applied
to locking by having the constructor acquire the lock and the destructor free it.
This approach can be quite useful, in fact in 1990 I was convinced that it was the
only type of locking that was needed.6 One very nice property of RAII locking is that
you don’t need to carefully release the lock on each and every code path that exits that
scope, a property that can eliminate a troublesome set of bugs.
However, RAII locking also has a dark side. RAII makes it quite difficult to
encapsulate lock acquisition and release, for example, in iterators. In many iterator
implementations, you would like to acquire the lock in the iterator’s “start” function
and release it in the iterator’s “stop” function. RAII locking instead requires that the
lock acquisition and release take place in the same level of scoping, making such
encapsulation difficult or even impossible.
RAII locking also prohibits overlapping critical sections, due to the fact that scopes
must nest. This prohibition makes it difficult or impossible to express a number of
useful constructs, for example, locking trees that mediate between multiple concurrent
attempts to assert an event. Of an arbitrarily large group of concurrent attempts, only
one need succeed, and the best strategy for the remaining attempts is for them to fail as
quickly and painlessly as possible. Otherwise, lock contention becomes pathological on
large systems (where “large” is many hundreds of CPUs).
Example data structures (taken from the Linux kernel’s implementation of RCU) are
shown in Figure 7.14. Here, each CPU is assigned a leaf rcu_node structure, and each
rcu_node structure has a pointer to its parent (named, oddly enough, ->parent), up
to the root rcu_node structure, which has a NULL ->parent pointer. The number
of child rcu_node structures per parent can vary, but is typically 32 or 64. Each
rcu_node structure also contains a lock named ->fqslock.
The general approach is a tournament, where a given CPU conditionally acquires
its leaf rcu_node structure’s ->fqslock, and, if successful, attempt to acquire that
of the parent, then release that of the child. In addition, at each level, the CPU checks
a global gp_flags variable, and if this variable indicates that some other CPU has
asserted the event, the first CPU drops out of the competition. This acquire-then-release
sequence continues until either the gp_flags variable indicates that someone else
won the tournament, one of the attempts to acquire an ->fqslock fails, or the root
rcu_node structure’s ->fqslock as been acquired.
Simplified code to implement this is shown in Figure 7.15. The purpose of this
function is to mediate between CPUs who have concurrently detected a need to invoke
the do_force_quiescent_state() function. At any given time, it only makes
5 Though more clearly expressed at http://www.stroustrup.com/bs_faq2.html#

finally.
6 My later work with parallelism at Sequent Computer Systems very quickly disabused me of this

misguided notion.
7.2. TYPES OF LOCKS 151

Root rcu_node
Structure

Leaf rcu_node Leaf rcu_node


Structure 0 Structure N
CPU 0
CPU 1

CPU m

CPU m * (N − 1)
CPU m * (N − 1) + 1

CPU m * N − 1
Figure 7.14: Locking Hierarchy

sense for one instance of do_force_quiescent_state() to be active, so if there


are multiple concurrent callers, we need at most one of them to actually invoke do_
force_quiescent_state(), and we need the rest to (as quickly and painlessly
as possible) give up and leave.
To this end, each pass through the loop spanning lines 7-15 attempts to advance
up one level in the rcu_node hierarchy. If the gp_flags variable is already set
(line 8) or if the attempt to acquire the current rcu_node structure’s ->fqslock
is unsuccessful (line 9), then local variable ret is set to 1. If line 10 sees that local
variable rnp_old is non-NULL, meaning that we hold rnp_old’s ->fqs_lock,
line 11 releases this lock (but only after the attempt has been made to acquire the parent
rcu_node structure’s ->fqslock). If line 12 sees that either line 8 or 9 saw a reason
to give up, line 13 returns to the caller. Otherwise, we must have acquired the current
rcu_node structure’s ->fqslock, so line 14 saves a pointer to this structure in local
variable rnp_old in preparation for the next pass through the loop.
If control reaches line 16, we won the tournament, and now holds the root rcu_
node structure’s ->fqslock. If line 16 still sees that the global variable gp_flags
is zero, line 17 sets gp_flags to one, line 18 invokes do_force_quiescent_
state(), and line 19 resets gp_flags back to zero. Either way, line 21 releases the
root rcu_node structure’s ->fqslock.
Quick Quiz 7.20: The code in Figure 7.15 is ridiculously complicated! Why not
conditionally acquire a single global lock?
Quick Quiz 7.21: Wait a minute! If we “win” the tournament on line 16 of Fig-
ure 7.15, we get to do all the work of do_force_quiescent_state(). Exactly
how is that a win, really?
This function illustrates the not-uncommon pattern of hierarchical locking. This
pattern is quite difficult to implement using RAII locking, just like the iterator encapsu-
lation noted earlier, and so the lock/unlock primitives will be needed for the foreseeable
future.
152 CHAPTER 7. LOCKING

1 void force_quiescent_state(struct rcu_node *rnp_leaf)


2 {
3 int ret;
4 struct rcu_node *rnp = rnp_leaf;
5 struct rcu_node *rnp_old = NULL;
6
7 for (; rnp != NULL; rnp = rnp->parent) {
8 ret = (ACCESS_ONCE(gp_flags)) ||
9 !raw_spin_trylock(&rnp->fqslock);
10 if (rnp_old != NULL)
11 raw_spin_unlock(&rnp_old->fqslock);
12 if (ret)
13 return;
14 rnp_old = rnp;
15 }
16 if (!ACCESS_ONCE(gp_flags)) {
17 ACCESS_ONCE(gp_flags) = 1;
18 do_force_quiescent_state();
19 ACCESS_ONCE(gp_flags) = 0;
20 }
21 raw_spin_unlock(&rnp_old->fqslock);
22 }

Figure 7.15: Conditional Locking to Reduce Contention


1 typedef int xchglock_t;
2 #define DEFINE_XCHG_LOCK(n) xchglock_t n = 0
3
4 void xchg_lock(xchglock_t *xp)
5 {
6 while (xchg(xp, 1) == 1) {
7 while (*xp == 1)
8 continue;
9 }
10 }
11
12 void xchg_unlock(xchglock_t *xp)
13 {
14 (void)xchg(xp, 0);
15 }

Figure 7.16: Sample Lock Based on Atomic Exchange

7.3 Locking Implementation Issues


Developers are almost always best-served by using whatever locking primitives are
provided by the system, for example, the POSIX pthread mutex locks [Ope97, But97].
Nevertheless, studying sample implementations can be helpful, as can considering the
challenges posed by extreme workloads and environments.

7.3.1 Sample Exclusive-Locking Implementation Based on Atomic


Exchange
This section reviews the implementation shown in Figure 7.16. The data structure for
this lock is just an int, as shown on line 1, but could be any integral type. The initial
value of this lock is zero, meaning “unlocked”, as shown on line 2.
Quick Quiz 7.22: Why not rely on the C language’s default initialization of zero
instead of using the explicit initializer shown on line 2 of Figure 7.16?
Lock acquisition is carried out by the xchg_lock() function shown on lines 4-9.
This function uses a nested loop, with the outer loop repeatedly atomically exchanging
the value of the lock with the value one (meaning “locked”). If the old value was already
the value one (in other words, someone else already holds the lock), then the inner loop
7.3. LOCKING IMPLEMENTATION ISSUES 153

(lines 7-8) spins until the lock is available, at which point the outer loop makes another
attempt to acquire the lock.
Quick Quiz 7.23: Why bother with the inner loop on lines 7-8 of Figure 7.16? Why
not simply repeatedly do the atomic exchange operation on line 6?
Lock release is carried out by the xchg_unlock() function shown on lines 12-15.
Line 14 atomically exchanges the value zero (“unlocked”) into the lock, thus marking it
as having been released.
Quick Quiz 7.24: Why not simply store zero into the lock word on line 14 of
Figure 7.16?
This lock is a simple example of a test-and-set lock [SR84], but very similar mecha-
nisms have been used extensively as pure spinlocks in production.

7.3.2 Other Exclusive-Locking Implementations


There are a great many other possible implementations of locking based on atomic
instructions, many of which are reviewed by Mellor-Crummey and Scott [MCS91].
These implementations represent different points in a multi-dimensional design trade-
off [McK96b]. For example, the atomic-exchange-based test-and-set lock presented in
the previous section works well when contention is low and has the advantage of small
memory footprint. It avoids giving the lock to threads that cannot use it, but as a result
can suffer from unfairness or even starvation at high contention levels.
In contrast, ticket lock [MCS91], which is used in the Linux kernel, avoids unfairness
at high contention levels, but as a consequence of its first-in-first-out discipline can
grant the lock to a thread that is currently unable to use it, for example, due to being
preempted, interrupted, or otherwise out of action. However, it is important to avoid
getting too worried about the possibility of preemption and interruption, given that this
preemption and interruption might just as well happen just after the lock was acquired.7
All locking implementations where waiters spin on a single memory location,
including both test-and-set locks and ticket locks, suffer from performance problems at
high contention levels. The problem is that the thread releasing the lock must update the
value of the corresponding memory location. At low contention, this is not a problem:
The corresponding cache line is very likely still local to and writeable by the thread
holding the lock. In contrast, at high levels of contention, each thread attempting to
acquire the lock will have a read-only copy of the cache line, and the lock holder will
need to invalidate all such copies before it can carry out the update that releases the lock.
In general, the more CPUs and threads there are, the greater the overhead incurred when
releasing the lock under conditions of high contention.
This negative scalability has motivated a number of different queued-lock implemen-
tations [And90, GT90, MCS91, WKS94, Cra93, MLH94, TS93]. Queued locks avoid
high cache-invalidation overhead by assigning each thread a queue element. These
queue elements are linked together into a queue that governs the order that the lock
will be granted to the waiting threads. The key point is that each thread spins on its
own queue element, so that the lock holder need only invalidate the first element from
the next thread’s CPU’s cache. This arrangement greatly reduces the overhead of lock
handoff at high levels of contention.

7 Besides, the best way of handling high lock contention is to avoid it in the first place! However, there

are some situation where high lock contention is the lesser of the available evils, and in any case, studying
schemes that deal with high levels of contention is good mental exercise.
154 CHAPTER 7. LOCKING

More recent queued-lock implementations also take the system’s architecture into
account, preferentially granting locks locally, while also taking steps to avoid starva-
tion [SSVM02, RH03, RH02, JMRR02, MCM02]. Many of these can be thought of as
analogous to the elevator algorithms traditionally used in scheduling disk I/O.
Unfortunately, the same scheduling logic that improves the efficiency of queued
locks at high contention also increases their overhead at low contention. Beng-Hong Lim
and Anant Agarwal therefore combined a simple test-and-set lock with a queued lock,
using the test-and-set lock at low levels of contention and switching to the queued lock at
high levels of contention [LA94], thus getting low overhead at low levels of contention
and getting fairness and high throughput at high levels of contention. Browning et
al. took a similar approach, but avoided the use of a separate flag, so that the test-and-
set fast path uses the same sequence of instructions that would be used in a simple
test-and-set lock [BMMM05]. This approach has been used in production.
Another issue that arises at high levels of contention is when the lock holder is
delayed, especially when the delay is due to preemption, which can result in priority
inversion, where a low-priority thread holds a lock, but is preempted by a medium
priority CPU-bound thread, which results in a high-priority process blocking while
attempting to acquire the lock. The result is that the CPU-bound medium-priority
process is preventing the high-priority process from running. One solution is priority
inheritance [LR80], which has been widely used for real-time computing [SRL90a,
Cor06b], despite some lingering controversy over this practice [Yod04a, Loc02].
Another way to avoid priority inversion is to prevent preemption while a lock is
held. Because preventing preemption while locks are held also improves throughput,
most proprietary UNIX kernels offer some form of scheduler-conscious synchronization
mechanism [KWS97], largely due to the efforts of a certain sizable database vendor.
These mechanisms usually take the form of a hint that preemption would be inappro-
priate. These hints frequently take the form of a bit set in a particular machine register,
which enables extremely low per-lock-acquisition overhead for these mechanisms. In
contrast, Linux avoids these hints, instead getting similar results from a mechanism
called futexes [FRK02, Mol06, Ros06, Dre11].
Interestingly enough, atomic instructions are not strictly needed to implement
locks [Dij65, Lam74]. An excellent exposition of the issues surrounding locking imple-
mentations based on simple loads and stores may be found in Herlihy’s and Shavit’s
textbook [HS08]. The main point echoed here is that such implementations currently
have little practical application, although a careful study of them can be both entertaining
and enlightening. Nevertheless, with one exception described below, such study is left
as an exercise for the reader.
Gamsa et al. [GKAS99, Section 5.3] describe a token-based mechanism in which a
token circulates among the CPUs. When the token reaches a given CPU, it has exclusive
access to anything protected by that token. There are any number of schemes that may
be used to implement the token-based mechanism, for example:

1. Maintain a per-CPU flag, which is initially zero for all but one CPU. When a
CPU’s flag is non-zero, it holds the token. When it finishes with the token, it
zeroes its flag and sets the flag of the next CPU to one (or to any other non-zero
value).

2. Maintain a per-CPU counter, which is initially set to the corresponding CPU’s


number, which we assume to range from zero to N − 1, where N is the number
of CPUs in the system. When a CPU’s counter is greater than that of the next
7.4. LOCK-BASED EXISTENCE GUARANTEES 155

1 int delete(int key)


2 {
3 int b;
4 struct element *p;
5
6 b = hashfunction(key);
7 p = hashtable[b];
8 if (p == NULL || p->key != key)
9 return 0;
10 spin_lock(&p->lock);
11 hashtable[b] = NULL;
12 spin_unlock(&p->lock);
13 kfree(p);
14 return 1;
15 }

Figure 7.17: Per-Element Locking Without Existence Guarantees

CPU (taking counter wrap into account), the first CPU holds the token. When it
is finished with the token, it sets the next CPU’s counter to a value one greater
than its own counter.

Quick Quiz 7.25: How can you tell if one counter is greater than another, while
accounting for counter wrap?
Quick Quiz 7.26: Which is better, the counter approach or the flag approach?
This lock is unusual in that a given CPU cannot necessarily acquire it immediately,
even if no other CPU is using it at the moment. Instead, the CPU must wait until the
token comes around to it. This is useful in cases where CPUs need periodic access
to the critical section, but can tolerate variances in token-circulation rate. Gamsa et
al. [GKAS99] used it to implement a variant of read-copy update (see Section 9.5), but
it could also be used to protect periodic per-CPU operations such as flushing per-CPU
caches used by memory allocators [MS93], garbage-collecting per-CPU data structures,
or flushing per-CPU data to shared storage (or to mass storage, for that matter).
As increasing numbers of people gain familiarity with parallel hardware and paral-
lelize increasing amounts of code, we can expect more special-purpose locking primi-
tives to appear. Nevertheless, you should carefully consider this important safety tip:
Use the standard synchronization primitives whenever humanly possible. The big ad-
vantage of the standard synchronization primitives over roll-your-own efforts is that the
standard primitives are typically much less bug-prone.8

7.4 Lock-Based Existence Guarantees


A key challenge in parallel programming is to provide existence guarantees [GKAS99],
so that attempts to access a given object can rely on that object being in existence
throughout a given access attempt. In some cases, existence guarantees are implicit:

1. Global variables and static local variables in the base module will exist as long as
the application is running.

2. Global variables and static local variables in a loaded module will exist as long as
that module remains loaded.
8 And yes, I have done at least my share of roll-your-own synchronization primitives. However, you will

notice that my hair is much greyer than it was before I started doing that sort of work. Coincidence? Maybe.
But are you really willing to risk your own hair turning prematurely grey?
156 CHAPTER 7. LOCKING

1 int delete(int key)


2 {
3 int b;
4 struct element *p;
5 spinlock_t *sp;
6
7 b = hashfunction(key);
8 sp = &locktable[b];
9 spin_lock(sp);
10 p = hashtable[b];
11 if (p == NULL || p->key != key) {
12 spin_unlock(sp);
13 return 0;
14 }
15 hashtable[b] = NULL;
16 spin_unlock(sp);
17 kfree(p);
18 return 1;
19 }

Figure 7.18: Per-Element Locking With Lock-Based Existence Guarantees

3. A module will remain loaded as long as at least one of its functions has an active
instance.

4. A given function instance’s on-stack variables will exist until that instance returns.

5. If you are executing within a given function or have been called (directly or
indirectly) from that function, then the given function has an active instance.

These implicit existence guarantees are straightforward, though bugs involving


implicit existence guarantees really can happen.
Quick Quiz 7.27: How can relying on implicit existence guarantees result in a bug?

But the more interesting—and troublesome—guarantee involves heap memory: A


dynamically allocated data structure will exist until it is freed. The problem to be solved
is to synchronize the freeing of the structure with concurrent accesses to that same
structure. One way to do this is with explicit guarantees, such as locking. If a given
structure may only be freed while holding a given lock, then holding that lock guarantees
that structure’s existence.
But this guarantee depends on the existence of the lock itself. One straightforward
way to guarantee the lock’s existence is to place the lock in a global variable, but global
locking has the disadvantage of limiting scalability. One way of providing scalability
that improves as the size of the data structure increases is to place a lock in each element
of the structure. Unfortunately, putting the lock that is to protect a data element in the
data element itself is subject to subtle race conditions, as shown in Figure 7.17.
Quick Quiz 7.28: What if the element we need to delete is not the first element of
the list on line 8 of Figure 7.17?
Quick Quiz 7.29: What race condition can occur in Figure 7.17?
One way to fix this example is to use a hashed set of global locks, so that each
hash bucket has its own lock, as shown in Figure 7.18. This approach allows acquiring
the proper lock (on line 9) before gaining a pointer to the data element (on line 10).
Although this approach works quite well for elements contained in a single partitionable
data structure such as the hash table shown in the figure, it can be problematic if a
given data element can be a member of multiple hash tables or given more-complex
data structures such as trees or graphs. These problems can be solved, in fact, such
7.5. LOCKING: HERO OR VILLAIN? 157

solutions form the basis of lock-based software transactional memory implementa-


tions [ST95, DSS06]. However, Chapter 9 describes simpler—and faster—ways of
providing existence guarantees.

7.5 Locking: Hero or Villain?


As is often the case in real life, locking can be either hero or villain, depending on
how it is used and on the problem at hand. In my experience, those writing whole
applications are happy with locking, those writing parallel libraries are less happy, and
those parallelizing existing sequential libraries are extremely unhappy. The following
sections discuss some reasons for these differences in viewpoints.

7.5.1 Locking For Applications: Hero!


When writing an entire application (or entire kernel), developers have full control of the
design, including the synchronization design. Assuming that the design makes good
use of partitioning, as discussed in Chapter 6, locking can be an extremely effective
synchronization mechanism, as demonstrated by the heavy use of locking in production-
quality parallel software.
Nevertheless, although such software usually bases most of its synchronization
design on locking, such software also almost always makes use of other synchronization
mechanisms, including special counting algorithms (Chapter 5), data ownership (Chap-
ter 8), reference counting (Section 9.2), sequence locking (Section 9.4), and read-copy
update (Section 9.5). In addition, practitioners use tools for deadlock detection [Cor06a],
lock acquisition/release balancing [Cor04b], cache-miss analysis [The11], hardware-
counter-based profiling [EGMdB11, The12], and many more besides.
Given careful design, use of a good combination of synchronization mechanisms,
and good tooling, locking works quite well for applications and kernels.

7.5.2 Locking For Parallel Libraries: Just Another Tool


Unlike applications and kernels, the designer of a library cannot know the locking
design of the code that the library will be interacting with. In fact, that code might not
be written for years to come. Library designers therefore have less control and must
exercise more care when laying out their synchronization design.
Deadlock is of course of particular concern, and the techniques discussed in Sec-
tion 7.1.1 need to be applied. One popular deadlock-avoidance strategy is therefore
to ensure that the library’s locks are independent subtrees of the enclosing program’s
locking hierarchy. However, this can be harder than it looks.
One complication was discussed in Section 7.1.1.2, namely when library functions
call into application code, with qsort()’s comparison-function argument being a case
in point. Another complication is the interaction with signal handlers. If an application
signal handler is invoked from a signal received within the library function, deadlock
can ensue just as surely as if the library function had called the signal handler directly.
A final complication occurs for those library functions that can be used between a
fork()/exec() pair, for example, due to use of the system() function. In this
case, if your library function was holding a lock at the time of the fork(), then the
child process will begin life with that lock held. Because the thread that will release the
158 CHAPTER 7. LOCKING

lock is running in the parent but not the child, if the child calls your library function,
deadlock will ensue.
The following strategies may be used to avoid deadlock problems in these cases:

1. Don’t use either callbacks or signals.


2. Don’t acquire locks from within callbacks or signal handlers.
3. Let the caller control synchronization.
4. Parameterize the library API to delegate locking to caller.
5. Explicitly avoid callback deadlocks.
6. Explicitly avoid signal-handler deadlocks.

Each of these strategies is discussed in one of the following sections.

7.5.2.1 Use Neither Callbacks Nor Signals


If a library function avoids callbacks and the application as a whole avoids signals,
then any locks acquired by that library function will be leaves of the locking-hierarchy
tree. This arrangement avoids deadlock, as discussed in Section 7.1.1.1. Although
this strategy works extremely well where it applies, there are some applications that
must use signal handlers, and there are some library functions (such as the qsort()
function discussed in Section 7.1.1.2) that require callbacks.
The strategy described in the next section can often be used in these cases.

7.5.2.2 Avoid Locking in Callbacks and Signal Handlers


If neither callbacks nor signal handlers acquire locks, then they cannot be involved
in deadlock cycles, which allows straightforward locking hierarchies to once again
consider library functions to be leaves on the locking-hierarchy tree. This strategy
works very well for most uses of qsort, whose callbacks usually simply compare the
two values passed in to them. This strategy also works wonderfully for many signal
handlers, especially given that acquiring locks from within signal handlers is generally
frowned upon [Gro01],9 but can fail if the application needs to manipulate complex data
structures from a signal handler.
Here are some ways to avoid acquiring locks in signal handlers even if complex data
structures must be manipulated:

1. Use simple data structures based on non-blocking synchronization, as will be


discussed in Section 14.3.1.
2. If the data structures are too complex for reasonable use of non-blocking syn-
chronization, create a queue that allows non-blocking enqueue operations. In
the signal handler, instead of manipulating the complex data structure, add an
element to the queue describing the required change. A separate thread can then
remove elements from the queue and carry out the required changes using normal
locking. There are a number of readily available implementations of concurrent
queues [KLP12, Des09, MS96].
9 But the standard’s words do not stop clever coders from creating their own home-brew locking primitives

from atomic operations.


7.5. LOCKING: HERO OR VILLAIN? 159

This strategy should be enforced with occasional manual or (preferably) automated


inspections of callbacks and signal handlers. When carrying out these inspections, be
wary of clever coders who might have (unwisely) created home-brew locks from atomic
operations.

7.5.2.3 Caller Controls Synchronization

Let the caller control synchronization. This works extremely well when the library
functions are operating on independent caller-visible instances of a data structure, each
of which may be synchronized separately. For example, if the library functions operate
on a search tree, and if the application needs a large number of independent search trees,
then the application can associate a lock with each tree. The application then acquires
and releases locks as needed, so that the library need not be aware of parallelism at all.
Instead, the application controls the parallelism, so that locking can work very well, as
was discussed in Section 7.5.1.
However, this strategy fails if the library implements a data structure that requires
internal concurrency, for example, a hash table or a parallel sort. In this case, the library
absolutely must control its own synchronization.

7.5.2.4 Parameterize Library Synchronization

The idea here is to add arguments to the library’s API to specify which locks to acquire,
how to acquire and release them, or both. This strategy allows the application to take on
the global task of avoiding deadlock by specifying which locks to acquire (by passing in
pointers to the locks in question) and how to acquire them (by passing in pointers to lock
acquisition and release functions), but also allows a given library function to control its
own concurrency by deciding where the locks should be acquired and released.
In particular, this strategy allows the lock acquisition and release functions to block
signals as needed without the library code needing to be concerned with which signals
need to be blocked by which locks. The separation of concerns used by this strategy can
be quite effective, but in some cases the strategies laid out in the following sections can
work better.
That said, passing explicit pointers to locks to external APIs must be very carefully
considered, as discussed in Section 7.1.1.4. Although this practice is sometimes the
right thing to do, you should do yourself a favor by looking into alternative designs first.

7.5.2.5 Explicitly Avoid Callback Deadlocks

The basic rule behind this strategy was discussed in Section 7.1.1.2: “Release all locks
before invoking unknown code.” This is usually the best approach because it allows
the application to ignore the library’s locking hierarchy: the library remains a leaf or
isolated subtree of the application’s overall locking hierarchy.
In cases where it is not possible to release all locks before invoking unknown code,
the layered locking hierarchies described in Section 7.1.1.3 can work well. For example,
if the unknown code is a signal handler, this implies that the library function block
signals across all lock acquisitions, which can be complex and slow. Therefore, in
cases where signal handlers (probably unwisely) acquire locks, the strategies in the next
section may prove helpful.
160 CHAPTER 7. LOCKING

7.5.2.6 Explicitly Avoid Signal-Handler Deadlocks


Signal-handler deadlocks can be explicitly avoided as follows:

1. If the application invokes the library function from within a signal handler, then
that signal must be blocked every time that the library function is invoked from
outside of a signal handler.

2. If the application invokes the library function while holding a lock acquired within
a given signal handler, then that signal must be blocked every time that the library
function is called outside of a signal handler.

These rules can be enforced by using tools similar to the Linux kernel’s lockdep
lock dependency checker [Cor06a]. One of the great strengths of lockdep is that it is
not fooled by human intuition [Ros11].

7.5.2.7 Library Functions Used Between fork() and exec()


As noted earlier, if a thread executing a library function is holding a lock at the time
that some other thread invokes fork(), the fact that the parent’s memory is copied to
create the child means that this lock will be born held in the child’s context. The thread
that will release this lock is running in the parent, but not in the child, which means that
the child’s copy of this lock will never be released. Therefore, any attempt on the part
of the child to invoke that same library function will result in deadlock.
One approach to this problem would be to have the library function check to see if
the owner of the lock is still running, and if not, “breaking” the lock by re-initializing
and then acquiring it. However, this approach has a couple of vulnerabilities:

1. The data structures protected by that lock are likely to be in some intermedi-
ate state, so that naively breaking the lock might result in arbitrary memory
corruption.

2. If the child creates additional threads, two threads might break the lock concur-
rently, with the result that both threads believe they own the lock. This could
again result in arbitrary memory corruption.

The atfork() function is provided to help deal with these situations. The idea is
to register a triplet of functions, one to be called by the parent before the fork(), one
to be called by the parent after the fork(), and one to be called by the child after the
fork(). Appropriate cleanups can then be carried out at these three points.
Be warned, however, that coding of atfork() handlers is quite subtle in general.
The cases where atfork() works best are cases where the data structure in question
can simply be re-initialized by the child.

7.5.2.8 Parallel Libraries: Discussion


Regardless of the strategy used, the description of the library’s API must include a clear
description of that strategy and how the caller should interact with that strategy. In short,
constructing parallel libraries using locking is possible, but not as easy as constructing a
parallel application.
7.5. LOCKING: HERO OR VILLAIN? 161

7.5.3 Locking For Parallelizing Sequential Libraries: Villain!


With the advent of readily available low-cost multicore systems, a common task is
parallelizing an existing library that was designed with only single-threaded use in mind.
This all-too-common disregard for parallelism can result in a library API that is severely
flawed from a parallel-programming viewpoint. Candidate flaws include:

1. Implicit prohibition of partitioning.

2. Callback functions requiring locking.

3. Object-oriented spaghetti code.

These flaws and the consequences for locking are discussed in the following sections.

7.5.3.1 Partitioning Prohibited


Suppose that you were writing a single-threaded hash-table implementation. It is easy
and fast to maintain an exact count of the total number of items in the hash table, and
also easy and fast to return this exact count on each addition and deletion operation. So
why not?
One reason is that exact counters do not perform or scale well on multicore systems,
as was seen in Chapter 5. As a result, the parallelized implementation of the hash table
will not perform or scale well.
So what can be done about this? One approach is to return an approximate count,
using one of the algorithms from Chapter 5. Another approach is to drop the element
count altogether.
Either way, it will be necessary to inspect uses of the hash table to see why the
addition and deletion operations need the exact count. Here are a few possibilities:

1. Determining when to resize the hash table. In this case, an approximate count
should work quite well. It might also be useful to trigger the resizing operation
from the length of the longest chain, which can be computed and maintained in a
nicely partitioned per-chain manner.

2. Producing an estimate of the time required to traverse the entire hash table. An
approximate count works well in this case, also.

3. For diagnostic purposes, for example, to check for items being lost when trans-
ferring them to and from the hash table. This clearly requires an exact count.
However, given that this usage is diagnostic in nature, it might suffice to maintain
the lengths of the hash chains, then to infrequently sum them up while locking
out addition and deletion operations.

It turns out that there is now a strong theoretical basis for some of the constraints that
performance and scalability place on a parallel library’s APIs [AGH+ 11a, AGH+ 11b,
McK11b]. Anyone designing a parallel library needs to pay close attention to those
constraints.
Although it is all too easy to blame locking for what are really problems due to a
concurrency-unfriendly API, doing so is not helpful. On the other hand, one has little
choice but to sympathize with the hapless developer who made this choice in (say)
1985. It would have been a rare and courageous developer to anticipate the need for
162 CHAPTER 7. LOCKING

parallelism at that time, and it would have required an even more rare combination of
brilliance and luck to actually arrive at a good parallel-friendly API.
Times change, and code must change with them. That said, there might be a huge
number of users of a popular library, in which case an incompatible change to the API
would be quite foolish. Adding a parallel-friendly API to complement the existing
heavily used sequential-only API is probably the best course of action in this situation.
Nevertheless, human nature being what it is, we can expect our hapless developer
to be more likely to complain about locking than about his or her own poor (though
understandable) API design choices.

7.5.3.2 Deadlock-Prone Callbacks

Sections 7.1.1.2, 7.1.1.3, and 7.5.2 described how undisciplined use of callbacks can
result in locking woes. These sections also described how to design your library function
to avoid these problems, but it is unrealistic to expect a 1990s programmer with no
experience in parallel programming to have followed such a design. Therefore, someone
attempting to parallelize an existing callback-heavy single-threaded library will likely
have many opportunities to curse locking’s villainy.
If there are a very large number of uses of a callback-heavy library, it may be wise to
again add a parallel-friendly API to the library in order to allow existing users to convert
their code incrementally. Alternatively, some advocate use of transactional memory in
these cases. While the jury is still out on transactional memory, Section 17.2 discusses
its strengths and weaknesses. It is important to note that hardware transactional memory
(discussed in Section 17.3) cannot help here unless the hardware transactional memory
implementation provides forward-progress guarantees, which few do. Other alternatives
that appear to be quite practical (if less heavily hyped) include the methods discussed in
Sections 7.1.1.5, and 7.1.1.6, as well as those that will be discussed in Chapters 8 and 9.

7.5.3.3 Object-Oriented Spaghetti Code

Object-oriented programming went mainstream sometime in the 1980s or 1990s, and


as a result there is a huge amount of object-oriented code in production, much of it
single-threaded. Although object orientation can be a valuable software technique,
undisciplined use of objects can easily result in object-oriented spaghetti code. In object-
oriented spaghetti code, control flits from object to object in an essentially random
manner, making the code hard to understand and even harder, and perhaps impossible,
to accommodate a locking hierarchy.
Although many might argue that such code should be cleaned up in any case, such
things are much easier to say than to do. If you are tasked with parallelizing such a beast,
you can reduce the number of opportunities to curse locking by using the techniques
described in Sections 7.1.1.5, and 7.1.1.6, as well as those that will be discussed in
Chapters 8 and 9. This situation appears to be the use case that inspired transactional
memory, so it might be worth a try as well. That said, the choice of synchronization
mechanism should be made in light of the hardware habits discussed in Chapter 3. After
all, if the overhead of the synchronization mechanism is orders of magnitude more than
that of the operations being protected, the results are not going to be pretty.
And that leads to a question well worth asking in these situations: Should the code
remain sequential? For example, perhaps parallelism should be introduced at the process
level rather than the thread level. In general, if a task is proving extremely hard, it
7.6. SUMMARY 163

is worth some time spent thinking about not only alternative ways to accomplish that
particular task, but also alternative tasks that might better solve the problem at hand.

7.6 Summary
Locking is perhaps the most widely used and most generally useful synchronization
tool. However, it works best when designed into an application or library from the
beginning. Given the large quantity of pre-existing single-threaded code that might
need to one day run in parallel, locking should therefore not be the only tool in your
parallel-programming toolbox. The next few chapters will discuss other tools, and how
they can best be used in concert with locking and with each other.
164 CHAPTER 7. LOCKING
It is mine, I tell you. My own. My precious. Yes, my
precious.

Gollum in “The Fellowship of the Ring”,


J.R.R. Tolkien

Chapter 8

Data Ownership

One of the simplest ways to avoid the synchronization overhead that comes with locking
is to parcel the data out among the threads (or, in the case of kernels, CPUs) so that a
given piece of data is accessed and modified by only one of the threads. Interestingly
enough, data ownership covers each of the “big three” parallel design techniques: It
partitions over threads (or CPUs, as the case may be), it batches all local operations, and
its elimination of synchronization operations is weakening carried to its logical extreme.
It should therefore be no surprise that data ownership is used extremely heavily, in fact,
it is one usage pattern that even novices use almost instinctively. In fact, it is used so
heavily that this chapter will not introduce any new examples, but will instead reference
examples from previous chapters.
Quick Quiz 8.1: What form of data ownership is extremely difficult to avoid when
creating shared-memory parallel programs (for example, using pthreads) in C or C++?

There are a number of approaches to data ownership. Section 8.1 presents the
logical extreme in data ownership, where each thread has its own private address space.
Section 8.2 looks at the opposite extreme, where the data is shared, but different threads
own different access rights to the data. Section 8.3 describes function shipping, which
is a way of allowing other threads to have indirect access to data owned by a particular
thread. Section 8.4 describes how designated threads can be assigned ownership of a
specified function and the related data. Section 8.5 discusses improving performance
by transforming algorithms with shared data to instead use data ownership. Finally,
Section 8.6 lists a few software environments that feature data ownership as a first-class
citizen.

8.1 Multiple Processes


Section 4.1 introduced the following example:
1 compute_it 1 > compute_it.1.out &
2 compute_it 2 > compute_it.2.out &
3 wait
4 cat compute_it.1.out
5 cat compute_it.2.out

This example runs two instances of the compute_it program in parallel, as


separate processes that do not share memory. Therefore, all data in a given process

165
166 CHAPTER 8. DATA OWNERSHIP

is owned by that process, so that almost the entirety of data in the above example
is owned. This approach almost entirely eliminates synchronization overhead. The
resulting combination of extreme simplicity and optimal performance is obviously quite
attractive.
Quick Quiz 8.2: What synchronization remains in the example shown in Sec-
tion 8.1?
Quick Quiz 8.3: Is there any shared data in the example shown in Section 8.1?
This same pattern can be written in C as well as in sh, as illustrated by Figures 4.2
and 4.3.
The next section discusses use of data ownership in shared-memory parallel pro-
grams.

8.2 Partial Data Ownership and pthreads


Chapter 5 makes heavy use of data ownership, but adds a twist. Threads are not allowed
to modify data owned by other threads, but they are permitted to read it. In short, the
use of shared memory allows more nuanced notions of ownership and access rights.
For example, consider the per-thread statistical counter implementation shown in
Figure 5.9 on page 62. Here, inc_count() updates only the corresponding thread’s
instance of counter, while read_count() accesses, but does not modify, all
threads’ instances of counter.
Quick Quiz 8.4: Does it ever make sense to have partial data ownership where each
thread reads only its own instance of a per-thread variable, but writes to other threads’
instances?
Pure data ownership is also both common and useful, for example, the per-thread
memory-allocator caches discussed in Section 6.4.3 starting on page 117. In this
algorithm, each thread’s cache is completely private to that thread.

8.3 Function Shipping


The previous section described a weak form of data ownership where threads reached
out to other threads’ data. This can be thought of as bringing the data to the functions
that need it. An alternative approach is to send the functions to the data.
Such an approach is illustrated in Section 5.4.3 beginning on page 77, in particular
the flush_local_count_sig() and flush_local_count() functions in
Figure 5.24 on page 80.
The flush_local_count_sig() function is a signal handler that acts as the
shipped function. The pthread_kill() function in flush_local_count()
sends the signal—shipping the function—and then waits until the shipped function
executes. This shipped function has the not-unusual added complication of needing to
interact with any concurrently executing add_count() or sub_count() functions
(see Figure 5.25 on page 82 and Figure 5.26 on page 83).
Quick Quiz 8.5: What mechanisms other than POSIX signals may be used for
function shipping?
8.4. DESIGNATED THREAD 167

8.4 Designated Thread


The earlier sections describe ways of allowing each thread to keep its own copy or its
own portion of the data. In contrast, this section describes a functional-decomposition
approach, where a special designated thread owns the rights to the data that is required
to do its job. The eventually consistent counter implementation described in Sec-
tion 5.2.3 provides an example. This implementation has a designated thread that runs
the eventual() function shown on lines 15-32 of Figure 5.8. This eventual()
thread periodically pulls the per-thread counts into the global counter, so that accesses
to the global counter will, as the name says, eventually converge on the actual value.
Quick Quiz 8.6: But none of the data in the eventual() function shown on
lines 15-32 of Figure 5.8 is actually owned by the eventual() thread! In just what
way is this data ownership???

8.5 Privatization
One way of improving the performance and scalability of a shared-memory parallel
program is to transform it so as to convert shared data to private data that is owned by a
particular thread.
An excellent example of this is shown in the answer to one of the Quick Quizzes in
Section 6.1.1, which uses privatization to produce a solution to the Dining Philosophers
problem with much better performance and scalability than that of the standard textbook
solution. The original problem has five philosophers sitting around the table with one
fork between each adjacent pair of philosophers, which permits at most two philosophers
to eat concurrently.
We can trivially privatize this problem by providing an additional five forks, so
that each philosopher has his or her own private pair of forks. This allows all five
philosophers to eat concurrently, and also offers a considerable reduction in the spread
of certain types of disease.
In other cases, privatization imposes costs. For example, consider the simple
limit counter shown in Figure 5.12 on page 67. This is an example of an algorithm
where threads can read each others’ data, but are only permitted to update their own
data. A quick review of the algorithm shows that the only cross-thread accesses are
in the summation loop in read_count(). If this loop is eliminated, we move to
the more-efficient pure data ownership, but at the cost of a less-accurate result from
read_count().
Quick Quiz 8.7: Is it possible to obtain greater accuracy while still maintaining full
privacy of the per-thread data?
In short, privatization is a powerful tool in the parallel programmer’s toolbox, but it
must nevertheless be used with care. Just like every other synchronization primitive, it
has the potential to increase complexity while decreasing performance and scalability.

8.6 Other Uses of Data Ownership


Data ownership works best when the data can be partitioned so that there is little or no
need for cross thread access or update. Fortunately, this situation is reasonably common,
and in a wide variety of parallel-programming environments.
Examples of data ownership include:
168 CHAPTER 8. DATA OWNERSHIP

1. All message-passing environments, such as MPI [MPI08] and BOINC [UoC08].


2. Map-reduce [Jac08].
3. Client-server systems, including RPC, web services, and pretty much any system
with a back-end database server.

4. Shared-nothing database systems.


5. Fork-join systems with separate per-process address spaces.
6. Process-based parallelism, such as the Erlang language.
7. Private variables, for example, C-language on-stack auto variables, in threaded
environments.

Data ownership is perhaps the most underappreciated synchronization mechanism


in existence. When used properly, it delivers unrivaled simplicity, performance, and
scalability. Perhaps its simplicity costs it the respect that it deserves. Hopefully a greater
appreciation for the subtlety and power of data ownership will lead to greater level of
respect, to say nothing of leading to greater performance and scalability coupled with
reduced complexity.
All things come to those who wait.

Violet Fane

Chapter 9

Deferred Processing

The strategy of deferring work goes back before the dawn of recorded history. It
has occasionally been derided as procrastination or even as sheer laziness. However,
in the last few decades workers have recognized this strategy’s value in simplifying
and streamlining parallel algorithms [KL80, Mas92]. Believe it or not, “laziness”
in parallel programming often outperforms and out-scales industriousness! These
performance and scalability benefits stem from the fact that deferring work often enables
weakening of synchronization primitives, thereby reducing synchronization overhead.
General approaches of work deferral include reference counting (Section 9.2), hazard
pointers (Section 9.3), sequence locking (Section 9.4), and RCU (Section 9.5). Finally,
Section 9.6 describes how to choose among the work-deferral schemes covered in this
chapter and Section 9.7 discusses the role of updates. But first we will introduce an
example algorithm that will be used to compare and contrast these approaches.

9.1 Running Example


This chapter will use a simplified packet-routing algorithm to demonstrate the value of
these approaches and to allow them to be compared. Routing algorithms are used in
operating-system kernels to deliver each outgoing TCP/IP packets to the appropriate
network interface. This particular algorithm is a simplified version of the classic 1980s
packet-train-optimized algorithm used in BSD UNIX [Jac88], consisting of a simple
linked list.1 Modern routing algorithms use more complex data structures, however, as
in Chapter 5, a simple algorithm will help highlight issues specific to parallelism in an
easy-to-understand setting.
We further simplify the algorithm by reducing the search key from a quadruple
consisting of source and destination IP addresses and ports all the way down to a simple
integer. The value looked up and returned will also be a simple integer, so that the data
structure is as shown in Figure 9.1, which directs packets with address 42 to interface 1,
address 56 to interface 3, and address 17 to interface 7. Assuming that external packet
network is stable, this list will be searched frequently and updated rarely. In Chapter 3
we learned that the best ways to evade inconvenient laws of physics, such as the finite
speed of light and the atomic nature of matter, is to either partition the data or to rely
on read-mostly sharing. In this chapter, we will use this Pre-BSD packet routing list to
evaluate a number of read-mostly synchronization techniques.
1 In other words, this is not OpenBSD, NetBSD, or even FreeBSD, but none other than Pre-BSD.

169
170 CHAPTER 9. DEFERRED PROCESSING

route_list

->addr=42 ->addr=56 ->addr=17


->iface=1 ->iface=3 ->iface=7

Figure 9.1: Pre-BSD Packet Routing List

Figure 9.2 shows a simple single-threaded implementation corresponding to Fig-


ure 9.1. Lines 1-5 define a route_entry structure and line 6 defines the route_
list header. Lines 8-21 define route_lookup(), which sequentially searches
route_list, returning the corresponding ->iface, or ULONG_MAX if there is
no such route entry. Lines 23-35 define route_add(), which allocates a route_
entry structure, initializes it, and adds it to the list, returning -ENOMEM in case
of memory-allocation failure. Finally, lines 37-50 define route_del(), which re-
moves and frees the specified route_entry structure if it exists, or returns -ENOENT
otherwise.
This single-threaded implementation serves as a prototype for the various concur-
rent implementations in this chapter, and also as an estimate of ideal scalability and
performance.

9.2 Reference Counting


Reference counting tracks the number of references to a given object in order to prevent
that object from being prematurely freed. As such, it has a long and honorable history
of use dating back to at least the early 1960s [Wei63].2 Reference counting is thus an
excellent candidate for a concurrent implementation of Pre-BSD routing.
To that end, Figure 9.3 shows data structures and the route_lookup() func-
tion and Figure 9.4 shows the route_add() and route_del() functions (all at
route_refcnt.c). Since these algorithms are quite similar to the sequential algo-
rithm shown in Figure 9.2, only the differences will be discussed.
Starting with Figure 9.3, line 2 adds the actual reference counter, line 6 adds a
->re_freed use-after-free check field, line 9 adds the routelock that will be
used to synchronize concurrent updates, and lines 11-15 add re_free(), which
sets ->re_freed, enabling route_lookup() to check for use-after-free bugs. In
route_lookup() itself, lines 29-31 release the reference count of the prior element
and free it if the count becomes zero, and lines 35-43 acquire a reference on the new
element, with lines 36 and 37 performing the use-after-free check.
Quick Quiz 9.1: Why bother with a use-after-free check?
In Figure 9.4, lines 12, 16, 25, 33, and 40 introduce locking to synchronize con-
current updates. Line 14 initializes the ->re_freed use-after-free-check field, and

2 Weizenbaum discusses reference counting as if it was already well-known, so it likely dates back to

the 1950s and perhaps even to the 1940s. And perhaps even further. People repairing and maintaining large
machines have long used a mechanical reference-counting technique, where each worker had a padlock.
9.2. REFERENCE COUNTING 171

1 struct route_entry {
2 struct cds_list_head re_next;
3 unsigned long addr;
4 unsigned long iface;
5 };
6 CDS_LIST_HEAD(route_list);
7
8 unsigned long route_lookup(unsigned long addr)
9 {
10 struct route_entry *rep;
11 unsigned long ret;
12
13 cds_list_for_each_entry(rep,
14 &route_list, re_next) {
15 if (rep->addr == addr) {
16 ret = rep->iface;
17 return ret;
18 }
19 }
20 return ULONG_MAX;
21 }
22
23 int route_add(unsigned long addr,
24 unsigned long interface)
25 {
26 struct route_entry *rep;
27
28 rep = malloc(sizeof(*rep));
29 if (!rep)
30 return -ENOMEM;
31 rep->addr = addr;
32 rep->iface = interface;
33 cds_list_add(&rep->re_next, &route_list);
34 return 0;
35 }
36
37 int route_del(unsigned long addr)
38 {
39 struct route_entry *rep;
40
41 cds_list_for_each_entry(rep,
42 &route_list, re_next) {
43 if (rep->addr == addr) {
44 cds_list_del(&rep->re_next);
45 free(rep);
46 return 0;
47 }
48 }
49 return -ENOENT;
50 }

Figure 9.2: Sequential Pre-BSD Routing Table

finally lines 34-35 invoke re_free() if the new value of the reference count is zero.
Quick Quiz 9.2: Why doesn’t route_del() in Figure 9.4 use reference counts
to protect the traversal to the element to be freed?
Figure 9.5 shows the performance and scalability of reference counting on a read-
only workload with a ten-element list running on a single-socket four-core hyperthreaded
2.5GHz x86 system. The “ideal” trace was generated by running the sequential code
shown in Figure 9.2, which works only because this is a read-only workload. The
reference-counting performance is abysmal and its scalability even more so, with the
“refcnt” trace dropping down onto the x-axis. This should be no surprise in view of
Chapter 3: The reference-count acquisitions and releases have added frequent shared-
memory writes to an otherwise read-only workload, thus incurring severe retribution
from the laws of physics. As well it should, given that all the wishful thinking in the
world is not going to increase the speed of light or decrease the size of the atoms used
172 CHAPTER 9. DEFERRED PROCESSING

1 struct route_entry { /* BUGGY!!! */


2 atomic_t re_refcnt;
3 struct route_entry *re_next;
4 unsigned long addr;
5 unsigned long iface;
6 int re_freed;
7 };
8 struct route_entry route_list;
9 DEFINE_SPINLOCK(routelock);
10
11 static void re_free(struct route_entry *rep)
12 {
13 ACCESS_ONCE(rep->re_freed) = 1;
14 free(rep);
15 }
16
17 unsigned long route_lookup(unsigned long addr)
18 {
19 int old;
20 int new;
21 struct route_entry *rep;
22 struct route_entry **repp;
23 unsigned long ret;
24
25 retry:
26 repp = &route_list.re_next;
27 rep = NULL;
28 do {
29 if (rep &&
30 atomic_dec_and_test(&rep->re_refcnt))
31 re_free(rep);
32 rep = ACCESS_ONCE(*repp);
33 if (rep == NULL)
34 return ULONG_MAX;
35 do {
36 if (ACCESS_ONCE(rep->re_freed))
37 abort();
38 old = atomic_read(&rep->re_refcnt);
39 if (old <= 0)
40 goto retry;
41 new = old + 1;
42 } while (atomic_cmpxchg(&rep->re_refcnt,
43 old, new) != old);
44 repp = &rep->re_next;
45 } while (rep->addr != addr);
46 ret = rep->iface;
47 if (atomic_dec_and_test(&rep->re_refcnt))
48 re_free(rep);
49 return ret;
50 }

Figure 9.3: Reference-Counted Pre-BSD Routing Table Lookup (BUGGY!!!)


9.2. REFERENCE COUNTING 173

1 int route_add(unsigned long addr, /* BUGGY!!! */


2 unsigned long interface)
3 {
4 struct route_entry *rep;
5
6 rep = malloc(sizeof(*rep));
7 if (!rep)
8 return -ENOMEM;
9 atomic_set(&rep->re_refcnt, 1);
10 rep->addr = addr;
11 rep->iface = interface;
12 spin_lock(&routelock);
13 rep->re_next = route_list.re_next;
14 rep->re_freed = 0;
15 route_list.re_next = rep;
16 spin_unlock(&routelock);
17 return 0;
18 }
19
20 int route_del(unsigned long addr)
21 {
22 struct route_entry *rep;
23 struct route_entry **repp;
24
25 spin_lock(&routelock);
26 repp = &route_list.re_next;
27 for (;;) {
28 rep = *repp;
29 if (rep == NULL)
30 break;
31 if (rep->addr == addr) {
32 *repp = rep->re_next;
33 spin_unlock(&routelock);
34 if (atomic_dec_and_test(&rep->re_refcnt))
35 re_free(rep);
36 return 0;
37 }
38 repp = &rep->re_next;
39 }
40 spin_unlock(&routelock);
41 return -ENOENT;
42 }

Figure 9.4: Reference-Counted Pre-BSD Routing Table Add/Delete (BUGGY!!!)


174 CHAPTER 9. DEFERRED PROCESSING

450000
400000 ideal

Lookups per Millisecond


350000
300000
250000
200000
150000
100000
50000
refcnt
0
1 2 3 4 5 6 7 8
Number of CPUs (Threads)

Figure 9.5: Pre-BSD Routing Table Protected by Reference Counting

in modern digital electronics.


Quick Quiz 9.3: Why the stairsteps in the “ideal” line in Figure 9.5? Shouldn’t it
be a straight line?
Quick Quiz 9.4: Why, in these modern times, does Figure 9.5 only go up to 8
CPUs???
But it gets worse.
Running multiple updater threads repeatedly invoking route_add() and route_
del() will quickly encounter the abort() statement on line 37 of Figure 9.3, which
indicates a use-after-free bug. This in turn means that the reference counts are not only
profoundly degrading scalability and performance, but also failing to provide the needed
protection.
One sequence of events leading to the use-after-free bug is as follows, given the list
shown in Figure 9.1:

1. Thread A looks up address 42, reaching line 33 of route_lookup() in Fig-


ure 9.3. In other words, Thread A has a pointer to the first element, but has not
yet acquired a reference to it.

2. Thread B invokes route_del() in Figure 9.4 to delete the route entry for
address 42. It completes successfully, and because this entry’s ->re_refcnt
field was equal to the value one, it invokes re_free() to set the ->re_freed
field and to free the entry.

3. Thread A continues execution of route_lookup(). Its rep pointer is non-


NULL, but line 36 sees that its ->re_freed field is non-zero, so line 37 invokes
abort().

The problem is that the reference count is located in the object to be protected, but
that means that there is no protection during the instant in time when the reference
count itself is being acquired! This is the reference-counting counterpart of a locking
issue noted by Gamsa et al. [GKAS99]. One could imagine using a global lock or
reference count to protect the per-route-entry reference-count acquisition, but this
would result in severe contention issues. Although algorithms exist that allow safe
reference-count acquisition in a concurrent environment [Val95], they are not only
9.3. HAZARD POINTERS 175

1 int hp_store(void **p, void **hp)


2 {
3 void *tmp;
4
5 tmp = ACCESS_ONCE(*p);
6 ACCESS_ONCE(*hp) = tmp;
7 smp_mb();
8 if (tmp != ACCESS_ONCE(*p) ||
9 tmp == HAZPTR_POISON) {
10 ACCESS_ONCE(*hp) = NULL;
11 return 0;
12 }
13 return 1;
14 }
15
16 void hp_erase(void **hp)
17 {
18 smp_mb();
19 ACCESS_ONCE(*hp) = NULL;
20 hp_free(hp);
21 }

Figure 9.6: Hazard-Pointer Storage and Erasure

extremely complex and error-prone [MS95], but also provide terrible performance and
scalability [HMBW07].
In short, concurrency has most definitely reduced the usefulness of reference count-
ing!
Quick Quiz 9.5: If concurrency has “most definitely reduced the usefulness of
reference counting”, why are there so many reference counters in the Linux kernel?
That said, sometimes it is necessary to look at a problem in an entirely different way
in order to successfully solve it. The next section describes what could be thought of as
an inside-out reference count that provides decent performance and scalability.

9.3 Hazard Pointers


One way of avoiding problems with concurrent reference counting is to implement the
reference counters inside out, that is, rather than incrementing an integer stored in the
data element, instead store a pointer to that data element in per-CPU (or per-thread)
lists. Each element of these lists is called a hazard pointer [Mic04].3 The value of a
given data element’s “virtual reference counter” can then be obtained by counting the
number of hazard pointers referencing that element. Therefore, if that element has been
rendered inaccessible to readers, and there are no longer any hazard pointers referencing
it, that element may safely be freed.
Of course, this means that hazard-pointer acquisition must be carried out quite care-
fully in order to avoid destructive races with concurrent deletion. One implementation
is shown in Figure 9.6, which shows hp_store() on lines 1-13 and hp_erase()
on lines 15-20. The smp_mb() primitive will be described in detail in Section 14.2,
but may be ignored for the purposes of this brief overview.
The hp_store() function records a hazard pointer at hp for the data element
whose pointer is referenced by p, while checking for concurrent modifications. If a
concurrent modification occurred, hp_store() refuses to record a hazard pointer,
and returns zero to indicate that the caller must restart its traversal from the beginning.

3 Also independently invented by others [HLM02].


176 CHAPTER 9. DEFERRED PROCESSING

1 struct route_entry {
2 struct hazptr_head hh;
3 struct route_entry *re_next;
4 unsigned long addr;
5 unsigned long iface;
6 int re_freed;
7 };
8 struct route_entry route_list;
9 DEFINE_SPINLOCK(routelock);
10 hazard_pointer __thread *my_hazptr;
11
12 unsigned long route_lookup(unsigned long addr)
13 {
14 int offset = 0;
15 struct route_entry *rep;
16 struct route_entry **repp;
17
18 retry:
19 repp = &route_list.re_next;
20 do {
21 rep = ACCESS_ONCE(*repp);
22 if (rep == NULL)
23 return ULONG_MAX;
24 if (rep == (struct route_entry *)HAZPTR_POISON)
25 goto retry;
26 my_hazptr[offset].p = &rep->hh;
27 offset = !offset;
28 smp_mb();
29 if (ACCESS_ONCE(*repp) != rep)
30 goto retry;
31 repp = &rep->re_next;
32 } while (rep->addr != addr);
33 if (ACCESS_ONCE(rep->re_freed))
34 abort();
35 return rep->iface;
36 }

Figure 9.7: Hazard-Pointer Pre-BSD Routing Table Lookup

Otherwise, hp_store() returns one to indicate that it successfully recorded a hazard


pointer for the data element.
Quick Quiz 9.6: Why does hp_store() in Figure 9.6 take a double indirection
to the data element? Why not void * instead of void **?
Quick Quiz 9.7: Why does hp_store()’s caller need to restart its traversal from
the beginning in case of failure? Isn’t that inefficient for large data structures?
Quick Quiz 9.8: Given that papers on hazard pointers use the bottom bits of each
pointer to mark deleted elements, what is up with HAZPTR_POISON?
Because algorithms using hazard pointers might be restarted at any step of their
traversal through the data structure, such algorithms must typically take care to avoid
making any changes to the data structure until after they have acquired all relevant
hazard pointers.
Quick Quiz 9.9: But don’t these restrictions on hazard pointers also apply to other
forms of reference counting?
These restrictions result in great benefits to readers, courtesy of the fact that the
hazard pointers are stored local to each CPU or thread, which in turn allows traversals
of the data structures themselves to be carried out in a completely read-only fashion.
Referring back to Figure 5.29 on page 90, hazard pointers enable the CPU caches to
do resource replication, which in turn allows weakening of the parallel-access-control
mechanism, thus boosting performance and scalability. Performance comparisons with
other mechanisms may be found in Chapter 10 and in other publications [HMBW07,
McK13, Mic04].
9.3. HAZARD POINTERS 177

1 int route_add(unsigned long addr,


2 unsigned long interface)
3 {
4 struct route_entry *rep;
5
6 rep = malloc(sizeof(*rep));
7 if (!rep)
8 return -ENOMEM;
9 rep->addr = addr;
10 rep->iface = interface;
11 rep->re_freed = 0;
12 spin_lock(&routelock);
13 rep->re_next = route_list.re_next;
14 route_list.re_next = rep;
15 spin_unlock(&routelock);
16 return 0;
17 }
18
19 int route_del(unsigned long addr)
20 {
21 struct route_entry *rep;
22 struct route_entry **repp;
23
24 spin_lock(&routelock);
25 repp = &route_list.re_next;
26 for (;;) {
27 rep = *repp;
28 if (rep == NULL)
29 break;
30 if (rep->addr == addr) {
31 *repp = rep->re_next;
32 rep->re_next =
33 (struct route_entry *)HAZPTR_POISON;
34 spin_unlock(&routelock);
35 hazptr_free_later(&rep->hh);
36 return 0;
37 }
38 repp = &rep->re_next;
39 }
40 spin_unlock(&routelock);
41 return -ENOENT;
42 }

Figure 9.8: Hazard-Pointer Pre-BSD Routing Table Add/Delete


178 CHAPTER 9. DEFERRED PROCESSING

450000
400000 ideal

Lookups per Millisecond


350000
300000
250000
200000
150000
100000
50000 hazptr
refcnt
0
1 2 3 4 5 6 7 8
Number of CPUs (Threads)

Figure 9.9: Pre-BSD Routing Table Protected by Hazard Pointers

The Pre-BSD routing example can use hazard pointers as shown in Figure 9.7
for data structures and route_lookup(), and in Figure 9.8 for route_add()
and route_del() (route_hazptr.c). As with reference counting, the hazard-
pointers implementation is quite similar to the sequential algorithm shown in Figure 9.2
on page 171, so only differences will be discussed.
Starting with Figure 9.7, line 2 shows the ->hh field used to queue objects pending
hazard-pointer free, line 6 shows the ->re_freed field used to detect use-after-free
bugs, and lines 24-30 attempt to acquire a hazard pointer, branching to line 18’s retry
label on failure.
In Figure 9.8, line 11 initializes ->re_freed, lines 32 and 33 poison the ->re_
next field of the newly removed object, and line 35 passes that object to the hazard
pointers’s hazptr_free_later() function, which will free that object once it is
safe to do so. The spinlocks work the same as in Figure 9.4.
Figure 9.9 shows the hazard-pointers-protected Pre-BSD routing algorithm’s per-
formance on the same read-only workload as for Figure 9.5. Although hazard pointers
scales much better than does reference counting, hazard pointers still require readers
to do writes to shared memory (albeit with much improved locality of reference), and
also require a full memory barrier and retry check for each object traversed. Therefore,
hazard pointers’s performance is far short of ideal. On the other hand, hazard pointers
do operate correctly for workloads involving concurrent updates.
Quick Quiz 9.10: The paper “Structured Deferral: Synchronization via Procrasti-
nation” [McK13] shows that hazard pointers have near-ideal performance. Whatever
happened in Figure 9.9???
The next section attempts to improve on hazard pointers by using sequence locks,
which avoid both read-side writes and per-object memory barriers.

9.4 Sequence Locks


Sequence locks are used in the Linux kernel for read-mostly data that must be seen in
a consistent state by readers. However, unlike reader-writer locking, readers do not
exclude writers. Instead, like hazard pointers, sequence locks force readers to retry
an operation if they detect activity from a concurrent writer. As can be seen from
9.4. SEQUENCE LOCKS 179

Ah, I finally got


done reading!

No, you didn't!


Start over!

Figure 9.10: Reader And Uncooperative Sequence Lock

Figure 9.10, it is important to design code using sequence locks so that readers very
rarely need to retry.
Quick Quiz 9.11: Why isn’t this sequence-lock discussion in Chapter 7, you know,
the one on locking?
The key component of sequence locking is the sequence number, which has an even
value in the absence of updaters and an odd value if there is an update in progress.
Readers can then snapshot the value before and after each access. If either snapshot has
an odd value, or if the two snapshots differ, there has been a concurrent update, and the
reader must discard the results of the access and then retry it. Readers therefore use
the read_seqbegin() and read_seqretry() functions shown in Figure 9.11
when accessing data protected by a sequence lock. Writers must increment the value
before and after each update, and only one writer is permitted at a given time. Writers
therefore use the write_seqlock() and write_sequnlock() functions shown
in Figure 9.12 when updating data protected by a sequence lock.
As a result, sequence-lock-protected data can have an arbitrarily large number of
concurrent readers, but only one writer at a time. Sequence locking is used in the Linux
kernel to protect calibration quantities used for timekeeping. It is also used in pathname
traversal to detect concurrent rename operations.
A simple implementation of sequence locks is shown in Figure 9.13 (seqlock.h).
The seqlock_t data structure is shown on lines 1-4, and contains the sequence
number along with a lock to serialize writers. Lines 6-10 show seqlock_init(),

1 do {
2 seq = read_seqbegin(&test_seqlock);
3 /* read-side access. */
4 } while (read_seqretry(&test_seqlock, seq));

Figure 9.11: Sequence-Locking Reader


1 write_seqlock(&test_seqlock);
2 /* Update */
3 write_sequnlock(&test_seqlock);

Figure 9.12: Sequence-Locking Writer


180 CHAPTER 9. DEFERRED PROCESSING

1 typedef struct {
2 unsigned long seq;
3 spinlock_t lock;
4 } seqlock_t;
5
6 static void seqlock_init(seqlock_t *slp)
7 {
8 slp->seq = 0;
9 spin_lock_init(&slp->lock);
10 }
11
12 static unsigned long read_seqbegin(seqlock_t *slp)
13 {
14 unsigned long s;
15
16 s = ACCESS_ONCE(slp->seq);
17 smp_mb();
18 return s & ~0x1UL;
19 }
20
21 static int read_seqretry(seqlock_t *slp,
22 unsigned long oldseq)
23 {
24 unsigned long s;
25
26 smp_mb();
27 s = ACCESS_ONCE(slp->seq);
28 return s != oldseq;
29 }
30
31 static void write_seqlock(seqlock_t *slp)
32 {
33 spin_lock(&slp->lock);
34 ++slp->seq;
35 smp_mb();
36 }
37
38 static void write_sequnlock(seqlock_t *slp)
39 {
40 smp_mb();
41 ++slp->seq;
42 spin_unlock(&slp->lock);
43 }

Figure 9.13: Sequence-Locking Implementation

which, as the name indicates, initializes a seqlock_t.


Lines 12-19 show read_seqbegin(), which begins a sequence-lock read-side
critical section. Line 16 takes a snapshot of the sequence counter, and line 17 orders
this snapshot operation before the caller’s critical section. Finally, line 18 returns the
value of the snapshot (with the least-significant bit cleared), which the caller will pass
to a later call to read_seqretry().
Quick Quiz 9.12: Why not have read_seqbegin() in Figure 9.13 check for
the low-order bit being set, and retry internally, rather than allowing a doomed read to
start?
Lines 21-29 show read_seqretry(), which returns true if there were no writers
present since the time of the corresponding call to read_seqbegin(). Line 26
orders the caller’s prior critical section before line 27’s fetch of the new snapshot of the
sequence counter. Finally, line 28 checks that the sequence counter has not changed, in
other words, that there has been no writer, and returns true if so.
Quick Quiz 9.13: Why is the smp_mb() on line 26 of Figure 9.13 needed?
Quick Quiz 9.14: Can’t weaker memory barriers be used in the code in Figure 9.13?
9.4. SEQUENCE LOCKS 181

1 struct route_entry {
2 struct route_entry *re_next;
3 unsigned long addr;
4 unsigned long iface;
5 int re_freed;
6 };
7 struct route_entry route_list;
8 DEFINE_SEQ_LOCK(sl);
9
10 unsigned long route_lookup(unsigned long addr)
11 {
12 struct route_entry *rep;
13 struct route_entry **repp;
14 unsigned long ret;
15 unsigned long s;
16
17 retry:
18 s = read_seqbegin(&sl);
19 repp = &route_list.re_next;
20 do {
21 rep = ACCESS_ONCE(*repp);
22 if (rep == NULL) {
23 if (read_seqretry(&sl, s))
24 goto retry;
25 return ULONG_MAX;
26 }
27 repp = &rep->re_next;
28 } while (rep->addr != addr);
29 if (ACCESS_ONCE(rep->re_freed))
30 abort();
31 ret = rep->iface;
32 if (read_seqretry(&sl, s))
33 goto retry;
34 return ret;
35 }

Figure 9.14: Sequence-Locked Pre-BSD Routing Table Lookup (BUGGY!!!)

Quick Quiz 9.15: What prevents sequence-locking updaters from starving readers?

Lines 31-36 show write_seqlock(), which simply acquires the lock, incre-
ments the sequence number, and executes a memory barrier to ensure that this in-
crement is ordered before the caller’s critical section. Lines 38-43 show write_
sequnlock(), which executes a memory barrier to ensure that the caller’s critical
section is ordered before the increment of the sequence number on line 44, then releases
the lock.
Quick Quiz 9.16: What if something else serializes writers, so that the lock is not
needed?
Quick Quiz 9.17: Why isn’t seq on line 2 of Figure 9.13 unsigned rather than
unsigned long? After all, if unsigned is good enough for the Linux kernel,
shouldn’t it be good enough for everyone?
So what happens when sequence locking is applied to the Pre-BSD routing table?
Figure 9.14 shows the data structures and route_lookup(), and Figure 9.15 shows
route_add() and route_del() (route_seqlock.c). This implementation
is once again similar to its counterparts in earlier sections, so only the differences will
be highlighted.
In Figure 9.14, line 5 adds ->re_freed, which is checked on lines 29 and 30.
Line 8 adds a sequence lock, which is used by route_lookup() on lines 18, 23,
and 32, with lines 24 and 33 branching back to the retry label on line 17. The effect
is to retry any lookup that runs concurrently with an update.
182 CHAPTER 9. DEFERRED PROCESSING

1 int route_add(unsigned long addr,


2 unsigned long interface)
3 {
4 struct route_entry *rep;
5
6 rep = malloc(sizeof(*rep));
7 if (!rep)
8 return -ENOMEM;
9 rep->addr = addr;
10 rep->iface = interface;
11 rep->re_freed = 0;
12 write_seqlock(&sl);
13 rep->re_next = route_list.re_next;
14 route_list.re_next = rep;
15 write_sequnlock(&sl);
16 return 0;
17 }
18
19 int route_del(unsigned long addr)
20 {
21 struct route_entry *rep;
22 struct route_entry **repp;
23
24 write_seqlock(&sl);
25 repp = &route_list.re_next;
26 for (;;) {
27 rep = *repp;
28 if (rep == NULL)
29 break;
30 if (rep->addr == addr) {
31 *repp = rep->re_next;
32 write_sequnlock(&sl);
33 smp_mb();
34 rep->re_freed = 1;
35 free(rep);
36 return 0;
37 }
38 repp = &rep->re_next;
39 }
40 write_sequnlock(&sl);
41 return -ENOENT;
42 }

Figure 9.15: Sequence-Locked Pre-BSD Routing Table Add/Delete (BUGGY!!!)


9.5. READ-COPY UPDATE (RCU) 183

450000
400000 ideal

Lookups per Millisecond


350000
300000
250000
200000
seqlock
150000
100000
50000 hazptr
refcnt
0
1 2 3 4 5 6 7 8
Number of CPUs (Threads)

Figure 9.16: Pre-BSD Routing Table Protected by Sequence Locking

In Figure 9.15, lines 12, 15, 24, and 40 acquire and release the sequence lock, while
lines 11, 33, and 44 handle ->re_freed. This implementation is therefore quite
straightforward.
It also performs better on the read-only workload, as can be seen in Figure 9.16,
though its performance is still far from ideal.
Unfortunately, it also suffers use-after-free failures. The problem is that the reader
might encounter a segmentation violation due to accessing an already-freed structure
before it comes to the read_seqretry().
Quick Quiz 9.18: Can this bug be fixed? In other words, can you use sequence locks
as the only synchronization mechanism protecting a linked list supporting concurrent
addition, deletion, and lookup?
Both the read-side and write-side critical sections of a sequence lock can be thought
of as transactions, and sequence locking therefore can be thought of as a limited form
of transactional memory, which will be discussed in Section 17.2. The limitations of
sequence locking are: (1) Sequence locking restricts updates and (2) sequence locking
does not permit traversal of pointers to objects that might be freed by updaters. These
limitations are of course overcome by transactional memory, but can also be overcome
by combining other synchronization primitives with sequence locking.
Sequence locks allow writers to defer readers, but not vice versa. This can result
in unfairness and even starvation in writer-heavy workloads. On the other hand, in the
absence of writers, sequence-lock readers are reasonably fast and scale linearly. It is only
human to want the best of both worlds: fast readers without the possibility of read-side
failure, let alone starvation. In addition, it would also be nice to overcome sequence
locking’s limitations with pointers. The following section presents a synchronization
mechanism with exactly these properties.

9.5 Read-Copy Update (RCU)


All of the mechanisms discussed in the preceding sections used one of a number of
approaches to defer specific actions until they may be carried out safely. The reference
counters discussed in Section 9.2 use explicit counters to defer actions that could disturb
readers, which results in read-side contention and thus poor scalability. The hazard
184 CHAPTER 9. DEFERRED PROCESSING

pointers covered by Section 9.3 uses implicit counters in the guise of per-thread lists of
pointer. This avoids read-side contention, but requires full memory barriers in read-side
primitives. The sequence lock presented in Section 9.4 also avoids read-side contention,
but does not protect pointer traversals and, like hazard pointers, requires full memory
barriers in read-side primitives. These schemes’ shortcomings raise the question of
whether it is possible to do better.
This section introduces read-copy update (RCU), which provides an API that allows
delays to be identified in the source code, rather than as expensive updates to shared data.
The remainder of this section examines RCU from a number of different perspectives.
Section 9.5.1 provides the classic introduction to RCU, Section 9.5.2 covers fundamental
RCU concepts, Section 9.5.3 introduces some common uses of RCU, Section 9.5.4
presents the Linux-kernel API, Section 9.5.5 covers a sequence of “toy” implementations
of user-level RCU, and finally Section 9.5.6 provides some RCU exercises.

9.5.1 Introduction to RCU


The approaches discussed in the preceding sections have provided some scalability but
decidedly non-ideal performance for the Pre-BSD routing table. It would be nice if
the overhead of Pre-BSD lookups was the same as that of a single-threaded lookup,
so that the parallel lookups would execute the same sequence of assembly language
instructions as would a single-threaded lookup. Although this is a nice goal, it does
raise some serious implementability questions. But let’s see what happens if we try,
treating insertion and deletion separately.
A classic approach for insertion is shown in Figure 9.17. The first row shows the
default state, with gptr equal to NULL. In the second row, we have allocated a structure
which is uninitialized, as indicated by the question marks. In the third row, we have
initialized the structure. Next, we assign gptr to reference this new element.4 On
modern general-purpose systems, this assignment is atomic in the sense that concurrent
readers will see either a NULL pointer or a pointer to the new structure p, but not some
mash-up containing bits from both values. Each reader is therefore guaranteed to either
get the default value of NULL or to get the newly installed non-default values, but either
way each reader will see a consistent result. Even better, readers need not use any
expensive synchronization primitives, so this approach is quite suitable for real-time
use.5
But sooner or later, it will be necessary to remove data that is being referenced by
concurrent readers. Let us move to a more complex example where we are removing
an element from a linked list, as shown in Figure 9.18. This list initially contains
elements A, B, and C, and we need to remove element B. First, we use list_del() to
carry out the removal,6 at which point all new readers will see element B as having been
deleted from the list. However, there might be old readers still referencing this element.
Once all these old readers have finished, we can safely free element B, resulting in the
situation shown at the bottom of the figure.
But how can we tell when the readers are finished?
It is tempting to consider a reference-counting scheme, but Figure 5.3 in Chapter 5

4 On many computer systems, simple assignment is insufficient due to interference from both the compiler

and the CPU. These issues will be covered in Section 9.5.2.


5 Again, on many computer systems, additional work is required to prevent interference from the compiler,

and, on DEC Alpha systems, the CPU as well. This will be covered in Section 9.5.2.
6 And yet again, this approximates reality, which will be expanded on in Section 9.5.2.
9.5. READ-COPY UPDATE (RCU) 185

(1) gptr

kmalloc()

p
->addr=?
(2) gptr ->iface=?

initialization

p
->addr=42
(3) gptr ->iface=1

gptr = p; /*almost*/

p
->addr=42
(4) gptr ->iface=1

Figure 9.17: Insertion With Concurrent Readers

shows that this can also result in long delays, just as can the locking and sequence-
locking approaches that we already rejected.
Let’s consider the logical extreme where the readers do absolutely nothing to
announce their presence. This approach clearly allows optimal performance for readers
(after all, free is a very good price), but leaves open the question of how the updater can
possibly determine when all the old readers are done. We clearly need some additional
constraints if we are to provide a reasonable answer to this question.
One constraint that fits well with some operating-system kernels is to consider the
case where threads are not subject to preemption. In such non-preemptible environments,
each thread runs until it explicitly and voluntarily blocks. This means that an infinite
loop without blocking will render a CPU useless for any other purpose from the start of
the infinite loop onwards.7 Non-preemptibility also requires that threads be prohibited
from blocking while holding spinlocks. Without this prohibition, all CPUs might be
consumed by threads spinning attempting to acquire a spinlock held by a blocked thread.
The spinning threads will not relinquish their CPUs until they acquire the lock, but
the thread holding the lock cannot possibly release it until one of the spinning threads
relinquishes a CPU. This is a classic deadlock situation.
Let us impose this same constraint on reader threads traversing the linked list:

7 In contrast, an infinite loop in a preemptible environment might be preempted. This infinite loop might

still waste considerable CPU time, but the CPU in question would nevertheless be able to do other work.
186 CHAPTER 9. DEFERRED PROCESSING

Readers?

(1) A B C 1 Version

list_del() /*almost*/
Readers?

(2) A B C 2 Versions

wait for readers


Readers?

(3) A B C 1 Versions

free()

(4) A C 1 Versions

Figure 9.18: Deletion From Linked List With Concurrent Readers

such threads are not allowed to block until after completing their traversal. Returning
to the second row of Figure 9.18, where the updater has just completed executing
list_del(), imagine that CPU 0 executes a context switch. Because readers are
not permitted to block while traversing the linked list, we are guaranteed that all prior
readers that might have been running on CPU 0 will have completed. Extending this
line of reasoning to the other CPUs, once each CPU has been observed executing a
context switch, we are guaranteed that all prior readers have completed, and that there
are no longer any reader threads referencing element B. The updater can then safely
free element B, resulting in the state shown at the bottom of Figure 9.18.
This approach is termed quiescent state based reclamation (QSBR) [HMB06]. A
QSBR schematic is shown in Figure 9.19, with time advancing from the top of the figure
to the bottom.
Although production-quality implementations of this approach can be quite complex,
a toy implementation is exceedingly simple:
1 for_each_online_cpu(cpu)
2 run_on(cpu);

The for_each_online_cpu() primitive iterates over all CPUs, and the run_
on() function causes the current thread to execute on the specified CPU, which forces
the destination CPU to execute a context switch. Therefore, once the for_each_
online_cpu() has completed, each CPU has executed a context switch, which in
turn guarantees that all pre-existing reader threads have completed.
Please note that this approach is not production quality. Correct handling of a
9.5. READ-COPY UPDATE (RCU) 187

wait for readers


CPU 1 CPU 2 CPU 3

list_del()
Context Switch

Reader

Grace Period
free()

Figure 9.19: RCU QSBR: Waiting for Pre-Existing Readers

number of corner cases and the need for a number of powerful optimizations mean that
production-quality implementations have significant additional complexity. In addition,
RCU implementations for preemptible environments require that readers actually do
something. However, this simple non-preemptible approach is conceptually complete,
and forms a good initial basis for understanding the RCU fundamentals covered in the
following section.

9.5.2 RCU Fundamentals


Read-copy update (RCU) is a synchronization mechanism that was added to the Linux
kernel in October of 2002. RCU achieves scalability improvements by allowing reads
to occur concurrently with updates. In contrast with conventional locking primitives
that ensure mutual exclusion among concurrent threads regardless of whether they be
readers or updaters, or with reader-writer locks that allow concurrent reads but not in the
presence of updates, RCU supports concurrency between a single updater and multiple
readers. RCU ensures that reads are coherent by maintaining multiple versions of objects
and ensuring that they are not freed up until all pre-existing read-side critical sections
complete. RCU defines and uses efficient and scalable mechanisms for publishing and
reading new versions of an object, and also for deferring the collection of old versions.
These mechanisms distribute the work among read and update paths in such a way as
to make read paths extremely fast, using replication and weakening optimizations in a
manner similar to hazard pointers, but without the need for read-side retries. In some
cases (non-preemptible kernels), RCU’s read-side primitives have zero overhead.
Quick Quiz 9.19: But doesn’t Section 9.4’s seqlock also permit readers and updaters
to get work done concurrently?
This leads to the question “What exactly is RCU?”, and perhaps also to the question
“How can RCU possibly work?” (or, not infrequently, the assertion that RCU cannot
188 CHAPTER 9. DEFERRED PROCESSING

1 struct foo {
2 int a;
3 int b;
4 int c;
5 };
6 struct foo *gp = NULL;
7
8 /* . . . */
9
10 p = kmalloc(sizeof(*p), GFP_KERNEL);
11 p->a = 1;
12 p->b = 2;
13 p->c = 3;
14 gp = p;

Figure 9.20: Data Structure Publication (Unsafe)

possibly work). This document addresses these questions from a fundamental viewpoint;
later installments look at them from usage and from API viewpoints. This last installment
also includes a list of references.
RCU is made up of three fundamental mechanisms, the first being used for insertion,
the second being used for deletion, and the third being used to allow readers to tolerate
concurrent insertions and deletions. Section 9.5.2.1 describes the publish-subscribe
mechanism used for insertion, Section 9.5.2.2 describes how waiting for pre-existing
RCU readers enabled deletion, and Section 9.5.2.3 discusses how maintaining multiple
versions of recently updated objects permits concurrent insertions and deletions. Finally,
Section 9.5.2.4 summarizes RCU fundamentals.

9.5.2.1 Publish-Subscribe Mechanism


One key attribute of RCU is the ability to safely scan data, even though that data is
being modified concurrently. To provide this ability for concurrent insertion, RCU uses
what can be thought of as a publish-subscribe mechanism. For example, consider an
initially NULL global pointer gp that is to be modified to point to a newly allocated and
initialized data structure. The code fragment shown in Figure 9.20 (with the addition of
appropriate locking) might be used for this purpose.
Unfortunately, there is nothing forcing the compiler and CPU to execute the last four
assignment statements in order. If the assignment to gp happens before the initialization
of p fields, then concurrent readers could see the uninitialized values. Memory barriers
are required to keep things ordered, but memory barriers are notoriously difficult to use.
We therefore encapsulate them into a primitive rcu_assign_pointer() that has
publication semantics. The last four lines would then be as follows:
1 p->a = 1;
2 p->b = 2;
3 p->c = 3;
4 rcu_assign_pointer(gp, p);

The rcu_assign_pointer() would publish the new structure, forcing both


the compiler and the CPU to execute the assignment to gp after the assignments to the
fields referenced by p.
However, it is not sufficient to only enforce ordering at the updater, as the reader must
enforce proper ordering as well. Consider for example the following code fragment:
1 p = gp;
2 if (p != NULL) {
3 do_something_with(p->a, p->b, p->c);
4 }
9.5. READ-COPY UPDATE (RCU) 189

next next next next


prev prev prev prev
A B C

Figure 9.21: Linux Circular Linked List

A B C

Figure 9.22: Linux Linked List Abbreviated

Although this code fragment might well seem immune to misordering, unfortunately,
the DEC Alpha CPU [McK05a, McK05b] and value-speculation compiler optimizations
can, believe it or not, cause the values of p->a, p->b, and p->c to be fetched before
the value of p. This is perhaps easiest to see in the case of value-speculation compiler
optimizations, where the compiler guesses the value of p fetches p->a, p->b, and
p->c then fetches the actual value of p in order to check whether its guess was correct.
This sort of optimization is quite aggressive, perhaps insanely so, but does actually
occur in the context of profile-driven optimization.
Clearly, we need to prevent this sort of skullduggery on the part of both the compiler
and the CPU. The rcu_dereference() primitive uses whatever memory-barrier
instructions and compiler directives are required for this purpose:8
1 rcu_read_lock();
2 p = rcu_dereference(gp);
3 if (p != NULL) {
4 do_something_with(p->a, p->b, p->c);
5 }
6 rcu_read_unlock();

The rcu_dereference() primitive can thus be thought of as subscribing to a


given value of the specified pointer, guaranteeing that subsequent dereference opera-
tions will see any initialization that occurred before the corresponding rcu_assign_
pointer() operation that published that pointer. The rcu_read_lock() and
rcu_read_unlock() calls are absolutely required: they define the extent of the
RCU read-side critical section. Their purpose is explained in Section 9.5.2.2, however,
they never spin or block, nor do they prevent the list_add_rcu() from executing
concurrently. In fact, in non-CONFIG_PREEMPT kernels, they generate absolutely no
code.
Although rcu_assign_pointer() and rcu_dereference() can in the-
ory be used to construct any conceivable RCU-protected data structure, in practice it is
often better to use higher-level constructs. Therefore, the rcu_assign_pointer()
and rcu_dereference() primitives have been embedded in special RCU vari-
ants of Linux’s list-manipulation API. Linux has two variants of doubly linked list,
8 In the Linux kernel, rcu_dereference() is implemented via a volatile cast, and, on DEC Alpha,

a memory barrier instruction. In the C11 and C++11 standards, memory_order_consume is intended
to provide longer-term support for rcu_dereference(), but no compilers implement this natively yet.
(They instead strengthen memory_order_consume to memory_order_acquire, thus emitting a
needless memory-barrier instruction on weakly ordered systems.)
190 CHAPTER 9. DEFERRED PROCESSING

1 struct foo {
2 struct list_head *list;
3 int a;
4 int b;
5 int c;
6 };
7 LIST_HEAD(head);
8
9 /* . . . */
10
11 p = kmalloc(sizeof(*p), GFP_KERNEL);
12 p->a = 1;
13 p->b = 2;
14 p->c = 3;
15 list_add_rcu(&p->list, &head);

Figure 9.23: RCU Data Structure Publication

first next next next


prev prev prev
A B C

Figure 9.24: Linux Linear Linked List

the circular struct list_head and the linear struct hlist_head/struct


hlist_node pair. The former is laid out as shown in Figure 9.21, where the green
(leftmost) boxes represent the list header and the blue (rightmost three) boxes represent
the elements in the list. This notation is cumbersome, and will therefore be abbreviated
as shown in Figure 9.22, which shows only the non-header (blue) elements.
Adapting the pointer-publish example for the linked list results in the code shown in
Figure 9.23.
Line 15 must be protected by some synchronization mechanism (most commonly
some sort of lock) to prevent multiple list_add_rcu() instances from executing
concurrently. However, such synchronization does not prevent this list_add()
instance from executing concurrently with RCU readers.
Subscribing to an RCU-protected list is straightforward:
1 rcu_read_lock();
2 list_for_each_entry_rcu(p, head, list) {
3 do_something_with(p->a, p->b, p->c);
4 }
5 rcu_read_unlock();

The list_add_rcu() primitive publishes an entry, inserting it at the head of the


specified list, guaranteeing that the corresponding list_for_each_entry_rcu()
invocation will properly subscribe to this same entry.
Quick Quiz 9.20: What prevents the list_for_each_entry_rcu() from
getting a segfault if it happens to execute at exactly the same time as the list_add_
rcu()?
Linux’s other doubly linked list, the hlist, is a linear list, which means that it needs
only one pointer for the header rather than the two required for the circular list, as
shown in Figure 9.24. Thus, use of hlist can halve the memory consumption for the
hash-bucket arrays of large hash tables. As before, this notation is cumbersome, so
hlists will be abbreviated in the same way lists are, as shown in Figure 9.22.
Publishing a new element to an RCU-protected hlist is quite similar to doing so for
9.5. READ-COPY UPDATE (RCU) 191

1 struct foo {
2 struct hlist_node *list;
3 int a;
4 int b;
5 int c;
6 };
7 HLIST_HEAD(head);
8
9 /* . . . */
10
11 p = kmalloc(sizeof(*p), GFP_KERNEL);
12 p->a = 1;
13 p->b = 2;
14 p->c = 3;
15 hlist_add_head_rcu(&p->list, &head);

Figure 9.25: RCU hlist Publication

Category Publish Retract Subscribe


Pointers rcu_assign_pointer() rcu_assign_pointer(..., NULL) rcu_dereference()
list_add_rcu()
Lists list_add_tail_rcu() list_del_rcu() list_for_each_entry_rcu()
list_replace_rcu()
hlist_add_after_rcu()
hlist_add_before_rcu()
Hlists hlist_del_rcu() hlist_for_each_entry_rcu()
hlist_add_head_rcu()
hlist_replace_rcu()

Table 9.1: RCU Publish and Subscribe Primitives

the circular list, as shown in Figure 9.25.


As before, line 15 must be protected by some sort of synchronization mechanism,
for example, a lock.
Subscribing to an RCU-protected hlist is also similar to the circular list:
1 rcu_read_lock();
2 hlist_for_each_entry_rcu(p, head, list) {
3 do_something_with(p->a, p->b, p->c);
4 }
5 rcu_read_unlock();

The set of RCU publish and subscribe primitives are shown in Table 9.1, along with
additional primitives to “unpublish”, or retract.
Note that the list_replace_rcu(), list_del_rcu(), hlist_replace_
rcu(), and hlist_del_rcu() APIs add a complication. When is it safe to free
up the data element that was replaced or removed? In particular, how can we possibly
know when all the readers have released their references to that data element?
These questions are addressed in the following section.

9.5.2.2 Wait For Pre-Existing RCU Readers to Complete


In its most basic form, RCU is a way of waiting for things to finish. Of course, there
are a great many other ways of waiting for things to finish, including reference counts,
reader-writer locks, events, and so on. The great advantage of RCU is that it can wait
for each of (say) 20,000 different things without having to explicitly track each and
every one of them, and without having to worry about the performance degradation,
scalability limitations, complex deadlock scenarios, and memory-leak hazards that are
inherent in schemes using explicit tracking.
In RCU’s case, the things waited on are called “RCU read-side critical sections”.
An RCU read-side critical section starts with an rcu_read_lock() primitive, and
192 CHAPTER 9. DEFERRED PROCESSING

Reader Reader Reader

Grace Period
Reader Reader
Extends as
Reader Reader Needed

Reader Reader

Removal Reclamation

Time

Figure 9.26: Readers and RCU Grace Period

ends with a corresponding rcu_read_unlock() primitive. RCU read-side critical


sections can be nested, and may contain pretty much any code, as long as that code does
not explicitly block or sleep (although a special form of RCU called SRCU [McK06]
does permit general sleeping in SRCU read-side critical sections). If you abide by these
conventions, you can use RCU to wait for any desired piece of code to complete.
RCU accomplishes this feat by indirectly determining when these other things have
finished [McK07f, McK07a].
In particular, as shown in Figure 9.26, RCU is a way of waiting for pre-existing RCU
read-side critical sections to completely finish, including memory operations executed
by those critical sections. However, note that RCU read-side critical sections that begin
after the beginning of a given grace period can and will extend beyond the end of that
grace period.
The following pseudocode shows the basic form of algorithms that use RCU to wait
for readers:

1. Make a change, for example, replace an element in a linked list.

2. Wait for all pre-existing RCU read-side critical sections to completely finish (for
example, by using the synchronize_rcu() primitive or its asynchronous
counterpart, call_rcu(), which invokes a specified function at the end of a
future grace period). The key observation here is that subsequent RCU read-side
critical sections have no way to gain a reference to the newly removed element.

3. Clean up, for example, free the element that was replaced above.

The code fragment shown in Figure 9.27, adapted from those in Section 9.5.2.1,
demonstrates this process, with field a being the search key.
Lines 19, 20, and 21 implement the three steps called out above. Lines 16-19 gives
RCU (“read-copy update”) its name: while permitting concurrent reads, line 16 copies
and lines 17-19 do an update.
As discussed in Section 9.5.1, the synchronize_rcu() primitive can be quite
simple (see Section 9.5.5 for additional “toy” RCU implementations). However,
production-quality implementations must deal with difficult corner cases and also incor-
porate powerful optimizations, both of which result in significant complexity. Although
it is good to know that there is a simple conceptual implementation of synchronize_
rcu(), other questions remain. For example, what exactly do RCU readers see when
9.5. READ-COPY UPDATE (RCU) 193

1 struct foo {
2 struct list_head *list;
3 int a;
4 int b;
5 int c;
6 };
7 LIST_HEAD(head);
8
9 /* . . . */
10
11 p = search(head, key);
12 if (p == NULL) {
13 /* Take appropriate action, unlock, & return. */
14 }
15 q = kmalloc(sizeof(*p), GFP_KERNEL);
16 *q = *p;
17 q->b = 2;
18 q->c = 3;
19 list_replace_rcu(&p->list, &q->list);
20 synchronize_rcu();
21 kfree(p);

Figure 9.27: Canonical RCU Replacement Example

traversing a concurrently updated list? This question is addressed in the following


section.

9.5.2.3 Maintain Multiple Versions of Recently Updated Objects


This section demonstrates how RCU maintains multiple versions of lists to accommodate
synchronization-free readers. Two examples are presented showing how an element
that might be referenced by a given reader must remain intact while that reader remains
in its RCU read-side critical section. The first example demonstrates deletion of a list
element, and the second example demonstrates replacement of an element.

Example 1: Maintaining Multiple Versions During Deletion We can now revisit


the deletion example from Section 9.5.1, but now with the benefit of a firm understanding
of the fundamental concepts underlying RCU. To begin this new version of the deletion
example, we will modify lines 11-21 in Figure 9.27 to read as follows:
1 p = search(head, key);
2 if (p != NULL) {
3 list_del_rcu(&p->list);
4 synchronize_rcu();
5 kfree(p);
6 }

This code will update the list as shown in Figure 9.28. The triples in each element
represent the values of fields a, b, and c, respectively. The red-shaded elements indicate
that RCU readers might be holding references to them, so in the initial state at the
top of the diagram, all elements are shaded red. Please note that we have omitted the
backwards pointers and the link from the tail of the list to the head for clarity.
After the list_del_rcu() on line 3 has completed, the 5,6,7 element has
been removed from the list, as shown in the second row of Figure 9.28. Since readers do
not synchronize directly with updaters, readers might be concurrently scanning this list.
These concurrent readers might or might not see the newly removed element, depending
on timing. However, readers that were delayed (e.g., due to interrupts, ECC memory
errors, or, in CONFIG_PREEMPT_RT kernels, preemption) just after fetching a pointer
to the newly removed element might see the old version of the list for quite some time
194 CHAPTER 9. DEFERRED PROCESSING

1,2,3 5,6,7 11,4,8

list_del_rcu()

1,2,3 5,6,7 11,4,8

synchronize_rcu()

1,2,3 5,6,7 11,4,8

kfree()

1,2,3 11,4,8

Figure 9.28: RCU Deletion From Linked List

after the removal. Therefore, we now have two versions of the list, one with element
5,6,7 and one without. The 5,6,7 element in the second row of the figure is now
shaded yellow, indicating that old readers might still be referencing it, but that new
readers cannot obtain a reference to it.
Please note that readers are not permitted to maintain references to element 5,6,7
after exiting from their RCU read-side critical sections. Therefore, once the synchronize_
rcu() on line 4 completes, so that all pre-existing readers are guaranteed to have
completed, there can be no more readers referencing this element, as indicated by its
green shading on the third row of Figure 9.28. We are thus back to a single version of
the list.
At this point, the 5,6,7 element may safely be freed, as shown on the final row
of Figure 9.28. At this point, we have completed the deletion of element 5,6,7. The
following example covers replacement.

Example 2: Maintaining Multiple Versions During Replacement To start the re-


placement example, here are the last few lines of the example shown in Figure 9.27:
1 q = kmalloc(sizeof(*p), GFP_KERNEL);
2 *q = *p;
3 q->b = 2;
4 q->c = 3;
5 list_replace_rcu(&p->list, &q->list);
6 synchronize_rcu();
7 kfree(p);

The initial state of the list, including the pointer p, is the same as for the deletion
example, as shown on the first row of Figure 9.29.
As before, the triples in each element represent the values of fields a, b, and c,
9.5. READ-COPY UPDATE (RCU) 195

1,2,3 5,6,7 11,4,8

Allocate

?,?,?

1,2,3 5,6,7 11,4,8

Copy

5,6,7

1,2,3 5,6,7 11,4,8

Update

5,2,3

1,2,3 5,6,7 11,4,8

list_replace_rcu()

5,2,3

1,2,3 5,6,7 11,4,8

synchronize_rcu()

5,2,3

1,2,3 5,6,7 11,4,8

kfree()

1,2,3 5,2,3 11,4,8

Figure 9.29: RCU Replacement in Linked List


196 CHAPTER 9. DEFERRED PROCESSING

respectively. The red-shaded elements might be referenced by readers, and because


readers do not synchronize directly with updaters, readers might run concurrently with
this entire replacement process. Please note that we again omit the backwards pointers
and the link from the tail of the list to the head for clarity.
The following text describes how to replace the 5,6,7 element with 5,2,3 in
such a way that any given reader sees one of these two values.
Line 1 kmalloc()s a replacement element, as follows, resulting in the state as
shown in the second row of Figure 9.29. At this point, no reader can hold a reference to
the newly allocated element (as indicated by its green shading), and it is uninitialized
(as indicated by the question marks).
Line 2 copies the old element to the new one, resulting in the state as shown in the
third row of Figure 9.29. The newly allocated element still cannot be referenced by
readers, but it is now initialized.
Line 3 updates q->b to the value “2”, and line 4 updates q->c to the value “3”, as
shown on the fourth row of Figure 9.29.
Now, line 5 does the replacement, so that the new element is finally visible to
readers, and hence is shaded red, as shown on the fifth row of Figure 9.29. At this point,
as shown below, we have two versions of the list. Pre-existing readers might see the
5,6,7 element (which is therefore now shaded yellow), but new readers will instead
see the 5,2,3 element. But any given reader is guaranteed to see some well-defined
list.
After the synchronize_rcu() on line 6 returns, a grace period will have
elapsed, and so all reads that started before the list_replace_rcu() will have
completed. In particular, any readers that might have been holding references to the
5,6,7 element are guaranteed to have exited their RCU read-side critical sections, and
are thus prohibited from continuing to hold a reference. Therefore, there can no longer
be any readers holding references to the old element, as indicated its green shading in
the sixth row of Figure 9.29. As far as the readers are concerned, we are back to having
a single version of the list, but with the new element in place of the old.
After the kfree() on line 7 completes, the list will appear as shown on the final
row of Figure 9.29.
Despite the fact that RCU was named after the replacement case, the vast majority
of RCU usage within the Linux kernel relies on the simple deletion case shown in
Section 9.5.2.3.

Discussion These examples assumed that a mutex was held across the entire update
operation, which would mean that there could be at most two versions of the list active
at a given time.
Quick Quiz 9.21: How would you modify the deletion example to permit more
than two versions of the list to be active?
Quick Quiz 9.22: How many RCU versions of a given list can be active at any
given time?
This sequence of events shows how RCU updates use multiple versions to safely
carry out changes in presence of concurrent readers. Of course, some algorithms cannot
gracefully handle multiple versions. There are techniques for adapting such algorithms
to RCU [McK04], but these are beyond the scope of this section.

9.5.2.4 Summary of RCU Fundamentals


This section has described the three fundamental components of RCU-based algorithms:
9.5. READ-COPY UPDATE (RCU) 197

Mechanism RCU Replaces Section


Reader-writer locking Section 9.5.3.2
Restricted reference-counting mechanism Section 9.5.3.3
Bulk reference-counting mechanism Section 9.5.3.4
Poor man’s garbage collector Section 9.5.3.5
Existence Guarantees Section 9.5.3.6
Type-Safe Memory Section 9.5.3.7
Wait for things to finish Section 9.5.3.8

Table 9.2: RCU Usage

1. a publish-subscribe mechanism for adding new data,

2. a way of waiting for pre-existing RCU readers to finish, and

3. a discipline of maintaining multiple versions to permit change without harming


or unduly delaying concurrent RCU readers.

Quick Quiz 9.23: How can RCU updaters possibly delay RCU readers, given
that the rcu_read_lock() and rcu_read_unlock() primitives neither spin
nor block?
These three RCU components allow data to be updated in face of concurrent readers,
and can be combined in different ways to implement a surprising variety of different
types of RCU-based algorithms, some of which are described in the following section.

9.5.3 RCU Usage


This section answers the question “What is RCU?” from the viewpoint of the uses to
which RCU can be put. Because RCU is most frequently used to replace some existing
mechanism, we look at it primarily in terms of its relationship to such mechanisms, as
listed in Table 9.2. Following the sections listed in this table, Section 9.5.3.9 provides a
summary.

9.5.3.1 RCU for Pre-BSD Routing


Figures 9.30 and 9.31 show code for an RCU-protected Pre-BSD routing table (route_
rcu.c). The former shows data structures and route_lookup(), and the latter
shows route_add() and route_del().
In Figure 9.30, line 2 adds the ->rh field used by RCU reclamation, line 6 adds the
->re_freed use-after-free-check field, lines 16, 17, 23, and 27 add RCU read-side
protection, and lines 21 and 22 add the use-after-free check. In Figure 9.31, lines 12,
14, 31, 36, and 41 add update-side locking, lines 13 and 35 add RCU update-side
protection, line 37 causes route_cb() to be invoked after a grace period elapses, and
lines 18-25 define route_cb(). This is minimal added code for a working concurrent
implementation.
Figure 9.32 shows the performance on the read-only workload. RCU scales quite
well, and offers nearly ideal performance. However, this data was generated using
the RCU_SIGNAL flavor of userspace RCU [Des09, MDJ13c], for which rcu_read_
lock() and rcu_read_unlock() generate a small amount of code. What happens
for the QSBR flavor of RCU, which generates no code at all for rcu_read_lock()
and rcu_read_unlock()? (See Section 9.5.1, and especially Figure 9.19, for a
discussion of RCU QSBR.)
198 CHAPTER 9. DEFERRED PROCESSING

1 struct route_entry {
2 struct rcu_head rh;
3 struct cds_list_head re_next;
4 unsigned long addr;
5 unsigned long iface;
6 int re_freed;
7 };
8 CDS_LIST_HEAD(route_list);
9 DEFINE_SPINLOCK(routelock);
10
11 unsigned long route_lookup(unsigned long addr)
12 {
13 struct route_entry *rep;
14 unsigned long ret;
15
16 rcu_read_lock();
17 cds_list_for_each_entry_rcu(rep, &route_list,
18 re_next) {
19 if (rep->addr == addr) {
20 ret = rep->iface;
21 if (ACCESS_ONCE(rep->re_freed))
22 abort();
23 rcu_read_unlock();
24 return ret;
25 }
26 }
27 rcu_read_unlock();
28 return ULONG_MAX;
29 }

Figure 9.30: RCU Pre-BSD Routing Table Lookup

The answer to this shown in Figure 9.33, which shows the RCU QSBR results as the
trace between the RCU and the ideal traces. RCU QSBR’s performance and scalability
is very nearly that of an ideal synchronization-free workload, as desired.
Quick Quiz 9.24: Why doesn’t RCU QSBR give exactly ideal results?
Quick Quiz 9.25: Given RCU QSBR’s read-side performance, why bother with
any other flavor of userspace RCU?

9.5.3.2 RCU is a Reader-Writer Lock Replacement


Perhaps the most common use of RCU within the Linux kernel is as a replacement
for reader-writer locking in read-intensive situations. Nevertheless, this use of RCU
was not immediately apparent to me at the outset, in fact, I chose to implement a
lightweight reader-writer lock [HW92]9 before implementing a general-purpose RCU
implementation back in the early 1990s. Each and every one of the uses I envisioned for
the lightweight reader-writer lock was instead implemented using RCU. In fact, it was
more than three years before the lightweight reader-writer lock saw its first use. Boy,
did I feel foolish!
The key similarity between RCU and reader-writer locking is that both have read-
side critical sections that can execute in parallel. In fact, in some cases, it is possible to
mechanically substitute RCU API members for the corresponding reader-writer lock
API members. But first, why bother?
Advantages of RCU include performance, deadlock immunity, and realtime latency.
There are, of course, limitations to RCU, including the fact that readers and updaters run
concurrently, that low-priority RCU readers can block high-priority threads waiting for a
grace period to elapse, and that grace-period latencies can extend for many milliseconds.
These advantages and limitations are discussed in the following sections.
9 Similar to brlock in the 2.4 Linux kernel and to lglock in more recent Linux kernels.
9.5. READ-COPY UPDATE (RCU) 199

1 int route_add(unsigned long addr,


2 unsigned long interface)
3 {
4 struct route_entry *rep;
5
6 rep = malloc(sizeof(*rep));
7 if (!rep)
8 return -ENOMEM;
9 rep->addr = addr;
10 rep->iface = interface;
11 rep->re_freed = 0;
12 spin_lock(&routelock);
13 cds_list_add_rcu(&rep->re_next, &route_list);
14 spin_unlock(&routelock);
15 return 0;
16 }
17
18 static void route_cb(struct rcu_head *rhp)
19 {
20 struct route_entry *rep;
21
22 rep = container_of(rhp, struct route_entry, rh);
23 ACCESS_ONCE(rep->re_freed) = 1;
24 free(rep);
25 }
26
27 int route_del(unsigned long addr)
28 {
29 struct route_entry *rep;
30
31 spin_lock(&routelock);
32 cds_list_for_each_entry(rep, &route_list,
33 re_next) {
34 if (rep->addr == addr) {
35 cds_list_del_rcu(&rep->re_next);
36 spin_unlock(&routelock);
37 call_rcu(&rep->rh, route_cb);
38 return 0;
39 }
40 }
41 spin_unlock(&routelock);
42 return -ENOENT;
43 }

Figure 9.31: RCU Pre-BSD Routing Table Add/Delete


200 CHAPTER 9. DEFERRED PROCESSING

450000
400000 ideal

Lookups per Millisecond


350000
300000
RCU
250000
200000
seqlock
150000
100000
50000 hazptr
refcnt
0
1 2 3 4 5 6 7 8
Number of CPUs (Threads)

Figure 9.32: Pre-BSD Routing Table Protected by RCU

450000
400000 ideal
Lookups per Millisecond

350000
300000
RCU
250000
200000
seqlock
150000
100000
50000 hazptr
refcnt
0
1 2 3 4 5 6 7 8
Number of CPUs (Threads)

Figure 9.33: Pre-BSD Routing Table Protected by RCU QSBR

Performance The read-side performance advantages of RCU over reader-writer lock-


ing are shown in Figure 9.34.
Quick Quiz 9.26: WTF? How the heck do you expect me to believe that RCU
has a 100-femtosecond overhead when the clock period at 3GHz is more than 300
picoseconds?
Note that reader-writer locking is orders of magnitude slower than RCU on a single
CPU, and is almost two additional orders of magnitude slower on 16 CPUs. In contrast,
RCU scales quite well. In both cases, the error bars span a single standard deviation in
either direction.
A more moderate view may be obtained from a CONFIG_PREEMPT kernel, though
RCU still beats reader-writer locking by between one and three orders of magnitude,
as shown in Figure 9.35. Note the high variability of reader-writer locking at larger
numbers of CPUs. The error bars span a single standard deviation in either direction.
Of course, the low performance of reader-writer locking in Figure 9.35 is exaggerated
by the unrealistic zero-length critical sections. The performance advantages of RCU
become less significant as the overhead of the critical section increases, as shown in
Figure 9.36 for a 16-CPU system, in which the y-axis represents the sum of the overhead
9.5. READ-COPY UPDATE (RCU) 201

10000
1000 rwlock

Overhead (nanoseconds)
100
10
1
0.1
0.01
0.001 rcu
1e-04
1e-05
0 2 4 6 8 10 12 14 16
Number of CPUs

Figure 9.34: Performance Advantage of RCU Over Reader-Writer Locking

10000

rwlock
Overhead (nanoseconds)

1000

100

10 rcu

1
0 2 4 6 8 10 12 14 16
Number of CPUs

Figure 9.35: Performance Advantage of Preemptible RCU Over Reader-Writer Locking

of the read-side primitives and that of the critical section.


Quick Quiz 9.27: Why does both the variability and overhead of rwlock decrease
as the critical-section overhead increases?
However, this observation must be tempered by the fact that a number of system
calls (and thus any RCU read-side critical sections that they contain) can complete
within a few microseconds.
In addition, as is discussed in the next section, RCU read-side primitives are almost
entirely deadlock-immune.

Deadlock Immunity Although RCU offers significant performance advantages for


read-mostly workloads, one of the primary reasons for creating RCU in the first place
was in fact its immunity to read-side deadlocks. This immunity stems from the fact that
RCU read-side primitives do not block, spin, or even do backwards branches, so that
their execution time is deterministic. It is therefore impossible for them to participate in
a deadlock cycle.
Quick Quiz 9.28: Is there an exception to this deadlock immunity, and if so, what
202 CHAPTER 9. DEFERRED PROCESSING

12000

10000

Overhead (nanoseconds)
8000
rwlock

6000

4000

2000 rcu

0
0 2 4 6 8 10
Critical-Section Duration (microseconds)

Figure 9.36: Comparison of RCU to Reader-Writer Locking as Function of Critical-


Section Duration

sequence of events could lead to deadlock?


An interesting consequence of RCU’s read-side deadlock immunity is that it is
possible to unconditionally upgrade an RCU reader to an RCU updater. Attempting
to do such an upgrade with reader-writer locking results in deadlock. A sample code
fragment that does an RCU read-to-update upgrade follows:
1 rcu_read_lock();
2 list_for_each_entry_rcu(p, &head, list_field) {
3 do_something_with(p);
4 if (need_update(p)) {
5 spin_lock(my_lock);
6 do_update(p);
7 spin_unlock(&my_lock);
8 }
9 }
10 rcu_read_unlock();

Note that do_update() is executed under the protection of the lock and under
RCU read-side protection.
Another interesting consequence of RCU’s deadlock immunity is its immunity to a
large class of priority inversion problems. For example, low-priority RCU readers cannot
prevent a high-priority RCU updater from acquiring the update-side lock. Similarly, a
low-priority RCU updater cannot prevent high-priority RCU readers from entering an
RCU read-side critical section.
Quick Quiz 9.29: Immunity to both deadlock and priority inversion??? Sounds too
good to be true. Why should I believe that this is even possible?

Realtime Latency Because RCU read-side primitives neither spin nor block, they
offer excellent realtime latencies. In addition, as noted earlier, this means that they are
immune to priority inversion involving the RCU read-side primitives and locks.
However, RCU is susceptible to more subtle priority-inversion scenarios, for exam-
ple, a high-priority process blocked waiting for an RCU grace period to elapse can be
blocked by low-priority RCU readers in -rt kernels. This can be solved by using RCU
priority boosting [McK07c, GMTW08].
9.5. READ-COPY UPDATE (RCU) 203

rwlock reader spin rwlock reader


rwlock reader spin rwlock reader
rwlock reader spin rwlock reader
spin rwlock writer

RCU reader RCU reader RCU reader


RCU reader RCU reader RCU reader
RCU reader RCU reader RCU reader
RCU updater
Time

Update Received

Figure 9.37: Response Time of RCU vs. Reader-Writer Locking

RCU Readers and Updaters Run Concurrently Because RCU readers never spin
nor block, and because updaters are not subject to any sort of rollback or abort semantics,
RCU readers and updaters must necessarily run concurrently. This means that RCU
readers might access stale data, and might even see inconsistencies, either of which can
render conversion from reader-writer locking to RCU non-trivial.
However, in a surprisingly large number of situations, inconsistencies and stale data
are not problems. The classic example is the networking routing table. Because routing
updates can take considerable time to reach a given system (seconds or even minutes),
the system will have been sending packets the wrong way for quite some time when
the update arrives. It is usually not a problem to continue sending updates the wrong
way for a few additional milliseconds. Furthermore, because RCU updaters can make
changes without waiting for RCU readers to finish, the RCU readers might well see the
change more quickly than would batch-fair reader-writer-locking readers, as shown in
Figure 9.37.
Once the update is received, the rwlock writer cannot proceed until the last reader
completes, and subsequent readers cannot proceed until the writer completes. However,
these subsequent readers are guaranteed to see the new value, as indicated by the green
shading of the rightmost boxes. In contrast, RCU readers and updaters do not block
each other, which permits the RCU readers to see the updated values sooner. Of course,
because their execution overlaps that of the RCU updater, all of the RCU readers might
well see updated values, including the three readers that started before the update.
Nevertheless only the green-shaded rightmost RCU readers are guaranteed to see the
updated values.
Reader-writer locking and RCU simply provide different guarantees. With reader-
writer locking, any reader that begins after the writer begins is guaranteed to see new
values, and any reader that attempts to begin while the writer is spinning might or
might not see new values, depending on the reader/writer preference of the rwlock
implementation in question. In contrast, with RCU, any reader that begins after the
updater completes is guaranteed to see new values, and any reader that completes after
the updater begins might or might not see new values, depending on timing.
The key point here is that, although reader-writer locking does indeed guarantee
consistency within the confines of the computer system, there are situations where this
consistency comes at the price of increased inconsistency with the outside world. In
204 CHAPTER 9. DEFERRED PROCESSING

other words, reader-writer locking obtains internal consistency at the price of silently
stale data with respect to the outside world.
Nevertheless, there are situations where inconsistency and stale data within the
confines of the system cannot be tolerated. Fortunately, there are a number of approaches
that avoid inconsistency and stale data [McK04, ACMS03], and some methods based
on reference counting are discussed in Section 9.2.

Low-Priority RCU Readers Can Block High-Priority Reclaimers In Realtime


RCU [GMTW08], SRCU [McK06], or QRCU [McK07e] (see Section 12.1.4), a pre-
empted reader will prevent a grace period from completing, even if a high-priority task
is blocked waiting for that grace period to complete. Realtime RCU can avoid this
problem by substituting call_rcu() for synchronize_rcu() or by using RCU
priority boosting [McK07c, GMTW08], which is still in experimental status as of early
2008. It might become necessary to augment SRCU and QRCU with priority boosting,
but not before a clear real-world need is demonstrated.

RCU Grace Periods Extend for Many Milliseconds With the exception of QRCU
and several of the “toy” RCU implementations described in Section 9.5.5, RCU grace
periods extend for multiple milliseconds. Although there are a number of techniques to
render such long delays harmless, including use of the asynchronous interfaces where
available (call_rcu() and call_rcu_bh()), this situation is a major reason for
the rule of thumb that RCU be used in read-mostly situations.

Comparison of Reader-Writer Locking and RCU Code In the best case, the con-
version from reader-writer locking to RCU is quite simple, as shown in Figures 9.38,
9.39, and 9.40, all taken from Wikipedia [MPA+ 06].
1 struct el { 1 struct el {
2 struct list_head lp; 2 struct list_head lp;
3 long key; 3 long key;
4 spinlock_t mutex; 4 spinlock_t mutex;
5 int data; 5 int data;
6 /* Other data fields */ 6 /* Other data fields */
7 }; 7 };
8 DEFINE_RWLOCK(listmutex); 8 DEFINE_SPINLOCK(listmutex);
9 LIST_HEAD(head); 9 LIST_HEAD(head);

Figure 9.38: Converting Reader-Writer Locking to RCU: Data

1 int search(long key, int *result) 1 int search(long key, int *result)
2 { 2 {
3 struct el *p; 3 struct el *p;
4 4
5 read_lock(&listmutex); 5 rcu_read_lock();
6 list_for_each_entry(p, &head, lp) { 6 list_for_each_entry_rcu(p, &head, lp) {
7 if (p->key == key) { 7 if (p->key == key) {
8 *result = p->data; 8 *result = p->data;
9 read_unlock(&listmutex); 9 rcu_read_unlock();
10 return 1; 10 return 1;
11 } 11 }
12 } 12 }
13 read_unlock(&listmutex); 13 rcu_read_unlock();
14 return 0; 14 return 0;
15 } 15 }

Figure 9.39: Converting Reader-Writer Locking to RCU: Search


9.5. READ-COPY UPDATE (RCU) 205

1 int delete(long key) 1 int delete(long key)


2 { 2 {
3 struct el *p; 3 struct el *p;
4 4
5 write_lock(&listmutex); 5 spin_lock(&listmutex);
6 list_for_each_entry(p, &head, lp) { 6 list_for_each_entry(p, &head, lp) {
7 if (p->key == key) { 7 if (p->key == key) {
8 list_del(&p->lp); 8 list_del_rcu(&p->lp);
9 write_unlock(&listmutex); 9 spin_unlock(&listmutex);
10 synchronize_rcu();
10 kfree(p); 11 kfree(p);
11 return 1; 12 return 1;
12 } 13 }
13 } 14 }
14 write_unlock(&listmutex); 15 spin_unlock(&listmutex);
15 return 0; 16 return 0;
16 } 17 }

Figure 9.40: Converting Reader-Writer Locking to RCU: Deletion

More-elaborate cases of replacing reader-writer locking with RCU are beyond the
scope of this document.

9.5.3.3 RCU is a Restricted Reference-Counting Mechanism


Because grace periods are not allowed to complete while there is an RCU read-side
critical section in progress, the RCU read-side primitives may be used as a restricted
reference-counting mechanism. For example, consider the following code fragment:
1 rcu_read_lock(); /* acquire reference. */
2 p = rcu_dereference(head);
3 /* do something with p. */
4 rcu_read_unlock(); /* release reference. */

The rcu_read_lock() primitive can be thought of as acquiring a reference


to p, because a grace period starting after the rcu_dereference() assigns to p
cannot possibly end until after we reach the matching rcu_read_unlock(). This
reference-counting scheme is restricted in that we are not allowed to block in RCU
read-side critical sections, nor are we permitted to hand off an RCU read-side critical
section from one task to another.
Regardless of these restrictions, the following code can safely delete p:
1 spin_lock(&mylock);
2 p = head;
3 rcu_assign_pointer(head, NULL);
4 spin_unlock(&mylock);
5 /* Wait for all references to be released. */
6 synchronize_rcu();
7 kfree(p);

The assignment to head prevents any future references to p from being acquired,
and the synchronize_rcu() waits for any previously acquired references to be
released.
Quick Quiz 9.30: But wait! This is exactly the same code that might be used when
thinking of RCU as a replacement for reader-writer locking! What gives?
Of course, RCU can also be combined with traditional reference counting, as
discussed in Section 13.2.
But why bother? Again, part of the answer is performance, as shown in Figure 9.41,
again showing data taken on a 16-CPU 3GHz Intel x86 system.
Quick Quiz 9.31: Why the dip in refcnt overhead near 6 CPUs?
206 CHAPTER 9. DEFERRED PROCESSING

10000

refcnt

Overhead (nanoseconds)
1000

100

10 rcu

1
0 2 4 6 8 10 12 14 16
Number of CPUs

Figure 9.41: Performance of RCU vs. Reference Counting


12000

10000
Overhead (nanoseconds)

8000

6000

4000 refcnt

2000 rcu

0
0 2 4 6 8 10
Critical-Section Duration (microseconds)

Figure 9.42: Response Time of RCU vs. Reference Counting

And, as with reader-writer locking, the performance advantages of RCU are most
pronounced for short-duration critical sections, as shown Figure 9.42 for a 16-CPU
system. In addition, as with reader-writer locking, many system calls (and thus any
RCU read-side critical sections that they contain) complete in a few microseconds.
However, the restrictions that go with RCU can be quite onerous. For example, in
many cases, the prohibition against sleeping while in an RCU read-side critical section
would defeat the entire purpose. The next section looks at ways of addressing this
problem, while also reducing the complexity of traditional reference counting, at least
in some cases.

9.5.3.4 RCU is a Bulk Reference-Counting Mechanism


As noted in the preceding section, traditional reference counters are usually associated
with a specific data structure, or perhaps a specific group of data structures. However,
maintaining a single global reference counter for a large variety of data structures
typically results in bouncing the cache line containing the reference count. Such cache-
line bouncing can severely degrade performance.
9.5. READ-COPY UPDATE (RCU) 207

In contrast, RCU’s light-weight read-side primitives permit extremely frequent read-


side usage with negligible performance degradation, permitting RCU to be used as a
“bulk reference-counting” mechanism with little or no performance penalty. Situations
where a reference must be held by a single task across a section of code that blocks
may be accommodated with Sleepable RCU (SRCU) [McK06]. This fails to cover
the not-uncommon situation where a reference is “passed” from one task to another,
for example, when a reference is acquired when starting an I/O and released in the
corresponding completion interrupt handler. (In principle, this could be handled by the
SRCU implementation, but in practice, it is not yet clear whether this is a good tradeoff.)
Of course, SRCU brings restrictions of its own, namely that the return value from
srcu_read_lock() be passed into the corresponding srcu_read_unlock(),
and that no SRCU primitives be invoked from hardware interrupt handlers or from
non-maskable interrupt (NMI) handlers. The jury is still out as to how much of a
problem is presented by these restrictions, and as to how they can best be handled.

9.5.3.5 RCU is a Poor Man’s Garbage Collector


A not-uncommon exclamation made by people first learning about RCU is “RCU is sort
of like a garbage collector!” This exclamation has a large grain of truth, but it can also
be misleading.
Perhaps the best way to think of the relationship between RCU and automatic
garbage collectors (GCs) is that RCU resembles a GC in that the timing of collection is
automatically determined, but that RCU differs from a GC in that: (1) the programmer
must manually indicate when a given data structure is eligible to be collected, and (2) the
programmer must manually mark the RCU read-side critical sections where references
might legitimately be held.
Despite these differences, the resemblance does go quite deep, and has appeared in
at least one theoretical analysis of RCU. Furthermore, the first RCU-like mechanism I
am aware of used a garbage collector to handle the grace periods. Nevertheless, a better
way of thinking of RCU is described in the following section.

9.5.3.6 RCU is a Way of Providing Existence Guarantees


Gamsa et al. [GKAS99] discuss existence guarantees and describe how a mechanism
resembling RCU can be used to provide these existence guarantees (see section 5 on page
7 of the PDF), and Section 7.4 discusses how to guarantee existence via locking, along
with the ensuing disadvantages of doing so. The effect is that if any RCU-protected
data element is accessed within an RCU read-side critical section, that data element is
guaranteed to remain in existence for the duration of that RCU read-side critical section.
Figure 9.43 demonstrates how RCU-based existence guarantees can enable per-
element locking via a function that deletes an element from a hash table. Line 6
computes a hash function, and line 7 enters an RCU read-side critical section. If line 9
finds that the corresponding bucket of the hash table is empty or that the element present
is not the one we wish to delete, then line 10 exits the RCU read-side critical section
and line 11 indicates failure.
Quick Quiz 9.32: What if the element we need to delete is not the first element of
the list on line 9 of Figure 9.43?
Otherwise, line 13 acquires the update-side spinlock, and line 14 then checks that
the element is still the one that we want. If so, line 15 leaves the RCU read-side critical
section, line 16 removes it from the table, line 17 releases the lock, line 18 waits for
208 CHAPTER 9. DEFERRED PROCESSING

1 int delete(int key)


2 {
3 struct element *p;
4 int b;
5
6 b = hashfunction(key);
7 rcu_read_lock();
8 p = rcu_dereference(hashtable[b]);
9 if (p == NULL || p->key != key) {
10 rcu_read_unlock();
11 return 0;
12 }
13 spin_lock(&p->lock);
14 if (hashtable[b] == p && p->key == key) {
15 rcu_read_unlock();
16 rcu_assign_pointer(hashtable[b], NULL);
17 spin_unlock(&p->lock);
18 synchronize_rcu();
19 kfree(p);
20 return 1;
21 }
22 spin_unlock(&p->lock);
23 rcu_read_unlock();
24 return 0;
25 }

Figure 9.43: Existence Guarantees Enable Per-Element Locking

all pre-existing RCU read-side critical sections to complete, line 19 frees the newly
removed element, and line 20 indicates success. If the element is no longer the one we
want, line 22 releases the lock, line 23 leaves the RCU read-side critical section, and
line 24 indicates failure to delete the specified key.
Quick Quiz 9.33: Why is it OK to exit the RCU read-side critical section on line 15
of Figure 9.43 before releasing the lock on line 17?
Quick Quiz 9.34: Why not exit the RCU read-side critical section on line 23 of
Figure 9.43 before releasing the lock on line 22?
Alert readers will recognize this as only a slight variation on the original “RCU
is a way of waiting for things to finish” theme, which is addressed in Section 9.5.3.8.
They might also note the deadlock-immunity advantages over the lock-based existence
guarantees discussed in Section 7.4.

9.5.3.7 RCU is a Way of Providing Type-Safe Memory


A number of lockless algorithms do not require that a given data element keep the same
identity through a given RCU read-side critical section referencing it—but only if that
data element retains the same type. In other words, these lockless algorithms can tolerate
a given data element being freed and reallocated as the same type of structure while they
are referencing it, but must prohibit a change in type. This guarantee, called “type-safe
memory” in academic literature [GC96], is weaker than the existence guarantees in the
previous section, and is therefore quite a bit harder to work with. Type-safe memory
algorithms in the Linux kernel make use of slab caches, specially marking these caches
with SLAB_DESTROY_BY_RCU so that RCU is used when returning a freed-up slab
to system memory. This use of RCU guarantees that any in-use element of such a slab
will remain in that slab, thus retaining its type, for the duration of any pre-existing RCU
read-side critical sections.
Quick Quiz 9.35: But what if there is an arbitrarily long series of RCU read-side
critical sections in multiple threads, so that at any point in time there is at least one
thread in the system executing in an RCU read-side critical section? Wouldn’t that
9.5. READ-COPY UPDATE (RCU) 209

prevent any data from a SLAB_DESTROY_BY_RCU slab ever being returned to the
system, possibly resulting in OOM events?
These algorithms typically use a validation step that checks to make sure that the
newly referenced data structure really is the one that was requested [LS86, Section 2.5].
These validation checks require that portions of the data structure remain untouched by
the free-reallocate process. Such validation checks are usually very hard to get right,
and can hide subtle and difficult bugs.
Therefore, although type-safety-based lockless algorithms can be extremely helpful
in a very few difficult situations, you should instead use existence guarantees where
possible. Simpler is after all almost always better!

9.5.3.8 RCU is a Way of Waiting for Things to Finish


As noted in Section 9.5.2 an important component of RCU is a way of waiting for RCU
readers to finish. One of RCU’s great strengths is that it allows you to wait for each
of thousands of different things to finish without having to explicitly track each and
every one of them, and without having to worry about the performance degradation,
scalability limitations, complex deadlock scenarios, and memory-leak hazards that are
inherent in schemes that use explicit tracking.
In this section, we will show how synchronize_sched()’s read-side counter-
parts (which include anything that disables preemption, along with hardware operations
and primitives that disable interrupts) permit you to implement interactions with non-
maskable interrupt (NMI) handlers that would be quite difficult if using locking. This
approach has been called “Pure RCU” [McK04], and it is used in a number of places in
the Linux kernel.
The basic form of such “Pure RCU” designs is as follows:

1. Make a change, for example, to the way that the OS reacts to an NMI.

2. Wait for all pre-existing read-side critical sections to completely finish (for ex-
ample, by using the synchronize_sched() primitive). The key observation
here is that subsequent RCU read-side critical sections are guaranteed to see
whatever change was made.

3. Clean up, for example, return status indicating that the change was successfully
made.

The remainder of this section presents example code adapted from the Linux ker-
nel. In this example, the timer_stop function uses synchronize_sched() to
ensure that all in-flight NMI notifications have completed before freeing the associated
resources. A simplified version of this code is shown Figure 9.44.
Lines 1-4 define a profile_buffer structure, containing a size and an indefinite
array of entries. Line 5 defines a pointer to a profile buffer, which is presumably
initialized elsewhere to point to a dynamically allocated region of memory.
Lines 7-16 define the nmi_profile() function, which is called from within an
NMI handler. As such, it cannot be preempted, nor can it be interrupted by a normal
interrupts handler, however, it is still subject to delays due to cache misses, ECC errors,
and cycle stealing by other hardware threads within the same core. Line 9 gets a local
pointer to the profile buffer using the rcu_dereference() primitive to ensure
memory ordering on DEC Alpha, and lines 11 and 12 exit from this function if there is
no profile buffer currently allocated, while lines 13 and 14 exit from this function if the
210 CHAPTER 9. DEFERRED PROCESSING

1 struct profile_buffer {
2 long size;
3 atomic_t entry[0];
4 };
5 static struct profile_buffer *buf = NULL;
6
7 void nmi_profile(unsigned long pcvalue)
8 {
9 struct profile_buffer *p = rcu_dereference(buf);
10
11 if (p == NULL)
12 return;
13 if (pcvalue >= p->size)
14 return;
15 atomic_inc(&p->entry[pcvalue]);
16 }
17
18 void nmi_stop(void)
19 {
20 struct profile_buffer *p = buf;
21
22 if (p == NULL)
23 return;
24 rcu_assign_pointer(buf, NULL);
25 synchronize_sched();
26 kfree(p);
27 }

Figure 9.44: Using RCU to Wait for NMIs to Finish

pcvalue argument is out of range. Otherwise, line 15 increments the profile-buffer


entry indexed by the pcvalue argument. Note that storing the size with the buffer
guarantees that the range check matches the buffer, even if a large buffer is suddenly
replaced by a smaller one.
Lines 18-27 define the nmi_stop() function, where the caller is responsible for
mutual exclusion (for example, holding the correct lock). Line 20 fetches a pointer to
the profile buffer, and lines 22 and 23 exit the function if there is no buffer. Otherwise,
line 24 NULLs out the profile-buffer pointer (using the rcu_assign_pointer()
primitive to maintain memory ordering on weakly ordered machines), and line 25 waits
for an RCU Sched grace period to elapse, in particular, waiting for all non-preemptible
regions of code, including NMI handlers, to complete. Once execution continues at
line 26, we are guaranteed that any instance of nmi_profile() that obtained a
pointer to the old buffer has returned. It is therefore safe to free the buffer, in this case
using the kfree() primitive.
Quick Quiz 9.36: Suppose that the nmi_profile() function was preemptible.
What would need to change to make this example work correctly?
In short, RCU makes it easy to dynamically switch among profile buffers (you just
try doing this efficiently with atomic operations, or at all with locking!). However, RCU
is normally used at a higher level of abstraction, as was shown in the previous sections.

9.5.3.9 RCU Usage Summary


At its core, RCU is nothing more nor less than an API that provides:
1. a publish-subscribe mechanism for adding new data,
2. a way of waiting for pre-existing RCU readers to finish, and
3. a discipline of maintaining multiple versions to permit change without harming
or unduly delaying concurrent RCU readers.
9.5. READ-COPY UPDATE (RCU) 211

That said, it is possible to build higher-level constructs on top of RCU, including


the reader-writer-locking, reference-counting, and existence-guarantee constructs listed
in the earlier sections. Furthermore, I have no doubt that the Linux community will
continue to find interesting new uses for RCU, as well as for any of a number of other
synchronization primitives.

Read-Mostly, Stale &


Inconsistent Data OK
(RCU Works Great!!!)

Read-Mostly, Need Consistent Data


(RCU Works Well)

Read-Write, Need Consistent Data


(RCU Might Be OK...)

Update-Mostly, Need Consistent Data


(RCU is Very Unlikely to be the Right Tool For The Job, But it Can:
(1) Provide Existence Guarantees For Update-Friendly Mechanisms
(2) Provide Wait-Free Read-Side Primitives for Real-Time Use)

Figure 9.45: RCU Areas of Applicability

In the meantime, Figure 9.45 shows some rough rules of thumb on where RCU is
most helpful.
As shown in the blue box at the top of the figure, RCU works best if you have
read-mostly data where stale and inconsistent data is permissible (but see below for
more information on stale and inconsistent data). The canonical example of this case
in the Linux kernel is routing tables. Because it may have taken many seconds or
even minutes for the routing updates to propagate across Internet, the system has been
sending packets the wrong way for quite some time. Having some small probability of
continuing to send some of them the wrong way for a few more milliseconds is almost
never a problem.
If you have a read-mostly workload where consistent data is required, RCU works
well, as shown by the green “read-mostly, need consistent data” box. One example
of this case is the Linux kernel’s mapping from user-level System-V semaphore IDs
to the corresponding in-kernel data structures. Semaphores tend to be used far more
frequently than they are created and destroyed, so this mapping is read-mostly. However,
it would be erroneous to perform a semaphore operation on a semaphore that has
already been deleted. This need for consistency is handled by using the lock in the
in-kernel semaphore data structure, along with a “deleted” flag that is set when deleting
a semaphore. If a user ID maps to an in-kernel data structure with the “deleted” flag set,
the data structure is ignored, so that the user ID is flagged as invalid.
Although this requires that the readers acquire a lock for the data structure repre-
senting the semaphore itself, it allows them to dispense with locking for the mapping
data structure. The readers therefore locklessly traverse the tree used to map from ID to
data structure, which in turn greatly improves performance, scalability, and real-time
response.
As indicated by the yellow “read-write” box, RCU can also be useful for read-write
workloads where consistent data is required, although usually in conjunction with a
number of other synchronization primitives. For example, the directory-entry cache in
recent Linux kernels uses RCU in conjunction with sequence locks, per-CPU locks, and
per-data-structure locks to allow lockless traversal of pathnames in the common case.
212 CHAPTER 9. DEFERRED PROCESSING

Although RCU can be very beneficial in this read-write case, such use is often more
complex than that of the read-mostly cases.
Finally, as indicated by the red box at the bottom of the figure, update-mostly
workloads requiring consistent data are rarely good places to use RCU, though there are
some exceptions [DMS+ 12]. In addition, as noted in Section 9.5.3.7, within the Linux
kernel, the SLAB_DESTROY_BY_RCU slab-allocator flag provides type-safe memory
to RCU readers, which can greatly simplify non-blocking synchronization and other
lockless algorithms.
In short, RCU is an API that includes a publish-subscribe mechanism for adding
new data, a way of waiting for pre-existing RCU readers to finish, and a discipline of
maintaining multiple versions to allow updates to avoid harming or unduly delaying
concurrent RCU readers. This RCU API is best suited for read-mostly situations,
especially if stale and inconsistent data can be tolerated by the application.

9.5.4 RCU Linux-Kernel API


This section looks at RCU from the viewpoint of its Linux-kernel API. Section 9.5.4.1
presents RCU’s wait-to-finish APIs, and Section 9.5.4.2 presents RCU’s publish-
subscribe and version-maintenance APIs. Finally, Section 9.5.4.4 presents concluding
remarks.

9.5.4.1 RCU has a Family of Wait-to-Finish APIs


The most straightforward answer to “what is RCU” is that RCU is an API used in the
Linux kernel, as summarized by Table 9.3, which shows the wait-for-RCU-readers
portions of the non-sleepable and sleepable APIs, respectively, and by Table 9.4, which
shows the publish-subscribe portions of the API.
If you are new to RCU, you might consider focusing on just one of the columns
in Table 9.3, each of which summarizes one member of the Linux kernel’s RCU API
family. For example, if you are primarily interested in understanding how RCU is
used in the Linux kernel, “RCU Classic” would be the place to start, as it is used most
frequently. On the other hand, if you want to understand RCU for its own sake, “SRCU”
has the simplest API. You can always come back for the other columns later.
If you are already familiar with RCU, these tables can serve as a useful reference.
Quick Quiz 9.37: Why do some of the cells in Table 9.3 have exclamation marks
(“!”)?
The “RCU Classic” column corresponds to the original RCU implementation, in
which RCU read-side critical sections are delimited by rcu_read_lock() and rcu_
read_unlock(), which may be nested. The corresponding synchronous update-
side primitives, synchronize_rcu(), along with its synonym synchronize_
net(), wait for any currently executing RCU read-side critical sections to complete.
The length of this wait is known as a “grace period”. The asynchronous update-side
primitive, call_rcu(), invokes a specified function with a specified argument after
a subsequent grace period. For example, call_rcu(p,f); will result in the “RCU
callback” f(p) being invoked after a subsequent grace period. There are situations,
such as when unloading a Linux-kernel module that uses call_rcu(), when it is
necessary to wait for all outstanding RCU callbacks to complete [McK07d]. The
rcu_barrier() primitive does this job. Note that the more recent hierarchical
RCU [McK08a] implementation also adheres to “RCU Classic” semantics.
Attribute RCU Classic RCU BH RCU Sched Realtime RCU SRCU
Purpose Original Prevent DDoS attacks Wait for preempt-disable Realtime response Sleeping readers
regions, hardirqs, & NMIs
Availability 2.5.43 2.6.9 2.6.12 2.6.26 2.6.19
Read-side primitives rcu_read_lock() ! rcu_read_lock_bh() preempt_disable() rcu_read_lock() srcu_read_lock()
rcu_read_unlock() ! rcu_read_unlock_bh() preempt_enable() rcu_read_unlock() srcu_read_unlock()
(and friends)
Update-side primitives synchronize_rcu() synchronize_rcu_ synchronize_sched() synchronize_rcu() synchronize_srcu()
(synchronous) synchronize_net() bh() synchronize_net()
Update-side primitives call_rcu() ! call_rcu_bh() call_rcu_sched() call_rcu() call_srcu()
(asynchronous/callback)
9.5. READ-COPY UPDATE (RCU)

Update-side primitives (wait rcu_barrier() rcu_barrier_bh() rcu_barrier_sched() rcu_barrier() N/A


for callbacks)
Type-safe memory SLAB_DESTROY_BY_RCU SLAB_DESTROY_BY_RCU
Read side constraints No blocking No bottom-half (BH) No blocking Only preemption and lock No
enabling acquisition synchronize_srcu()
with same srcu_struct
Read side overhead Preempt disable/enable (free BH disable/enable Preempt disable/enable (free Simple instructions, irq Simple instructions, preempt
on non-PREEMPT) on non-PREEMPT) disable/enable disable/enable, memory
barriers
Asynchronous update-side sub-microsecond sub-microsecond sub-microsecond sub-microsecond N/A
overhead
Grace-period latency 10s of milliseconds 10s of milliseconds 10s of milliseconds 10s of milliseconds 10s of milliseconds
Non-PREEMPT_RT RCU Classic RCU BH RCU Classic Preemptible RCU SRCU
implementation
PREEMPT_RT Preemptible RCU Realtime RCU Forced Schedule on all Realtime RCU SRCU
implementation CPUs

Table 9.3: RCU Wait-to-Finish APIs


213
214 CHAPTER 9. DEFERRED PROCESSING

Finally, RCU may be used to provide type-safe memory [GC96], as described in


Section 9.5.3.7. In the context of RCU, type-safe memory guarantees that a given data
element will not change type during any RCU read-side critical section that accesses
it. To make use of RCU-based type-safe memory, pass SLAB_DESTROY_BY_RCU
to kmem_cache_create(). It is important to note that SLAB_DESTROY_BY_
RCU will in no way prevent kmem_cache_alloc() from immediately reallocating
memory that was just now freed via kmem_cache_free()! In fact, the SLAB_
DESTROY_BY_RCU-protected data structure just returned by rcu_dereference
might be freed and reallocated an arbitrarily large number of times, even when under
the protection of rcu_read_lock(). Instead, SLAB_DESTROY_BY_RCU operates
by preventing kmem_cache_free() from returning a completely freed-up slab of
data structures to the system until after an RCU grace period elapses. In short, although
the data element might be freed and reallocated arbitrarily often, at least its type will
remain the same.
Quick Quiz 9.38: How do you prevent a huge number of RCU read-side critical
sections from indefinitely blocking a synchronize_rcu() invocation?
Quick Quiz 9.39: The synchronize_rcu() API waits for all pre-existing
interrupt handlers to complete, right?
In the “RCU BH” column, rcu_read_lock_bh() and rcu_read_unlock_
bh() delimit RCU read-side critical sections, synchronize_rcu_bh() waits for
a grace period, and call_rcu_bh() invokes the specified function and argument
after a later grace period.
Quick Quiz 9.40: What happens if you mix and match? For example, suppose
you use rcu_read_lock() and rcu_read_unlock() to delimit RCU read-side
critical sections, but then use call_rcu_bh() to post an RCU callback?
Quick Quiz 9.41: Hardware interrupt handlers can be thought of as being under the
protection of an implicit rcu_read_lock_bh(), right?
In the “RCU Sched” column, anything that disables preemption acts as an RCU
read-side critical section, and synchronize_sched() waits for the corresponding
RCU grace period. This RCU API family was added in the 2.6.12 kernel, which split the
old synchronize_kernel() API into the current synchronize_rcu() (for
RCU Classic) and synchronize_sched() (for RCU Sched). Note that RCU Sched
did not originally have an asynchronous call_rcu_sched() interface, but one was
added in 2.6.26. In accordance with the quasi-minimalist philosophy of the Linux
community, APIs are added on an as-needed basis.
Quick Quiz 9.42: What happens if you mix and match RCU Classic and RCU
Sched?
Quick Quiz 9.43: In general, you cannot rely on synchronize_sched() to
wait for all pre-existing interrupt handlers, right?
The “Realtime RCU” column has the same API as does RCU Classic, the only differ-
ence being that RCU read-side critical sections may be preempted and may block while
acquiring spinlocks. The design of Realtime RCU is described elsewhere [McK07a].
The “SRCU” column in Table 9.3 displays a specialized RCU API that permits
general sleeping in RCU read-side critical sections [McK06]. Of course, use of
synchronize_srcu() in an SRCU read-side critical section can result in self-
deadlock, so should be avoided. SRCU differs from earlier RCU implementations in
that the caller allocates an srcu_struct for each distinct SRCU usage. This approach
prevents SRCU read-side critical sections from blocking unrelated synchronize_
srcu() invocations. In addition, in this variant of RCU, srcu_read_lock()
9.5. READ-COPY UPDATE (RCU) 215

returns a value that must be passed into the corresponding srcu_read_unlock().


Quick Quiz 9.44: Why should you be careful with call_srcu()?
Quick Quiz 9.45: Under what conditions can synchronize_srcu() be safely
used within an SRCU read-side critical section?
The Linux kernel currently has a surprising number of RCU APIs and implementa-
tions. There is some hope of reducing this number, evidenced by the fact that a given
build of the Linux kernel currently has at most four implementations behind three APIs
(given that RCU Classic and Realtime RCU share the same API). However, careful
inspection and analysis will be required, just as would be required in order to eliminate
one of the many locking APIs.
The various RCU APIs are distinguished by the forward-progress guarantees that
their RCU read-side critical sections must provide, and also by their scope, as follows:

1. RCU BH: read-side critical sections must guarantee forward progress against
everything except for NMI and interrupt handlers, but not including software-
interrupt (softirq) handlers. RCU BH is global in scope.
2. RCU Sched: read-side critical sections must guarantee forward progress against
everything except for NMI and irq handlers, including softirq handlers. RCU
Sched is global in scope.
3. RCU (both classic and real-time): read-side critical sections must guarantee
forward progress against everything except for NMI handlers, irq handlers,
softirq handlers, and (in the real-time case) higher-priority real-time tasks.
RCU is global in scope.
4. SRCU: read-side critical sections need not guarantee forward progress unless
some other task is waiting for the corresponding grace period to complete, in
which case these read-side critical sections should complete in no more than a
few seconds (and preferably much more quickly).10 SRCU’s scope is defined by
the use of the corresponding srcu_struct.

In other words, SRCU compensate for their extremely weak forward-progress


guarantees by permitting the developer to restrict their scope.

9.5.4.2 RCU has Publish-Subscribe and Version-Maintenance APIs


Fortunately, the RCU publish-subscribe and version-maintenance primitives shown
in the following table apply to all of the variants of RCU discussed above. This
commonality can in some cases allow more code to be shared, which certainly reduces
the API proliferation that would otherwise occur. The original purpose of the RCU
publish-subscribe APIs was to bury memory barriers into these APIs, so that Linux
kernel programmers could use RCU without needing to become expert on the memory-
ordering models of each of the 20+ CPU families that Linux supports [Spr01].
The first pair of categories operate on Linux struct list_head lists, which
are circular, doubly-linked lists. The list_for_each_entry_rcu() primitive
traverses an RCU-protected list in a type-safe manner, while also enforcing memory
ordering for situations where a new list element is inserted into the list concurrently with
traversal. On non-Alpha platforms, this primitive incurs little or no performance penalty
10 Thanks to James Bottomley for urging me to this formulation, as opposed to simply saying that there

are no forward-progress guarantees.


216 CHAPTER 9. DEFERRED PROCESSING

Category Primitives Availability Overhead


List traversal list_for_each_entry_ 2.5.59 Simple instructions (memory
rcu() barrier on Alpha)
List update list_add_rcu() 2.5.44 Memory barrier
list_add_tail_rcu() 2.5.44 Memory barrier
list_del_rcu() 2.5.44 Simple instructions
list_replace_rcu() 2.6.9 Memory barrier
list_splice_init_ 2.6.21 Grace-period latency
rcu()
Hlist traversal hlist_for_each_entry_ 2.6.8 Simple instructions (memory
rcu() barrier on Alpha)
hlist_add_after_rcu() 2.6.14 Memory barrier
hlist_add_before_ 2.6.14 Memory barrier
rcu()
hlist_add_head_rcu() 2.5.64 Memory barrier
hlist_del_rcu() 2.5.64 Simple instructions
hlist_replace_rcu() 2.6.15 Memory barrier
Pointer traversal rcu_dereference() 2.6.9 Simple instructions (memory
barrier on Alpha)
Pointer update rcu_assign_pointer() 2.6.10 Memory barrier

Table 9.4: RCU Publish-Subscribe and Version Maintenance APIs

compared to list_for_each_entry(). The list_add_rcu(), list_add_


tail_rcu(), and list_replace_rcu() primitives are analogous to their non-
RCU counterparts, but incur the overhead of an additional memory barrier on weakly-
ordered machines. The list_del_rcu() primitive is also analogous to its non-RCU
counterpart, but oddly enough is very slightly faster due to the fact that it poisons only
the prev pointer rather than both the prev and next pointers as list_del() must
do. Finally, the list_splice_init_rcu() primitive is similar to its non-RCU
counterpart, but incurs a full grace-period latency. The purpose of this grace period
is to allow RCU readers to finish their traversal of the source list before completely
disconnecting it from the list header—failure to do this could prevent such readers from
ever terminating their traversal.
Quick Quiz 9.46: Why doesn’t list_del_rcu() poison both the next and
prev pointers?
The second pair of categories operate on Linux’s struct hlist_head, which
is a linear linked list. One advantage of struct hlist_head over struct
list_head is that the former requires only a single-pointer list header, which can
save significant memory in large hash tables. The struct hlist_head primitives
in the table relate to their non-RCU counterparts in much the same way as do the
struct list_head primitives.
The final pair of categories operate directly on pointers, and are useful for creating
RCU-protected non-list data structures, such as RCU-protected arrays and trees. The
rcu_assign_pointer() primitive ensures that any prior initialization remains
ordered before the assignment to the pointer on weakly ordered machines. Similarly,
the rcu_dereference() primitive ensures that subsequent code dereferencing
the pointer will see the effects of initialization code prior to the corresponding rcu_
assign_pointer() on Alpha CPUs. On non-Alpha CPUs, rcu_dereference()
documents which pointer dereferences are protected by RCU.
Quick Quiz 9.47: Normally, any pointer subject to rcu_dereference() must
always be updated using rcu_assign_pointer(). What is an exception to this
9.5. READ-COPY UPDATE (RCU) 217

NMI

RCU List Traversal


rcu_read_unlock()
rcu_dereference()
rcu_read_lock()

rcu_assign_pointer()

RCU List Mutation


IRQ

call_rcu()
Process synchronize_rcu()

Figure 9.46: RCU API Usage Constraints

rule?
Quick Quiz 9.48: Are there any downsides to the fact that these traversal and update
primitives can be used with any of the RCU API family members?

9.5.4.3 Where Can RCU’s APIs Be Used?

Figure 9.46 shows which APIs may be used in which in-kernel environments. The
RCU read-side primitives may be used in any environment, including NMI, the RCU
mutation and asynchronous grace-period primitives may be used in any environment
other than NMI, and, finally, the RCU synchronous grace-period primitives may be used
only in process context. The RCU list-traversal primitives include list_for_each_
entry_rcu(), hlist_for_each_entry_rcu(), etc. Similarly, the RCU list-
mutation primitives include list_add_rcu(), hlist_del_rcu(), etc.
Note that primitives from other families of RCU may be substituted, for example,
srcu_read_lock() may be used in any context in which rcu_read_lock()
may be used.

9.5.4.4 So, What is RCU Really?

At its core, RCU is nothing more nor less than an API that supports publication and
subscription for insertions, waiting for all RCU readers to complete, and maintenance
of multiple versions. That said, it is possible to build higher-level constructs on top of
RCU, including the reader-writer-locking, reference-counting, and existence-guarantee
constructs listed in Section 9.5.3. Furthermore, I have no doubt that the Linux com-
munity will continue to find interesting new uses for RCU, just as they do for any of a
number of synchronization primitives throughout the kernel.
Of course, a more-complete view of RCU would also include all of the things you
can do with these APIs.
However, for many people, a complete view of RCU must include sample RCU
implementations. The next section therefore presents a series of “toy” RCU implemen-
tations of increasing complexity and capability.
218 CHAPTER 9. DEFERRED PROCESSING

9.5.5 “Toy” RCU Implementations


The toy RCU implementations in this section are designed not for high performance,
practicality, or any kind of production use,11 but rather for clarity. Nevertheless, you will
need a thorough understanding of Chapters 2, 3, 4, and 6, as well as the previous portions
of Chapter 9 for even these toy RCU implementations to be easily understandable.
This section provides a series of RCU implementations in order of increasing
sophistication, from the viewpoint of solving the existence-guarantee problem. Sec-
tion 9.5.5.1 presents a rudimentary RCU implementation based on simple locking, while
Sections 9.5.5.2 through 9.5.5.9 present a series of simple RCU implementations based
on locking, reference counters, and free-running counters. Finally, Section 9.5.5.10
provides a summary and a list of desirable RCU properties.

9.5.5.1 Lock-Based RCU


Perhaps the simplest RCU implementation leverages locking, as shown in Figure 9.47
(rcu_lock.h and rcu_lock.c). In this implementation, rcu_read_lock()
acquires a global spinlock, rcu_read_unlock() releases it, and synchronize_
rcu() acquires it then immediately releases it.
Because synchronize_rcu() does not return until it has acquired (and released)
the lock, it cannot return until all prior RCU read-side critical sections have completed,
thus faithfully implementing RCU semantics. Of course, only one RCU reader may
be in its read-side critical section at a time, which almost entirely defeats the purpose
of RCU. In addition, the lock operations in rcu_read_lock() and rcu_read_
unlock() are extremely heavyweight, with read-side overhead ranging from about
100 nanoseconds on a single Power5 CPU up to more than 17 microseconds on a
64-CPU system. Worse yet, these same lock operations permit rcu_read_lock()
to participate in deadlock cycles. Furthermore, in absence of recursive locks, RCU
read-side critical sections cannot be nested, and, finally, although concurrent RCU
updates could in principle be satisfied by a common grace period, this implementation
serializes grace periods, preventing grace-period sharing.
Quick Quiz 9.49: Why wouldn’t any deadlock in the RCU implementation in
Figure 9.47 also be a deadlock in any other RCU implementation?
Quick Quiz 9.50: Why not simply use reader-writer locks in the RCU implementa-
tion in Figure 9.47 in order to allow RCU readers to proceed in parallel?

1 static void rcu_read_lock(void)


2 {
3 spin_lock(&rcu_gp_lock);
4 }
5
6 static void rcu_read_unlock(void)
7 {
8 spin_unlock(&rcu_gp_lock);
9 }
10
11 void synchronize_rcu(void)
12 {
13 spin_lock(&rcu_gp_lock);
14 spin_unlock(&rcu_gp_lock);
15 }

Figure 9.47: Lock-Based RCU Implementation

11 However, production-quality user-level RCU implementations are available [Des09].


9.5. READ-COPY UPDATE (RCU) 219

1 static void rcu_read_lock(void)


2 {
3 spin_lock(&__get_thread_var(rcu_gp_lock));
4 }
5
6 static void rcu_read_unlock(void)
7 {
8 spin_unlock(&__get_thread_var(rcu_gp_lock));
9 }
10
11 void synchronize_rcu(void)
12 {
13 int t;
14
15 for_each_running_thread(t) {
16 spin_lock(&per_thread(rcu_gp_lock, t));
17 spin_unlock(&per_thread(rcu_gp_lock, t));
18 }
19 }

Figure 9.48: Per-Thread Lock-Based RCU Implementation

It is hard to imagine this implementation being useful in a production setting, though


it does have the virtue of being implementable in almost any user-level application.
Furthermore, similar implementations having one lock per CPU or using reader-writer
locks have been used in production in the 2.4 Linux kernel.
A modified version of this one-lock-per-CPU approach, but instead using one lock
per thread, is described in the next section.

9.5.5.2 Per-Thread Lock-Based RCU

Figure 9.48 (rcu_lock_percpu.h and rcu_lock_percpu.c) shows an imple-


mentation based on one lock per thread. The rcu_read_lock() and rcu_read_
unlock() functions acquire and release, respectively, the current thread’s lock. The
synchronize_rcu() function acquires and releases each thread’s lock in turn.
Therefore, all RCU read-side critical sections running when synchronize_rcu()
starts must have completed before synchronize_rcu() can return.
This implementation does have the virtue of permitting concurrent RCU readers, and
does avoid the deadlock condition that can arise with a single global lock. Furthermore,
the read-side overhead, though high at roughly 140 nanoseconds, remains at about 140
nanoseconds regardless of the number of CPUs. However, the update-side overhead
ranges from about 600 nanoseconds on a single Power5 CPU up to more than 100
microseconds on 64 CPUs.
Quick Quiz 9.51: Wouldn’t it be cleaner to acquire all the locks, and then release
them all in the loop from lines 15-18 of Figure 9.48? After all, with this change, there
would be a point in time when there were no readers, simplifying things greatly.
Quick Quiz 9.52: Is the implementation shown in Figure 9.48 free from deadlocks?
Why or why not?
Quick Quiz 9.53: Isn’t one advantage of the RCU algorithm shown in Figure 9.48
that it uses only primitives that are widely available, for example, in POSIX pthreads?
This approach could be useful in some situations, given that a similar approach was
used in the Linux 2.4 kernel [MM00].
The counter-based RCU implementation described next overcomes some of the
shortcomings of the lock-based implementation.
220 CHAPTER 9. DEFERRED PROCESSING

1 atomic_t rcu_refcnt;
2
3 static void rcu_read_lock(void)
4 {
5 atomic_inc(&rcu_refcnt);
6 smp_mb();
7 }
8
9 static void rcu_read_unlock(void)
10 {
11 smp_mb();
12 atomic_dec(&rcu_refcnt);
13 }
14
15 void synchronize_rcu(void)
16 {
17 smp_mb();
18 while (atomic_read(&rcu_refcnt) != 0) {
19 poll(NULL, 0, 10);
20 }
21 smp_mb();
22 }

Figure 9.49: RCU Implementation Using Single Global Reference Counter

9.5.5.3 Simple Counter-Based RCU


A slightly more sophisticated RCU implementation is shown in Figure 9.49 (rcu_
rcg.h and rcu_rcg.c). This implementation makes use of a global reference
counter rcu_refcnt defined on line 1. The rcu_read_lock() primitive atomi-
cally increments this counter, then executes a memory barrier to ensure that the RCU
read-side critical section is ordered after the atomic increment. Similarly, rcu_read_
unlock() executes a memory barrier to confine the RCU read-side critical section,
then atomically decrements the counter. The synchronize_rcu() primitive spins
waiting for the reference counter to reach zero, surrounded by memory barriers. The
poll() on line 19 merely provides pure delay, and from a pure RCU-semantics point
of view could be omitted. Again, once synchronize_rcu() returns, all prior RCU
read-side critical sections are guaranteed to have completed.
In happy contrast to the lock-based implementation shown in Section 9.5.5.1, this
implementation allows parallel execution of RCU read-side critical sections. In happy
contrast to the per-thread lock-based implementation shown in Section 9.5.5.2, it also
allows them to be nested. In addition, the rcu_read_lock() primitive cannot
possibly participate in deadlock cycles, as it never spins nor blocks.
Quick Quiz 9.54: But what if you hold a lock across a call to synchronize_
rcu(), and then acquire that same lock within an RCU read-side critical section?

However, this implementations still has some serious shortcomings. First, the
atomic operations in rcu_read_lock() and rcu_read_unlock() are still quite
heavyweight, with read-side overhead ranging from about 100 nanoseconds on a single
Power5 CPU up to almost 40 microseconds on a 64-CPU system. This means that
the RCU read-side critical sections have to be extremely long in order to get any real
read-side parallelism. On the other hand, in the absence of readers, grace periods elapse
in about 40 nanoseconds, many orders of magnitude faster than production-quality
implementations in the Linux kernel.
Quick Quiz 9.55: How can the grace period possibly elapse in 40 nanoseconds
when synchronize_rcu() contains a 10-millisecond delay?
Second, if there are many concurrent rcu_read_lock() and rcu_read_
9.5. READ-COPY UPDATE (RCU) 221

1 DEFINE_SPINLOCK(rcu_gp_lock);
2 atomic_t rcu_refcnt[2];
3 atomic_t rcu_idx;
4 DEFINE_PER_THREAD(int, rcu_nesting);
5 DEFINE_PER_THREAD(int, rcu_read_idx);

Figure 9.50: RCU Global Reference-Count Pair Data


1 static void rcu_read_lock(void)
2 {
3 int i;
4 int n;
5
6 n = __get_thread_var(rcu_nesting);
7 if (n == 0) {
8 i = atomic_read(&rcu_idx);
9 __get_thread_var(rcu_read_idx) = i;
10 atomic_inc(&rcu_refcnt[i]);
11 }
12 __get_thread_var(rcu_nesting) = n + 1;
13 smp_mb();
14 }
15
16 static void rcu_read_unlock(void)
17 {
18 int i;
19 int n;
20
21 smp_mb();
22 n = __get_thread_var(rcu_nesting);
23 if (n == 1) {
24 i = __get_thread_var(rcu_read_idx);
25 atomic_dec(&rcu_refcnt[i]);
26 }
27 __get_thread_var(rcu_nesting) = n - 1;
28 }

Figure 9.51: RCU Read-Side Using Global Reference-Count Pair

unlock() operations, there will be extreme memory contention on rcu_refcnt,


resulting in expensive cache misses. Both of these first two shortcomings largely defeat
a major purpose of RCU, namely to provide low-overhead read-side synchronization
primitives.
Finally, a large number of RCU readers with long read-side critical sections could
prevent synchronize_rcu() from ever completing, as the global counter might
never reach zero. This could result in starvation of RCU updates, which is of course
unacceptable in production settings.
Quick Quiz 9.56: Why not simply make rcu_read_lock() wait when a con-
current synchronize_rcu() has been waiting too long in the RCU implementation
in Figure 9.49? Wouldn’t that prevent synchronize_rcu() from starving?
Therefore, it is still hard to imagine this implementation being useful in a production
setting, though it has a bit more potential than the lock-based mechanism, for example,
as an RCU implementation suitable for a high-stress debugging environment. The next
section describes a variation on the reference-counting scheme that is more favorable to
writers.

9.5.5.4 Starvation-Free Counter-Based RCU


Figure 9.51 (rcu_rcgp.h) shows the read-side primitives of an RCU implementation
that uses a pair of reference counters (rcu_refcnt[]), along with a global index
that selects one counter out of the pair (rcu_idx), a per-thread nesting counter rcu_
222 CHAPTER 9. DEFERRED PROCESSING

nesting, a per-thread snapshot of the global index (rcu_read_idx), and a global


lock (rcu_gp_lock), which are themselves shown in Figure 9.50.

Design It is the two-element rcu_refcnt[] array that provides the freedom from
starvation. The key point is that synchronize_rcu() is only required to wait for
pre-existing readers. If a new reader starts after a given instance of synchronize_
rcu() has already begun execution, then that instance of synchronize_rcu()
need not wait on that new reader. At any given time, when a given reader enters its RCU
read-side critical section via rcu_read_lock(), it increments the element of the
rcu_refcnt[] array indicated by the rcu_idx variable. When that same reader
exits its RCU read-side critical section via rcu_read_unlock(), it decrements
whichever element it incremented, ignoring any possible subsequent changes to the
rcu_idx value.
This arrangement means that synchronize_rcu() can avoid starvation by
complementing the value of rcu_idx, as in rcu_idx = !rcu_idx. Suppose that
the old value of rcu_idx was zero, so that the new value is one. New readers that arrive
after the complement operation will increment rcu_idx[1], while the old readers that
previously incremented rcu_idx[0] will decrement rcu_idx[0] when they exit
their RCU read-side critical sections. This means that the value of rcu_idx[0] will
no longer be incremented, and thus will be monotonically decreasing.12 This means that
all that synchronize_rcu() need do is wait for the value of rcu_refcnt[0] to
reach zero.
With the background, we are ready to look at the implementation of the actual
primitives.

Implementation The rcu_read_lock() primitive atomically increments the mem-


ber of the rcu_refcnt[] pair indexed by rcu_idx, and keeps a snapshot of this
index in the per-thread variable rcu_read_idx. The rcu_read_unlock() prim-
itive then atomically decrements whichever counter of the pair that the corresponding
rcu_read_lock() incremented. However, because only one value of rcu_idx is
remembered per thread, additional measures must be taken to permit nesting. These
additional measures use the per-thread rcu_nesting variable to track nesting.
To make all this work, line 6 of rcu_read_lock() in Figure 9.51 picks up
the current thread’s instance of rcu_nesting, and if line 7 finds that this is the
outermost rcu_read_lock(), then lines 8-10 pick up the current value of rcu_
idx, save it in this thread’s instance of rcu_read_idx, and atomically increment the
selected element of rcu_refcnt. Regardless of the value of rcu_nesting, line 12
increments it. Line 13 executes a memory barrier to ensure that the RCU read-side
critical section does not bleed out before the rcu_read_lock() code.
Similarly, the rcu_read_unlock() function executes a memory barrier at
line 21 to ensure that the RCU read-side critical section does not bleed out after the rcu_
read_unlock() code. Line 22 picks up this thread’s instance of rcu_nesting,
and if line 23 finds that this is the outermost rcu_read_unlock(), then lines 24 and
25 pick up this thread’s instance of rcu_read_idx (saved by the outermost rcu_
read_lock()) and atomically decrements the selected element of rcu_refcnt.
Regardless of the nesting level, line 27 decrements this thread’s instance of rcu_
nesting.
12 There is a race condition that this “monotonically decreasing” statement ignores. This race condition

will be dealt with by the code for synchronize_rcu(). In the meantime, I suggest suspending disbelief.
9.5. READ-COPY UPDATE (RCU) 223

1 void synchronize_rcu(void)
2 {
3 int i;
4
5 smp_mb();
6 spin_lock(&rcu_gp_lock);
7 i = atomic_read(&rcu_idx);
8 atomic_set(&rcu_idx, !i);
9 smp_mb();
10 while (atomic_read(&rcu_refcnt[i]) != 0) {
11 poll(NULL, 0, 10);
12 }
13 smp_mb();
14 atomic_set(&rcu_idx, i);
15 smp_mb();
16 while (atomic_read(&rcu_refcnt[!i]) != 0) {
17 poll(NULL, 0, 10);
18 }
19 spin_unlock(&rcu_gp_lock);
20 smp_mb();
21 }

Figure 9.52: RCU Update Using Global Reference-Count Pair

Figure 9.52 (rcu_rcpg.c) shows the corresponding synchronize_rcu()


implementation. Lines 6 and 19 acquire and release rcu_gp_lock in order to prevent
more than one concurrent instance of synchronize_rcu(). Lines 7-8 pick up the
value of rcu_idx and complement it, respectively, so that subsequent instances of
rcu_read_lock() will use a different element of rcu_idx that did preceding
instances. Lines 10-12 then wait for the prior element of rcu_idx to reach zero, with
the memory barrier on line 9 ensuring that the check of rcu_idx is not reordered to
precede the complementing of rcu_idx. Lines 13-18 repeat this process, and line 20
ensures that any subsequent reclamation operations are not reordered to precede the
checking of rcu_refcnt.
Quick Quiz 9.57: Why the memory barrier on line 5 of synchronize_rcu()
in Figure 9.52 given that there is a spin-lock acquisition immediately after?
Quick Quiz 9.58: Why is the counter flipped twice in Figure 9.52? Shouldn’t a
single flip-and-wait cycle be sufficient?
This implementation avoids the update-starvation issues that could occur in the
single-counter implementation shown in Figure 9.49.

Discussion There are still some serious shortcomings. First, the atomic operations
in rcu_read_lock() and rcu_read_unlock() are still quite heavyweight. In
fact, they are more complex than those of the single-counter variant shown in Figure 9.49,
with the read-side primitives consuming about 150 nanoseconds on a single Power5 CPU
and almost 40 microseconds on a 64-CPU system. The update-side synchronize_
rcu() primitive is more costly as well, ranging from about 200 nanoseconds on a
single Power5 CPU to more than 40 microseconds on a 64-CPU system. This means
that the RCU read-side critical sections have to be extremely long in order to get any
real read-side parallelism.
Second, if there are many concurrent rcu_read_lock() and rcu_read_
unlock() operations, there will be extreme memory contention on the rcu_refcnt
elements, resulting in expensive cache misses. This further extends the RCU read-side
critical-section duration required to provide parallel read-side access. These first two
shortcomings defeat the purpose of RCU in most situations.
Third, the need to flip rcu_idx twice imposes substantial overhead on updates,
224 CHAPTER 9. DEFERRED PROCESSING

1 DEFINE_SPINLOCK(rcu_gp_lock);
2 DEFINE_PER_THREAD(int [2], rcu_refcnt);
3 atomic_t rcu_idx;
4 DEFINE_PER_THREAD(int, rcu_nesting);
5 DEFINE_PER_THREAD(int, rcu_read_idx);

Figure 9.53: RCU Per-Thread Reference-Count Pair Data

1 static void rcu_read_lock(void)


2 {
3 int i;
4 int n;
5
6 n = __get_thread_var(rcu_nesting);
7 if (n == 0) {
8 i = atomic_read(&rcu_idx);
9 __get_thread_var(rcu_read_idx) = i;
10 __get_thread_var(rcu_refcnt)[i]++;
11 }
12 __get_thread_var(rcu_nesting) = n + 1;
13 smp_mb();
14 }
15
16 static void rcu_read_unlock(void)
17 {
18 int i;
19 int n;
20
21 smp_mb();
22 n = __get_thread_var(rcu_nesting);
23 if (n == 1) {
24 i = __get_thread_var(rcu_read_idx);
25 __get_thread_var(rcu_refcnt)[i]--;
26 }
27 __get_thread_var(rcu_nesting) = n - 1;
28 }

Figure 9.54: RCU Read-Side Using Per-Thread Reference-Count Pair

especially if there are large numbers of threads.


Finally, despite the fact that concurrent RCU updates could in principle be satisfied
by a common grace period, this implementation serializes grace periods, preventing
grace-period sharing.
Quick Quiz 9.59: Given that atomic increment and decrement are so expensive,
why not just use non-atomic increment on line 10 and a non-atomic decrement on line 25
of Figure 9.51?
Despite these shortcomings, one could imagine this variant of RCU being used on
small tightly coupled multiprocessors, perhaps as a memory-conserving implementation
that maintains API compatibility with more complex implementations. However, it
would not likely scale well beyond a few CPUs.
The next section describes yet another variation on the reference-counting scheme
that provides greatly improved read-side performance and scalability.

9.5.5.5 Scalable Counter-Based RCU

Figure 9.54 (rcu_rcpl.h) shows the read-side primitives of an RCU implementation


that uses per-thread pairs of reference counters. This implementation is quite similar
to that shown in Figure 9.51, the only difference being that rcu_refcnt is now a
per-thread array (as shown in Figure 9.53). As with the algorithm in the previous section,
use of this two-element array prevents readers from starving updaters. One benefit of
9.5. READ-COPY UPDATE (RCU) 225

1 static void flip_counter_and_wait(int i)


2 {
3 int t;
4
5 atomic_set(&rcu_idx, !i);
6 smp_mb();
7 for_each_thread(t) {
8 while (per_thread(rcu_refcnt, t)[i] != 0) {
9 poll(NULL, 0, 10);
10 }
11 }
12 smp_mb();
13 }
14
15 void synchronize_rcu(void)
16 {
17 int i;
18
19 smp_mb();
20 spin_lock(&rcu_gp_lock);
21 i = atomic_read(&rcu_idx);
22 flip_counter_and_wait(i);
23 flip_counter_and_wait(!i);
24 spin_unlock(&rcu_gp_lock);
25 smp_mb();
26 }

Figure 9.55: RCU Update Using Per-Thread Reference-Count Pair

per-thread rcu_refcnt[] array is that the rcu_read_lock() and rcu_read_


unlock() primitives no longer perform atomic operations.
Quick Quiz 9.60: Come off it! We can see the atomic_read() primitive in
rcu_read_lock()!!! So why are you trying to pretend that rcu_read_lock()
contains no atomic operations???
Figure 9.55 (rcu_rcpl.c) shows the implementation of synchronize_rcu(),
along with a helper function named flip_counter_and_wait(). The synchronize_
rcu() function resembles that shown in Figure 9.52, except that the repeated counter
flip is replaced by a pair of calls on lines 22 and 23 to the new helper function.
The new flip_counter_and_wait() function updates the rcu_idx vari-
able on line 5, executes a memory barrier on line 6, then lines 7-11 spin on each thread’s
prior rcu_refcnt element, waiting for it to go to zero. Once all such elements have
gone to zero, it executes another memory barrier on line 12 and returns.
This RCU implementation imposes important new requirements on its software
environment, namely, (1) that it be possible to declare per-thread variables, (2) that
these per-thread variables be accessible from other threads, and (3) that it is possible to
enumerate all threads. These requirements can be met in almost all software environ-
ments, but often result in fixed upper bounds on the number of threads. More-complex
implementations might avoid such bounds, for example, by using expandable hash
tables. Such implementations might dynamically track threads, for example, by adding
them on their first call to rcu_read_lock().
Quick Quiz 9.61: Great, if we have N threads, we can have 2N ten-millisecond
waits (one set per flip_counter_and_wait() invocation, and even that assumes
that we wait only once for each thread. Don’t we need the grace period to complete
much more quickly?
This implementation still has several shortcomings. First, the need to flip rcu_idx
twice imposes substantial overhead on updates, especially if there are large numbers of
threads.
Second, synchronize_rcu() must now examine a number of variables that
226 CHAPTER 9. DEFERRED PROCESSING

1 DEFINE_SPINLOCK(rcu_gp_lock);
2 DEFINE_PER_THREAD(int [2], rcu_refcnt);
3 long rcu_idx;
4 DEFINE_PER_THREAD(int, rcu_nesting);
5 DEFINE_PER_THREAD(int, rcu_read_idx);

Figure 9.56: RCU Read-Side Using Per-Thread Reference-Count Pair and Shared
Update Data
1 static void rcu_read_lock(void)
2 {
3 int i;
4 int n;
5
6 n = __get_thread_var(rcu_nesting);
7 if (n == 0) {
8 i = ACCESS_ONCE(rcu_idx) & 0x1;
9 __get_thread_var(rcu_read_idx) = i;
10 __get_thread_var(rcu_refcnt)[i]++;
11 }
12 __get_thread_var(rcu_nesting) = n + 1;
13 smp_mb();
14 }
15
16 static void rcu_read_unlock(void)
17 {
18 int i;
19 int n;
20
21 smp_mb();
22 n = __get_thread_var(rcu_nesting);
23 if (n == 1) {
24 i = __get_thread_var(rcu_read_idx);
25 __get_thread_var(rcu_refcnt)[i]--;
26 }
27 __get_thread_var(rcu_nesting) = n - 1;
28 }

Figure 9.57: RCU Read-Side Using Per-Thread Reference-Count Pair and Shared
Update

increases linearly with the number of threads, imposing substantial overhead on applica-
tions with large numbers of threads.
Third, as before, although concurrent RCU updates could in principle be satisfied
by a common grace period, this implementation serializes grace periods, preventing
grace-period sharing.
Finally, as noted in the text, the need for per-thread variables and for enumerating
threads may be problematic in some software environments.
That said, the read-side primitives scale very nicely, requiring about 115 nanoseconds
regardless of whether running on a single-CPU or a 64-CPU Power5 system. As noted
above, the synchronize_rcu() primitive does not scale, ranging in overhead from
almost a microsecond on a single Power5 CPU up to almost 200 microseconds on a
64-CPU system. This implementation could conceivably form the basis for a production-
quality user-level RCU implementation.
The next section describes an algorithm permitting more efficient concurrent RCU
updates.

9.5.5.6 Scalable Counter-Based RCU With Shared Grace Periods


Figure 9.57 (rcu_rcpls.h) shows the read-side primitives for an RCU implementa-
tion using per-thread reference count pairs, as before, but permitting updates to share
9.5. READ-COPY UPDATE (RCU) 227

1 static void flip_counter_and_wait(int ctr)


2 {
3 int i;
4 int t;
5
6 ACCESS_ONCE(rcu_idx) = ctr + 1;
7 i = ctr & 0x1;
8 smp_mb();
9 for_each_thread(t) {
10 while (per_thread(rcu_refcnt, t)[i] != 0) {
11 poll(NULL, 0, 10);
12 }
13 }
14 smp_mb();
15 }
16
17 void synchronize_rcu(void)
18 {
19 int ctr;
20 int oldctr;
21
22 smp_mb();
23 oldctr = ACCESS_ONCE(rcu_idx);
24 smp_mb();
25 spin_lock(&rcu_gp_lock);
26 ctr = ACCESS_ONCE(rcu_idx);
27 if (ctr - oldctr >= 3) {
28 spin_unlock(&rcu_gp_lock);
29 smp_mb();
30 return;
31 }
32 flip_counter_and_wait(ctr);
33 if (ctr - oldctr < 2)
34 flip_counter_and_wait(ctr + 1);
35 spin_unlock(&rcu_gp_lock);
36 smp_mb();
37 }

Figure 9.58: RCU Shared Update Using Per-Thread Reference-Count Pair

grace periods. The main difference from the earlier implementation shown in Fig-
ure 9.54 is that rcu_idx is now a long that counts freely, so that line 8 of Figure 9.57
must mask off the low-order bit. We also switched from using atomic_read() and
atomic_set() to using ACCESS_ONCE(). The data is also quite similar, as shown
in Figure 9.56, with rcu_idx now being a long instead of an atomic_t.
Figure 9.58 (rcu_rcpls.c) shows the implementation of synchronize_rcu()
and its helper function flip_counter_and_wait(). These are similar to those in
Figure 9.55. The differences in flip_counter_and_wait() include:
1. Line 6 uses ACCESS_ONCE() instead of atomic_set(), and increments
rather than complementing.
2. A new line 7 masks the counter down to its bottom bit.
The changes to synchronize_rcu() are more pervasive:
1. There is a new oldctr local variable that captures the pre-lock-acquisition value
of rcu_idx on line 23.
2. Line 26 uses ACCESS_ONCE() instead of atomic_read().
3. Lines 27-30 check to see if at least three counter flips were performed by other
threads while the lock was being acquired, and, if so, releases the lock, does
a memory barrier, and returns. In this case, there were two full waits for the
counters to go to zero, so those other threads already did all the required work.
228 CHAPTER 9. DEFERRED PROCESSING

1 DEFINE_SPINLOCK(rcu_gp_lock);
2 long rcu_gp_ctr = 0;
3 DEFINE_PER_THREAD(long, rcu_reader_gp);
4 DEFINE_PER_THREAD(long, rcu_reader_gp_snap);

Figure 9.59: Data for Free-Running Counter Using RCU

4. At lines 33-34, flip_counter_and_wait() is only invoked a second time


if there were fewer than two counter flips while the lock was being acquired. On
the other hand, if there were two counter flips, some other thread did one full wait
for all the counters to go to zero, so only one more is required.

With this approach, if an arbitrarily large number of threads invoke synchronize_


rcu() concurrently, with one CPU for each thread, there will be a total of only three
waits for counters to go to zero.
Despite the improvements, this implementation of RCU still has a few shortcomings.
First, as before, the need to flip rcu_idx twice imposes substantial overhead on
updates, especially if there are large numbers of threads.
Second, each updater still acquires rcu_gp_lock, even if there is no work to be
done. This can result in a severe scalability limitation if there are large numbers of
concurrent updates. There are ways of avoiding this, as was done in a production-quality
real-time implementation of RCU for the Linux kernel [McK07a].
Third, this implementation requires per-thread variables and the ability to enumerate
threads, which again can be problematic in some software environments.
Finally, on 32-bit machines, a given update thread might be preempted long enough
for the rcu_idx counter to overflow. This could cause such a thread to force an
unnecessary pair of counter flips. However, even if each grace period took only one
microsecond, the offending thread would need to be preempted for more than an hour,
in which case an extra pair of counter flips is likely the least of your worries.
As with the implementation described in Section 9.5.5.3, the read-side primitives
scale extremely well, incurring roughly 115 nanoseconds of overhead regardless of the
number of CPUs. The synchronize_rcu() primitive is still expensive, ranging
from about one microsecond up to about 16 microseconds. This is nevertheless much
cheaper than the roughly 200 microseconds incurred by the implementation in Sec-
tion 9.5.5.5. So, despite its shortcomings, one could imagine this RCU implementation
being used in production in real-life applications.
Quick Quiz 9.62: All of these toy RCU implementations have either atomic op-
erations in rcu_read_lock() and rcu_read_unlock(), or synchronize_
rcu() overhead that increases linearly with the number of threads. Under what
circumstances could an RCU implementation enjoy light-weight implementations for
all three of these primitives, all having deterministic (O (1)) overheads and latencies?
Referring back to Figure 9.57, we see that there is one global-variable access and
no fewer than four accesses to thread-local variables. Given the relatively high cost
of thread-local accesses on systems implementing POSIX threads, it is tempting to
collapse the three thread-local variables into a single structure, permitting rcu_read_
lock() and rcu_read_unlock() to access their thread-local data with a single
thread-local-storage access. However, an even better approach would be to reduce the
number of thread-local accesses to one, as is done in the next section.
9.5. READ-COPY UPDATE (RCU) 229

1 static void rcu_read_lock(void)


2 {
3 __get_thread_var(rcu_reader_gp) =
4 ACCESS_ONCE(rcu_gp_ctr) + 1;
5 smp_mb();
6 }
7
8 static void rcu_read_unlock(void)
9 {
10 smp_mb();
11 __get_thread_var(rcu_reader_gp) =
12 ACCESS_ONCE(rcu_gp_ctr);
13 }
14
15 void synchronize_rcu(void)
16 {
17 int t;
18
19 smp_mb();
20 spin_lock(&rcu_gp_lock);
21 ACCESS_ONCE(rcu_gp_ctr) += 2;
22 smp_mb();
23 for_each_thread(t) {
24 while ((per_thread(rcu_reader_gp, t) & 0x1) &&
25 ((per_thread(rcu_reader_gp, t) -
26 ACCESS_ONCE(rcu_gp_ctr)) < 0)) {
27 poll(NULL, 0, 10);
28 }
29 }
30 spin_unlock(&rcu_gp_lock);
31 smp_mb();
32 }

Figure 9.60: Free-Running Counter Using RCU

9.5.5.7 RCU Based on Free-Running Counter

Figure 9.60 (rcu.h and rcu.c) shows an RCU implementation based on a single
global free-running counter that takes on only even-numbered values, with data shown in
Figure 9.59. The resulting rcu_read_lock() implementation is extremely straight-
forward. Lines 3 and 4 simply add one to the global free-running rcu_gp_ctr variable
and stores the resulting odd-numbered value into the rcu_reader_gp per-thread
variable. Line 5 executes a memory barrier to prevent the content of the subsequent
RCU read-side critical section from “leaking out”.
The rcu_read_unlock() implementation is similar. Line 10 executes a mem-
ory barrier, again to prevent the prior RCU read-side critical section from “leaking out”.
Lines 11 and 12 then copy the rcu_gp_ctr global variable to the rcu_reader_gp
per-thread variable, leaving this per-thread variable with an even-numbered value so
that a concurrent instance of synchronize_rcu() will know to ignore it.
Quick Quiz 9.63: If any even value is sufficient to tell synchronize_rcu() to
ignore a given task, why don’t lines 10 and 11 of Figure 9.60 simply assign zero to
rcu_reader_gp?
Thus, synchronize_rcu() could wait for all of the per-thread rcu_reader_
gp variables to take on even-numbered values. However, it is possible to do much better
than that because synchronize_rcu() need only wait on pre-existing RCU read-
side critical sections. Line 19 executes a memory barrier to prevent prior manipulations
of RCU-protected data structures from being reordered (by either the CPU or the
compiler) to follow the increment on line 21. Line 20 acquires the rcu_gp_lock
(and line 30 releases it) in order to prevent multiple synchronize_rcu() instances
from running concurrently. Line 21 then increments the global rcu_gp_ctr variable
230 CHAPTER 9. DEFERRED PROCESSING

1 DEFINE_SPINLOCK(rcu_gp_lock);
2 #define RCU_GP_CTR_SHIFT 7
3 #define RCU_GP_CTR_BOTTOM_BIT (1 << RCU_GP_CTR_SHIFT)
4 #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BOTTOM_BIT - 1)
5 long rcu_gp_ctr = 0;
6 DEFINE_PER_THREAD(long, rcu_reader_gp);

Figure 9.61: Data for Nestable RCU Using a Free-Running Counter

by two, so that all pre-existing RCU read-side critical sections will have corresponding
per-thread rcu_reader_gp variables with values less than that of rcu_gp_ctr,
modulo the machine’s word size. Recall also that threads with even-numbered values
of rcu_reader_gp are not in an RCU read-side critical section, so that lines 23-29
scan the rcu_reader_gp values until they all are either even (line 24) or are greater
than the global rcu_gp_ctr (lines 25-26). Line 27 blocks for a short period of time
to wait for a pre-existing RCU read-side critical section, but this can be replaced with a
spin-loop if grace-period latency is of the essence. Finally, the memory barrier at line 31
ensures that any subsequent destruction will not be reordered into the preceding loop.
Quick Quiz 9.64: Why are the memory barriers on lines 19 and 31 of Figure 9.60
needed? Aren’t the memory barriers inherent in the locking primitives on lines 20
and 30 sufficient?
This approach achieves much better read-side performance, incurring roughly
63 nanoseconds of overhead regardless of the number of Power5 CPUs. Updates
incur more overhead, ranging from about 500 nanoseconds on a single Power5 CPU to
more than 100 microseconds on 64 such CPUs.
Quick Quiz 9.65: Couldn’t the update-side batching optimization described in
Section 9.5.5.6 be applied to the implementation shown in Figure 9.60?
This implementation suffers from some serious shortcomings in addition to the high
update-side overhead noted earlier. First, it is no longer permissible to nest RCU read-
side critical sections, a topic that is taken up in the next section. Second, if a reader is
preempted at line 3 of Figure 9.60 after fetching from rcu_gp_ctr but before storing
to rcu_reader_gp, and if the rcu_gp_ctr counter then runs through more than
half but less than all of its possible values, then synchronize_rcu() will ignore
the subsequent RCU read-side critical section. Third and finally, this implementation
requires that the enclosing software environment be able to enumerate threads and
maintain per-thread variables.
Quick Quiz 9.66: Is the possibility of readers being preempted in lines 3-4 of
Figure 9.60 a real problem, in other words, is there a real sequence of events that could
lead to failure? If not, why not? If so, what is the sequence of events, and how can the
failure be addressed?

9.5.5.8 Nestable RCU Based on Free-Running Counter


Figure 9.62 (rcu_nest.h and rcu_nest.c) show an RCU implementation based
on a single global free-running counter, but that permits nesting of RCU read-side
critical sections. This nestability is accomplished by reserving the low-order bits of the
global rcu_gp_ctr to count nesting, using the definitions shown in Figure 9.61. This
is a generalization of the scheme in Section 9.5.5.7, which can be thought of as having a
single low-order bit reserved for counting nesting depth. Two C-preprocessor macros are
used to arrange this, RCU_GP_CTR_NEST_MASK and RCU_GP_CTR_BOTTOM_BIT.
These are related: RCU_GP_CTR_NEST_MASK=RCU_GP_CTR_BOTTOM_BIT-1.
9.5. READ-COPY UPDATE (RCU) 231

1 static void rcu_read_lock(void)


2 {
3 long tmp;
4 long *rrgp;
5
6 rrgp = &__get_thread_var(rcu_reader_gp);
7 tmp = *rrgp;
8 if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
9 tmp = ACCESS_ONCE(rcu_gp_ctr);
10 tmp++;
11 *rrgp = tmp;
12 smp_mb();
13 }
14
15 static void rcu_read_unlock(void)
16 {
17 long tmp;
18
19 smp_mb();
20 __get_thread_var(rcu_reader_gp)--;
21 }
22
23 void synchronize_rcu(void)
24 {
25 int t;
26
27 smp_mb();
28 spin_lock(&rcu_gp_lock);
29 ACCESS_ONCE(rcu_gp_ctr) +=
30 RCU_GP_CTR_BOTTOM_BIT;
31 smp_mb();
32 for_each_thread(t) {
33 while (rcu_gp_ongoing(t) &&
34 ((per_thread(rcu_reader_gp, t) -
35 rcu_gp_ctr) < 0)) {
36 poll(NULL, 0, 10);
37 }
38 }
39 spin_unlock(&rcu_gp_lock);
40 smp_mb();
41 }

Figure 9.62: Nestable RCU Using a Free-Running Counter

The RCU_GP_CTR_BOTTOM_BIT macro contains a single bit that is positioned just


above the bits reserved for counting nesting, and the RCU_GP_CTR_NEST_MASK has
all one bits covering the region of rcu_gp_ctr used to count nesting. Obviously,
these two C-preprocessor macros must reserve enough of the low-order bits of the
counter to permit the maximum required nesting of RCU read-side critical sections, and
this implementation reserves seven bits, for a maximum RCU read-side critical-section
nesting depth of 127, which should be well in excess of that needed by most applications.
The resulting rcu_read_lock() implementation is still reasonably straightfor-
ward. Line 6 places a pointer to this thread’s instance of rcu_reader_gp into the
local variable rrgp, minimizing the number of expensive calls to the pthreads thread-
local-state API. Line 7 records the current value of rcu_reader_gp into another
local variable tmp, and line 8 checks to see if the low-order bits are zero, which would
indicate that this is the outermost rcu_read_lock(). If so, line 9 places the global
rcu_gp_ctr into tmp because the current value previously fetched by line 7 is likely
to be obsolete. In either case, line 10 increments the nesting depth, which you will recall
is stored in the seven low-order bits of the counter. Line 11 stores the updated counter
back into this thread’s instance of rcu_reader_gp, and, finally, line 12 executes a
memory barrier to prevent the RCU read-side critical section from bleeding out into the
code preceding the call to rcu_read_lock().
232 CHAPTER 9. DEFERRED PROCESSING

1 DEFINE_SPINLOCK(rcu_gp_lock);
2 long rcu_gp_ctr = 0;
3 DEFINE_PER_THREAD(long, rcu_reader_qs_gp);

Figure 9.63: Data for Quiescent-State-Based RCU

In other words, this implementation of rcu_read_lock() picks up a copy of the


global rcu_gp_ctr unless the current invocation of rcu_read_lock() is nested
within an RCU read-side critical section, in which case it instead fetches the contents of
the current thread’s instance of rcu_reader_gp. Either way, it increments whatever
value it fetched in order to record an additional nesting level, and stores the result in the
current thread’s instance of rcu_reader_gp.
Interestingly enough, despite their rcu_read_lock() differences, the implemen-
tation of rcu_read_unlock() is broadly similar to that shown in Section 9.5.5.7.
Line 19 executes a memory barrier in order to prevent the RCU read-side critical section
from bleeding out into code following the call to rcu_read_unlock(), and line 20
decrements this thread’s instance of rcu_reader_gp, which has the effect of decre-
menting the nesting count contained in rcu_reader_gp’s low-order bits. Debugging
versions of this primitive would check (before decrementing!) that these low-order bits
were non-zero.
The implementation of synchronize_rcu() is quite similar to that shown in
Section 9.5.5.7. There are two differences. The first is that lines 29 and 30 adds RCU_
GP_CTR_BOTTOM_BIT to the global rcu_gp_ctr instead of adding the constant
“2”, and the second is that the comparison on line 33 has been abstracted out to a separate
function, where it checks the bit indicated by RCU_GP_CTR_BOTTOM_BIT instead
of unconditionally checking the low-order bit.
This approach achieves read-side performance almost equal to that shown in Sec-
tion 9.5.5.7, incurring roughly 65 nanoseconds of overhead regardless of the number of
Power5 CPUs. Updates again incur more overhead, ranging from about 600 nanoseconds
on a single Power5 CPU to more than 100 microseconds on 64 such CPUs.
Quick Quiz 9.67: Why not simply maintain a separate per-thread nesting-level
variable, as was done in previous section, rather than having all this complicated bit
manipulation?
This implementation suffers from the same shortcomings as does that of Sec-
tion 9.5.5.7, except that nesting of RCU read-side critical sections is now permitted. In
addition, on 32-bit systems, this approach shortens the time required to overflow the
global rcu_gp_ctr variable. The following section shows one way to greatly increase
the time required for overflow to occur, while greatly reducing read-side overhead.
Quick Quiz 9.68: Given the algorithm shown in Figure 9.62, how could you double
the time required to overflow the global rcu_gp_ctr?
Quick Quiz 9.69: Again, given the algorithm shown in Figure 9.62, is counter
overflow fatal? Why or why not? If it is fatal, what can be done to fix it?

9.5.5.9 RCU Based on Quiescent States


Figure 9.64 (rcu_qs.h) shows the read-side primitives used to construct a user-level
implementation of RCU based on quiescent states, with the data shown in Figure 9.63.
As can be seen from lines 1-7 in the figure, the rcu_read_lock() and rcu_
read_unlock() primitives do nothing, and can in fact be expected to be inlined
and optimized away, as they are in server builds of the Linux kernel. This is due
9.5. READ-COPY UPDATE (RCU) 233

1 static void rcu_read_lock(void)


2 {
3 }
4
5 static void rcu_read_unlock(void)
6 {
7 }
8
9 rcu_quiescent_state(void)
10 {
11 smp_mb();
12 __get_thread_var(rcu_reader_qs_gp) =
13 ACCESS_ONCE(rcu_gp_ctr) + 1;
14 smp_mb();
15 }
16
17 static void rcu_thread_offline(void)
18 {
19 smp_mb();
20 __get_thread_var(rcu_reader_qs_gp) =
21 ACCESS_ONCE(rcu_gp_ctr);
22 smp_mb();
23 }
24
25 static void rcu_thread_online(void)
26 {
27 rcu_quiescent_state();
28 }

Figure 9.64: Quiescent-State-Based RCU Read Side

to the fact that quiescent-state-based RCU implementations approximate the extents


of RCU read-side critical sections using the aforementioned quiescent states. Each
of these quiescent states contains a call to rcu_quiescent_state(), which is
shown from lines 9-15 in the figure. Threads entering extended quiescent states (for
example, when blocking) may instead call rcu_thread_offline() (lines 17-23)
when entering an extended quiescent state and then call rcu_thread_online()
(lines 25-28) when leaving it. As such, rcu_thread_online() is analogous to
rcu_read_lock() and rcu_thread_offline() is analogous to rcu_read_
unlock(). In addition, rcu_quiescent_state() can be thought of as a rcu_
thread_online() immediately followed by a rcu_thread_offline().13 It
is illegal to invoke rcu_quiescent_state(), rcu_thread_offline(), or
rcu_thread_online() from an RCU read-side critical section.
In rcu_quiescent_state(), line 11 executes a memory barrier to prevent
any code prior to the quiescent state (including possible RCU read-side critical sections)
from being reordered into the quiescent state. Lines 12-13 pick up a copy of the global
rcu_gp_ctr, using ACCESS_ONCE() to ensure that the compiler does not employ
any optimizations that would result in rcu_gp_ctr being fetched more than once, and
then adds one to the value fetched and stores it into the per-thread rcu_reader_qs_
gp variable, so that any concurrent instance of synchronize_rcu() will see an
odd-numbered value, thus becoming aware that a new RCU read-side critical section has
started. Instances of synchronize_rcu() that are waiting on older RCU read-side
critical sections will thus know to ignore this new one. Finally, line 14 executes a
memory barrier, which prevents subsequent code (including a possible RCU read-side

13 Although the code in the figure is consistent with rcu_quiescent_state() being the same

as rcu_thread_online() immediately followed by rcu_thread_offline(), this relationship is


obscured by performance optimizations.
234 CHAPTER 9. DEFERRED PROCESSING

1 void synchronize_rcu(void)
2 {
3 int t;
4
5 smp_mb();
6 spin_lock(&rcu_gp_lock);
7 rcu_gp_ctr += 2;
8 smp_mb();
9 for_each_thread(t) {
10 while (rcu_gp_ongoing(t) &&
11 ((per_thread(rcu_reader_qs_gp, t) -
12 rcu_gp_ctr) < 0)) {
13 poll(NULL, 0, 10);
14 }
15 }
16 spin_unlock(&rcu_gp_lock);
17 smp_mb();
18 }

Figure 9.65: RCU Update Side Using Quiescent States

critical section) from being re-ordered with the lines 12-13.


Quick Quiz 9.70: Doesn’t the additional memory barrier shown on line 14 of
Figure 9.64 greatly increase the overhead of rcu_quiescent_state?
Some applications might use RCU only occasionally, but use it very heavily when
they do use it. Such applications might choose to use rcu_thread_online()
when starting to use RCU and rcu_thread_offline() when no longer using
RCU. The time between a call to rcu_thread_offline() and a subsequent call to
rcu_thread_online() is an extended quiescent state, so that RCU will not expect
explicit quiescent states to be registered during this time.
The rcu_thread_offline() function simply sets the per-thread rcu_reader_
qs_gp variable to the current value of rcu_gp_ctr, which has an even-numbered
value. Any concurrent instances of synchronize_rcu() will thus know to ignore
this thread.
Quick Quiz 9.71: Why are the two memory barriers on lines 19 and 22 of Fig-
ure 9.64 needed?
The rcu_thread_online() function simply invokes rcu_quiescent_state(),
thus marking the end of the extended quiescent state.
Figure 9.65 (rcu_qs.c) shows the implementation of synchronize_rcu(),
which is quite similar to that of the preceding sections.
This implementation has blazingly fast read-side primitives, with an rcu_read_
lock()-rcu_read_unlock() round trip incurring an overhead of roughly 50 pi-
coseconds. The synchronize_rcu() overhead ranges from about 600 nanoseconds
on a single-CPU Power5 system up to more than 100 microseconds on a 64-CPU system.
Quick Quiz 9.72: To be sure, the clock frequencies of Power systems in 2008 were
quite high, but even a 5GHz clock frequency is insufficient to allow loops to be executed
in 50 picoseconds! What is going on here?
However, this implementation requires that each thread either invoke rcu_quiescent_
state() periodically or to invoke rcu_thread_offline() for extended quies-
cent states. The need to invoke these functions periodically can make this implementa-
tion difficult to use in some situations, such as for certain types of library functions.
Quick Quiz 9.73: Why would the fact that the code is in a library make any
difference for how easy it is to use the RCU implementation shown in Figures 9.64 and
9.65?
Quick Quiz 9.74: But what if you hold a lock across a call to synchronize_
9.5. READ-COPY UPDATE (RCU) 235

rcu(), and then acquire that same lock within an RCU read-side critical section? This
should be a deadlock, but how can a primitive that generates absolutely no code possibly
participate in a deadlock cycle?
In addition, this implementation does not permit concurrent calls to synchronize_
rcu() to share grace periods. That said, one could easily imagine a production-quality
RCU implementation based on this version of RCU.

9.5.5.10 Summary of Toy RCU Implementations


If you made it this far, congratulations! You should now have a much clearer under-
standing not only of RCU itself, but also of the requirements of enclosing software
environments and applications. Those wishing an even deeper understanding are invited
to read descriptions of production-quality RCU implementations [DMS+ 12, McK07a,
McK08a, McK09a].
The preceding sections listed some desirable properties of the various RCU primi-
tives. The following list is provided for easy reference for those wishing to create a new
RCU implementation.
1. There must be read-side primitives (such as rcu_read_lock() and rcu_
read_unlock()) and grace-period primitives (such as synchronize_rcu()
and call_rcu()), such that any RCU read-side critical section in existence at
the start of a grace period has completed by the end of the grace period.
2. RCU read-side primitives should have minimal overhead. In particular, expensive
operations such as cache misses, atomic instructions, memory barriers, and
branches should be avoided.
3. RCU read-side primitives should have O (1) computational complexity to enable
real-time use. (This implies that readers run concurrently with updaters.)
4. RCU read-side primitives should be usable in all contexts (in the Linux kernel,
they are permitted everywhere except in the idle loop). An important special
case is that RCU read-side primitives be usable within an RCU read-side critical
section, in other words, that it be possible to nest RCU read-side critical sections.
5. RCU read-side primitives should be unconditional, with no failure returns. This
property is extremely important, as failure checking increases complexity and
complicates testing and validation.
6. Any operation other than a quiescent state (and thus a grace period) should
be permitted in an RCU read-side critical section. In particular, irrevocable
operations such as I/O should be permitted.
7. It should be possible to update an RCU-protected data structure while executing
within an RCU read-side critical section.
8. Both RCU read-side and update-side primitives should be independent of memory
allocator design and implementation, in other words, the same RCU implementa-
tion should be able to protect a given data structure regardless of how the data
elements are allocated and freed.
9. RCU grace periods should not be blocked by threads that halt outside of RCU read-
side critical sections. (But note that most quiescent-state-based implementations
violate this desideratum.)
236 CHAPTER 9. DEFERRED PROCESSING

Reference Counting Hazard Pointers Sequence Locks RCU


Existence Guarantees Complex Yes No Yes
Updates and Readers Progress Yes Yes No Yes
Concurrently
Contention Among Readers High None None None
Reader Per-Critical-Section N/A N/A Two smp_ Ranges from none to
Overhead mb() two smp_mb()
Reader Per-Object Traversal Read-modify-write atomic op- smp_mb() None, but un- None (volatile ac-
Overhead erations, memory-barrier in- safe cesses)
structions, and cache misses
Reader Forward Progress Lock free Lock free Blocking Bounded wait free
Guarantee
Reader Reference Acquisition Can fail (conditional) Can fail (condi- Unsafe Cannot fail (uncon-
tional) ditional)
Memory Footprint Bounded Bounded Bounded Unbounded
Reclamation Forward Progress Lock free Lock free N/A Blocking
Automatic Reclamation Yes No N/A No
Lines of Code 94 79 79 73

Table 9.5: Which Deferred Technique to Choose?

Quick Quiz 9.75: Given that grace periods are prohibited within RCU read-side
critical sections, how can an RCU data structure possibly be updated while in an RCU
read-side critical section?

9.5.6 RCU Exercises

This section is organized as a series of Quick Quizzes that invite you to apply RCU
to a number of examples earlier in this book. The answer to each Quick Quiz gives
some hints, and also contains a pointer to a later section where the solution is explained at
length. The rcu_read_lock(), rcu_read_unlock(), rcu_dereference(),
rcu_assign_pointer(), and synchronize_rcu() primitives should suffice
for most of these exercises.
Quick Quiz 9.76: The statistical-counter implementation shown in Figure 5.9
(count_end.c) used a global lock to guard the summation in read_count(),
which resulted in poor performance and negative scalability. How could you use RCU
to provide read_count() with excellent performance and good scalability. (Keep in
mind that read_count()’s scalability will necessarily be limited by its need to scan
all threads’ counters.)
Quick Quiz 9.77: Section 5.5 showed a fanciful pair of code fragments that dealt
with counting I/O accesses to removable devices. These code fragments suffered from
high overhead on the fastpath (starting an I/O) due to the need to acquire a reader-writer
lock. How would you use RCU to provide excellent performance and scalability? (Keep
in mind that the performance of the common-case first code fragment that does I/O
accesses is much more important than that of the device-removal code fragment.)
9.6. WHICH TO CHOOSE? 237

9.6 Which to Choose?


Table 9.5 provides some rough rules of thumb that can help you choose among the four
deferred-processing techniques presented in this chapter.
As shown in the “Existence Guarantee” row, if you need existence guarantees
for linked data elements, you must use reference counting, hazard pointers, or RCU.
Sequence locks do not provide existence guarantees, instead providing detection of
updates, retrying any read-side critical sections that do encounter an update.
Of course, as shown in the “Updates and Readers Progress Concurrently” row, this
detection of updates implies that sequence locking does not permit updaters and readers
to make forward progress concurrently. After all, preventing such forward progress is
the whole point of using sequence locking in the first place! This situation points the
way to using sequence locking in conjunction with reference counting, hazard pointers,
or RCU in order to provide both existence guarantees and update detection. In fact, the
Linux kernel combines RCU and sequence locking in this manner during pathname
lookup.
The “Contention Among Readers”, “Reader Per-Critical-Section Overhead”, and
“Reader Per-Object Traversal Overhead” rows give a rough sense of the read-side
overhead of these techniques. The overhead of reference counting can be quite large,
with contention among readers along with a fully ordered read-modify-write atomic
operation required for each and every object traversed. Hazard pointers incur the
overhead of a memory barrier for each data element traversed, and sequence locks
incur the overhead of a pair of memory barriers for each attempt to execute the critical
section. The overhead of RCU implementations vary from nothing to that of a pair
of memory barriers for each read-side critical section, thus providing RCU with the
best performance, particularly for read-side critical sections that traverse many data
elements.
The “Reader Forward Progress Guarantee” row shows that only RCU has a bounded
wait-free forward-progress guarantee, which means that it can carry out a finite traversal
by executing a bounded number of instructions.
The “Reader Reference Acquisition” rows indicates that only RCU is capable of
unconditionally acquiring references. The entry for sequence locks is “Unsafe” because,
again, sequence locks detect updates rather than acquiring references. Reference count-
ing and hazard pointers both require that traversals be restarted from the beginning if
a given acquisition fails. To see this, consider a linked list containing objects A, B, C,
and D, in that order, and the following series of events:
1. A reader acquires a reference to object B.
2. An updater removes object B, but refrains from freeing it because the reader holds
a reference. The list now contains objects A, C, and D, and object B’s ->next
pointer is set to HAZPTR_POISON.
3. The updater removes object C, so that the list now contains objects A and D.
Because there is no reference to object C, it is immediately freed.
4. The reader tries to advance to the successor of the object following the now-
removed object B, but the poisoned ->next pointer prevents this. Which is a
good thing, because object B’s ->next pointer would otherwise point to the
freelist.
5. The reader must therefore restart its traversal from the head of the list.
238 CHAPTER 9. DEFERRED PROCESSING

Thus, when failing to acquire a reference, a hazard-pointer or reference-counter


traversal must restart that traversal from the beginning. In the case of nested linked data
structures, for example, a tree containing linked lists, the traversal must be restarted
from the outermost data structure. This situation gives RCU a significant ease-of-use
advantage.
However, RCU’s ease-of-use advantage does not come for free, as can be seen in the
“Memory Footprint” row. RCU’s support of unconditional reference acquisition means
that it must avoid freeing any object reachable by a given RCU reader until that reader
completes. RCU therefore has an unbounded memory footprint, at least unless updates
are throttled. In contrast, reference counting and hazard pointers need to retain only
those data elements actually referenced by concurrent readers.
This tension between memory footprint and acquisition failures is sometimes re-
solved within the Linux kernel by combining use of RCU and reference counters. RCU
is used for short-lived references, which means that RCU read-side critical sections can
be short. These short RCU read-side critical sections in turn mean that the corresponding
RCU grace periods can also be short, which limits the memory footprint. For the few
data elements that need longer-lived references, reference counting is used. This means
that the complexity of reference-acquisition failure only needs to be dealt with for those
few data elements: The bulk of the reference acquisitions are unconditional, courtesy
of RCU. See Section 13.2 for more information on combining reference counting with
other synchronization mechanisms.
The “Reclamation Forward Progress” row shows that hazard pointers can pro-
vide non-blocking updates [Mic04, HLM02]. Reference counting might or might not,
depending on the implementation. However, sequence locking cannot provide non-
blocking updates, courtesy of its update-side lock. RCU updaters must wait on readers,
which also rules out fully non-blocking updates. However, there are situations in which
the only blocking operation is a wait to free memory, which results in an situation that,
for many purposes, is as good as non-blocking [DMS+ 12].
As shown in the “Automatic Reclamation” row, only reference counting can auto-
mate freeing of memory, and even then only for non-cyclic data structures.
Finally, the “Lines of Code” row shows the size of the Pre-BSD Routing Table
implementations, giving a rough idea of relative ease of use. That said, it is important to
note that the reference-counting and sequence-locking implementations are buggy, and
that a correct reference-counting implementation is considerably more complex [Val95,
MS95]. For its part, a correct sequence-locking implementation requires the addition of
some other synchronization mechanism, for example, hazard pointers or RCU, so that
sequence locking detects concurrent updates and the other mechanism provides safe
reference acquisition.
As more experience is gained using these techniques, both separately and in combi-
nation, the rules of thumb laid out in this section will need to be refined. However, this
section does reflect the current state of the art.

9.7 What About Updates?


The deferred-processing techniques called out in this chapter are most directly applicable
to read-mostly situations, which begs the question “But what about updates?” After all,
increasing the performance and scalability of readers is all well and good, but it is only
natural to also want great performance and scalability for writers.
9.7. WHAT ABOUT UPDATES? 239

We have already seen one situation featuring high performance and scalability
for writers, namely the counting algorithms surveyed in Chapter 5. These algorithms
featured partially partitioned data structures so that updates can operate locally, while the
more-expensive reads must sum across the entire data structure. Silas Boyd-Wickhizer
has generalized this notion to produce OpLog, which he has applied to Linux-kernel
pathname lookup, VM reverse mappings, and the stat() system call [BW14].
Another approach, called “Disruptor”, is designed for applications that process
high-volume streams of input data. The approach is to rely on single-producer-single-
consumer FIFO queues, minimizing the need for synchronization [Sut13]. For Java
applications, Disruptor also has the virtue of minimizing use of the garbage collector.
And of course, where feasible, fully partitioned or “sharded” systems provide
excellent performance and scalability, as noted in Chapter 6.
The next chapter will look at updates in the context of several types of data struc-
tures.
240 CHAPTER 9. DEFERRED PROCESSING
Bad programmers worry about the code. Good
programmers worry about data structures and their
relationships.

Linus Torvalds

Chapter 10

Data Structures

Efficient access to data is critically important, so that discussions of algorithms include


time complexity of the related data structures [CLRS01]. However, for parallel programs,
measures of time complexity must also include concurrency effects. These effects can
be overwhelmingly large, as shown in Chapter 3, which means that concurrent data
structure designs must focus as much on concurrency as they do on sequential time
complexity. In other words, an important part of the data-structure relationships that
good parallel programmers must worry about is that portion related to concurrency.
Section 10.1 presents a motivating application that will be used to evaluate the data
structures presented in this chapter.
As discussed in Chapter 6, an excellent way to achieve high scalability is par-
titioning. This points the way to partitionable data structures, a topic taken up by
Section 10.2. Chapter 9 described how deferring some actions can greatly improve both
performance and scalability. Section 9.5 in particular showed how to tap the awesome
power of procrastination in pursuit of performance and scalability, a topic taken up by
Section 10.3.
Not all data structures are partitionable. Section 10.4 looks at a mildly non-
partitionable example data structure. This section shows how to split it into read-mostly
and partitionable portions, enabling a fast and scalable implementation.
Because this chapter cannot delve into the details of every concurrent data structure
that has ever been used Section 10.5 provides a brief survey of the most common and
important ones. Although the best performance and scalability results design rather than
after-the-fact micro-optimization, it is nevertheless the case that micro-optimization has
an important place in achieving the absolute best possible performance and scalability.
This topic is therefore taken up in Section 10.6.
Finally, Section 10.7 presents a summary of this chapter.

10.1 Motivating Application


We will use the Schrödinger’s Zoo application to evaluate performance [McK13].
Schrödinger has a zoo containing a large number of animals, and he would like to
track them using an in-memory database with each animal in the zoo represented by a
data item in this database. Each animal has a unique name that is used as a key, with a
variety of data tracked for each animal.

241
242 CHAPTER 10. DATA STRUCTURES

Births, captures, and purchases result in insertions, while deaths, releases, and sales
result in deletions. Because Schrödinger’s zoo contains a large quantity of short-lived
animals, including mice and insects, the database must be able to support a high update
rate.
Those interested in Schrödinger’s animals can query them, however, Schrödinger
has noted extremely high rates of queries for his cat, so much so that he suspects that
his mice might be using the database to check up on their nemesis. This means that
Schrödinger’s application must be able to support a high rate of queries to a single data
element.
Please keep this application in mind as various data structures are presented.

10.2 Partitionable Data Structures


There are a huge number of data structures in use today, so much so that there are
multiple textbooks covering them. This small section focuses on a single data structure,
namely the hash table. This focused approach allows a much deeper investigation of
how concurrency interacts with data structures, and also focuses on a data structure that
is heavily used in practice. Section 10.2.1 overviews of the design, and Section 10.2.2
presents the implementation. Finally, Section 10.2.3 discusses the resulting performance
and scalability.

10.2.1 Hash-Table Design


Chapter 6 emphasized the need to apply partitioning in order to attain respectable
performance and scalability, so partitionability must be a first-class criterion when
selecting data structures. This criterion is well satisfied by that workhorse of parallelism,
the hash table. Hash tables are conceptually simple, consisting of an array of hash
buckets. A hash function maps from a given element’s key to the hash bucket that this
element will be stored in. Each hash bucket therefore heads up a linked list of elements,
called a hash chain. When properly configured, these hash chains will be quite short,
permitting a hash table to access the element with a given key extremely efficiently.
Quick Quiz 10.1: But there are many types of hash tables, of which the chained
hash tables described here are but one type. Why the focus on chained hash tables?
In addition, each bucket can be given its own lock, so that elements in different
buckets of the hash table may be added, deleted, and looked up completely independently.
A large hash table containing a large number of elements therefore offers excellent
scalability.

10.2.2 Hash-Table Implementation


Figure 10.1 (hash_bkt.c) shows a set of data structures used in a simple fixed-sized
hash table using chaining and per-hash-bucket locking, and Figure 10.2 diagrams how
they fit together. The hashtab structure (lines 11-14 in Figure 10.1) contains four
ht_bucket structures (lines 6-9 in Figure 10.1), with the ->ht_nbuckets field
controlling the number of buckets. Each such bucket contains a list header ->htb_
head and a lock ->htb_lock. The list headers chain ht_elem structures (lines 1-4
in Figure 10.1) through their ->hte_next fields, and each ht_elem structure also
10.2. PARTITIONABLE DATA STRUCTURES 243

1 struct ht_elem {
2 struct cds_list_head hte_next;
3 unsigned long hte_hash;
4 };
5
6 struct ht_bucket {
7 struct cds_list_head htb_head;
8 spinlock_t htb_lock;
9 };
10
11 struct hashtab {
12 unsigned long ht_nbuckets;
13 struct ht_bucket ht_bkt[0];
14 };

Figure 10.1: Hash-Table Data Structures

struct hashtab
−>ht_nbuckets = 4
−>ht_bkt[0] struct ht_elem struct ht_elem
−>htb_head −>hte_next −>hte_next
−>htb_lock −>hte_hash −>hte_hash
−>ht_bkt[1]
−>htb_head
−>htb_lock
−>ht_bkt[2] struct ht_elem
−>htb_head −>hte_next
−>htb_lock −>hte_hash
−>ht_bkt[3]
−>htb_head
−>htb_lock

Figure 10.2: Hash-Table Data-Structure Diagram

caches the corresponding element’s hash value in the ->hte_hash field. The ht_
elem structure would be included in the larger structure being placed in the hash table,
and this larger structure might contain a complex key.
The diagram shown in Figure 10.2 has bucket 0 with two elements and bucket 2
with one.
Figure 10.3 shows mapping and locking functions. Lines 1 and 2 show the macro
HASH2BKT(), which maps from a hash value to the corresponding ht_bucket
structure. This macro uses a simple modulus: if more aggressive hashing is required, the

1 #define HASH2BKT(htp, h) \
2 (&(htp)->ht_bkt[h % (htp)->ht_nbuckets])
3
4 static void hashtab_lock(struct hashtab *htp,
5 unsigned long hash)
6 {
7 spin_lock(&HASH2BKT(htp, hash)->htb_lock);
8 }
9
10 static void hashtab_unlock(struct hashtab *htp,
11 unsigned long hash)
12 {
13 spin_unlock(&HASH2BKT(htp, hash)->htb_lock);
14 }

Figure 10.3: Hash-Table Mapping and Locking


244 CHAPTER 10. DATA STRUCTURES

1 struct ht_elem *
2 hashtab_lookup(struct hashtab *htp,
3 unsigned long hash,
4 void *key,
5 int (*cmp)(struct ht_elem *htep,
6 void *key))
7 {
8 struct ht_bucket *htb;
9 struct ht_elem *htep;
10
11 htb = HASH2BKT(htp, hash);
12 cds_list_for_each_entry(htep,
13 &htb->htb_head,
14 hte_next) {
15 if (htep->hte_hash != hash)
16 continue;
17 if (cmp(htep, key))
18 return htep;
19 }
20 return NULL;
21 }

Figure 10.4: Hash-Table Lookup


1 void
2 hashtab_add(struct hashtab *htp,
3 unsigned long hash,
4 struct ht_elem *htep)
5 {
6 htep->hte_hash = hash;
7 cds_list_add(&htep->hte_next,
8 &HASH2BKT(htp, hash)->htb_head);
9 }
10
11 void hashtab_del(struct ht_elem *htep)
12 {
13 cds_list_del_init(&htep->hte_next);
14 }

Figure 10.5: Hash-Table Modification

caller needs to implement it when mapping from key to hash value. The remaining two
functions acquire and release the ->htb_lock corresponding to the specified hash
value.
Figure 10.4 shows hashtab_lookup(), which returns a pointer to the element
with the specified hash and key if it exists, or NULL otherwise. This function takes both
a hash value and a pointer to the key because this allows users of this function to use
arbitrary keys and arbitrary hash functions, with the key-comparison function passed
in via cmp(), in a manner similar to qsort(). Line 11 maps from the hash value
to a pointer to the corresponding hash bucket. Each pass through the loop spanning
lines 12-19 examines one element of the bucket’s hash chain. Line 15 checks to see if
the hash values match, and if not, line 16 proceeds to the next element. Line 17 checks
to see if the actual key matches, and if so, line 18 returns a pointer to the matching
element. If no element matches, line 20 returns NULL.
Quick Quiz 10.2: But isn’t the double comparison on lines 15-18 in Figure 10.4
inefficient in the case where the key fits into an unsigned long?
Figure 10.5 shows the hashtab_add() and hashtab_del() functions that
add and delete elements from the hash table, respectively.
The hashtab_add() function simply sets the element’s hash value on line 6, then
adds it to the corresponding bucket on lines 7 and 8. The hashtab_del() function
simply removes the specified element from whatever hash chain it is on, courtesy of the
10.2. PARTITIONABLE DATA STRUCTURES 245

1 struct hashtab *
2 hashtab_alloc(unsigned long nbuckets)
3 {
4 struct hashtab *htp;
5 int i;
6
7 htp = malloc(sizeof(*htp) +
8 nbuckets *
9 sizeof(struct ht_bucket));
10 if (htp == NULL)
11 return NULL;
12 htp->ht_nbuckets = nbuckets;
13 for (i = 0; i < nbuckets; i++) {
14 CDS_INIT_LIST_HEAD(&htp->ht_bkt[i].htb_head);
15 spin_lock_init(&htp->ht_bkt[i].htb_lock);
16 }
17 return htp;
18 }
19
20 void hashtab_free(struct hashtab *htp)
21 {
22 free(htp);
23 }

Figure 10.6: Hash-Table Allocation and Free

90000

80000
Total Lookups per Millisecond

70000

60000 ideal

50000

40000

30000

20000

10000
1 2 3 4 5 6 7 8
Number of CPUs (Threads)

Figure 10.7: Read-Only Hash-Table Performance For Schrödinger’s Zoo

doubly linked nature of the hash-chain lists. Before calling either of these two functions,
the caller is required to ensure that no other thread is accessing or modifying this same
bucket, for example, by invoking hashtab_lock() beforehand.

Figure 10.6 shows hashtab_alloc() and hashtab_free(), which do hash-


table allocation and freeing, respectively. Allocation begins on lines 7-9 with allocation
of the underlying memory. If line 10 detects that memory has been exhausted, line 11
returns NULL to the caller. Otherwise, line 12 initializes the number of buckets, and
the loop spanning lines 13-16 initializes the buckets themselves, including the chain
list header on line 14 and the lock on line 15. Finally, line 17 returns a pointer to
the newly allocated hash table. The hashtab_free() function on lines 20-23 is
straightforward.
246 CHAPTER 10. DATA STRUCTURES

60000
55000

Total Lookups per Millisecond


50000
45000
40000
35000
30000
25000
20000
15000
10000
0 10 20 30 40 50 60
Number of CPUs (Threads)

Figure 10.8: Read-Only Hash-Table Performance For Schrödinger’s Zoo, 60 CPUs

10.2.3 Hash-Table Performance


The performance results for an eight-CPU 2GHz Intel® Xeon® system using a bucket-
locked hash table with 1024 buckets are shown in Figure 10.7. The performance does
scale nearly linearly, but is not much more than half of the ideal performance level, even
at only eight CPUs. Part of this shortfall is due to the fact that the lock acquisitions and
releases incur no cache misses on a single CPU, but do incur misses on two or more
CPUs.
And things only get worse with larger number of CPUs, as can be seen in Figure 10.8.
We do not need an additional line to show ideal performance: The performance for
nine CPUs and beyond is worse than abysmal. This clearly underscores the dangers of
extrapolating performance from a modest number of CPUs.
Of course, one possible reason for the collapse in performance might be that more
hash buckets are needed. After all, we did not pad each hash bucket to a full cache
line, so there are a number of hash buckets per cache line. It is possible that the
resulting cache-thrashing comes into play at nine CPUs. This is of course easy to test
by increasing the number of hash buckets.
Quick Quiz 10.3: Instead of simply increasing the number of hash buckets, wouldn’t
it be better to cache-align the existing hash buckets?
However, as can be seen in Figure 10.9, although increasing the number of buckets
does increase performance somewhat, scalability is still abysmal. In particular, we still
see a sharp dropoff at nine CPUs and beyond. Furthermore, going from 8192 buckets to
16,384 buckets produced almost no increase in performance. Clearly something else is
going on.
The problem is that this is a multi-socket system, with CPUs 0-7 and 32-39 mapped
to the first socket as shown in Table 10.1. Test runs