0% found this document useful (0 votes)
94 views673 pages

Mpi Book

The book 'Parallel Programming in MPI and OpenMP' by Victor Eijkhout focuses on practical aspects of parallel programming in scientific computing, specifically using MPI and OpenMP. It covers various topics including parallelism, the PETSc library for linear algebra, and provides examples in C/C++, Fortran, and Python. The book is continuously updated and is available under the CC-BY 4.0 license.

Uploaded by

pixav66911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views673 pages

Mpi Book

The book 'Parallel Programming in MPI and OpenMP' by Victor Eijkhout focuses on practical aspects of parallel programming in scientific computing, specifically using MPI and OpenMP. It covers various topics including parallelism, the PETSc library for linear algebra, and provides examples in C/C++, Fortran, and Python. The book is continuously updated and is available under the CC-BY 4.0 license.

Uploaded by

pixav66911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 673

Parallel Programming in MPI and OpenMP

The Art of HPC, volume 2

Victor Eijkhout

2nd edition 2022, formatted March 9, 2023


Book and slides download: https://tinyurl.com/vle335course
Public repository: https://github.com/VictorEijkhout/TheArtOfHPC_vol2_parallelprogramming
HTML version: https://theartofhpc.com/pcse/

This book is published under the CC-BY 4.0 license.


The term ‘parallel computing’ means different things depending on the application area. In this book we
focus on parallel computing – and more specifically parallel programming; we will not discuss a lot of
theory – in the context of scientific computing.
Two of the most common software systems for parallel programming in scientific computing are MPI and
OpenMP. They target different types of parallelism, and use very different constructs. Thus, by covering
both of them in one book we can offer a treatment of parallelism that spans a large range of possible
applications.
Finally, we also discuss the PETSc (Portable Toolkit for Scientific Computing) library, which offers an
abstraction level higher than MPI or OpenMP, geared specifically towards parallel linear algebra, and very
specifically the sort of linear algebra computations arising from Partial Differential Equation modeling.
The main languages in scientific computing are C/C++ and Fortran. We will discuss both MPI and OpenMP
with many examples in these two languages. For MPI and the PETSc library we will also discuss the Python
interfaces.
Comments This book is in perpetual state of revision and refinement. Please send
comments of any kind to [email protected].

2 Parallel Computing – r428


Contents

I MPI 15

1 Getting started with MPI 17


1.1 Distributed memory and message passing . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Making and running an MPI program . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Language bindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 MPI topic: Functional parallelism 26


2.1 The SPMD model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Starting and running MPI processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Processor identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Functional parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Distributed computing and distributed data . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Review questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 MPI topic: Collectives 41


3.1 Working with global information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Rooted collectives: broadcast, reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Scan operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Rooted collectives: gather and scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 All-to-all . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.7 Reduce-scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.8 Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.9 Variable-size-input collectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.10 MPI Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.11 Nonblocking collectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.12 Performance of collectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.13 Collectives and synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.14 Performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3
CONTENTS

3.15 Review questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93


3.16 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4 MPI topic: Point-to-point 97


4.1 Blocking point-to-point operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Nonblocking point-to-point operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.3 The Status object and wildcards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4 More about point-to-point communication . . . . . . . . . . . . . . . . . . . . . . . . 133
4.5 Review questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.6 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5 MPI topic: Communication modes 141


5.1 Persistent communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2 Partitioned communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.3 Synchronous and asynchronous communication . . . . . . . . . . . . . . . . . . . . 148
5.4 Local and nonlocal operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.5 Buffered communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.6 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6 MPI topic: Data types 154


6.1 The MPI_Datatype data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.2 Predefined data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.3 Derived datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.4 Big data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.5 Type maps and type matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.6 Type extent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.7 Reconstructing types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.8 Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.9 Review questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.10 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

7 MPI topic: Communicators 201


7.1 Basic communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.2 Duplicating communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.3 Sub-communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.4 Splitting a communicator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.5 Communicators and groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
7.6 Intercommunicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
7.7 Review questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
7.8 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

8 MPI topic: Process management 219


8.1 Process spawning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.2 Socket-style communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

4 Parallel Computing – r428


CONTENTS

8.3 Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226


8.4 Functionality available outside init/finalize . . . . . . . . . . . . . . . . . . . . . . . 230
8.5 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

9 MPI topic: One-sided communication 232


9.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
9.2 Active target synchronization: epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
9.3 Put, get, accumulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.4 Passive target synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
9.5 More about window memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
9.6 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
9.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
9.8 Review questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.9 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

10 MPI topic: File I/O 263


10.1 File handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
10.2 File reading and writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
10.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10.4 Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10.5 Error handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10.6 Review questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
10.7 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

11 MPI topic: Topologies 275


11.1 Cartesian grid topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.2 Distributed graph topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
11.3 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

12 MPI topic: Shared memory 289


12.1 Recognizing shared memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
12.2 Shared memory for windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
12.3 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

13 MPI topic: Hybrid computing 296


13.1 MPI support for threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
13.2 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

14 MPI topic: Tools interface 300


14.1 Initializing the tools interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
14.2 Control variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
14.3 Performance variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
14.4 Categories of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
14.5 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

Victor Eijkhout 5
CONTENTS

14.6 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

15 MPI leftover topics 307


15.1 Contextual information, attributes, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . 307
15.2 Error handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
15.3 Fortran issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
15.4 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
15.5 Fault tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
15.6 Performance, tools, and profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
15.7 Determinism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
15.8 Subtleties with processor synchronization . . . . . . . . . . . . . . . . . . . . . . . . . 324
15.9 Shell interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
15.10Leftover topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
15.11Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
15.12Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

16 MPI Examples 330


16.1 Bandwidth and halfbandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
16.2 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

II OpenMP 335

17 Getting started with OpenMP 337


17.1 The OpenMP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
17.2 Compiling and running an OpenMP program . . . . . . . . . . . . . . . . . . . . . . 339
17.3 Your first OpenMP program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
17.4 Thread data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
17.5 Creating parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
17.6 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346

18 OpenMP topic: Parallel regions 347


18.1 Creating parallelism with parallel regions . . . . . . . . . . . . . . . . . . . . . . . . 347
18.2 Nested parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
18.3 Cancel parallel construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
18.4 Review questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
18.5 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

19 OpenMP topic: Loop parallelism 353


19.1 Loop parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
19.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
19.3 Loop schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
19.4 Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
19.5 Nested loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

6 Parallel Computing – r428


CONTENTS

19.6 Ordered iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365


19.7 nowait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
19.8 While loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
19.9 Review questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
19.10Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

20 OpenMP topic: Reductions 369


20.1 Reductions: why, what, how? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
20.2 Built-in reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
20.3 Initial value for reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
20.4 User-defined reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
20.5 Scan / prefix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
20.6 Reductions and floating-point math . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
20.7 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

21 OpenMP topic: Work sharing 380


21.1 Work sharing constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
21.2 Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
21.3 Single/master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
21.4 Fortran array syntax parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
21.5 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384

22 OpenMP topic: Controlling thread data 385


22.1 Shared data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
22.2 Private data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
22.3 Data in dynamic scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
22.4 Temporary variables in a loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
22.5 Default . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
22.6 First and last private . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
22.7 Array data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
22.8 Persistent data through threadprivate . . . . . . . . . . . . . . . . . . . . . . . . . . 391
22.9 Allocators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
22.10Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

23 OpenMP topic: Synchronization 396


23.1 Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
23.2 Mutual exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
23.3 Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
23.4 Relaxed memory model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
23.5 Example: Fibonacci computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
23.6 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

24 OpenMP topic: Tasks 406


24.1 Task generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406

Victor Eijkhout 7
CONTENTS

24.2 Task data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408


24.3 Task synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
24.4 Task dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
24.5 Task reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
24.6 More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
24.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
24.8 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414

25 OpenMP topic: Affinity 415


25.1 OpenMP thread affinity control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
25.2 First-touch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
25.3 Affinity control outside OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
25.4 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
25.5 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424

26 OpenMP topic: SIMD processing 427


26.1 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431

27 OpenMP topic: Offloading 432


27.1 Data on the device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
27.2 Execution on the device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
27.3 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434

28 OpenMP remaining topics 435


28.1 Runtime functions, environment variables, internal control variables . . . . . . . 435
28.2 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
28.3 Thread safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
28.4 Performance and tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
28.5 Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
28.6 Tools interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
28.7 OpenMP standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
28.8 Memory model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
28.9 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442

29 OpenMP Review 443


29.1 Concepts review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
29.2 Review questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
29.3 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

30 OpenMP Examples 453


30.1 N-body problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
30.2 Tree traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
30.3 Depth-first search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
30.4 Filtering array elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462

8 Parallel Computing – r428


CONTENTS

30.5 Thread synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464


30.6 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466

III PETSc 467

31 PETSc basics 468


31.1 What is PETSc and why? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
31.2 Basics of running a PETSc program . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
31.3 PETSc installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
31.4 External packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
31.5 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476

32 PETSc objects 477


32.1 Distributed objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
32.2 Scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
32.3 Vec: Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
32.4 Mat: Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
32.5 Index sets and Vector Scatters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
32.6 AO: Application Orderings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
32.7 Partitionings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
32.8 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503

33 Grid support 504


33.1 Grid definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
33.2 Constructing a vector on a grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
33.3 Vectors of a distributed array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
33.4 Matrices of a distributed array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
33.5 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512

34 Finite Elements support 513


34.1 General Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
34.2 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516

35 PETSc solvers 517


35.1 KSP: linear system solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
35.2 Direct solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
35.3 Control through command line options . . . . . . . . . . . . . . . . . . . . . . . . . . 527
35.4 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528

36 PETSC nonlinear solvers 529


36.1 Nonlinear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
36.2 Time-stepping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
36.3 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532

Victor Eijkhout 9
CONTENTS

37 PETSc GPU support 533


37.1 Installation with GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
37.2 Setup for GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
37.3 Distributed objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
37.4 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
37.5 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535

38 PETSc tools 536


38.1 Error checking and debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
38.2 Program output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
38.3 Commandline options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
38.4 Timing and profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
38.5 Memory management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
38.6 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546

39 PETSc topics 547


39.1 Communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
39.2 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548

IV Other programming models 549

40 Co-array Fortran 550


40.1 History and design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
40.2 Compiling and running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
40.3 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
40.4 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553

41 Kokkos 554
41.1 Parallel code execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
41.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
41.3 Execution and memory spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
41.4 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
41.5 Stuff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
41.6 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560

42 Sycl, OneAPI, DPC++ 561


42.1 Logistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
42.2 Platforms and devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
42.3 Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
42.4 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
42.5 Parallel operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
42.6 Memory access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
42.7 Parallel output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571

10 Parallel Computing – r428


CONTENTS

42.8 DPCPP extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571


42.9 Intel devcloud notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
42.10Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
42.11Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573

43 Python multiprocessing 574


43.1 Software and hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
43.2 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
43.3 Pools and mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
43.4 Shared data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
43.5 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578

V The Rest 579

44 Exploring computer architecture 580


44.1 Tools for discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
44.2 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581

45 Hybrid computing 582


45.1 Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
45.2 Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
45.3 What does the hardware look like? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
45.4 Affinity control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
45.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
45.6 Processes and cores and affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
45.7 Practical specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
45.8 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590

46 Support libraries 591


46.1 SimGrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
46.2 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
46.3 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592

VI Class projects 593

47 A Style Guide to Project Submissions 594


47.1 Structure of your writeup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
47.2 The parallel part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
47.3 Helpful remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596

48 Warmup Exercises 597


48.1 Hello world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
48.2 Collectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597

Victor Eijkhout 11
CONTENTS

48.3 Linear arrays of processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598

49 Mandelbrot set 600


49.1 Invocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
49.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
49.3 Bulk task scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
49.4 Collective task scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
49.5 Asynchronous task scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
49.6 One-sided solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605

50 Data parallel grids 606


50.1 Description of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
50.2 Code basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606

51 N-body problems 609


51.1 Solution methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
51.2 Shared memory approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
51.3 Distributed memory approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609

VII Didactics 611

52 Teaching guide 612


52.1 Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613

53 Teaching from mental models 614


53.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
53.2 Implied mental models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
53.3 Teaching MPI, the usual way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
53.4 Teaching MPI, our proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
53.5 ‘Parallel computer games’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
53.6 Further course summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
53.7 Prospect for an online course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
53.8 Evaluation and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
53.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
53.10Full source code of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627

VIII Bibliography, index, and list of acronyms 629

54 Bibliography 630

55 List of acronyms 632

56 General Index 633

12 Parallel Computing – r428


CONTENTS

57 Lists of notes 643


57.1 MPI-4 notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
57.2 Fortran notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
57.3 C++ notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
57.4 The MPL C++ interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
57.5 Python notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647

58 Index of MPI commands and keywords 649


58.1 From the standard document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
58.2 MPI for Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662

59 Index of OpenMP keywords 663

60 Index of PETSc keywords 666

61 Index of KOKKOS keywords 671

62 Index of SYCL keywords 672

Victor Eijkhout 13
CONTENTS

14 Parallel Computing – r428


PART I

MPI
This section of the book teaches MPI (‘Message Passing Interface’), the dominant model for distributed
memory programming in science and engineering. It will instill the following competencies.
Basic level:
• The student will understand the SPMD model and how it is realized in MPI (chapter 2).
• The student will know the basic collective calls, both with and without a root process, and can
use them in applications (chapter 3).
• The student knows the basic blocking and non-blocking point-to-point calls, and how to use
them (chapter 4).
Intermediate level:
• The students knows about derived datatypes and can use them in communication routines
(chapter 6).
• The student knows about intra-communicators, and some basic calls for creating subcommuni-
cators (chapter 7); also Cartesian process topologies (section 11.1).
• The student understands the basic design of MPI I/O calls and can use them in basic applications
(chapter 10).
• The student understands about graph process topologies and neighborhood collectives (sec-
tion 11.2).
Advanced level:
• The student understands one-sided communication routines, including window creation rou-
tines, and synchronization mechanisms (chapter 9).
• The student understands MPI shared memory, its advantages, and how it is based on windows
(chapter 12).
• The student understands MPI process spawning mechanisms and inter-communicators (chap-
ter 8).

16 Parallel Computing – r428


Chapter 1

Getting started with MPI

In this chapter you will learn the use of the main tool for distributed memory programming: the Message
Passing Interface (MPI) library. The MPI library has about 250 routines, many of which you may never
need. Since this is a textbook, not a reference manual, we will focus on the important concepts and give the
important routines for each concept. What you learn here should be enough for most common purposes.
You are advised to keep a reference document handy, in case there is a specialized routine, or to look up
subtleties about the routines you use.

1.1 Distributed memory and message passing


In its simplest form, a distributed memory machine is a collection of single computers hooked up with
network cables. In fact, this has a name: a Beowulf cluster. As you recognize from that setup, each processor
can run an independent program, and has its own memory without direct access to other processors’
memory. MPI is the magic that makes multiple instantiations of the same executable run so that they
know about each other and can exchange data through the network.
One of the reasons that MPI is so successful as a tool for high performance on clusters is that it is very
explicit: the programmer controls many details of the data motion between the processors. Consequently,
a capable programmer can write very efficient code with MPI. Unfortunately, that programmer will have to
spell things out in considerable detail. For this reason, people sometimes call MPI ‘the assembly language
of parallel programming’. If that sounds scary, be assured that things are not that bad. You can get started
fairly quickly with MPI, using just the basics, and coming to the more sophisticated tools only when
necessary.
Another reason that MPI was a big hit with programmers is that it does not ask you to learn a new
language: it is a library that can be interfaced to C/C++ or Fortran; there are even bindings to Python and
Java (not described in this course). This does not mean, however, that it is simple to ‘add parallelism’ to an
existing sequential program. An MPI version of a serial program takes considerable rewriting; certainly
more than shared memory parallelism through OpenMP, discussed later in this book.
MPI is also easy to install: there are free implementations that you can download and install on any
computer that has a Unix-like operating system, even if that is not a parallel machine. However, if you
are working on a supercomputer cluster, likely there will already be an MPI installation, tuned to that
machine’s network.

17
1. Getting started with MPI

1.2 History
Before the MPI standard was developed in 1993-4, there were many libraries for distributed memory
computing, often proprietary to a vendor platform. MPI standardized the inter-process communication
mechanisms. Other features, such as process management in PVM, or parallel I/O were omitted. Later
versions of the standard have included many of these features.
Since MPI was designed by a large number of academic and commercial participants, it quickly became a
standard. A few packages from the pre-MPI era, such as Charmpp [17], are still in use since they support
mechanisms that do not exist in MPI.

1.3 Basic model


Here we sketch the two most common scenarios for using MPI. In the first, the user is working on an
interactive machine, which has network access to a number of hosts, typically a network of workstations;
see figure 1.1. The user types the command mpiexec1 and supplies

Figure 1.1: Interactive MPI setup

• The number of hosts involved,


• their names, possibly in a hostfile,
• and other parameters, such as whether to include the interactive host; followed by
• the name of the program and its parameters.
The mpiexec program then makes an ssh connection to each of the hosts, giving them sufficient in-
formation that they can find each other. All the output of the processors is piped through the mpiexec
program, and appears on the interactive console.
In the second scenario (figure 1.2) the user prepares a batch job script with commands, and these will
be run when the batch scheduler gives a number of hosts to the job. Now the batch script contains the
mpiexec command, and the hostfile is dynamically generated when the job starts. Since the job now runs
at a time when the user may not be logged in, any screen output goes into an output file.

1. A command variant is mpirun; your local cluster may have a different mechanism.

18 Parallel Computing – r428


1.4. Making and running an MPI program

Figure 1.2: Batch MPI setup

You see that in both scenarios the parallel program is started by the mpiexec command using an Single
Program Multiple Data (SPMD) mode of execution: all hosts execute the same program. It is possible for
different hosts to execute different programs, but we will not consider that in this book.
There can be options and environment variables that are specific to some MPI installations, or to the
network.
• mpich and its derivatives such as Intel MPI or Cray MPI have mpiexec options: https://www.
mpich.org/static/docs/v3.1/www1/mpiexec.html

1.4 Making and running an MPI program


MPI is a library, called from programs in ordinary programming languages such as C/C++ or Fortran. To
compile such a program you use your regular compiler:
gcc -c my_mpi_prog.c -I/path/to/mpi.h
gcc -o my_mpi_prog my_mpi_prog.o -L/path/to/mpi -lmpich
However, MPI libraries may have different names between different architectures, making it hard to have
a portable makefile. Therefore, MPI typically has shell scripts around your compiler call, called mpicc,
mpicxx, mpif90 for C/C++/Fortran respectively.
mpicc -c my_mpi_prog.c
mpicc -o my_mpi_prog my_mpi_prog.o
If you want to know what mpicc does, there is usually an option that prints out its definition. On a Mac
with the clang compiler:
$$ mpicc -show
clang -fPIC -fstack-protector -fno-stack-check -Qunused-arguments -g3 -O0 -Wno-implicit-functio

Remark 1 In OpenMPI, these commands are binary executables by default, but you can make it a shell
script by passing the --enable-script-wrapper-compilers option at configure time.

Victor Eijkhout 19
1. Getting started with MPI

MPI programs can be run on many different architectures. Obviously it is your ambition (or at least your
dream) to run your code on a cluster with a hundred thousand processors and a fast network. But maybe
you only have a small cluster with plain ethernet. Or maybe you’re sitting in a plane, with just your laptop.
An MPI program can be run in all these circumstances – within the limits of your available memory of
course.
The way this works is that you do not start your executable directly, but you use a program, typically
called mpiexec or something similar, which makes a connection to all available processors and starts a
run of your executable there. So if you have a thousand nodes in your cluster, mpiexec can start your
program once on each, and if you only have your laptop it can start a few instances there. In the latter
case you will of course not get great performance, but at least you can test your code for correctness.
Python note 1: Running mpi4py programs. Load the TACC-provided python:
module load python
and run it as:
ibrun python-mpi yourprogram.py

1.5 Language bindings


1.5.1 C
The MPI library is written in C. However, the standard is careful to distinguish between MPI routines,
versus their C bindings. In fact, as of MPI MPI-4, for a number of routines there are two bindings, depending
on whether you want 4 byte integers, or larger. See section 6.4, in particular 6.4.1.

1.5.2 C++, including MPL


C++ bindings were defined in the standard at one point, but they were declared deprecated, and have been
officially removed in the MPI-3 standard. Thus, MPI can be used from C++ by including
#include <mpi.h>
and using the C API.
The boost library has its own version of MPI, but it seems not to be under further development. A recent
effort at idiomatic C++ support is Message Passing Layer (MPL) https://github.com/rabauke/mpl.
This book has an index of MPL notes and commands: section 57.4.
MPL note 1: Notes format. MPL is a C++ header-only library. Notes on MPI usage from MPL will be indi-
cated like this.

1.5.3 Fortran
Fortran note 1: Formatting of Fortran notes. Fortran-specific notes will be indicated with a note like this.
Traditionally, Fortran bindings for MPI look very much like the C ones, except that each routine has a final
error return parameter. You will find that a lot of MPI code in Fortran conforms to this.

20 Parallel Computing – r428


1.5. Language bindings

However, in the MPI 3 standard it is recommended that an MPI implementation providing a Fortran in-
terface provide a module named mpi_f08 that can be used in a Fortran program. This incorporates the
following improvements:
• This defines MPI routines to have an optional final parameter for the error.
• There are some visible implications of using the mpi_f08 module, mostly related to the fact
that some of the ‘MPI datatypes’ such as MPI_Comm, which were declared as Integer previously,
are now a Fortran Type. See the following sections for details: Communicator 7.1, Datatype 6.1,
Info 15.1.1, Op 3.10.2, Request 4.2.1, Status 4.3, Window 9.1.
• The mpi_f08 module solves a problem with previous Fortran90 bindings: Fortran90 is a strongly
typed language, so it is not possible to pass argument by reference to their address, as C/C++ do
with the void* type for send and receive buffers. This was solved by having separate routines
for each datatype, and providing an Interface block in the MPI module. If you manage to
request a version that does not exist, the compiler will display a message like
There is no matching specific subroutine for this generic subroutine call [MPI_Send]
For details see http://mpi-forum.org/docs/mpi-3.1/mpi31-report/node409.htm.

1.5.4 Python
Python note 2: Python notes. Python-specific notes will be indicated with a note like this.
The mpi4py package [6, 5] of python bindings is not defined by the MPI standards committee. Instead, it
is the work of an individual, Lisandro Dalcin.
In a way, the Python interface is the most elegant. It uses Object-Oriented (OO) techniques such as meth-
ods on objects, and many default arguments.
Notable about the Python bindings is that many communication routines exist in two variants:
• a version that can send arbitrary Python objects. These routines have lowercase names such as
bcast; and
• a version that sends numpy objects; these routines have names such as Bcast. Their syntax can
be slightly different.
The first version looks more ‘pythonic’, is easier to write, and can do things like sending python ob-
jects, but it is also decidedly less efficient since data is packed and unpacked with pickle. As a common
sense guideline, use the numpy interface in the performance-critical parts of your code, and the pythonic
interface only for complicated actions in a setup phase.
Codes with mpi4py can be interfaced to other languages through Swig or conversion routines.
Data in numpy can be specified as a simple object, or [data, (count,displ), datatype].

1.5.5 How to read routine signatures


Throughout the MPI part of this book we will give the reference syntax of the routines. This typically
comprises:
• The semantics: routine name and list of parameters and what they mean.
• C synxtax: the routine definition as it appears in the mpi.h file.

Victor Eijkhout 21
1. Getting started with MPI

• Fortran syntax: routine definition with parameters, giving in/out specification.


• Python syntax: routine name, indicating to what class it applies, and parameter, indicating
which ones are optional.
These ‘routine signatures’ look like code but they are not! Here is how you translate them.

1.5.5.1 C
The typically C routine specification in MPI looks like:
int MPI_Comm_size(MPI_Comm comm,int *nprocs)

This means that


• The routine returns an int parameter. Strictly speaking you should test against MPI_SUCCESS (for
all error codes, see section 15.2.1):
MPI_Comm comm = MPI_COMM_WORLD;
int nprocs;
int errorcode;
errorcode = MPI_Comm_size( MPI_COMM_WORLD,&nprocs);
if (errorcode!=MPI_SUCCESS) {
printf("Routine MPI_Comm_size failed! code=%d\n",
errorcode);
return 1;
}

However, the error codes are hardly ever useful, and there is not much your program can do to
recover from an error. Most people call the routine as
MPI_Comm_size( /* parameter ... */ );

For more on error handling, see section 15.2.


• The first argument is of type MPI_Comm. This is not a C built-in datatype, but it behaves like one.
There are many of these MPI_something datatypes in MPI. So you can write:
MPI_Comm my_comm =
MPI_COMM_WORLD; // using a predefined value
MPI_Comm_size( comm, /* remaining parameters */ );

• Finally, there is a ‘star’ parameter. This means that the routine wants an address, rather than a
value. You would typically write:
MPI_Comm my_comm = MPI_COMM_WORLD; // using a predefined value
int nprocs;
MPI_Comm_size( comm, &nprocs );

Seeing a ‘star’ parameter usually means either: the routine has an array argument, or: the routine
internally sets the value of a variable. The latter is the case here.

1.5.5.2 Fortran
The Fortran specification looks like:

22 Parallel Computing – r428


1.5. Language bindings

MPI_Comm_size(comm, size, ierror)


Type(MPI_Comm), Intent(In) :: comm
Integer, Intent(Out) :: size
Integer, Optional, Intent(Out) :: ierror

or for the Fortran90 legacy mode:


MPI_Comm_size(comm, size, ierror)
INTEGER, INTENT(IN) :: comm
INTEGER, INTENT(OUT) :: size
INTEGER, OPTIONAL, INTENT(OUT) :: ierror

The syntax of using this routine is close to this specification: you write
Type(MPI_Comm) :: comm = MPI_COMM_WORLD
! legacy: Integer :: comm = MPI_COMM_WORLD
Integer :: comm = MPI_COMM_WORLD
Integer :: size,ierr
CALL MPI_Comm_size( comm, size ) ! without the optional ierr

• Most Fortran routines have the same parameters as the corresponding C routine, except that
they all have the error code as final parameter, instead of as a function result. As with C, you
can ignore the value of that parameter. Just don’t forget it.
• The types of the parameters are given in the specification.
• Where C routines have MPI_Comm and MPI_Request and such parameters, Fortran has INTEGER
parameters, or sometimes arrays of integers.

1.5.5.3 Python
The Python interface to MPI uses classes and objects. Thus, a specification like:
MPI.Comm.Send(self, buf, int dest, int tag=0)

should be parsed as follows.


• First of all, you need the MPI class:
from mpi4py import MPI

• Next, you need a Comm object. Often you will use the predefined communicator
comm = MPI.COMM_WORLD

• The keyword self indicates that the actual routine Send is a method of the Comm object, so you
call:
comm.Send( .... )

• Parameters that are listed by themselves, such as buf, as positional. Parameters that are listed
with a type, such as int dest are keyword parameters. Keyword parameters that have a value
specified, such as int tag=0 are optional, with the default value indicated. Thus, the typical
call for this routine is:
comm.Send(sendbuf,dest=other)

Victor Eijkhout 23
1. Getting started with MPI

specifying the send buffer as positional parameter, the destination as keyword parameter, and
using the default value for the optional tag.
Some python routines are ‘class methods’, and their specification lacks the self keyword. For instance:
MPI.Request.Waitall(type cls, requests, statuses=None)

would be used as
MPI.Request.Waitall(requests)

1.6 Review
Review 1.1. What determines the parallelism of an MPI job?
1. The size of the cluster you run on.
2. The number of cores per cluster node.
3. The parameters of the MPI starter (mpiexec, ibrun,…)
Review 1.2. T/F: the number of cores of your laptop is the limit of how many MPI proceses
you can start up.
Review 1.3. Do the following languages have an object-oriented interface to MPI? In what
sense?
1. C
2. C++
3. Fortran2008
4. Python

24 Parallel Computing – r428


1.7. Full source code of examples

1.7 Full source code of examples


Please see the repository: https://github.com/VictorEijkhout/TheArtOfHPC_vol2_
parallelprogramming/tree/main. Examples are sorted first by programming system, then by
language.

Victor Eijkhout 25
Chapter 2

MPI topic: Functional parallelism

2.1 The SPMD model


MPI programs conform largely to the Single Program Multiple Data (SPMD) model, where each processor
runs the same executable. This running executable we call a process.
When MPI was first written, 20 years ago, it was clear what a processor was: it was what was in a computer
on someone’s desk, or in a rack. If this computer was part of a networked cluster, you called it a node. So
if you ran an MPI program, each node would have one MPI process; figure 2.1. You could of course run

Figure 2.1: Cluster structure as of the mid 1990s

more than one process, using the time slicing of the Operating System (OS), but that would give you no
extra performance.
These days the situation is more complicated. You can still talk about a node in a cluster, but now a node
can contain more than one processor chip (sometimes called a socket), and each processor chip probably
has multiple cores. Figure 2.2 shows how you could explore this using a mix of MPI between the nodes,
and a shared memory programming system on the nodes.
However, since each core can act like an independent processor, you can also have multiple MPI processes
per node. To MPI, the cores look like the old completely separate processors. This is the ‘pure MPI’ model

26
2.1. The SPMD model

Figure 2.2: Hybrid cluster structure

of figure 2.3, which we will use in most of this part of the book. (Hybrid computing will be discussed in
chapter 45.)

Figure 2.3: MPI-only cluster structure

This is somewhat confusing: the old processors needed MPI programming, because they were physically
separated. The cores on a modern processor, on the other hand, share the same memory, and even some
caches. In its basic mode MPI seems to ignore all of this: each core receives an MPI process and the
programmer writes the same send/receive call no matter where the other process is located. In fact, you
can’t immediately see whether two cores are on the same node or different nodes. Of course, on the
implementation level MPI uses a different communication mechanism depending on whether cores are

Victor Eijkhout 27
2. MPI topic: Functional parallelism

on the same socket or on different nodes, so you don’t have to worry about lack of efficiency.

Remark 2 In some rare cases you may want to run in an Multiple Program Multiple Data (MPMD) mode,
rather than SPMD. This can be achieved either on the OS level (see section 15.9.4), using options of the mpiexec
mechanism, or you can use MPI’s built-in process management; chapter 8. Like I said, this concerns only rare
cases.

2.2 Starting and running MPI processes


The SPMD model may be initially confusing. Even though there is only a single source, compiled into
a single executable, the parallel run comprises a number of independently started MPI processes (see
section 1.3 for the mechanism).
The following exercises are designed to give you an intuition for this one-source-many-processes setup.
In the first exercise you will see that the mechanism for starting MPI programs starts up independent
copies. There is nothing in the source that says ‘and now you become parallel’.

Figure 2.4: Running a hello world program in parallel

The following exercise demonstrates this point.


Exercise 2.1. Write a ‘hello world’ program, without any MPI in it, and run it in parallel
with mpiexec or your local equivalent. Explain the output.
(There is a skeleton for this exercise under the name hello.)
This exercise is illustrated in figure 2.4.

28 Parallel Computing – r428


2.2. Starting and running MPI processes

Figure 2.1 MPI_Init


Name Param name Explanation C type F type inout
MPI_Init (
)

Figure 2.2 MPI_Finalize


Name Param name Explanation C type F type inout
MPI_Finalize (
)

2.2.1 Headers
If you use MPI commands in a program file, be sure to include the proper header file, mpi.h for C/C++.
#include "mpi.h" // for C
The internals of these files can be different between MPI installations, so you can not compile one file
against one mpi.h file and another file, even with the same compiler on the same machine, against a
different MPI.
Fortran note 2: MPI module. For MPI use from Fortran, use an MPI module.
use mpi ! pre 3.0
use mpi_f08 ! 3.0 standard
New language developments, such as large counts; section 6.4.2 will only be included in the
mpi_f08 module, not in the earlier mechanisms.
The header file mpif.h is deprecated as of MPI-4.1: it may be supported by installations, but
doing so is strongly discouraged.
Python note 3: Import mpi module. It’s easiest to
from mpi4py import MPI

MPL note 2: Header file. To compile MPL programs, add a line


#include <mpl/mpl.hpp>

to your file.

2.2.2 Initialization / finalization


Every (useful) MPI program has to start with MPI initialization through a call to MPI_Init (figure 2.1), and
have MPI_Finalize (figure 2.2) to finish the use of MPI in your program. The init call is different between
the various languages.
In C, you can pass argc and argv, the arguments of a C language main program:

Victor Eijkhout 29
2. MPI topic: Functional parallelism

int main(int argc,char **argv) {


....
return 0;
}

(It is allowed to pass NULL for these arguments.)


Fortran (before 2008) lacks this commandline argument handling, so MPI_Init lacks those arguments.
After MPI_Finalize no MPI routines (with a few exceptions such as MPI_Finalized) can be called. In partic-
ular, it is not allowed to call MPI_Init again. If you want to do that, use the sessions model; section 8.3.
Python note 4: Initialize/finalize. In many cases, no initialize and finalize calls are needed: the statement
## mpi.py
from mpi4py import MPI
performs the initialization. Likewise, the finalize happens at the end of the program.
However, for special cases, there is an mpi4py.rc object that can be set in between importing
mpi4py and importing mpi4py.MPI:
import mpi4py
mpi4py.rc.initialize = False
mpi4py.rc.finalize = False
from mpi4py import MPI
MPI.Init()
# stuff
MPI.Finalize()

MPL note 3: Init, finalize. There is no initialization or finalize call.


MPL implementation note: Initialization is done at the first mpl::environment method
call, such as comm_world.
This may look a bit like declaring ‘this is the parallel part of a program’, but that’s not true: again, the
whole code is executed multiple times in parallel.
Exercise 2.2. Add the commands MPI_Init and MPI_Finalize to your code. Put three
different print statements in your code: one before the init, one between init and
finalize, and one after the finalize. Again explain the output.
Run your program on a large scale, using a batch job. Where does the output go?
Experiment with
MY_MPIRUN_OPTIONS="-prepend-rank" ibrun yourprogram

Remark 3 For hybrid MPI-plus-threads programming there is also a call MPI_Init_thread. For that, see sec-
tion 13.1.

2.2.2.1 Aborting an MPI run


Apart from MPI_Finalize, which signals a successful conclusion of the MPI run, an abnormal end to a run
can be forced by MPI_Abort (figure 2.3). This stop execution on all processes associated with the commu-

30 Parallel Computing – r428


2.2. Starting and running MPI processes

Figure 2.3 MPI_Abort


Name Param name Explanation C type F type inout
MPI_Abort (
comm communicator of MPI MPI_Comm TYPE IN
processes to abort (MPI_Comm)
errorcode error code to return to int INTEGER IN
invoking environment
)
MPL:

void mpl::communicator::abort ( int ) const


Python:

MPI.Comm.Abort(self, int errorcode=0)

Figure 2.4 MPI_Initialized


Name Param name Explanation C type F type inout
MPI_Initialized (
flag Flag is true if MPI_INIT int* LOGICAL OUT
has been called and false
otherwise
)

nicator, but many implementations will terminate all processes. The value parameter is returned to the
environment.
Code: Output:
// return.c mpicc -o return return.o
MPI_Abort(MPI_COMM_WORLD,17); mpirun -n 1 ./return ; \
echo "MPI program
↪return code $?"
application called
↪MPI_Abort(MPI_COMM_WORLD,
↪17) - process 0
MPI program return code 17

2.2.2.2 Testing the initialized/finalized status


The commandline arguments argc and argv are only guaranteed to be passed to process zero, so the best
way to pass commandline information is by a broadcast (section 3.3.3).
There are a few commands, such as MPI_Get_processor_name, that are allowed before MPI_Init.
If MPI is used in a library, MPI can have already been initialized in a main program. For this reason, one
can test where MPI_Init has been called with MPI_Initialized (figure 2.4).
You can test whether MPI_Finalize has been called with MPI_Finalized (figure 2.5).

Victor Eijkhout 31
2. MPI topic: Functional parallelism

Figure 2.5 MPI_Finalized


Name Param name Explanation C type F type inout
MPI_Finalized (
flag true if MPI was finalized int* LOGICAL OUT
)

2.2.2.3 Information about the run

Once MPI has been initialized, the MPI_INFO_ENV object contains a number of key/value pairs describing
run-specific information; see section 15.1.1.1.

2.2.2.4 Commandline arguments

The MPI_Init routines takes a reference to argc and argv for the following reason: the MPI_Init calls
filters out the arguments to mpirun or mpiexec, thereby lowering the value of argc and elimitating some
of the argv arguments.

On the other hand, the commandline arguments that are meant for mpiexec wind up in the MPI_INFO_ENV
object as a set of key/value pairs; see section 15.1.1.

2.3 Processor identification

Since all processes in an MPI job are instantiations of the same executable, you’d think that they all execute
the exact same instructions, which would not be terribly useful. You will now learn how to distinguish
processes from each other, so that together they can start doing useful work.

2.3.1 Processor name

In the following exercise you will print out the hostname of each MPI process with MPI_Get_processor_name
(figure 2.6) as a first way of distinguishing between processes. This routine has a character buffer argu-
ment, which needs to be allocated by you. The length of the buffer is also passed, and on return that
parameter has the actually used length. The maximum needed length is MPI_MAX_PROCESSOR_NAME.

32 Parallel Computing – r428


2.3. Processor identification

Figure 2.6 MPI_Get_processor_name


Name Param name Explanation C type F type inout
MPI_Get_processor_name (
name A unique specifier for char* CHARACTER OUT
the actual (as opposed to
virtual) node.
resultlen Length (in printable int* INTEGER OUT
characters) of the result
returned in name
)
Python:

MPI.Get_processor_name()

Code: Output:
// procname.c make[3]: `procname' is up to
int name_length = MPI_MAX_PROCESSOR_NAME; ↪date.
char proc_name[name_length]; TACC: Starting up job
MPI_Get_processor_name(proc_name,&name_length); ↪4328411
printf("Process %d/%d is running on node <<%s>>\n", TACC: Starting parallel
procid,nprocs,proc_name); ↪tasks...
This process is running on
↪node
↪<<c205-036.frontera.tacc.utexas.edu>>
This process is running on
↪node
↪<<c205-036.frontera.tacc.utexas.edu>>
This process is running on
↪node
↪<<c205-036.frontera.tacc.utexas.edu>>
This process is running on
↪node
↪<<c205-036.frontera.tacc.utexas.edu>>
This process is running on
↪node
↪<<c205-036.frontera.tacc.utexas.edu>>
This process is running on
↪node
↪<<c205-035.frontera.tacc.utexas.edu>>
This process is running on
↪node
↪<<c205-035.frontera.tacc.utexas.edu>>
This process is running on
↪node
↪<<c205-035.frontera.tacc.utexas.edu>>
This process is running on
↪node
↪<<c205-035.frontera.tacc.utexas.edu>>
This process is running on
↪node
↪<<c205-035.frontera.tacc.utexas.edu>>
TACC: Shutdown complete.
↪Exiting.
Victor Eijkhout 33
2. MPI topic: Functional parallelism

(Underallocating the buffer will not lead to a runtime error.)


Fortran note 3: Processor name. Allocate a Character variable with the appropriate length. The returned
value of the length parameter can assist in printing the result:
Code: Output:
!! procname.F90 Proc 1 runs on
Character(len=MPI_MAX_PROCESSOR_NAME) :: proc_name ↪c202-010.frontera.tacc.utexas.edu.
Integer :: len Proc 3 runs on
len = MPI_MAX_PROCESSOR_NAME ↪c202-011.frontera.tacc.utexas.edu.
call MPI_Get_processor_name(proc_name,len) Proc 0 runs on
print *,"Proc",procid,"runs on ",proc_name(1:len),"." ↪c202-010.frontera.tacc.utexas.edu.
Proc 2 runs on
↪c202-011.frontera.tacc.utexas.edu.

Exercise 2.3. Use the command MPI_Get_processor_name. Confirm that you are able to run a
program that uses two different nodes.
MPL note 4: Processor name. The processor_name call is an environment method returning a std::string:
std::string mpl::environment::processor_name ();

2.3.2 Communicators
First we need to introduce the concept of communicator, which is an abstract description of a group of
processes. For now you only need to know about the existence of the MPI_Comm data type, and that there
is a pre-defined communicator MPI_COMM_WORLD which describes all the processes involved in your parallel
run.
In the procedural languages C, a communicator is a variable that is passed to most routines:
#include <mpi.h>
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Send( /* stuff */ comm );

Fortran note 4: Communicator type. In Fortran, pre-2008 a communicator was an opaque handle, stored in
an Integer. With Fortran 2008, communicators are derived types:
use mpi_f098
Type(MPI_Comm} :: comm = MPI_COMM_WORLD
call MPI_Send( ... comm )

Python note 5: Communicator objects. In object-oriented languages, a communicator is an object, and


rather than passing it to routines, the routines are often methods of the communicator object:
from mpi4py import MPI
comm = MPI.COMM_WORLD
comm.Send( buffer, target )

MPL note 5: World communicator. The naive way of declaring a communicator would be:
// commrank.cxx
mpl::communicator comm_world =
mpl::environment::comm_world();

34 Parallel Computing – r428


2.3. Processor identification

calling the predefined environment method comm_world.


However, if the variable will always correspond to the world communicator, it is better to make
it const and declare it to be a reference:
const mpl::communicator &comm_world =
mpl::environment::comm_world();

MPL note 6: Communicator copying. The communicator class has its copy operator deleted; however, copy
initialization exists:
// commcompare.cxx
const mpl::communicator &comm =
mpl::environment::comm_world();
cout << "same: " << boolalpha << (comm==comm) << endl;

mpl::communicator copy =
mpl::environment::comm_world();
cout << "copy: " << boolalpha << (comm==copy) << endl;

mpl::communicator init = comm;


cout << "init: " << boolalpha << (init==comm) << endl;

(This outputs true/false/false respectively.)


MPL implementation note: The copy initializer performs an MPI_Comm_dup.
MPL note 7: Communicator passing. Pass communicators by reference to avoid communicator duplica-
tion:
// commpass.cxx
// BAD! this does a MPI_Comm_dup.
void comm_val( const mpl::communicator comm );

// correct!
void comm_ref( const mpl::communicator &comm );

You will learn much more about communicators in chapter 7.

2.3.3 Process and communicator properties: rank and size


To distinguish between processes in a communicator, MPI provides two calls
1. MPI_Comm_size (figure 2.7) reports how many processes there are in all; and
2. MPI_Comm_rank (figure 2.8) states what the number of the process is that calls this routine.
If every process executes the MPI_Comm_size call, they all get the same result, namely the total number of
processes in your run. On the other hand, if every process executes MPI_Comm_rank, they all get a different
result, namely their own unique number, an integer in the range from zero to the number of processes
minus 1. See figure 2.5. In other words, each process can find out ‘I am process 5 out of a total of 20’.
Exercise 2.4. Write a program where each process prints out a message reporting its
number, and how many processes there are:

Victor Eijkhout 35
2. MPI topic: Functional parallelism

Figure 2.7 MPI_Comm_size


Name Param name Explanation C type F type inout
MPI_Comm_size (
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
size number of processes in the int* INTEGER OUT
group of comm
)
MPL:

int mpl::communicator::size ( ) const


Python:

MPI.Comm.Get_size(self)

Figure 2.8 MPI_Comm_rank


Name Param name Explanation C type F type inout
MPI_Comm_rank (
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
rank rank of the calling int* INTEGER OUT
process in group of comm
)
MPL:

int mpl::communicator::rank ( ) const


Python:

MPI.Comm.Get_rank(self)

Hello from process 2 out of 5!


Write a second version of this program, where each process opens a unique file and
writes to it. On some clusters this may not be advisable if you have large numbers of
processors, since it can overload the file system.
(There is a skeleton for this exercise under the name commrank.)
Exercise 2.5. Write a program where only the process with number zero reports on how
many processes there are in total.
In object-oriented approaches to MPI, that is, mpi4py and MPL, the MPI_Comm_rank and MPI_Comm_size rou-
tines are methods of the communicator class:

Python note 6: Communicator rank and size. Rank and size are methods of the communicator object. Note
that their names are slightly different from the MPI standard names.
comm = MPI.COMM_WORLD
procid = comm.Get_rank()
nprocs = comm.Get_size()

36 Parallel Computing – r428


2.4. Functional parallelism

Figure 2.5: Parallel program that prints process rank

MPL note 8: Rank and size. The rank of a process (by mpl::communicator::rank) and the size of a commu-
nicator (by mpl::communicator::size) are both methods of the communicator class:
const mpl::communicator &comm_world =
mpl::environment::comm_world();
int procid = comm_world.rank();
int nprocs = comm_world.size();

2.4 Functional parallelism


Now that processes can distinguish themselves from each other, they can decide to engage in different
activities. In an extreme case you could have a code that looks like
// climate simulation:
if (procid==0)
earth_model();
else if (procid=1)
sea_model();
else
air_model();

Practice is a little more complicated than this. But we will start exploring this notion of processes deciding
on their activity based on their process number.

Victor Eijkhout 37
2. MPI topic: Functional parallelism

Being able to tell processes apart is already enough to write some applications, without knowing any
other MPI. We will look at a simple parallel search algorithm: based on its rank, a processor can find
its section of a search space. For instance, in Monte Carlo codes a large number of random samples is
generated and some computation performed on each. (This particular example requires each MPI process
to run an independent random number generator, which is not entirely trivial.)
Exercise 2.6. Is the number 𝑁 = 2, 000, 000, 111 prime? Let each process test a disjoint set
of integers, and print out any factor they find. You don’t have to test all
integers < 𝑁 : any factor is at most √𝑁 ≈ 45, 200.
(Hint: i%0 probably gives a runtime error.)
Can you find more than one solution?
(There is a skeleton for this exercise under the name prime.)

Remark 4 Normally, we expect parallel algorithms to be faster than sequential. Now consider the above
exercise. Suppose the number we are testing is divisible by some small prime number, but every process has a
large block of numbers to test. In that case the sequential algorithm would have been faster than the parallel
one. Food for thought.

As another example, in Boolean satisfiability problems a number of points in a search space needs to be
evaluated. Knowing a process’s rank is enough to let it generate its own portion of the search space. The
computation of the Mandelbrot set can also be considered as a case of functional parallelism. However,
the image that is constructed is data that needs to be kept on one processor, which breaks the symmetry
of the parallel run.
Of course, at the end of a functionally parallel run you need to summarize the results, for instance printing
out some total. The mechanisms for that you will learn next.

2.5 Distributed computing and distributed data


One reason for using MPI is that sometimes you need to work on a single object, say a vector or a matrix,
with a data size larger than can fit in the memory of a single processor. With distributed memory, each
processor then gets a part of the whole data structure and only works on that.
So let’s say we have a large array, and we want to distribute the data over the processors. That means
that, with p processes and n elements per processor, we have a total of n ⋅ p elements.
We sometimes say that data is the local part of a distributed array with a total size of n ⋅ p elements.
However, this array only exists conceptually: each processor has an array with lowest index zero, and
you have to translate that yourself to an index in the global array. In other words, you have to write
your code in such a way that it acts like you’re working with a large array that is distributed over the
processors, while actually manipulating only the local arrays on the processors.
Your typical code then looks like
int myfirst = .....;
for (int ilocal=0; ilocal<nlocal; ilocal++) {

38 Parallel Computing – r428


2.6. Review questions

Figure 2.6: Local parts of a distributed array

int iglobal = myfirst+ilocal;


array[ilocal] = f(iglobal);
}

Exercise 2.7. Allocate on each process an array:


int my_ints[10];

and fill it so that process 0 has the integers 0 ⋯ 9, process 1 has 10 ⋯ 19, et cetera.
It may be hard to print the output in a non-messy way.
If the array size is not perfectly divisible by the number of processors, we have to come up with a division
that is uneven, but not too much. You could for instance, write
int Nglobal, // is something large
Nlocal = Nglobal/ntids,
excess = Nglobal%ntids;
if (mytid==ntids-1)
Nlocal += excess;

Exercise 2.8. Argue that this strategy is not optimal. Can you come up with a better
distribution? Load balancing is further discussed in HPC book, section-2.10.

2.6 Review questions


For all true/false questions, if you answer that a statement is false, give a one-line explanation.
Exercise 2.9. True or false: mpicc is a compiler.
Exercise 2.10. T/F?
1. In C, the result of MPI_Comm_rank is a number from zero to
number-of-processes-minus-one, inclusive.
2. In Fortran, the result of MPI_Comm_rank is a number from one to
number-of-processes, inclusive.
Exercise 2.11. What is the function of a hostfile?

Victor Eijkhout 39
2. MPI topic: Functional parallelism

2.7 Full source code of examples


Please see the repository: https://github.com/VictorEijkhout/TheArtOfHPC_vol2_
parallelprogramming/tree/main. Examples are sorted first by programming system, then by
language.

40 Parallel Computing – r428


Chapter 3

MPI topic: Collectives

A certain class of MPI routines are called ‘collective’, or more correctly: ‘collective on a communicator’.
This means that if process one in that communicator calls that routine, they all need to call that routine.
In this chapter we will discuss collective routines that are about combining the data on all processes in
that communicator, but there are also operations such as opening a shared file that are collective, which
will be discussed in a later chapter.

3.1 Working with global information


If all processes have individual data, for instance the result of a local computation, you may want to
bring that information together, for instance to find the maximal computed value or the sum of all values.
Conversely, sometimes one processor has information that needs to be shared with all. For this sort of
operation, MPI has collectives.
There are various cases, illustrated in figure 3.1, which you can (sort of) motivate by considering some
classroom activities:
• The teacher tells the class when the exam will be. This is a broadcast: the same item of infor-
mation goes to everyone.
• After the exam, the teacher performs a gather operation to collect the invidivual exams.
• On the other hand, when the teacher computes the average grade, each student has an individual
number, but these are now combined to compute a single number. This is a reduction.
• Now the teacher has a list of grades and gives each student their grade. This is a scatter operation,
where one process has multiple data items, and gives a different one to all the other processes.
This story is a little different from what happens with MPI processes, because these are more symmetric;
the process doing the reducing and broadcasting is no different from the others. Any process can function
as the root process in such a collective.
Exercise 3.1. How would you realize the following scenarios with MPI collectives?
1. Let each process compute a random number. You want to print the maximum
of these numbers to your screen.
2. Each process computes a random number again. Now you want to scale these
numbers by their maximum.

41
3. MPI topic: Collectives

Figure 3.1: The four most common collectives

3. Let each process compute a random number. You want to print on what
processor the maximum value is computed.
Think about time and space complexity of your suggestions.

3.1.1 Practical use of collectives


Collectives are quite common in scientific applications. For instance, if one process reads data from disc
or the commandline, it can use a broadcast or a gather to get the information to other processes. Likewise,
at the end of a program run, a gather or reduction can be used to collect summary information about the
program run.
However, a more common scenario is that the result of a collective is needed on all processes.
Consider the computation of the standard deviation:

𝑁 𝑁
1 ∑ 𝑥𝑖
𝜎= ∑(𝑥 − 𝜇)2 where 𝜇= 𝑖
𝑁 −1 𝑖 𝑖 𝑁

and assume that every process stores just one 𝑥𝑖 value.
1. The calculation of the average 𝜇 is a reduction, since all the distributed values need to be added.
2. Now every process needs to compute 𝑥𝑖 − 𝜇 for its value 𝑥𝑖 , so 𝜇 is needed everywhere. You can
compute this by doing a reduction followed by a broadcast, but it is better to use a so-called
allreduce operation, which does the reduction and leaves the result on all processors.

42 Parallel Computing – r428


3.1. Working with global information

3. The calculation of ∑𝑖 (𝑥𝑖 − 𝜇) is another sum of distributed data, so we need another reduction
operation. Depending on whether each process needs to know 𝜎 , we can again use an allreduce.

3.1.2 Synchronization
Collectives are operations that involve all processes in a communicator. A collective is a single call, and
it blocks on all processors, meaning that a process calling a collective cannot proceed until the other
processes have similarly called the collective.
That does not mean that all processors exit the call at the same time: because of implementational details
and network latency they need not be synchronized in their execution. However, semantically we can say
that a process can not finish a collective until every other process has at least started the collective.
In addition to these collective operations, there are operations that are said to be ‘collective on their
communicator’, but which do not involve data movement. Collective then means that all processors must
call this routine; not to do so is an error that will manifest itself in ‘hanging’ code. One such example is
MPI_File_open.

3.1.3 Collectives in MPI


We will now explain the MPI collectives in the following order.
Allreduce We use the allreduce as an introduction to the concepts behind collectives; section 3.2. As
explained above, this routines serves many practical scenarios.
Broadcast and reduce We then introduce the concept of a root in the reduce (section 3.3.1) and broadcast
(section 3.3.3) collectives.
• Sometimes you want a reduction with partial results, where each processor computes the sum
(or other operation) on the values of lower-numbered processors. For this, you use a scan col-
lective (section 3.4).
Gather and scatter The gather/scatter collectives deal with more than a single data item; section 3.5.
There are more collectives or variants on the above.
• If every processor needs to broadcast to every other, you use an all-to-all operation (section 3.6).
• The reduce-scatter is a lesser known combination of collectives; section 3.7.
• A barrier is an operation that makes all processes wait until every process has reached the
barrier (section 3.8).
• If you want to gather or scatter information, but the contribution of each processor is of a
different size, there are ‘variable’ collectives; they have a v in the name (section 3.9).
Finally, there are some advanced topics in collectives.
• User-defined reduction operators; section 3.10.2.
• Nonblocking collectives; section 3.11.
• We briefly discuss performance aspects of collectives in section 3.12.
• We discuss synchronization aspects in section 3.13.

Victor Eijkhout 43
3. MPI topic: Collectives

Figure 3.1 MPI_Allreduce


Name Param name Explanation C type F type inout
MPI_Allreduce (
MPI_Allreduce_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
recvbuf starting address of void* TYPE(*), OUT
receive buffer DIMENSION(..)
int
count number of elements in send [ INTEGER IN
MPI_Count
buffer
datatype datatype of elements of MPI_Datatype TYPE IN
send buffer (MPI_Datatype)
op operation MPI_Op TYPE(MPI_Op) IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)
MPL:

template<typename T , typename F >


void mpl::communicator::allreduce
( F,const T &, T & ) const;
( F,const T *, T *,
const contiguous_layout< T > & ) const;
( F,T & ) const;
( F,T *, const contiguous_layout< T > & ) const;
F : reduction function
T : type
Python:

MPI.Comm.Allreduce(self, sendbuf, recvbuf, Op op=SUM)

3.2 Reduction
3.2.1 Reduce to all

Above we saw a couple of scenarios where a quantity is reduced, with all proceses getting the result. The
MPI call for this is MPI_Allreduce (figure 3.1).

Example: we give each process a random number, and sum these numbers together. The result should be
approximately 1/2 times the number of processes.
// allreduce.c
float myrandom,sumrandom;
myrandom = (float) rand()/(float)RAND_MAX;
// add the random variables together
MPI_Allreduce(&myrandom,&sumrandom,
1,MPI_FLOAT,MPI_SUM,comm);
// the result should be approx nprocs/2:
if (procno==nprocs-1)
printf("Result %6.9f compared to .5\n",sumrandom/nprocs);

44 Parallel Computing – r428


3.2. Reduction

Or:
MPI_Count buffersize = 1000;
double *indata,*outdata;
indata = (double*) malloc( buffersize*sizeof(double) );
outdata = (double*) malloc( buffersize*sizeof(double) );
MPI_Allreduce_c(indata,outdata,buffersize,
MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);

3.2.1.1 Buffer description


This is the first example in this course that involves MPI data buffers: the MPI_Allreduce call contains two
buffer arguments. In most MPI calls (with the one-sided ones as big exception) a buffer is described by
three parameters:
1. a pointer to the data,
2. the number of items in the buffer, and
3. the datatype of the items in the buffer.
Each of these needs some elucidation.
1. The buffer specification depends on the programming languages. Defailts are in section 3.2.4.
2. The count was a 4-byte integer in MPI standard up to and including MPI-3. In the MPI-4 standard
the MPI_Count data type become allowed. See section 6.4 for details.
3. Datatypes can be predefined, as in the above example, or user-defined. See chapter 6 for details.

Remark 5 Routines with both a send and receive buffer should not alias these. Instead, see the discussion of
MPI_IN_PLACE; section 3.3.2.

3.2.1.2 Examples and exercises


Exercise 3.2. Let each process compute a random number, and compute the sum of these
numbers using the MPI_Allreduce routine.

𝜉 = ∑ 𝑥𝑖
𝑖

Each process then scales its value by this sum.

𝑥𝑖′ ← 𝑥𝑖 /𝜉

Compute the sum of the scaled numbers

𝜉 ′ = ∑ 𝑥𝑖′
𝑖

and check that it is 1.


(There is a skeleton for this exercise under the name randommax.)

Victor Eijkhout 45
3. MPI topic: Collectives

Exercise 3.3. Implement a (very simple-minded) Fourier transform: if 𝑓 is a function on the


interval [0, 1], then the 𝑛-th Fourier coefficient is
1
𝑓𝑛 =̂ ∫ 𝑓 (𝑡)𝑒 −2𝜋𝑥 𝑑𝑥
0

which we approximate by
𝑁 −1
𝑓𝑛 =̂ ∑ 𝑓 (𝑖ℎ)𝑒 −𝑖𝑛𝜋/𝑁
𝑖=0

• Make one distributed array for the 𝑒 −𝑖𝑛ℎ coefficients,


• make one distributed array for the 𝑓 (𝑖ℎ) values
• calculate a couple of coefficients
Exercise 3.4. In the previous exercise you worked with a distributed array, computing a
local quantity and combining that into a global quantity. Why is it not a good idea to
gather the whole distributed array on a single processor, and do all the computation
locally?
MPL note 9: Reduction operator. The usual reduction operators are given as templated operators:
float
xrank = static_cast<float>( comm_world.rank() ),
xreduce;
// separate recv buffer
comm_world.allreduce(mpl::plus<float>(), xrank,xreduce);
// in place
comm_world.allreduce(mpl::plus<float>(), xrank);

Note the parentheses after the operator. Also note that the operator comes first, not last.
Available: max, min, plus, multiplies, logical_and, logical_or, logical_xor, bit_and, bit_or,
bit_xor.

MPL implementation note: The reduction operator has to be compatible with T(T,T)>
For more about operators, see section 3.10.

3.2.2 Inner product as allreduce


One of the more common applications of the reduction operation is the inner product computation. Typ-
ically, you have two vectors 𝑥, 𝑦 that have the same distribution, that is, where all processes store equal
parts of 𝑥 and 𝑦. The computation is then
local_inprod = 0;
for (i=0; i<localsize; i++)
local_inprod += x[i]*y[i];
MPI_Allreduce( &local_inprod, &global_inprod, 1,MPI_DOUBLE ... )

46 Parallel Computing – r428


3.2. Reduction

Exercise 3.5. The Gram-Schmidt method is a simple way to orthogonalize two vectors:

𝑢 ← 𝑢 − (𝑢 𝑡 𝑣)/(𝑢 𝑡 𝑢)

Implement this, and check that the result is indeed orthogonal.


Suggestion: fill 𝑣 with the values sin 2𝑛ℎ𝜋 where 𝑛 = 2𝜋/𝑁 , and 𝑢 with
sin 2𝑛ℎ𝜋 + sin 4𝑛ℎ𝜋. What does 𝑢 become after orthogonalization?

3.2.3 Reduction operations


Several MPI_Op values are pre-defined. For the list, see section 3.10.1.
For use in reductions and scans it is possible to define your own operator.
MPI_Op_create( MPI_User_function *func, int commute, MPI_Op *op);

For more details, see section 3.10.2.

3.2.4 Data buffers


Collectives are the first example you see of MPI routines that involve transfer of user data. Here, and in
every other case, you see that the data description involves:
• A buffer. This can be a scalar or an array.
• A datatype. This describes whether the buffer contains integers, single/double floating point
numbers, or more complicated types, to be discussed later.
• A count. This states how many of the elements of the given datatype are in the send buffer, or
will be received into the receive buffer.
These three together describe what MPI needs to send through the network.
In the various languages such a buffer/count/datatype triplet is specified in different ways.
First of all, in C the buffer is always an opaque handle, that is, a void* parameter to which you supply an
address. This means that an MPI call can take two forms.
For scalars we need to use the ampersand operator to take the address:
float x,y;
MPI_Allreduce( &x,&y,1,MPI_FLOAT, ... );

But for arrays we use the fact that arrays and addresses are more-or-less equivalent in:
float xx[2],yy[2];
MPI_Allreduce( xx,yy,2,MPI_FLOAT, ... );

You could cast the buffers and write:


MPI_Allreduce( (void*)&x,(void*)&y,1,MPI_FLOAT, ... );
MPI_Allreduce( (void*)xx,(void*)yy,2,MPI_FLOAT, ... );

Victor Eijkhout 47
3. MPI topic: Collectives

but that is not necessary. The compiler will not complain if you leave out the cast.
C++ note 1: Buffer treatment. Treatment of scalars in C++ is the same as in C. However, for arrays you
have the choice between C-style arrays, and std::vector or std::array. For the latter there are
two ways of dealing with buffers:
vector<float> xx(25);
MPI_Send( xx.data(),25,MPI_FLOAT, .... );
MPI_Send( &xx[0],25,MPI_FLOAT, .... );

Fortran note 5: MPI send/recv buffers. In Fortran parameters are always passed by reference, so the buffer
is treated the same way:
Real*4 :: x
Real*4,dimension(2) :: xx
call MPI_Allreduce( x,1,MPI_REAL4, ... )
call MPI_Allreduce( xx,2,MPI_REAL4, ... )

In discussing OO languages, we first note that the official C++ Application Programmer Interface (API)
has been removed from the standard.
Specification of the buffer/count/datatype triplet is not needed explicitly in OO languages.
Python note 7: Buffers from numpy. Most MPI routines in Python have both a variant that can send arbi-
trary Python data, and one that is based on numpy arrays. The former looks the most ‘pythonic’,
and is very flexible, but is usually demonstrably inefficient.
## allreduce.py
random_number = random.randint(1,random_bound)
# native mode send
max_random = comm.allreduce(random_number,op=MPI.MAX)

In the numpy variant, all buffers are numpy objects, which carry information about their type
and size. For scalar reductions this means we have to create an array for the receive buffer, even
though only one element is used.
myrandom = np.empty(1,dtype=int)
myrandom[0] = random_number
allrandom = np.empty(nprocs,dtype=int)
# numpy mode send
comm.Allreduce(myrandom,allrandom[:1],op=MPI.MAX)

Python note 8: Buffers from subarrays. In many examples you will pass a whole Numpy array as send/re-
ceive buffer. Should want to use a buffer that corresponds to a subset of an array, you can use
the following notation:
MPI_Whatever( buffer[...,5] # more stuff

for passing the buffer that starts at location 5 of the array.


For even more complicated effects, use numpy.frombuffer:

48 Parallel Computing – r428


3.2. Reduction

Code: Output:
## bcastcolumn.py int size: 4
datatype = np.intc i
elementsize = datatype().itemsize [[ 0 0 0 0 0 0]
typechar = datatype().dtype.char [-1 1 1 1 1 1]
buffer = np.zeros( [nprocs,nprocs], dtype=datatype) [-1 -1 2 2 2 2]
buffer[:,:] = -1 [-1 -1 -1 3 3 3]
for proc in range(nprocs): [-1 -1 -1 -1 4 4]
if procid==proc: [ 5 5 5 5 5 5]]
buffer[proc,:] = proc
comm.Bcast\
( [ np.frombuffer\
( buffer.data,
dtype=datatype,
offset=(proc*nprocs+proc)*elementsize ),
nprocs-proc, typechar ],
root=proc )

MPL note 10: Scalar buffers. Buffer type handling is done through polymorphism and templating: no ex-
plicit indiation of types.
Scalars are handled as such:
float x,y;
comm.bcast( 0,x ); // note: root first
comm.allreduce( mpl::plus<float>(), x,y ); // op first

where the reduction function needs to be compatible with the type of the buffer.
MPL note 11: Vector buffers. If your buffer is a std::vector you need to take the .data() component of it:
vector<float> xx(2),yy(2);
comm.allreduce( mpl::plus<float>(),
xx.data(), yy.data(), mpl::contiguous_layout<float>(2) );

The contiguous_layout is a ‘derived type’; this will be discussed in more detail elsewhere (see
note 43 and later). For now, interpret it as a way of indicating the count/type part of a buffer
specification.
MPL note 12: Iterator buffers. MPL point-to-point routines have a way of specifying the buffer(s) through
a begin and end iterator.
// sendrange.cxx
vector<double> v(15);
comm_world.send(v.begin(), v.end(), 1); // send to rank 1
comm_world.recv(v.begin(), v.end(), 0); // receive from rank 0

Not available for collectives.


MPL note 13: Send vs recv buffer. There is a separate variant for non-root usage of rooted collectives:
// scangather.cxx
if (procno==0) {

Victor Eijkhout 49
3. MPI topic: Collectives

comm_world.reduce
( mpl::plus<int>(),0,
my_number_of_elements,total_number_of_elements );
} else {
comm_world.reduce
( mpl::plus<int>(),0,my_number_of_elements );
}

3.3 Rooted collectives: broadcast, reduce


In some scenarios there is a certain process that has a privileged status.
• One process can generate or read in the initial data for a program run. This then needs to be
communicated to all other processes.
• At the end of a program run, often one process needs to output some summary information.
This process is called the root process, and we will now consider routines that have a root.

3.3.1 Reduce to a root


In the broadcast operation a single data item was communicated to all processes. A reduction operation
with MPI_Reduce (figure 3.2) goes the other way: each process has a data item, and these are all brought
together into a single item.
Here are the essential elements of a reduction operation:
MPI_Reduce( senddata, recvdata..., operator,
root, comm );

• There is the original data, and the data resulting from the reduction. It is a design decision of
MPI that it will not by default overwrite the original data. The send data and receive data are of
the same size and type: if every processor has one real number, the reduced result is again one
real number.
• It is possible to indicate explicitly that a single buffer is used, and thereby the original data
overwritten; see section 3.3.2 for this ‘in place’ mode.
• There is a reduction operator. Popular choices are MPI_SUM, MPI_PROD and MPI_MAX, but compli-
cated operators such as finding the location of the maximum value exist. (For the full list, see
section 3.10.1.) You can also define your own operators; section 3.10.2.
• There is a root process that receives the result of the reduction. Since the nonroot processes do
not receive the reduced data, they can actually leave the receive buffer undefined.
// reduce.c
float myrandom = (float) rand()/(float)RAND_MAX,
result;
int target_proc = nprocs-1;
// add all the random variables together
MPI_Reduce(&myrandom,&result,1,MPI_FLOAT,MPI_SUM,
target_proc,comm);
// the result should be approx nprocs/2:

50 Parallel Computing – r428


3.3. Rooted collectives: broadcast, reduce

Figure 3.2 MPI_Reduce


Name Param name Explanation C type F type inout
MPI_Reduce (
MPI_Reduce_c (
sendbuf address of send buffer const void* TYPE(*), IN
DIMENSION(..)
recvbuf address of receive buffer void* TYPE(*), OUT
DIMENSION(..)
int
count number of elements in send [ INTEGER IN
MPI_Count
buffer
datatype datatype of elements of MPI_Datatype TYPE IN
send buffer (MPI_Datatype)
op reduce operation MPI_Op TYPE(MPI_Op) IN
root rank of root process int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)
MPL:

void mpl::communicator::reduce
// root, in place
( F f,int root_rank,T & sendrecvdata ) const
( F f,int root_rank,T * sendrecvdata,const contiguous_layout< T > & l ) const
// non-root
( F f,int root_rank,const T & senddata ) const
( F f,int root_rank,
const T * senddata,const contiguous_layout< T > & l ) const
// general
( F f,int root_rank,const T & senddata,T & recvdata ) const
( F f,int root_rank,
const T * senddata,T * recvdata,const contiguous_layout< T > & l ) const

Python:

comm.Reduce(self, sendbuf, recvbuf, Op op=SUM, int root=0)


native:
comm.reduce(self, sendobj=None, recvobj=None, op=SUM, int root=0)

if (procno==target_proc)
printf("Result %6.3f compared to nprocs/2=%5.2f\n",
result,nprocs/2.);

Exercise 3.6. Write a program where each process computes a random number, and
process 0 finds and prints the maximum generated value. Let each process print its
value, just to check the correctness of your program.
Collective operations can also take an array argument, instead of just a scalar. In that case, the operation
is applied pointwise to each location in the array.
Exercise 3.7. Create on each process an array of length 2 integers, and put the values 1, 2 in
it on each process. Do a sum reduction on that array. Can you predict what the

Victor Eijkhout 51
3. MPI topic: Collectives

result should be? Code it. Was your prediction right?

3.3.2 Reduce in place


By default MPI will not overwrite the original data with the reduction result, but you can tell it to do so
using the MPI_IN_PLACE specifier:
// allreduceinplace.c
for (int irand=0; irand<nrandoms; irand++)
myrandoms[irand] = (float) rand()/(float)RAND_MAX;
// add all the random variables together
MPI_Allreduce(MPI_IN_PLACE,myrandoms,
nrandoms,MPI_FLOAT,MPI_SUM,comm);

Now every process only has a receive buffer, so this has the advantage of saving half the memory. Each
process puts its input values in the receive buffer, and these are overwritten by the reduced result.
The above example used MPI_IN_PLACE in MPI_Allreduce; in MPI_Reduce it’s little tricky. The reasoning is a
follows:
• In MPI_Reduce every process has a buffer to contribute, but only the root needs a receive buffer.
Therefore, MPI_IN_PLACE takes the place of the receive buffer on any processor except for the
root …
• … while the root, which needs a receive buffer, has MPI_IN_PLACE takes the place of the send
buffer. In order to contribute its value, the root needs to put this in the receive buffer.
Here is one way you could write the in-place version of MPI_Reduce:
if (procno==root)
MPI_Reduce(MPI_IN_PLACE,myrandoms,
nrandoms,MPI_FLOAT,MPI_SUM,root,comm);
else
MPI_Reduce(myrandoms,MPI_IN_PLACE,
nrandoms,MPI_FLOAT,MPI_SUM,root,comm);

However, as a point of style, having different versions of a a collective in different branches of a condition
is infelicitous. The following may be preferable:
float *sendbuf,*recvbuf;
if (procno==root) {
sendbuf = MPI_IN_PLACE; recvbuf = myrandoms;
} else {
sendbuf = myrandoms; recvbuf = MPI_IN_PLACE;
}
MPI_Reduce(sendbuf,recvbuf,
nrandoms,MPI_FLOAT,MPI_SUM,root,comm);

In Fortran you can not do these address calculations. You can use the solution with a conditional:
!! reduceinplace.F90
call random_number(mynumber)
target_proc = ntids-1;
! add all the random variables together

52 Parallel Computing – r428


3.3. Rooted collectives: broadcast, reduce

if (mytid.eq.target_proc) then
result = mytid
call MPI_Reduce(MPI_IN_PLACE,result,1,MPI_REAL,MPI_SUM,&
target_proc,comm)
else
mynumber = mytid
call MPI_Reduce(mynumber,result,1,MPI_REAL,MPI_SUM,&
target_proc,comm)
end if

but you can also solve this with pointers:


!! reduceinplaceptr.F90
in_place_val = MPI_IN_PLACE
if (mytid.eq.target_proc) then
! set pointers
result_ptr => result
mynumber_ptr => in_place_val
! target sets value in receive buffer
result_ptr = mytid
else
! set pointers
mynumber_ptr => mynumber
result_ptr => in_place_val
! non-targets set value in send buffer
mynumber_ptr = mytid
end if
call MPI_Reduce(mynumber_ptr,result_ptr,1,MPI_REAL,MPI_SUM,&
target_proc,comm,err)

Python note 9: In-place collectives. The value MPI.IN_PLACE can be used for the send buffer:
## allreduceinplace.py
myrandom = np.empty(1,dtype=int)
myrandom[0] = random_number

comm.Allreduce(MPI.IN_PLACE,myrandom,op=MPI.MAX)

MPL note 14: Reduce in place. The in-place variant is activated by specifying only one instead of two buffer
arguments.
float
xrank = static_cast<float>( comm_world.rank() ),
xreduce;
// separate recv buffer
comm_world.allreduce(mpl::plus<float>(), xrank,xreduce);
// in place
comm_world.allreduce(mpl::plus<float>(), xrank);

Reducing a buffer requires specification of a contiguous_layout:


// collectbuffer.cxx
float
xrank = static_cast<float>( comm_world.rank() );

Victor Eijkhout 53
3. MPI topic: Collectives

Figure 3.3 MPI_Bcast


Name Param name Explanation C type F type inout
MPI_Bcast (
MPI_Bcast_c (
buffer starting address of buffer void* TYPE(*), INOUT
DIMENSION(..)
int
count number of entries in [ INTEGER IN
MPI_Count
buffer
datatype datatype of buffer MPI_Datatype TYPE IN
(MPI_Datatype)
root rank of broadcast root int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)
MPL:

template<typename T >
void mpl::communicator::bcast
( int root, T & data ) const
( int root, T * data, const layout< T > & l ) const
Python:

MPI.Comm.Bcast(self, buf, int root=0)

vector<float> rank2p2p1{ 2*xrank,2*xrank+1 },reduce2p2p1{0,0};


mpl::contiguous_layout<float> two_floats(rank2p2p1.size());
comm_world.allreduce
(mpl::plus<float>(), rank2p2p1.data(),reduce2p2p1.data(),two_floats);
if ( iprint )
cout << "Got: " << reduce2p2p1.at(0) << ","
<< reduce2p2p1.at(1) << endl;

Note that the buffers are of type T *, so it is necessary to take the data() of any std::vector and
such.

3.3.3 Broadcast
A broadcast models the scenario where one process, the ‘root’ process, owns some data, and it communi-
cates it to all other processes.
The broadcast routine MPI_Bcast (figure 3.3) has the following structure:
MPI_Bcast( data..., root , comm);

Here:
• There is only one buffer, the send buffer. Before the call, the root process has data in this buffer;
the other processes allocate a same sized buffer, but for them the contents are irrelevant.
• The root is the process that is sending its data. Typically, it will be the root of a broadcast tree.

54 Parallel Computing – r428


3.3. Rooted collectives: broadcast, reduce

Example: in general we can not assume that all processes get the commandline arguments, so we broadcast
them from process 0.
// init.c
if (procno==0) {
if ( argc==1 || // the program is called without parameter
( argc>1 && !strcmp(argv[1],"-h") ) // user asked for help
) {
printf("\nUsage: init [0-9]+\n");
MPI_Abort(comm,1);
}
input_argument = atoi(argv[1]);
}
MPI_Bcast(&input_argument,1,MPI_INT,0,comm);

Python note 10: Sending objects. In python it is both possible to send objects, and to send more C-like
buffers. The two possibilities correspond (see section 1.5.4) to different routine names; the
buffers have to be created as numpy objects.
We illustrate both the general Python and numpy variants. In the former variant the result is
given as a function return; in the numpy variant the send buffer is reused.
## bcast.py
# first native
if procid==root:
buffer = [ 5.0 ] * dsize
else:
buffer = [ 0.0 ] * dsize
buffer = comm.bcast(obj=buffer,root=root)
if not reduce( lambda x,y:x and y,
[ buffer[i]==5.0 for i in range(len(buffer)) ] ):
print( "Something wrong on proc %d: native buffer <<%s>>" \
% (procid,str(buffer)) )

# then with NumPy


buffer = np.arange(dsize, dtype=np.float64)
if procid==root:
for i in range(dsize):
buffer[i] = 5.0
comm.Bcast( buffer,root=root )
if not all( buffer==5.0 ):
print( "Something wrong on proc %d: numpy buffer <<%s>>" \
% (procid,str(buffer)) )
else:
if procid==root:
print("Success.")

MPL note 15: Broadcast. The broadcast call comes in two variants, with scalar argument and general lay-
out:
template<typename T >
void mpl::communicator::bcast
( int root_rank, T &data ) const;

Victor Eijkhout 55
3. MPI topic: Collectives

void mpl::communicator::bcast
( int root_rank, T *data, const layout< T > &l ) const;

Note that the root argument comes first.


For the following exercise, study figure 3.2.
Exercise 3.8. The Gauss-Jordan algorithm for solving a linear system with a matrix 𝐴 (or
computing its inverse) runs as follows:
for pivot 𝑘 = 1, … , 𝑛
(𝑘)
let the vector of scalings ℓ𝑖 = 𝐴𝑖𝑘 /𝐴𝑘𝑘
for row 𝑟 ≠ 𝑘
for column 𝑐 = 1, … , 𝑛
(𝑘)
𝐴𝑟𝑐 ← 𝐴𝑟𝑐 − ℓ𝑟 𝐴𝑘𝑐

where we ignore the update of the righthand side, or the formation of the inverse.
Let a matrix be distributed with each process storing one column. Implement the
Gauss-Jordan algorithm as a series of broadcasts: in iteration 𝑘 process 𝑘 computes
(𝑘)
and broadcasts the scaling vector {ℓ𝑖 }𝑖 . Replicate the right-hand side on all
processors.
(There is a skeleton for this exercise under the name jordan.)
Exercise 3.9. Add partial pivoting to your implementation of Gauss-Jordan elimination.
Change your implementation to let each processor store multiple columns, but still
do one broadcast per column. Is there a way to have only one broadcast per
processor?

3.4 Scan operations


The MPI_Scan operation also performs a reduction, but it keeps the partial results. That is, if processor 𝑖
contains a number 𝑥𝑖 , and ⊕ is an operator, then the scan operation leaves 𝑥0 ⊕ ⋯ ⊕ 𝑥𝑖 on processor 𝑖. This
type of operation is often called a prefix operation; see HPC book, section-27.
The MPI_Scan (figure 3.4) routine is an inclusive scan operation, meaning that it includes the data on the
process itself; MPI_Exscan (see section 3.4.1) is exclusive, and does not include the data on the calling
process.

process ∶ 0 1 2 ⋯ 𝑝−1
data ∶ 𝑥0 𝑥1 𝑥2 ⋯ 𝑥𝑝−1
𝑝−1
inclusive ∶ 𝑥0 𝑥0 ⊕ 𝑥1 𝑥0 ⊕ 𝑥1 ⊕ 𝑥2 ⋯ ⊕𝑖=0 𝑥𝑖
𝑝−2
exclusive ∶ unchanged 𝑥0 𝑥0 ⊕ 𝑥1 ⋯ ⊕𝑖=0 𝑥𝑖

// scan.c
// add all the random variables together
MPI_Scan(&myrandom,&result,1,MPI_FLOAT,MPI_SUM,comm);

56 Parallel Computing – r428


3.4. Scan operations

Initial: Step 7: Step 14:

matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔
matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔
2 2 13 1 17 2 2 13 1 17
2 0 1 1 3 minus 1 × 1/3
4 5 32 1 41 0 1 6 1 7 take this row 1
0 1 6 1 7
⇈ ⇈ ⇈
-2 -3 -16 1 -21 0 -1 -3 1 -4
0 0 3 1 3 take this row 1/3
Step 1: Step 8:

matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 matrix sol rhs action Step 15:
𝑠𝑐𝑎𝑙𝑖𝑛𝑔
2 2 13 1 17 take this row 1/2 2 2 13 1 17 minus 2 × 1
↑ ↑ ↑ matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔
4 5 32 1 41 0 1 6 1 7 take this row 1 2 0 0 1 2 −1/3

-2 -3 -16 1 -21 0 -1 -3 1 -4 0 1 6 1 7
Step 2: Step 9:
0 0 3 1 3 take this row 1/3
matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔
2 2 13 1 17 take this row 1/2 2 0 1 1 3 −2 Step 16:
↓ ↓ ↓
4 5 32 1 41 minus 4 × (1/2) 0 1 6 1 7 take this row 1
matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔
-2 -3 -16 1 -21 0 -1 -3 1 -4 2 0 0 1 2 −1/3

Step 3: Step 10: 0 1 6 1 7 minus 6 × 1/3


↑ ↑ ↑
matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 0 0 3 1 3 take this row 1/3
2 2 13 1 17 take this row 1/2 2 0 1 1 3 −2

0 1 6 1 7 −2 0 1 6 1 7 take this row 1 Step 17:


↓ ↓ ↓
-2 -3 -16 1 -21 0 -1 -3 1 -4 minus (−1) × 1 matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔
Step 4: Step 11: 2 0 0 1 2

matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 0 1 0 1 1
2 2 13 1 17 take this row 1/2 2 0 1 1 3 −2
⇊ ⇊ ⇊ 0 0 3 1 3 take this row
0 1 6 1 7 −2 0 1 6 1 7 take this row 1
Step 18:
-2 -3 -16 1 -21 minus (−2) × (1/2) 0 0 3 1 3 +1
Step 5: Step 12: matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔
2 0 0 1 2 −1/3
matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔
2 2 13 1 17 take this row 1/2 2 0 1 1 3 −2 0 1 0 1 1 −2/3
0 1 6 1 7 −2 0 1 6 1 7 second column done 1 0 0 3 1 3 third column done 1/3
0 -1 -3 1 -4 +1 0 0 3 1 3 +1
Finished:
Step 6: Step 13:

matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 matrix sol rhs action 𝑠𝑐𝑎𝑙𝑖𝑛𝑔
2 2 13 1 17 first column done 1/2 2 0 1 1 3 2 0 0 1 2

0 1 6 1 7 −2 0 1 6 1 7 0 1 0 1 1

0 -1 -3 1 -4 +1 0 0 3 1 3 take this row 1/3 0 0 3 1 3

Figure 3.2: Gauss-Jordan elimination example

Victor Eijkhout 57
3. MPI topic: Collectives

Figure 3.4 MPI_Scan


Name Param name Explanation C type F type inout
MPI_Scan (
MPI_Scan_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
recvbuf starting address of void* TYPE(*), OUT
receive buffer DIMENSION(..)
int
count number of elements in [ INTEGER IN
MPI_Count
input buffer
datatype datatype of elements of MPI_Datatype TYPE IN
input buffer (MPI_Datatype)
op operation MPI_Op TYPE(MPI_Op) IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)
MPL:

template<typename T , typename F >


void mpl::communicator::scan
( F,const T &, T & ) const;
( F,const T *, T *,
const contiguous_layout< T > & ) const;
( F,T & ) const;
( F,T *, const contiguous_layout< T > & ) const;
F : reduction function
T : type
Python:

res = Intracomm.scan( sendobj=None,recvobj=None,op=MPI.SUM)

// the result should be approaching nprocs/2:


if (procno==nprocs-1)
printf("Result %6.3f compared to nprocs/2=%5.2f\n",
result,nprocs/2.);

In python mode the result is a function return value, with numpy the result is passed as the second
parameter.
## scan.py
mycontrib = 10+random.randint(1,nprocs)
myfirst = 0
mypartial = comm.scan(mycontrib)
sbuf = np.empty(1,dtype=int)
rbuf = np.empty(1,dtype=int)
sbuf[0] = mycontrib
comm.Scan(sbuf,rbuf)

You can use any of the given reduction operators, (for the list, see section 3.10.1), or a user-defined one.
In the latter case, the MPI_Op operations do not return an error code.

58 Parallel Computing – r428


3.4. Scan operations

Figure 3.5 MPI_Exscan


Name Param name Explanation C type F type inout
MPI_Exscan (
MPI_Exscan_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
recvbuf starting address of void* TYPE(*), OUT
receive buffer DIMENSION(..)
int
count number of elements in [ INTEGER IN
MPI_Count
input buffer
datatype datatype of elements of MPI_Datatype TYPE IN
input buffer (MPI_Datatype)
op operation MPI_Op TYPE(MPI_Op) IN
comm intra-communicator MPI_Comm TYPE IN
(MPI_Comm)
)
MPL:

template<typename T , typename F >


void mpl::communicator::exscan
( F,const T &, T & ) const;
( F,const T *, T *,
const contiguous_layout< T > & ) const;
( F,T & ) const;
( F,T *, const contiguous_layout< T > & ) const;
F : reduction function
T : type
Python:

res = Intracomm.exscan( sendobj=None,recvobj=None,op=MPI.SUM)

MPL note 16: Scan operations. As in the C/F interfaces, MPL interfaces to the scan routines have the same
calling sequences as the ‘Allreduce’ routine.

3.4.1 Exclusive scan


Often, the more useful variant is the exclusive scan MPI_Exscan (figure 3.5) with the same signature.
The result of the exclusive scan is undefined on processor 0 (None in python), and on processor 1 it is a
copy of the send value of processor 1. In particular, the MPI_Op need not be called on these two processors.
Exercise 3.10. The exclusive definition, which computes 𝑥0 ⊕ 𝑥𝑖−1 on processor 𝑖, can be
derived from the inclusive operation for operations such as MPI_SUM or MPI_PROD. Are
there operators where that is not the case?

3.4.2 Use of scan operations


The MPI_Scan operation is often useful with indexing data. Suppose that every processor 𝑝 has a local vector
where the number of elements 𝑛𝑝 is dynamically determined. In order to translate the local numbering

Victor Eijkhout 59
3. MPI topic: Collectives

0 … 𝑛𝑝 − 1 to a global numbering one does a scan with the number of local elements as input. The output
is then the global number of the first local variable.
As an example, setting Fast Fourier Transform (FFT) coefficients requires this translation. If the local sizes
are all equal, determining the global index of the first element is an easy calculation. For the irregular
case, we first do a scan:
// fft.c
MPI_Allreduce( &localsize,&globalsize,1,MPI_INT,MPI_SUM, comm );
globalsize += 1;
int myfirst=0;
MPI_Exscan( &localsize,&myfirst,1,MPI_INT,MPI_SUM, comm );

for (int i=0; i<localsize; i++)


vector[i] = sin( pi*freq* (i+1+myfirst) / globalsize );

Figure 3.3: Local arrays that together form a consecutive range


Exercise 3.11.
• Let each process compute a random value 𝑛local , and allocate an array of that
length. Define
𝑁 = ∑ 𝑛local
• Fill the array with consecutive integers, so that all local arrays, laid end-to-end,
contain the numbers 0 ⋯ 𝑁 − 1. (See figure 3.3.)
(There is a skeleton for this exercise under the name scangather.)
Exercise 3.12. Did you use MPI_Scan or MPI_Exscan for the previous exercise? How would
you describe the result of the other scan operation, given the same input?
It is possible to do a segmented scan. Let 𝑥𝑖 be a series of numbers that we want to sum to 𝑋𝑖 as follows.
Let 𝑦𝑖 be a series of booleans such that

𝑋𝑖 = 𝑥𝑖 if 𝑦𝑖 = 0
{
𝑋𝑖 = 𝑋𝑖−1 + 𝑥𝑖 if 𝑦𝑖 = 1

(This is the basis for the implementation of the sparse matrix vector product as prefix operation; see HPC
book, section-27.2.) This means that 𝑋𝑖 sums the segments between locations where 𝑦𝑖 = 0 and the first
subsequent place where 𝑦𝑖 = 1. To implement this, you need a user-defined operator

𝑋 𝑋1 𝑋2
𝑋 = 𝑥 1 + 𝑥2 if 𝑦2 == 1
( 𝑥 ) = ( 𝑥1 ) ⨁ ( 𝑥2 ) ∶ {
𝑦 𝑦1 𝑦2 𝑋 = 𝑥2 if 𝑦2 == 0

This operator is not communitative, and it needs to be declared as such with MPI_Op_create; see sec-
tion 3.10.2

60 Parallel Computing – r428


3.5. Rooted collectives: gather and scatter

3.5 Rooted collectives: gather and scatter

Figure 3.4: Gather collects all data onto a root

In the MPI_Scatter operation, the root spreads information to all other processes. The difference with a
broadcast is that it involves individual information from/to every process. Thus, the gather operation
typically has an array of items, one coming from each sending process, and scatter has an array, with an

Figure 3.5: A scatter operation

individual item for each receiving process; see figure 3.5.


These gather and scatter collectives have a different parameter list from the broadcast/reduce. The broad-
cast/reduce involves the same amount of data on each process, so it was enough to have a single datatype-
/size specification; for one buffer in the broadcast, and for both buffers in the reduce call. In the gather/s-
catter calls you have
• a large buffer on the root, with a datatype and size specification, and
• a smaller buffer on each process, with its own type and size specification.
In the gather and scatter calls, each processor has 𝑛 elements of individual data. There is also a root
processor that has an array of length 𝑛𝑝, where 𝑝 is the number of processors. The gather call collects all
this data from the processors to the root; the scatter call assumes that the information is initially on the
root and it is spread to the individual processors.
Here is a small example:

Victor Eijkhout 61
3. MPI topic: Collectives

// gather.c
// we assume that each process has a value "localsize"
// the root process collects these values

if (procno==root)
localsizes = (int*) malloc( nprocs*sizeof(int) );

// everyone contributes their info


MPI_Gather(&localsize,1,MPI_INT,
localsizes,1,MPI_INT,root,comm);

This will also be the basis of a more elaborate example in section 3.9.
Exercise 3.13. Let each process compute a random number. You want to print the
maximum value and on what processor it is computed. What collective(s) do you
use? Write a short program.
The MPI_Scatter operation is in some sense the inverse of the gather: the root process has an array of
length 𝑛𝑝 where 𝑝 is the number of processors and 𝑛 the number of elements each processor will receive.
int MPI_Scatter
(void* sendbuf, int sendcount, MPI_Datatype sendtype,
void* recvbuf, int recvcount, MPI_Datatype recvtype,
int root, MPI_Comm comm)

Two things to note about these routines:


• The signature for MPI_Gather (figure 3.6) has two ‘count’ parameters, one for the length of the
individual send buffers, and one for the receive buffer. However, confusingly, the second pa-
rameter (which is only relevant on the root) does not indicate the total amount of information
coming in, but rather the size of each contribution. Thus, the two count parameters will usually
be the same (at least on the root); they can differ if you use different MPI_Datatype values for the
sending and receiving processors.
• While every process has a sendbuffer in the gather, and a receive buffer in the scatter call, only
the root process needs the long array in which to gather, or from which to scatter. However,
because in SPMD mode all processes need to use the same routine, a parameter for this long
array is always present. Nonroot processes can use a null pointer here.
• More elegantly, the MPI_IN_PLACE option can be used for buffers that are not applicable, such as
the receive buffer on a sending process. See section 3.3.2.
MPL note 17: Gather/scatter. Gathering (by communicator::gather) or scattering (by communicator::scatter)
a single scalar takes a scalar argument and a raw array:
vector<float> v;
float x;
comm_world.scatter(0, v.data(), x);

If more than a single scalar is gathered, or scattered into, it becomes necessary to specify a
layout:
vector<float> vrecv(2),vsend(2*nprocs);
mpl::contiguous_layout<float> twonums(2);

62 Parallel Computing – r428


3.5. Rooted collectives: gather and scatter

Figure 3.6 MPI_Gather


Name Param name Explanation C type F type inout
MPI_Gather (
MPI_Gather_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
sendcount number of elements in send [ INTEGER IN
MPI_Count
buffer
sendtype datatype of send buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
recvbuf address of receive buffer void* TYPE(*), OUT
DIMENSION(..)
int
recvcount number of elements for any [ INTEGER IN
MPI_Count
single receive
recvtype datatype of recv buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
root rank of receiving process int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)
MPL:

void mpl::communicator::gather
( int root_rank, const T & senddata ) const
( int root_rank, const T & senddata, T * recvdata ) const
( int root_rank, const T * senddata, const layout< T > & sendl ) const
( int root_rank, const T * senddata, const layout< T > & sendl,
T * recvdata, const layout< T > & recvl ) const
Python:

MPI.Comm.Gather
(self, sendbuf, recvbuf, int root=0)

comm_world.scatter
(0, vsend.data(),twonums, vrecv.data(),twonums );

MPL note 18: Gather on nonroot. Logically speaking, on every nonroot process, the gather call only has a
send buffer. MPL supports this by having two variants that only specify the send data.
if (procno==0) {
vector<int> size_buffer(nprocs);
comm_world.gather
(
0,my_number_of_elements,size_buffer.data()
);
} else {
/*
* If you are not the root, do versions with only send buffers
*/
comm_world.gather

Victor Eijkhout 63
3. MPI topic: Collectives

( 0,my_number_of_elements );

3.5.1 Examples
In some applications, each process computes a row or column of a matrix, but for some calculation (such
as the determinant) it is more efficient to have the whole matrix on one process. You should of course
only do this if this matrix is essentially smaller than the full problem, such as an interface system or the
last coarsening level in multigrid.

Figure 3.6: Gather a distributed matrix onto one process

Figure 3.6 pictures this. Note that conceptually we are gathering a two-dimensional object, but the buffer
is of course one-dimensional. You will later see how this can be done more elegantly with the ‘subarray’
datatype; section 6.3.4.
Another thing you can do with a distributed matrix is to transpose it.
// itransposeblock.c
for (int iproc=0; iproc<nprocs; iproc++) {
MPI_Scatter( regular,1,MPI_DOUBLE,
&(transpose[iproc]),1,MPI_DOUBLE,
iproc,comm);
}

In this example, each process scatters its column. This needs to be done only once, yet the scatter happens
in a loop. The trick here is that a process only originates the scatter when it is the root, which happens
only once. Why do we need a loop? That is because each element of a process’ row originates from a
different scatter operation.
Exercise 3.14. Can you rewrite this code so that it uses a gather rather than a scatter? Does
that change anything essential about structure of the code?
Exercise 3.15. Take the code from exercise 3.11 and extend it to gather all local buffers onto
rank zero. Since the local arrays are of differing lengths, this requires MPI_Gatherv.
How do you construct the lengths and displacements arrays?
(There is a skeleton for this exercise under the name scangather.)

64 Parallel Computing – r428


3.5. Rooted collectives: gather and scatter

Figure 3.7 MPI_Allgather


Name Param name Explanation C type F type inout
MPI_Allgather (
MPI_Allgather_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
sendcount number of elements in send [ INTEGER IN
MPI_Count
buffer
sendtype datatype of send buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
recvbuf address of receive buffer void* TYPE(*), OUT
DIMENSION(..)
int
recvcount number of elements [ INTEGER IN
MPI_Count
received from any process
recvtype datatype of receive buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)

3.5.2 Allgather

Figure 3.7: All gather collects all data onto every process

The MPI_Allgather (figure 3.7) routine does the same gather onto every process: each process winds up
with the totality of all data; figure 3.7.
This routine can be used in the simplest implementation of the dense matrix-vector product to give each
processor the full input; see HPC book, section-6.2.3.
The MPI_IN_PLACE keyword can be used with an all-gather:
1. Use MPI_IN_PLACE for the send buffer;
2. send count and datetype are ignored by MPI;
3. each process needs to put its ‘send content’ in the correct location of the gather buffer.
Some cases look like an all-gather but can be implemented more efficiently. Let’s consider as an example

Victor Eijkhout 65
3. MPI topic: Collectives

Figure 3.8 MPI_Alltoall


Name Param name Explanation C type F type inout
MPI_Alltoall (
MPI_Alltoall_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
sendcount number of elements sent to [ INTEGER IN
MPI_Count
each process
sendtype datatype of send buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
recvbuf address of receive buffer void* TYPE(*), OUT
DIMENSION(..)
int
recvcount number of elements [ INTEGER IN
MPI_Count
received from any process
recvtype datatype of receive buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)

the set difference of two distributed vectors. That is, you have two distributed vectors, and you want
to create a new vector, again distributed, that contains those elements of the one that do not appear in
the other. You could implement this by gathering the second vector on each processor, but this may be
prohibitive in memory usage.
Exercise 3.16. Can you think of another algorithm for taking the set difference of two
distributed vectors. Hint: look up bucket brigade algorithm; section 4.1.5. What is the
time and space complexity of this algorithm? Can you think of other advantages
beside a reduction in workspace?

3.6 All-to-all
The all-to-all operation MPI_Alltoall (figure 3.8) can be seen as a collection of simultaneous broadcasts
or simultaneous gathers. The parameter specification is much like an allgather, with a separate send and
receive buffer, and no root specified. As with the gather call, the receive count corresponds to an individual
receive, not the total amount.
Unlike the gather call, the send buffer now obeys the same principle: with a send count of 1, the buffer
has a length of the number of processes.

3.6.1 All-to-all as data transpose


The all-to-all operation can be considered as a data transpose. For instance, assume that each process
knows how much data to send to every other process. If you draw a connectivity matrix of size 𝑃 × 𝑃,

66 Parallel Computing – r428


3.6. All-to-all

Figure 3.8: All-to-all transposes data

denoting who-sends-to-who, then the send information can be put in rows:

∀𝑖 ∶ 𝐶[𝑖, 𝑗] > 0 if process 𝑖 sends to process 𝑗.

Conversely, the columns then denote the receive information:

∀𝑗 ∶ 𝐶[𝑖, 𝑗] > 0 if process 𝑗 receives from process 𝑖.

The typical application for such data transposition is in the FFT algorithm, where it can take tens of
percents of the running time on large clusters.

We will consider another application of data transposition, namely radix sort, but we will do that in a
couple of steps. First of all:
Exercise 3.17. In the initial stage of a radix sort, each process considers how many
elements to send to every other process. Use MPI_Alltoall to derive from this how
many elements they will receive from every other process.

3.6.2 All-to-all-v

The major part of the radix sort algorithm consist of every process sending some of its elements to each
of the other processes. The routine MPI_Alltoallv (figure 3.9) is used for this pattern:
• Every process scatters its data to all others,
• but the amount of data is different per process.
Exercise 3.18. The actual data shuffle of a radix sort can be done with MPI_Alltoallv. Finish
the code of exercise 3.17.

Victor Eijkhout 67
3. MPI topic: Collectives

Figure 3.9 MPI_Alltoallv


Name Param name Explanation C type F type inout
MPI_Alltoallv (
MPI_Alltoallv_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
const int[]
sendcounts non-negative integer array [ INTEGER(*) IN
MPI_Count[]
(of length group size)
specifying the number of
elements to send to each
rank
const int[]
sdispls integer array (of length [ INTEGER(*) IN
MPI_Aint[]
group size). Entry j
specifies the displacement
(relative to sendbuf) from
which to take the outgoing
data destined for process
j
sendtype datatype of send buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
recvbuf address of receive buffer void* TYPE(*), OUT
DIMENSION(..)
const int[]
recvcounts non-negative integer array [ INTEGER(*) IN
MPI_Count[]
(of length group size)
specifying the number
of elements that can be
received from each rank
const int[]
rdispls integer array (of length [ INTEGER(*) IN
MPI_Aint[]
group size). Entry i
specifies the displacement
(relative to recvbuf)
at which to place the
incoming data from process
i
recvtype datatype of receive buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)

68 Parallel Computing – r428


3.7. Reduce-scatter

3.7 Reduce-scatter
There are several MPI collectives that are functionally equivalent to a combination of others. You have
already seen MPI_Allreduce which is equivalent to a reduction followed by a broadcast. Often such com-
binations can be more efficient than using the individual calls; see HPC book, section-6.1.
Here is another example: MPI_Reduce_scatter is equivalent to a reduction on an array of data (meaning a
pointwise reduction on each array location) followed by a scatter of this array to the individual processes.

Figure 3.9: Reduce scatter

We will discuss this routine, or rather its variant MPI_Reduce_scatter_block (figure 3.10), using an impor-
tant example: the sparse matrix-vector product (see HPC book, section-6.5.1 for background information).
Each process contains one or more matrix rows, so by looking at indices the process can decide what other
processes it needs to receive data from, that is, each process knows how many messages it will receive,
and from which processes. The problem is for a process to find out what other processes it needs to send
data to.
Let’s set up the data:
// reducescatter.c
int
// data that we know:
*i_recv_from_proc = (int*) malloc(nprocs*sizeof(int)),
*procs_to_recv_from, nprocs_to_recv_from=0,
// data we are going to determin:
*procs_to_send_to,nprocs_to_send_to;

Each process creates an array of ones and zeros, describing who it needs data from. Ideally, we only need
the array procs_to_recv_from but initially we need the (possibly much larger) array i_recv_from_proc.
Next, the MPI_Reduce_scatter_block call then computes, on each process, how many messages it needs to
send.
MPI_Reduce_scatter_block
(i_recv_from_proc,&nprocs_to_send_to,1,MPI_INT,
MPI_SUM,comm);

Victor Eijkhout 69
3. MPI topic: Collectives

We do not yet have the information to which processes to send. For that, each process sends a zero-size
message to each of its senders. Conversely, it then does a receive to with MPI_ANY_SOURCE to discover who is
requesting data from it. The crucial point to the MPI_Reduce_scatter_block call is that, without it, a process
would not know how many of these zero-size messages to expect.
/*
* Send a zero-size msg to everyone that you receive from,
* just to let them know that they need to send to you.
*/
MPI_Request send_requests[nprocs_to_recv_from];
for (int iproc=0; iproc<nprocs_to_recv_from; iproc++) {
int proc=procs_to_recv_from[iproc];
double send_buffer=0.;
MPI_Isend(&send_buffer,0,MPI_DOUBLE, /*to:*/ proc,0,comm,
&(send_requests[iproc]));
}

/*
* Do as many receives as you know are coming in;
* use wildcards since you don't know where they are coming from.
* The source is a process you need to send to.
*/
procs_to_send_to = (int*)malloc( nprocs_to_send_to * sizeof(int) );
for (int iproc=0; iproc<nprocs_to_send_to; iproc++) {
double recv_buffer;
MPI_Status status;
MPI_Recv(&recv_buffer,0,MPI_DOUBLE,MPI_ANY_SOURCE,MPI_ANY_TAG,comm,
&status);
procs_to_send_to[iproc] = status.MPI_SOURCE;
}
MPI_Waitall(nprocs_to_recv_from,send_requests,MPI_STATUSES_IGNORE);

The MPI_Reduce_scatter (figure 3.10) call is more general: instead of indicating the mere presence of a
message between two processes, by having individual receive counts one can, for instance, indicate the
size of the messages.
We can look at reduce-scatter as a limited form of the all-to-all data transposition discussed above (sec-
tion 3.6.1). Suppose that the matrix 𝐶 contains only 0/1, indicating whether or not a messages is send,
rather than the actual amounts. If a receiving process only needs to know how many messages to receive,
rather than where they come from, it is enough to know the column sum, rather than the full column (see
figure 3.9).
Another application of the reduce-scatter mechanism is in the dense matrix-vector product, if a two-
dimensional data distribution is used.
Exercise 3.19. Implement the dense matrix-vector product, where the matrix is distributed
by columns: each process stores 𝐴∗𝑗 for a disjoint set of 𝑗-values. At the start and
end of the algorithm each process should store a distinct part of the input and
output vectors. Argue that this can be done naively with an MPI_Reduce operation,
but more efficiently using MPI_Reduce_scatter.
For discussion, see HPC book, section-6.2.2.

70 Parallel Computing – r428


3.7. Reduce-scatter

Figure 3.10 MPI_Reduce_scatter


Name Param name Explanation C type F type inout
MPI_Reduce_scatter (
MPI_Reduce_scatter_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
recvbuf starting address of void* TYPE(*), OUT
receive buffer DIMENSION(..)
const int[]
recvcounts non-negative integer [ INTEGER(*) IN
MPI_Count[]
array (of length group
size) specifying the
number of elements of the
result distributed to each
process.
datatype datatype of elements of MPI_Datatype TYPE IN
send and receive buffers (MPI_Datatype)
op operation MPI_Op TYPE(MPI_Op) IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)

3.7.1 Examples
An important application of this is establishing an irregular communication pattern. Assume that each
process knows which other processes it wants to communicate with; the problem is to let the other pro-
cesses know about this. The solution is to use MPI_Reduce_scatter to find out how many processes want
to communicate with you
MPI_Reduce_scatter_block
(i_recv_from_proc,&nprocs_to_send_to,1,MPI_INT,
MPI_SUM,comm);

and then wait for precisely that many messages with a source value of MPI_ANY_SOURCE.
/*
* Send a zero-size msg to everyone that you receive from,
* just to let them know that they need to send to you.
*/
MPI_Request send_requests[nprocs_to_recv_from];
for (int iproc=0; iproc<nprocs_to_recv_from; iproc++) {
int proc=procs_to_recv_from[iproc];
double send_buffer=0.;
MPI_Isend(&send_buffer,0,MPI_DOUBLE, /*to:*/ proc,0,comm,
&(send_requests[iproc]));
}

/*
* Do as many receives as you know are coming in;
* use wildcards since you don't know where they are coming from.
* The source is a process you need to send to.

Victor Eijkhout 71
3. MPI topic: Collectives

Figure 3.11 MPI_Barrier


Name Param name Explanation C type F type inout
MPI_Barrier (
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)

*/
procs_to_send_to = (int*)malloc( nprocs_to_send_to * sizeof(int) );
for (int iproc=0; iproc<nprocs_to_send_to; iproc++) {
double recv_buffer;
MPI_Status status;
MPI_Recv(&recv_buffer,0,MPI_DOUBLE,MPI_ANY_SOURCE,MPI_ANY_TAG,comm,
&status);
procs_to_send_to[iproc] = status.MPI_SOURCE;
}
MPI_Waitall(nprocs_to_recv_from,send_requests,MPI_STATUSES_IGNORE);

Use of MPI_Reduce_scatter to implement the two-dimensional matrix-vector product. Set up separate row
and column communicators with MPI_Comm_split, use MPI_Reduce_scatter to combine local products.
MPI_Allgather(&my_x,1,MPI_DOUBLE,
local_x,1,MPI_DOUBLE,environ.col_comm);
MPI_Reduce_scatter(local_y,&my_y,&ione,MPI_DOUBLE,
MPI_SUM,environ.row_comm);

3.8 Barrier
A barrier call, MPI_Barrier (figure 3.11) is a routine that blocks all processes until they have all reached
the barrier call. Thus it achieves time synchronization of the processes.
This call’s simplicity is contrasted with its usefulness, which is very limited. It is almost never necessary
to synchronize processes through a barrier: for most purposes it does not matter if processors are out of
sync. Conversely, collectives (except the new nonblocking ones; section 3.11) introduce a barrier of sorts
themselves.

3.9 Variable-size-input collectives


In the gather and scatter call above each processor received or sent an identical number of items. In many
cases this is appropriate, but sometimes each processor wants or contributes an individual number of
items.
Let’s take the gather calls as an example. Assume that each processor does a local computation that
produces a number of data elements, and this number is different for each processor (or at least not the

72 Parallel Computing – r428


3.9. Variable-size-input collectives

same for all). In the regular MPI_Gather call the root processor had a buffer of size 𝑛𝑃, where 𝑛 is the
number of elements produced on each processor, and 𝑃 the number of processors. The contribution from
processor 𝑝 would go into locations 𝑝𝑛, … , (𝑝 + 1)𝑛 − 1.

For the variable case, we first need to compute the total required buffer size. This can be done through a
simple MPI_Reduce with MPI_SUM as reduction operator: the buffer size is ∑𝑝 𝑛𝑝 where 𝑛𝑝 is the number of
elements on processor 𝑝. But you can also postpone this calculation for a minute.

The next question is where the contributions of the processor will go into this buffer. For the contribution
from processor 𝑝 that is ∑𝑞<𝑝 𝑛𝑝 , … ∑𝑞≤𝑝 𝑛𝑝 − 1. To compute this, the root processor needs to have all the
𝑛𝑝 numbers, and it can collect them with an MPI_Gather call.

We now have all the ingredients. All the processors specify a send buffer just as with MPI_Gather. However,
the receive buffer specification on the root is more complicated. It now consists of:

outbuffer, array-of-outcounts, array-of-displacements, outtype

and you have just seen how to construct that information.

For example, in an MPI_Gatherv (figure 3.12) call each process has an individual number of items to con-
tribute. To gather this, the root process needs to find these individual amounts with an MPI_Gather call,
and locally construct the offsets array. Note how the offsets array has size ntids+1: the final offset value
is automatically the total size of all incoming data. See the example below.

There are various calls where processors can have buffers of differing sizes.

• In MPI_Scatterv (figure 3.13) the root process has a different amount of data for each recipient.
• In MPI_Gatherv, conversely, each process contributes a different sized send buffer to the re-
ceived result; MPI_Allgatherv (figure 3.14) does the same, but leaves its result on all processes;
MPI_Alltoallv does a different variable-sized gather on each process.

3.9.1 Example of Gatherv

We use MPI_Gatherv to do an irregular gather onto a root. We first need an MPI_Gather to determine offsets.

Victor Eijkhout 73
3. MPI topic: Collectives

Figure 3.12 MPI_Gatherv


Name Param name Explanation C type F type inout
MPI_Gatherv (
MPI_Gatherv_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
sendcount number of elements in send [ INTEGER IN
MPI_Count
buffer
sendtype datatype of send buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
recvbuf address of receive buffer void* TYPE(*), OUT
DIMENSION(..)
const int[]
recvcounts non-negative integer array [ INTEGER(*) IN
MPI_Count[]
(of length group size)
containing the number of
elements that are received
from each process
const int[]
displs integer array (of length [ INTEGER(*) IN
MPI_Aint[]
group size). Entry i
specifies the displacement
relative to recvbuf
at which to place the
incoming data from process
i
recvtype datatype of recv buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
root rank of receiving process int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)
MPL:

template<typename T>
void gatherv
(int root_rank, const T *senddata, const layout<T> &sendl,
T *recvdata, const layouts<T> &recvls, const displacements &recvdispls) const
(int root_rank, const T *senddata, const layout<T> &sendl,
T *recvdata, const layouts<T> &recvls) const
(int root_rank, const T *senddata, const layout<T> &sendl ) const
Python:

Gatherv(self, sendbuf, [recvbuf,counts], int root=0)

74 Parallel Computing – r428


3.9. Variable-size-input collectives

Figure 3.13 MPI_Scatterv


Name Param name Explanation C type F type inout
MPI_Scatterv (
MPI_Scatterv_c (
sendbuf address of send buffer const void* TYPE(*), IN
DIMENSION(..)
const int[]
sendcounts non-negative integer array [ INTEGER(*) IN
MPI_Count[]
(of length group size)
specifying the number of
elements to send to each
rank
const int[]
displs integer array (of length [ INTEGER(*) IN
MPI_Aint[]
group size). Entry i
specifies the displacement
(relative to sendbuf) from
which to take the outgoing
data to process i
sendtype datatype of send buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
recvbuf address of receive buffer void* TYPE(*), OUT
DIMENSION(..)
int
recvcount number of elements in [ INTEGER IN
MPI_Count
receive buffer
recvtype datatype of receive buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
root rank of sending process int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)

Code: Output:
// gatherv.c make[3]: `gatherv' is up to
// we assume that each process has an array "localdata" ↪date.
// of size "localsize" TACC: Starting up job
↪4328411
// the root process decides how much data will be coming: TACC: Starting parallel
// allocate arrays to contain size and offset information ↪tasks...
if (procno==root) { Local sizes: 13, 12, 13, 14,
localsizes = (int*) malloc( nprocs*sizeof(int) ); ↪11, 12, 14, 6, 12, 8,
offsets = (int*) malloc( nprocs*sizeof(int) ); Collected:
}
// everyone contributes their local size info ↪0:1,1,1,1,1,1,1,1,1,1,1,1,1;
MPI_Gather(&localsize,1,MPI_INT, 1:2,2,2,2,2,2,2,2,2,2,2,2;
localsizes,1,MPI_INT,root,comm);
// the root constructs the offsets array ↪2:3,3,3,3,3,3,3,3,3,3,3,3,3;
if (procno==root) {
int total_data = 0; ↪3:4,4,4,4,4,4,4,4,4,4,4,4,4,4;
for (int i=0; i<nprocs; i++) { 4:5,5,5,5,5,5,5,5,5,5,5;
offsets[i] = total_data; 5:6,6,6,6,6,6,6,6,6,6,6,6;
total_data += localsizes[i];
} ↪6:7,7,7,7,7,7,7,7,7,7,7,7,7,7;
Victoralldata
Eijkhout = (int*) malloc( total_data*sizeof(int) ); 7:8,8,8,8,8,8; 75
} 8:9,9,9,9,9,9,9,9,9,9,9,9;
// everyone contributes their data 9:10,10,10,10,10,10,10,10;
MPI_Gatherv(localdata,localsize,MPI_INT, TACC: Shutdown complete.
↪Exiting.
↪alldata,localsizes,offsets,MPI_INT,root,comm);
3. MPI topic: Collectives

Figure 3.14 MPI_Allgatherv


Name Param name Explanation C type F type inout
MPI_Allgatherv (
MPI_Allgatherv_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
sendcount number of elements in send [ INTEGER IN
MPI_Count
buffer
sendtype datatype of send buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
recvbuf address of receive buffer void* TYPE(*), OUT
DIMENSION(..)
const int[]
recvcounts non-negative integer array [ INTEGER(*) IN
MPI_Count[]
(of length group size)
containing the number of
elements that are received
from each process
const int[]
displs integer array (of length [ INTEGER(*) IN
MPI_Aint[]
group size). Entry i
specifies the displacement
(relative to recvbuf)
at which to place the
incoming data from process
i
recvtype datatype of receive buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)
Python:

MPI.Comm.Allgatherv(self, sendbuf, recvbuf)


where recvbuf = "[ array, counts, displs, type]"

## gatherv.py
# implicitly using root=0
globalsize = comm.reduce(localsize)
if procid==0:
print("Global size=%d" % globalsize)
collecteddata = np.empty(globalsize,dtype=int)
counts = comm.gather(localsize)
comm.Gatherv(localdata, [collecteddata, counts])

3.9.2 Example of Allgatherv


Prior to the actual gatherv call, we need to construct the count and displacement arrays. The easiest way
is to use a reduction.

76 Parallel Computing – r428


3.10. MPI Operators

// allgatherv.c
MPI_Allgather
( &my_count,1,MPI_INT,
recv_counts,1,MPI_INT, comm );
int accumulate = 0;
for (int i=0; i<nprocs; i++) {
recv_displs[i] = accumulate; accumulate += recv_counts[i]; }
int *global_array = (int*) malloc(accumulate*sizeof(int));
MPI_Allgatherv
( my_array,procno+1,MPI_INT,
global_array,recv_counts,recv_displs,MPI_INT, comm );

In python the receive buffer has to contain the counts and displacements arrays.
## allgatherv.py
mycount = procid+1
my_array = np.empty(mycount,dtype=np.float64)

my_count = np.empty(1,dtype=int)
my_count[0] = mycount
comm.Allgather( my_count,recv_counts )

accumulate = 0
for p in range(nprocs):
recv_displs[p] = accumulate; accumulate += recv_counts[p]
global_array = np.empty(accumulate,dtype=np.float64)
comm.Allgatherv( my_array, [global_array,recv_counts,recv_displs,MPI.DOUBLE] )

3.9.3 Variable all-to-all

The variable all-to-all routine MPI_Alltoallv is discussed in section 3.6.2.

3.10 MPI Operators


MPI operators, that is, objects of type MPI_Op, are used in reduction operators. Most common operators,
such as sum or maximum, have been built into the MPI library; see section 3.10.1. It is also possible to
define new operators; see section 3.10.2.

3.10.1 Pre-defined operators

The following is the list of pre-defined operators MPI_Op values.

Victor Eijkhout 77
3. MPI topic: Collectives

MPI type meaning applies to


MPI_MAX maximum integer, floating point
MPI_MIN minimum
MPI_SUM sum integer, floating point, complex, multilanguage types
MPI_REPLACE overwrite
MPI_NO_OP no change
MPI_PROD product
MPI_LAND logical and C integer, logical
MPI_LOR logical or
MPI_LXOR logical xor
MPI_BAND bitwise and integer, byte, multilanguage types
MPI_BOR bitwise or
MPI_BXOR bitwise xor
MPI_MAXLOC max value and location MPI_DOUBLE_INT and such
MPI_MINLOC min value and location

3.10.1.1 Minloc and maxloc


The MPI_MAXLOC and MPI_MINLOC operations yield both the maximum and the rank on which it occurs. Their
result is a struct of the data over which the reduction happens, and an int.
In C, the types to use in the reduction call are: MPI_FLOAT_INT, MPI_LONG_INT, MPI_DOUBLE_INT, MPI_SHORT_INT,
MPI_2INT, MPI_LONG_DOUBLE_INT. Likewise, the input needs to consist of such structures: the input should be
an array of such struct types, where the int is the rank of the number.
These types may have some unusual size properties:
Code: Output:
// longint.c MPI_LONG_INT size=12
MPI_Type_size( MPI_LONG_INT,&s ); MPI_LONG_INT extent=16
printf("MPI_LONG_INT size=%d\n",s);
MPI_Aint ss;
MPI_Type_extent( MPI_LONG_INT,&ss );
printf("MPI_LONG_INT extent=%ld\n",ss);

Fortran note 6: Min/maxloc types. The original Fortran interface to MPI was designed around Fortran77
features, so it is not using Fortran derived types (Type keyword). Instead, all integer indices
are stored in whatever the type is that is being reduced. The available result types are then
MPI_2REAL, MPI_2DOUBLE_PRECISION, MPI_2INTEGER.

Likewise, the input needs to be arrays of such type. Consider this example:
Real*8,dimension(2,N) :: input,output
call MPI_Reduce( input,output, N, MPI_2DOUBLE_PRECISION, &
MPI_MAXLOC, root, comm )

MPL note 19: Operators. Arithmetic: plus, multiplies, max, min.

78 Parallel Computing – r428


3.10. MPI Operators

Figure 3.15 MPI_Op_create


Name Param name Explanation C type F type inout
MPI_Op_create (
MPI_Op_create_c (
MPI_User_function∗
user_fn user defined function [ PROCEDURE IN
MPI_User_function_c∗
(MPI_User_function)
commute true if commutative; false int LOGICAL IN
otherwise.
op operation MPI_Op* TYPE(MPI_Op) OUT
)
Python:

MPI.Op.create(cls,function,bool commute=False)

Logic: logical_and, logical_or, logical_xor.


Bitwise: bit_and, bit_or, bit_xor.

3.10.2 User-defined operators


In addition to predefined operators, MPI has the possibility of user-defined operators to use in a reduction
or scan operation.
The routine for this is MPI_Op_create (figure 3.15), which takes a user function and turns it into an object
of type MPI_Op, which can then be used in any reduction:
MPI_Op rwz;
MPI_Op_create(reduce_without_zero,1,&rwz);
MPI_Allreduce(data+procno,&positive_minimum,1,MPI_INT,rwz,comm);

Python note 11: Define reduction operator. In python, Op.Create is a class method for the MPI class.
rwz = MPI.Op.Create(reduceWithoutZero)
positive_minimum = np.zeros(1,dtype=intc)
comm.Allreduce(data[procid],positive_minimum,rwz);

The user function needs to have the following signature:


typedef void MPI_User_function
( void *invec, void *inoutvec, int *len,
MPI_Datatype *datatype);

FUNCTION USER_FUNCTION( INVEC(*), INOUTVEC(*), LEN, TYPE)


<type> INVEC(LEN), INOUTVEC(LEN)
INTEGER LEN, TYPE

For example, here is an operator for finding the smallest nonzero number in an array of nonnegative
integers:
*(int*)inout = m;
}

Victor Eijkhout 79
3. MPI topic: Collectives

Python note 12: Reduction function. The python equivalent of such a function receives bare buffers as
arguments. Therefore, it is best to turn them first into NumPy arrays using np.frombuffer:
## reductpositive.py
def reduceWithoutZero(in_buf, inout_buf, datatype):
typecode = MPI._typecode(datatype)
assert typecode is not None ## check MPI datatype is built-in
dtype = np.dtype(typecode)

in_array = np.frombuffer(in_buf, dtype)


inout_array = np.frombuffer(inout_buf, dtype)

n = in_array[0]; r = inout_array[0]
if n==0:
m = r
elif r==0:
m = n
elif n<r:
m = n
else:
m = r
inout_array[0] = m

The assert statement accounts for the fact that this mapping of MPI datatype to NumPy dtype
only works for built-in MPI datatypes.
MPL note 20: User-defined operators. A user-defined operator can be a templated class with an operator().
Example:
// reduceuser.cxx
template<typename T>
class lcm {
public:
T operator()(T a, T b) {
T zero=T();
T t((a/gcd(a, b))*b);
if (t<zero)
return -t;
return t;
}

comm_world.reduce(lcm<int>(), 0, v, result);

(The templated class can be a lambda expression)


MPL note 21: Lambda operator. You can also do the reduction by lambda:
comm_world.reduce
( [] (int i,int j) -> int
{ return i+j; },
0,data );

The function has an array length argument len, to allow for pointwise reduction on a whole array at
once. The inoutvec array contains partially reduced results, and is typically overwritten by the function.

80 Parallel Computing – r428


3.11. Nonblocking collectives

Figure 3.16 MPI_Op_commutative


Name Param name Explanation C type F type inout
MPI_Op_commutative (
op operation MPI_Op TYPE(MPI_Op) IN
commute true if op is commutative, int* LOGICAL OUT
false otherwise
)

There are some restrictions on the user function:


• It may not call MPI functions, except for MPI_Abort.
• It must be associative; it can be optionally commutative, which fact is passed to the
MPI_Op_create call.
Exercise 3.20. Write the reduction function to implement the one-norm of a vector:

‖𝑥‖1 ≡ ∑ |𝑥𝑖 |.
𝑖

(There is a skeleton for this exercise under the name onenorm.)


The operator can be destroyed with a corresponding MPI_Op_free.
int MPI_Op_free(MPI_Op *op)

This sets the operator to MPI_OP_NULL. This is not necessary in OO languages, where the destructor takes
care of it.
You can query the commutativity of an operator with MPI_Op_commutative (figure 3.16).

3.10.3 Local reduction


The application of an MPI_Op can be performed with the routine MPI_Reduce_local (figure 3.17). Using this
routine and some send/receive scheme you can build your own global reductions. Note that this routine
does not take a communicator because it is purely local.

3.11 Nonblocking collectives


Above you have seen how the ‘Isend’ and ‘Irecv’ routines can overlap communication with computation.
This is not possible with the collectives you have seen so far: they act like blocking sends or receives.
However, there are also nonblocking collectives, introduced in MPI-3.
Such operations can be used to increase efficiency. For instance, computing

𝑦 ← 𝐴𝑥 + (𝑥 𝑡 𝑥)𝑦

involves a matrix-vector product, which is dominated by computation in the sparse matrix case, and an
inner product which is typically dominated by the communication cost. You would code this as

Victor Eijkhout 81
3. MPI topic: Collectives

Figure 3.17 MPI_Reduce_local


Name Param name Explanation C type F type inout
MPI_Reduce_local (
MPI_Reduce_local_c (
inbuf input buffer const void* TYPE(*), IN
DIMENSION(..)
inoutbuf combined input and output void* TYPE(*), INOUT
buffer DIMENSION(..)
int
count number of elements in [ INTEGER IN
MPI_Count
inbuf and inoutbuf buffers
datatype datatype of elements of MPI_Datatype TYPE IN
inbuf and inoutbuf buffers (MPI_Datatype)
op operation MPI_Op TYPE(MPI_Op) IN
)

MPI_Iallreduce( .... x ..., &request);


// compute the matrix vector product
MPI_Wait(request);
// do the addition

This can also be used for 3D FFT operations [14]. Occasionally, a nonblocking collective can be used for
nonobvious purposes, such as the MPI_Ibarrier in [15].
These have roughly the same calling sequence as their blocking counterparts, except that they output an
MPI_Request. You can then use an MPI_Wait call to make sure the collective has completed.

Nonblocking collectives offer a number of performance advantages:


• Do two reductions (on the same communicator) with different operators simultaneously:

𝛼 ← 𝑥𝑡 𝑦
𝛽 ← ‖𝑧‖∞

which translates to:


MPI_Allreduce( &local_xy, &global_xy, 1,MPI_DOUBLE,MPI_SUM,comm);
MPI_Allreduce( &local_xinf,&global_xin,1,MPI_DOUBLE,MPI_MAX,comm);

• do collectives on overlapping communicators simultaneously;


• overlap a nonblocking collective with a blocking one.
Exercise 3.21. Revisit exercise 7.1. Let only the first row and first column have certain data,
which they broadcast through columns and rows respectively. Each process is now
involved in two simultaneous collectives. Implement this with nonblocking
broadcasts, and time the difference between a blocking and a nonblocking solution.
(There is a skeleton for this exercise under the name procgridnonblock.)

Remark 6 Blocking and nonblocking don’t match: either all processes call the nonblocking or all call the
blocking one. Thus the following code is incorrect:

82 Parallel Computing – r428


3.11. Nonblocking collectives

Figure 3.18 MPI_Iallgather


Name Param name Explanation C type F type inout
MPI_Iallgather (
MPI_Iallgather_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
sendcount number of elements in send [ INTEGER IN
MPI_Count
buffer
sendtype datatype of send buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
recvbuf address of receive buffer void* TYPE(*), OUT
DIMENSION(..)
int
recvcount number of elements [ INTEGER IN
MPI_Count
received from any process
recvtype datatype of receive buffer MPI_Datatype TYPE IN
elements (MPI_Datatype)
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
request communication request MPI_Request* TYPE OUT
(MPI_Request)
)

if (rank==root)
MPI_Reduce( &x /* ... */ root,comm );
else
MPI_Ireduce( &x /* ... */ );

This is unlike the point-to-point behavior of nonblocking calls: you can catch a message with MPI_Irecv that
was sent with MPI_Send.

Remark 7 Unlike sends and received, collectives have no identifying tag. With blocking collectives that does
not lead to ambiguity problems. With nonblocking collectives it means that all processes need to issue them
in identical order.

List of nonblocking collectives:


• MPI_Igather,MPI_Igatherv, MPI_Iallgather (figure 3.18),MPI_Iallgatherv,
• MPI_Iscatter, MPI_Iscatterv,
• MPI_Ireduce, MPI_Iallreduce (figure 3.19), MPI_Ireduce_scatter, MPI_Ireduce_scatter_block.
• MPI_Ialltoall,MPI_Ialltoallv, MPI_Ialltoallw,
• MPI_Ibarrier; section 3.11.2,
• MPI_Ibcast,
• MPI_Iexscan, MPI_Iscan,
MPL note 22: Nonblocking collectives. Nonblocking collectives have the same argument list as the corre-
sponding blocking variant, except that instead of a void result, they return an irequest. (See 29)

Victor Eijkhout 83
3. MPI topic: Collectives

Figure 3.19 MPI_Iallreduce


Name Param name Explanation C type F type inout
MPI_Iallreduce (
MPI_Iallreduce_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
recvbuf starting address of void* TYPE(*), OUT
receive buffer DIMENSION(..)
int
count number of elements in send [ INTEGER IN
MPI_Count
buffer
datatype datatype of elements of MPI_Datatype TYPE IN
send buffer (MPI_Datatype)
op operation MPI_Op TYPE(MPI_Op) IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
request communication request MPI_Request* TYPE OUT
(MPI_Request)
)

// ireducescalar.cxx
float x{1.},sum;
auto reduce_request =
comm_world.ireduce(mpl::plus<float>(), 0, x, sum);
reduce_request.wait();
if (comm_world.rank()==0) {
std::cout << "sum = " << sum << '\n';
}

3.11.1 Examples
3.11.1.1 Array transpose
To illustrate the overlapping of multiple nonblocking collectives, consider transposing a data matrix. Ini-
tially, each process has one row of the matrix; after transposition each process has a column. Since each
row needs to be distributed to all processes, algorithmically this corresponds to a series of scatter calls,
one originating from each process.
// itransposeblock.c
for (int iproc=0; iproc<nprocs; iproc++) {
MPI_Scatter( regular,1,MPI_DOUBLE,
&(transpose[iproc]),1,MPI_DOUBLE,
iproc,comm);
}

Introducing the nonblocking MPI_Iscatter call, this becomes:


MPI_Request scatter_requests[nprocs];
for (int iproc=0; iproc<nprocs; iproc++) {
MPI_Iscatter( regular,1,MPI_DOUBLE,

84 Parallel Computing – r428


3.11. Nonblocking collectives

&(transpose[iproc]),1,MPI_DOUBLE,
iproc,comm,scatter_requests+iproc);
}
MPI_Waitall(nprocs,scatter_requests,MPI_STATUSES_IGNORE);

Exercise 3.22. Can you implement the same algorithm with MPI_Igather?

3.11.1.2 Stencils

Figure 3.10: Illustration of five-point stencil gather

The ever-popular five-point stencil evaluation does not look like a collective operation, and indeed, it is
usually evaluated with (nonblocking) send/recv operations. However, if we create a subcommunicator on
each subdomain that contains precisely that domain and its neighbors, (see figure 3.10) we can formu-
late the communication pattern as a gather on each of these. With ordinary collectives this can not be
formulated in a deadlock-free manner, but nonblocking collectives make this feasible.
We will see an even more elegant formulation of this operation in section 11.2.

3.11.2 Nonblocking barrier


Probably the most surprising nonblocking collective is the nonblocking barrier MPI_Ibarrier (figure 3.20).
The way to understand this is to think of a barrier not in terms of temporal synchronization, but state
agreement: reaching a barrier is a sign that a process has attained a certain state, and leaving a barrier
means that all processes are in the same state. The ordinary barrier is then a blocking wait for agreement,
while with a nonblocking barrier:
• Posting the barrier means that a process has reached a certain state; and
• the request being fullfilled means that all processes have reached the barrier.

Victor Eijkhout 85
3. MPI topic: Collectives

Figure 3.20 MPI_Ibarrier


Name Param name Explanation C type F type inout
MPI_Ibarrier (
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
request communication request MPI_Request* TYPE OUT
(MPI_Request)
)

One scenario would be local refinement, where some processes decide to refine their subdomain, which
fact they need to communicate to their neighbors. The problem here is that most processes are not among
these neighbors, so they should not post a receive of any type. Instead, any refining process sends to its
neighbors, and every process posts a barrier.
// ibarrierprobe.c
if (i_do_send) {
/*
* Pick a random process to send to,
* not yourself.
*/
int receiver = rand()%nprocs;
MPI_Ssend(&data,1,MPI_FLOAT,receiver,0,comm);
}
/*
* Everyone posts the non-blocking barrier
* and gets a request to test/wait for
*/
MPI_Request barrier_request;
MPI_Ibarrier(comm,&barrier_request);

Now every process alternately probes for messages and tests for completion of the barrier. Probing is done
through the nonblocking MPI_Iprobe call, while testing completion of the barrier is done through MPI_Test.
for ( ; ; step++) {
int barrier_done_flag=0;
MPI_Test(&barrier_request,&barrier_done_flag,
MPI_STATUS_IGNORE);
//stop if you're done!
if (barrier_done_flag) {
break;
} else {
// if you're not done with the barrier:
int flag; MPI_Status status;
MPI_Iprobe
( MPI_ANY_SOURCE,MPI_ANY_TAG,
comm, &flag, &status );
if (flag) {
// absorb message!

We can use a nonblocking barrier to good effect, utilizing the idle time that would result from a blocking

86 Parallel Computing – r428


3.12. Performance of collectives

barrier. In the following code fragment processes test for completion of the barrier, and failing to detect
such completion, perform some local work.
// findbarrier.c
MPI_Request final_barrier;
MPI_Ibarrier(comm,&final_barrier);

int global_finish=mysleep;
do {
int all_done_flag=0;
MPI_Test(&final_barrier,&all_done_flag,MPI_STATUS_IGNORE);
if (all_done_flag) {
break;
} else {
int flag; MPI_Status status;
// force progress
MPI_Iprobe
( MPI_ANY_SOURCE,MPI_ANY_TAG,
comm, &flag, MPI_STATUS_IGNORE );
printf("[%d] going to work for another second\n",procid);
sleep(1);
global_finish++;
}
} while (1);

3.12 Performance of collectives


It is easy to visualize a broadcast as in figure 3.11: see figure 3.11. the root sends all of its data directly to

Figure 3.11: A simple broadcast

every other process. While this describes the semantics of the operation, in practice the implementation
works quite differently.
The time that a message takes can simply be modeled as

𝛼 + 𝛽𝑛,

where 𝛼 is the latency, a one time delay from establishing the communication between two processes,
and 𝛽 is the time-per-byte, or the inverse of the bandwidth, and 𝑛 the number of bytes sent.
Under the assumption that a processor can only send one message at a time, the broadcast in figure 3.11
would take a time proportional to the number of processors.

Victor Eijkhout 87
3. MPI topic: Collectives

Exercise 3.23. What is the total time required for a broadcast involving 𝑝 processes? Give 𝛼
and 𝛽 terms separately.
One way to ameliorate that is to structure the broadcast in a tree-like fashion. This is depicted in fig-

Figure 3.12: A tree-based broadcast

ure 3.12.
Exercise 3.24. How does the communication time now depend on the number of
processors, again 𝛼 and 𝛽 terms separately.
What would be a lower bound on the 𝛼, 𝛽 terms?
The theory of the complexity of collectives is described in more detail in HPC book, section-6.1; see
also [3].

3.13 Collectives and synchronization


Collectives, other than a barrier, have a synchronizing effect between processors. For instance, in
MPI_Bcast( ....data... root);
MPI_Send(....);

the send operations on all processors will occur after the root executes the broadcast. Conversely, in a
reduce operation the root may have to wait for other processors. This is illustrated in figure 3.13, which
gives a TAU trace of a reduction operation on two nodes, with two six-core sockets (processors) each. We
see that1 :
• In each socket, the reduction is a linear accumulation;
• on each node, cores zero and six then combine their result;
• after which the final accumulation is done through the network.
We also see that the two nodes are not perfectly in sync, which is normal for MPI applications. As a result,
core 0 on the first node will sit idle until it receives the partial result from core 12, which is on the second
node.
While collectives synchronize in a loose sense, it is not possible to make any statements about events
before and after the collectives between processors:

1. This uses mvapich version 1.6; in version 1.9 the implementation of an on-node reduction has changed to simulate shared
memory.

88 Parallel Computing – r428


3.13. Collectives and synchronization

Figure 3.13: Trace of a reduction operation between two dual-socket 12-core nodes

...event 1...
MPI_Bcast(....);
...event 2....

Consider a specific scenario:


switch(rank) {
case 0:
MPI_Bcast(buf1, count, type, 0, comm);
MPI_Send(buf2, count, type, 1, tag, comm);
break;
case 1:
MPI_Recv(buf2, count, type, MPI_ANY_SOURCE, tag, comm, &status);
MPI_Bcast(buf1, count, type, 0, comm);
MPI_Recv(buf2, count, type, MPI_ANY_SOURCE, tag, comm, &status);
break;
case 2:
MPI_Send(buf2, count, type, 1, tag, comm);
MPI_Bcast(buf1, count, type, 0, comm);
break;
}

Note the MPI_ANY_SOURCE parameter in the receive calls on processor 1. One obvious execution of this would
be:

Victor Eijkhout 89
3. MPI topic: Collectives

The most logical execution is:

However, this ordering is allowed too:

Which looks from a distance like:

In other words, one of the messages seems to go ‘back in time’.

Figure 3.14: Possible temporal orderings of send and collective calls

90 Parallel Computing – r428


3.14. Performance considerations

1. The send from 2 is caught by processor 1;


2. Everyone executes the broadcast;
3. The send from 0 is caught by processor 1.
However, it is equally possible to have this execution:
1. Processor 0 starts its broadcast, then executes the send;
2. Processor 1’s receive catches the data from 0, then it executes its part of the broadcast;
3. Processor 1 catches the data sent by 2, and finally processor 2 does its part of the broadcast.
This is illustrated in figure 3.14.

3.14 Performance considerations


In this section we will consider how collectives can be implemented in multiple ways, and the performance
implications of such decisions. You can test the algorithms described here using SimGrid (section Tutorials
book, section-20).

3.14.1 Scalability
We are motivated to write parallel software from two considerations. First of all, if we have a certain
problem to solve which normally takes time 𝑇 , then we hope that with 𝑝 processors it will take time 𝑇 /𝑝.
If this is true, we call our parallelization scheme scalable in time. In practice, we often accept small extra
terms: as you will see below, parallelization often adds a term log2 𝑝 to the running time.
Exercise 3.25. Discuss scalability of the following algorithms:
• You have an array of floating point numbers. You need to compute the sine of
each
• You a two-dimensional array, denoting the interval [−2, 2]2 . You want to make
a picture of the Mandelbrot set, so you need to compute the color of each point.
• The primality test of exercise 2.6.
There is also the notion that a parallel algorithm can be scalable in space: more processors gives you more
memory so that you can run a larger problem.
Exercise 3.26. Discuss space scalability in the context of modern processor design.

3.14.2 Complexity and scalability of collectives


3.14.2.1 Broadcast
Naive broadcast Write a broadcast operation where the root does an MPI_Send to each other process.
What is the expected performance of this in terms of 𝛼, 𝛽?
Run some tests and confirm.

Victor Eijkhout 91
3. MPI topic: Collectives

Simple ring Let the root only send to the next process, and that one send to its neighbor. This scheme
is known as a bucket brigade; see also section 4.1.5.
What is the expected performance of this in terms of 𝛼, 𝛽?
Run some tests and confirm.

Figure 3.15: A pipelined bucket brigade

Pipelined ring In a ring broadcast, each process needs to receive the whole message before it can
pass it on. We can increase the efficiency by breaking up the message and sending it in multiple parts.
(See figure 3.15.) This will be advantageous for messages that are long enough that the bandwidth cost
dominates the latency.
Assume a send buffer of length more than 1. Divide the send buffer into a number of chunks. The root
sends the chunks successively to the next process, and each process sends on whatever chunks it receives.
What is the expected performance of this in terms of 𝛼, 𝛽? Why is this better than the simple ring?
Run some tests and confirm.

Recursive doubling Collectives such as broadcast can be implemented through recursive doubling,
where the root sends to another process, then the root and the other process send to two more, those
four send to four more, et cetera. However, in an actual physical architecture this scheme can be realized
in multiple ways that have drastically different performance.
First consider the implementation where process 0 is the root, and it starts by sending to process 1; then
they send to 2 and 3; these four send to 4–7, et cetera. If the architecture is a linear array of procesors,

92 Parallel Computing – r428


3.15. Review questions

this will lead to contention: multiple messages wanting to go through the same wire. (This is also related
to the concept of bisecection bandwidth.)
In the following analyses we will assume wormhole routing: a message sets up a path through the network,
reserving the necessary wires, and performing a send in time independent of the distance through the
network. That is, the send time for any message can be modeled as

𝑇 (𝑛) = 𝛼 + 𝛽𝑛

regardless source and destination, as long as the necessary connections are available.
Exercise 3.27. Analyze the running time of a recursive doubling broad cast as just
described, with wormhole routing.
Implement this broadcast in terms of blocking MPI send and receive calls. If you
have SimGrid available, run tests with a number of parameters.
The alternative, that avoids contention, is to let each doubling stage divide the network into separate
halves. That is, process 0 sends to 𝑃/2, after which these two repeat the algorithm in the two halves of
the network, sending to 𝑃/4 and 3𝑃/4 respectively.
Exercise 3.28. Analyze this variant of recursive doubling. Code it and measure runtimes on
SimGrid.
Exercise 3.29. Revisit exercise 3.27 and replace the blocking calls by nonblocking
MPI_Isend / MPI_Irecv calls.
Make sure to test that the data is correctly propagated.
MPI implementations often have multiple algorithms, which they dynamicaly switch between. Sometimes
you can determine the choice yourself through environment variables.
TACC note. For Intel MPI , see https://software.intel.com/en-us/
mpi-developer-reference-linux-i-mpi-adjust-family-environment-variables.

3.15 Review questions


For all true/false questions, if you answer that a statement is false, give a one-line explanation.
Review 3.30. How would you realize the following scenarios with MPI collectives?
• Let each process compute a random number. You want to print the maximum
of these numbers to your screen.
• Each process computes a random number again. Now you want to scale these
numbers by their maximum.
• Let each process compute a random number. You want to print on what
processor the maximum value is computed.
Review 3.31. MPI collectives can be sorted in at least the following categories
1. rooted vs rootless
2. using uniform buffer lengths vs variable length buffers
3. blocking vs nonblocking.

Victor Eijkhout 93
3. MPI topic: Collectives

Give examples of each type.


Review 3.32. True or false: collective routines are all about communicating user data
between the processes.
Review 3.33. True or false: an MPI_Scatter call puts the same data on each process.
Review 3.34. True or false: using the option MPI_IN_PLACE you only need space for a send
buffer in MPI_Reduce.
Review 3.35. True or false: using the option MPI_IN_PLACE you only need space for a send
buffer in MPI_Gather.
Review 3.36. Given a distributed array, with every processor storing
double x[N]; // N can vary per processor

give the approximate MPI-based code that computes the maximum value in the
array, and leaves the result on every processor.
Review 3.37.
double data[Nglobal];
int myfirst = /* something */, mylast = /* something */;
for (int i=myfirst; i<mylast; i++) {
if (i>0 && i<N-1) {
process_point( data,i,Nglobal );
}
}
void process_point( double *data,int i,int N ) {
data[i-1] = g(i-1); data[i] = g(i); data[i+1] = g(i+1);
data[i] = f(data[i-1],data[i],data[i+1]);
}

Is this scalable in time? Is this scalable in space? What is the missing MPI call?
Review 3.38.
double data[Nlocal+2]; // include left and right neighbor
int myfirst = /* something */, mylast = myfirst+Nlocal;
for (int i=0; i<Nlocal; i++) {
if (i>0 && i<N-1) {
process_point( data,i,Nlocal );
}
void process_point( double *data,int i0,int n ) {
int i = i0+1;
data[i-1] = g(i-1); data[i] = g(i); data[i+1] = g(i+1);
data[i] = f(data[i-1],data[i],data[i+1]);
}

Is this scalable in time? Is this scalable in space? What is the missing MPI call?
Review 3.39. With data as in the previous question, given the code for normalizing the
array, that is, scaling each element so that ‖𝑥‖2 = 1.
Review 3.40. Just like MPI_Allreduce is equivalent to MPI_Reduce following by MPI_Bcast,
MPI_Reduce_scatter is equivalent to at least one of the following combinations. Select
those that are equivalent, and discuss differences in time or space complexity:

94 Parallel Computing – r428


3.15. Review questions

1. MPI_Reduce followed by MPI_Scatter;


2. MPI_Gather followed by MPI_Scatter;
3. MPI_Allreduce followed by MPI_Scatter;
4. MPI_Allreduce followed by a local operation (which?);
5. MPI_Allgather followed by a local operation (which?).
Review 3.41. Think of at least two algorithms for doing a broadcast. Compare them with
regards to asymptotic behavior.

Victor Eijkhout 95
3. MPI topic: Collectives

3.16 Full source code of examples


Please see the repository: https://github.com/VictorEijkhout/TheArtOfHPC_vol2_
parallelprogramming/tree/main. Examples are sorted first by programming system, then by
language.

96 Parallel Computing – r428


Chapter 4

MPI topic: Point-to-point

4.1 Blocking point-to-point operations


Suppose you have an array of numbers 𝑥𝑖 ∶ 𝑖 = 0, … , 𝑁 and you want to compute

𝑦𝑖 = (𝑥𝑖−1 + 𝑥𝑖 + 𝑥𝑖+1 )/3 ∶ 𝑖 = 1, … , 𝑁 − 1.

As seen in figure 2.6, we give each processor a contiguous subset of the 𝑥𝑖 s and 𝑦𝑖 s. Let’s define 𝑖𝑝 as the
first index of 𝑦 that is computed by processor 𝑝. (What is the last index computed by processor 𝑝? How
many indices are computed on that processor?)
We often talk about the owner computes model of parallel computing: each processor ‘owns’ certain data
items, and it computes their value. The values used for this computation need of course not be local, and
this is where the need for communication arises.
Let’s investigate how processor 𝑝 goes about computing 𝑦𝑖 for the 𝑖-values it owns. Let’s assume that
process 𝑝 also stores the values 𝑥𝑖 for these same indices. Now, for many values 𝑖 it can evalute the com-
putation
𝑦𝑖 = (𝑥𝑖−1 + 𝑥𝑖 + 𝑥𝑖+1 )/3
locally (figure 4.1).
However, there is a problem with computing 𝑦 in the first index 𝑖𝑝 on processor 𝑝:

𝑦𝑖𝑝 = (𝑥𝑖𝑝 −1 + 𝑥𝑖𝑝 + 𝑥𝑖𝑝 +1 )/3

The point to the left, 𝑥𝑖𝑝 −1 , is not stored on process 𝑝 (it is stored on 𝑝−1), so it is not immediately available
for use by process 𝑝. (figure 4.2). There is a similar story with the last index that 𝑝 tries to compute: that
involves a value that is only present on 𝑝 + 1.
You see that there is a need for processor-to-processor, or technically point-to-point, information ex-
change. MPI realizes this through matched send and receive calls:
• One process does a send to a specific other process;
• the other process does a specific receive from that source.
We will now discuss the send and receive routines in detail.

97
4. MPI topic: Point-to-point

Figure 4.1: Three point averaging in parallel, case of interior points

Figure 4.2: Three point averaging in parallel, case of edge points

4.1.1 Example: ping-pong


A simple scenario for information exchange between just two processes is the ping-pong: process A sends
data to process B, which sends data back to A. This is not an operation that is particularly relevant to
applications, although it is often used as a benchmark. Here we discuss it for to explain basic ideas.
This means that process A executes the code
MPI_Send( /* to: */ B ..... );
MPI_Recv( /* from: */ B ... );

while process B executes


MPI_Recv( /* from: */ A ... );
MPI_Send( /* to: */ A ..... );

Since we are programming in SPMD mode, this means our program looks like:
if ( /* I am process A */ ) {
MPI_Send( /* to: */ B ..... );
MPI_Recv( /* from: */ B ... );
} else if ( /* I am process B */ ) {
MPI_Recv( /* from: */ A ... );
MPI_Send( /* to: */ A ..... );
}

Remark 8 The structure of the send and receive calls shows the symmetric nature of MPI: every target process
is reached with the same send call, no matter whether it’s running on the same multicore chip as the sender, or

98 Parallel Computing – r428


4.1. Blocking point-to-point operations

Figure 4.1 MPI_Send


Name Param name Explanation C type F type inout
MPI_Send (
MPI_Send_c (
buf initial address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
count number of elements in send [ INTEGER IN
MPI_Count
buffer
datatype datatype of each send MPI_Datatype TYPE IN
buffer element (MPI_Datatype)
dest rank of destination int INTEGER IN
tag message tag int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)
MPL:

template<typename T >
void mpl::communicator::send
( const T scalar&,int dest,tag = tag(0) ) const
( const T *buffer,const layout< T > &,int dest,tag = tag(0) ) const
( iterT begin,iterT end,int dest,tag = tag(0) ) const
T : scalar type
begin : begin iterator
end : end iterator
Python:

Python native:
MPI.Comm.send(self, obj, int dest, int tag=0)
Python numpy:
MPI.Comm.Send(self, buf, int dest, int tag=0)

on a computational node halfway across the machine room, taking several network hops to reach. Of course,
any self-respecting MPI implementation optimizes for the case where sender and receiver have access to the
same shared memory. This means that a send/recv pair is realized as a copy operation from the sender buffer
to the receiver buffer, rather than a network transfer.

4.1.2 Send call


The blocking send command is MPI_Send (figure 4.1). Example:
// sendandrecv.c
double send_data = 1.;
MPI_Send
( /* send buffer/count/type: */ &send_data,1,MPI_DOUBLE,
/* to: */ receiver, /* tag: */ 0,
/* communicator: */ comm);

The send call has the following elements.

Victor Eijkhout 99
4. MPI topic: Point-to-point

Buffer The send buffer is described by a trio of buffer/count/datatype. See section 3.2.4 for discussion.

Target The messsage target is an explicit process rank to send to. This rank is a number from zero up
to the result of MPI_Comm_size. It is allowed for a process to send to itself, but this may lead to a runtime
deadlock; see section 4.1.4 for discussion. The value MPI_PROC_NULL is allowed: using that as a target causes
no message to be sent or received.

Tag Next, a message can have a tag. Many applications have each sender send only one message at a
time to a given receiver. For the case where there are multiple simultaneous messages between the same
sender / receiver pair, the tag can be used to disambiguate between the messages.
Often, a tag value of zero is safe to use. Indeed, OO interfaces to MPI typically have the tag as an optional
parameter with value zero. If you do use tag values, you can use the key MPI_TAG_UB to query what the
maximum value is that can be used; see section 15.1.2.

Communicator Finally, in common with the vast majority of MPI calls, there is a communicator ar-
gument that provides a context for the send transaction. In order to match a send and receive operation,
they need to be in the same communicator.
MPL note 23: Buffer type safety.
• Data type is templated: derived by the compiler.
• Count > 1 is declared in the datatype.
MPL note 24: Blocking send and receive. MPL uses a default value for the tag, and it can deduce the type
of the buffer. Sending a scalar becomes:
// sendscalar.cxx
if (comm_world.rank()==0) {
double pi=3.14;
comm_world.send(pi, 1); // send to rank 1
cout << "sent: " << pi << '\n';
} else if (comm_world.rank()==1) {
double pi=0;
comm_world.recv(pi, 0); // receive from rank 0
cout << "got : " << pi << '\n';
}

(See also note 10.)


MPL note 25: Sending arrays. MPL can send static arrays without further layout specification:
// sendarray.cxx
double v[2][2][2];
comm_world.send(v, 1); // send to rank 1
comm_world.recv(v, 0); // receive from rank 0

Sending vectors uses a general mechanism:

100 Parallel Computing – r428


4.1. Blocking point-to-point operations

// sendbuffer.cxx
std::vector<double> v(8);
mpl::contiguous_layout<double> v_layout(v.size());
comm_world.send(v.data(), v_layout, 1); // send to rank 1
comm_world.recv(v.data(), v_layout, 0); // receive from rank 0

(See also note 11.)


MPL note 26: Iterator layout. Noncontiguous iteratable objects can be send with a iterator_layout:
std::list<int> v(20, 0);
mpl::iterator_layout<int> l(v.begin(), v.end());
comm_world.recv(&(*v.begin()), l, 0);

(See also note 12.)

4.1.3 Receive call


The basic blocking receive command is MPI_Recv (figure 4.2).
An example:
double recv_data;
MPI_Recv
( /* recv buffer/count/type: */ &recv_data,1,MPI_DOUBLE,
/* from: */ sender, /* tag: */ 0,
/* communicator: */ comm,
/* recv status: */ MPI_STATUS_IGNORE);

This is similar in structure to the send call, with some exceptions.

Buffer The receive buffer has the same buffer/count/data parameters as the send call. However, the
count argument here indicates the size of the buffer, rather than the actual length of a message. This sets
an upper bound on the length of the incoming message.
• For receiving messages with unknown length, use MPI_Probe; section 4.4.1.
• A message longer than the buffer size will give an overflow error, either returning an error, or
ending your program; see section 15.2.2.
The length of the received message can be determined from the status object; see section 4.3 for more
detail.

Source Mirroring the target argument of the MPI_Send call, MPI_Recv has a message source argument. This
can be either a specific rank, or it can be the MPI_ANY_SOURCE wildcard. In the latter case, the actual source
can be determined after the message has been received; see section 4.3. A source value of MPI_PROC_NULL
is also allowed, which makes the receive succeed immediately with no data received.
MPL note 27: Any source. The constant mpl::any_source equals MPI_ANY_SOURCE (by constexpr).

Victor Eijkhout 101


4. MPI topic: Point-to-point

Figure 4.2 MPI_Recv


Name Param name Explanation C type F type inout
MPI_Recv (
MPI_Recv_c (
buf initial address of receive void* TYPE(*), OUT
buffer DIMENSION(..)
int
count number of elements in [ INTEGER IN
MPI_Count
receive buffer
datatype datatype of each receive MPI_Datatype TYPE IN
buffer element (MPI_Datatype)
source rank of source or int INTEGER IN
MPI_ANY_SOURCE
tag message tag or MPI_ANY_TAG int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
status status object MPI_Status* TYPE OUT
(MPI_Status)
)
MPL:

template<typename T >
status mpl::communicator::recv
( T &,int,tag = tag(0) ) const inline
( T *,const layout< T > &,int,tag = tag(0) ) const
( iterT begin,iterT end,int source, tag t = tag(0) ) const
Python:

Comm.Recv(self, buf, int source=ANY_SOURCE, int tag=ANY_TAG,


Status status=None)
Python native:
recvbuf = Comm.recv(self, buf=None, int source=ANY_SOURCE, int tag=ANY_TAG,
Status status=None)

Tag Similar to the messsage source, the message tag of a receive call can be a specific value or a wildcard,
in this case MPI_ANY_TAG.
Python note 13: Message tags. Python calls sensible use a default tag=0, but you can specify your own tag
value. On the receive call, the tag wildcard is MPI.ANY_TAG.

Communicator The communicator argument almost goes without remarking.

Status The MPI_Recv command has one parameter that the send call lacks: the MPI_Status object, de-
scribing the message status. This gives information about the message received, for instance if you used
wildcards for source or tag. See section 4.3 for more about the status object.

Remark 9 If you’re not interested in the status, as is the case in many examples in this book, you can specify
the constant MPI_STATUS_IGNORE. Note that the signature of MPI_Recv lists the status parameter as ‘output’; this

102 Parallel Computing – r428


4.1. Blocking point-to-point operations

‘direction’ of the parameter of course only applies if you do not specify this constant.

Exercise 4.1. Implement the ping-pong program. Add a timer using MPI_Wtime. For the
status argument of the receive call, use MPI_STATUS_IGNORE.
• Run multiple ping-pongs (say a thousand) and put the timer around the loop.
The first run may take longer; try to discard it.
• Run your code with the two communicating processes first on the same node,
then on different nodes. Do you see a difference?
• Then modify the program to use longer messages. How does the timing
increase with message size?
For bonus points, can you do a regression to determine 𝛼, 𝛽?
(There is a skeleton for this exercise under the name pingpong.)
Exercise 4.2. Take your pingpong program and modify it to let half the processors be
source and the other half the targets. Does the pingpong time increase? Does the
observed behavior depend on how you choose the two sets?

4.1.4 Problems with blocking communication


You may be tempted to think that the send call puts the data somewhere in the network, and the sending
code can progress after this call, as in figure 4.3, left. But this ideal scenario is not realistic: it assumes that

Figure 4.3: Illustration of an ideal (left) and actual (right) send-receive interaction

somewhere in the network there is buffer capacity for all messages that are in transit. This is not the case:
data resides on the sender, and the sending call blocks, until the receiver has received all of it. (There is a
exception for small messages, as explained in the next section.)
The use of MPI_Send and MPI_Recv is known as blocking communication: when your code reaches a send or
receive call, it blocks until the call is succesfully completed. Technically, blocking operations are called
non-local since their execution depends on factors that are not local to the process. See section 5.4.

4.1.4.1 Deadlock
Suppose two process need to exchange data, and consider the following pseudo-code, which purports to
exchange data between processes 0 and 1:

Victor Eijkhout 103


4. MPI topic: Point-to-point

other = 1-mytid; /* if I am 0, other is 1; and vice versa */


receive(source=other);
send(target=other);

Imagine that the two processes execute this code. They both issue the send call… and then can’t go on,
because they are both waiting for the other to issue the send call corresponding to their receive call. This
is known as deadlock.

4.1.4.2 Eager vs rendezvous protocol


Messages can be sent using (at least) two different protocols:
1. Rendezvous protocol, and
2. Eager protocol.
The rendezvous protocol is the most general. Sending a message takes several steps:
1. the sender sends a header, typically containing the message envelope: metadata describing the
message;
2. the receiver returns a ‘ready-to-send’ message;
3. the sender sends the actual data.
The purpose of this is to to prepare the receiver buffer space for large messages. However, it implies that
the sender has to wait for some return message from the receiver, making the behavior a synchronous
message.
For the eager protocol, consider the example:
other = 1-mytid; /* if I am 0, other is 1; and vice versa */
send(target=other);
receive(source=other);

With a synchronous protocol you should get deadlock, since the send calls will be waiting for the receive
operation to be posted.
In practice, however, this code will often work. The reason is that MPI implementations sometimes send
small messages regardless of whether the receive has been posted. This is known as an eager send, and it
relies on the availability of some amount of available buffer space. The size under which this behavior is
used is sometimes referred to as the eager limit.
To illustrate eager and blocking behavior in MPI_Send, consider an example where we send gradually larger
messages. From the screen output you can see what the largest message was that fell under the eager limit;
after that the code hangs because of a deadlock.
// sendblock.c
other = 1-procno;
/* loop over increasingly large messages */
for (int size=1; size<2000000000; size*=10) {
sendbuf = (int*) malloc(size*sizeof(int));
recvbuf = (int*) malloc(size*sizeof(int));
if (!sendbuf || !recvbuf) {
printf("Out of memory\n"); MPI_Abort(comm,1);

104 Parallel Computing – r428


4.1. Blocking point-to-point operations

}
MPI_Send(sendbuf,size,MPI_INT,other,0,comm);
MPI_Recv(recvbuf,size,MPI_INT,other,0,comm,&status);
/* If control reaches this point, the send call
did not block. If the send call blocks,
we do not reach this point, and the program will hang.
*/
if (procno==0)
printf("Send did not block for size %d\n",size);
free(sendbuf); free(recvbuf);
}

!! sendblock.F90
other = 1-mytid
size = 1
do
allocate(sendbuf(size)); allocate(recvbuf(size))
print *,size
call MPI_Send(sendbuf,size,MPI_INTEGER,other,0,comm,err)
call MPI_Recv(recvbuf,size,MPI_INTEGER,other,0,comm,status,err)
if (mytid==0) then
print *,"MPI_Send did not block for size",size
end if
deallocate(sendbuf); deallocate(recvbuf)
size = size*10
if (size>2000000000) goto 20
end do
20 continue

## sendblock.py
size = 1
while size<2000000000:
sendbuf = np.empty(size, dtype=int)
recvbuf = np.empty(size, dtype=int)
comm.Send(sendbuf,dest=other)
comm.Recv(sendbuf,source=other)
if procid<other:
print("Send did not block for",size)
size *= 10

If you want a code to exhibit the same blocking behavior for all message sizes, you force the send call
to be blocking by using MPI_Ssend, which has the same calling sequence as MPI_Send, but which does not
allow eager sends.
// ssendblock.c
other = 1-procno;
sendbuf = (int*) malloc(sizeof(int));
recvbuf = (int*) malloc(sizeof(int));
size = 1;
MPI_Ssend(sendbuf,size,MPI_INT,other,0,comm);
MPI_Recv(recvbuf,size,MPI_INT,other,0,comm,&status);
printf("This statement is not reached\n");

Victor Eijkhout 105


4. MPI topic: Point-to-point

Formally you can describe deadlock as follows. Draw up a graph where every process is a node, and draw
a directed arc from process A to B if A is waiting for B. There is deadlock if this directed graph has a loop.
The solution to the deadlock in the above example is to first do the send from 0 to 1, and then from 1 to 0
(or the other way around). So the code would look like:
if ( /* I am processor 0 */ ) {
send(target=other);
receive(source=other);
} else {
receive(source=other);
send(target=other);
}

Eager sends also influences non-blocking sends. The wait call after a non-blocking send will return imme-
diately, regardless any receive call, if the message is under the eager limit:
Code: Output:
// eageri.c Setting eager limit to 5000
printf("Sending %lu elements\n",n); ↪bytes
MPI_Request request; TACC: Starting up job
MPI_Isend(buffer,n,MPI_DOUBLE,processB,0,comm,&request); ↪4049189
MPI_Wait(&request,MPI_STATUS_IGNORE); TACC: Starting parallel
printf(".. concluded\n"); ↪tasks...
Sending 1 elements
.. concluded
Sending 10 elements
.. concluded
Sending 100 elements
.. concluded
Sending 1000 elements
^C[[email protected]]
↪Sending Ctrl-C to
↪processes as requested

The eager limit is implementation-specific. For instance, for Intel MPI there is a variable
I_MPI_EAGER_THRESHOLD (old versions) or I_MPI_SHM_EAGER_THRESHOLD; for mvapich2 it is
MV2_IBA_EAGER_THRESHOLD, and for OpenMPI the --mca options btl_openib_eager_limit and
btl_openib_rndv_eager_limit.

4.1.4.3 Serialization
There is a second, even more subtle problem with blocking communication. Consider the scenario where
every processor needs to pass data to its successor, that is, the processor with the next higher rank. The
basic idea would be to first send to your successor, then receive from your predecessor. Since the last
processor does not have a successor it skips the send, and likewise the first processor skips the receive.
The pseudo-code looks like:
successor = mytid+1; predecessor = mytid-1;
if ( /* I am not the last processor */ )
send(target=successor);

106 Parallel Computing – r428


4.1. Blocking point-to-point operations

if ( /* I am not the first processor */ )


receive(source=predecessor)

Exercise 4.3. (Classroom exercise) Each student holds a piece of paper in the right hand
– keep your left hand behind your back – and we want to execute:
1. Give the paper to your right neighbor;
2. Accept the paper from your left neighbor.
Including boundary conditions for first and last process, that becomes the following
program:
1. If you are not the rightmost student, turn to the right and give the paper to
your right neighbor.
2. If you are not the leftmost student, turn to your left and accept the paper from
your left neighbor.
This code does not deadlock. All processors but the last one block on the send call, but the last processor
executes the receive call. Thus, the processor before the last one can do its send, and subsequently continue
to its receive, which enables another send, et cetera.
In one way this code does what you intended to do: it will terminate (instead of hanging forever on a
deadlock) and exchange data the right way. However, the execution now suffers from unexpected serial-
ization: only one processor is active at any time, so what should have been a parallel operation becomes

Figure 4.4: Trace of a simple send-recv code

a sequential one. This is illustrated in figure 4.4.


Exercise 4.4. Implement the above algorithm using MPI_Send and MPI_Recv calls. Run the
code, and use TAU to reproduce the trace output of figure 4.4. If you don’t have
TAU, can you show this serialization behavior using timings, for instance running it
on an increasing number of processes?
(There is a skeleton for this exercise under the name rightsend.)

Victor Eijkhout 107


4. MPI topic: Point-to-point

It is possible to orchestrate your processes to get an efficient and deadlock-free execution, but doing so is
a bit cumbersome.
Exercise 4.5. The above solution treated every processor equally. Can you come up with a
solution that uses blocking sends and receives, but does not suffer from the
serialization behavior?
There are better solutions which we will explore in the next section.

4.1.5 Bucket brigade


The problem with the previous exercise was that an operation that was conceptually parallel became serial
in execution. On the other hand, sometimes the operation is actually serial in nature. One example is the
bucket brigade operation, where a piece of data is successively passed down a sequence of processors.
Exercise 4.6. Take the code of exercise 4.4 and modify it so that the data from process zero
𝑝
gets propagated to every process. Specifically, compute all partial sums ∑𝑖=0 𝑖2 :

𝑥0 = 1 on process zero
{
𝑥𝑝 = 𝑥𝑝−1 + (𝑝 + 1)2 on process 𝑝

Use MPI_Send and MPI_Recv; make sure to get the order right.
Food for thought: all quantities involved here are integers. Is it a good idea to use
the integer datatype here?
(There is a skeleton for this exercise under the name bucketblock.)

Remark 10 There is an MPI_Scan routine (section 3.4) that performs the same computation, but computa-
tionally more efficiently. Thus, this exercise only serves to illustrate the principle.

4.1.6 Pairwise exchange


Above you saw that with blocking sends the precise ordering of the send and receive calls is crucial. Use
the wrong ordering and you get either deadlock, or something that is not efficient at all in parallel. MPI has
a way out of this problem that is sufficient for many purposes: the combined send/recv call MPI_Sendrecv
(figure 4.3).
The sendrecv call works great if every process is paired with precisely one sender and one receiver. You
would then write
sendrecv( ....from... ...to... );

with the right choice of source and destination. For instance, to send data to your right neighbor:
MPI_Comm_rank(comm,&procno);
MPI_Sendrecv( ....
/* from: */ procno-1
... ...
/* to: */ procno+1
... );

108 Parallel Computing – r428


4.1. Blocking point-to-point operations

Figure 4.3 MPI_Sendrecv


Name Param name Explanation C type F type inout
MPI_Sendrecv (
MPI_Sendrecv_c (
sendbuf initial address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
sendcount number of elements in send [ INTEGER IN
MPI_Count
buffer
sendtype type of elements in send MPI_Datatype TYPE IN
buffer (MPI_Datatype)
dest rank of destination int INTEGER IN
sendtag send tag int INTEGER IN
recvbuf initial address of receive void* TYPE(*), OUT
buffer DIMENSION(..)
int
recvcount number of elements in [ INTEGER IN
MPI_Count
receive buffer
recvtype type of elements receive MPI_Datatype TYPE IN
buffer element (MPI_Datatype)
source rank of source or int INTEGER IN
MPI_ANY_SOURCE
recvtag receive tag or MPI_ANY_TAG int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
status status object MPI_Status* TYPE OUT
(MPI_Status)
)
MPL:

template<typename T >
status mpl::communicator::sendrecv
( const T & senddata, int dest, tag sendtag,
T & recvdata, int source, tag recvtag
) const
( const T * senddata, const layout< T > & sendl, int dest, tag sendtag,
T * recvdata, const layout< T > & recvl, int source, tag recvtag
) const
( iterT1 begin1, iterT1 end1, int dest, tag sendtag,
iterT2 begin2, iterT2 end2, int source, tag recvtag
) const
Python:

Sendrecv(self,
sendbuf, int dest, int sendtag=0,
recvbuf=None, int source=ANY_SOURCE, int recvtag=ANY_TAG,
Status status=None)

Victor Eijkhout 109


4. MPI topic: Point-to-point

This scheme is correct for all processes but the first and last. In order to use the sendrecv call on these
processes, we use MPI_PROC_NULL for the non-existing processes that the endpoints communicate with.
MPI_Comm_rank( .... &procno );
if ( /* I am not the first processor */ )
predecessor = procno-1;
else
predecessor = MPI_PROC_NULL;
if ( /* I am not the last processor */ )
successor = procno+1;
else
successor = MPI_PROC_NULL;
sendrecv(from=predecessor,to=successor);

where the sendrecv call is executed by all processors.


All processors but the last one send to their neighbor; the target value of MPI_PROC_NULL for the last pro-
cessor means a ‘send to the null processor’: no actual send is done.
Likewise, receiving from MPI_PROC_NULL succeeds without altering the receive buffer. The corresponding
MPI_Status object has source MPI_PROC_NULL, tag MPI_ANY_TAG, and count zero.

Remark 11 The MPI_Sendrecv can inter-operate with the normal send and receive calls, both blocking and
non-blocking. Thus it would also be possible to replace the MPI_Sendrecv calls at the end points by simple sends
or receives.

MPL note 28: Send-recv call. The send-recv call in MPL has the same possibilities for specifying the send
and receive buffer as the separate send and recv calls: scalar, layout, iterator. However, out of the
nine conceivably possible routine signatures, only the versions are available where the send and
receive buffer are specified the same way. Also, the send and receive tag need to be specified;
they do not have default values.
// sendrecv.cxx
mpl::tag t0(0);
comm_world.sendrecv
( mydata,sendto,t0,
leftdata,recvfrom,t0 );

Exercise 4.7. Revisit exercise 4.3 and solve it using MPI_Sendrecv.


If you have TAU installed, make a trace. Does it look different from the serialized
send/recv code? If you don’t have TAU, run your code with different numbers of
processes and show that the runtime is essentially constant.
This call makes it easy to exchange data between two processors: both specify the other as both target and
source. However, there need not be any such relation between target and source: it is possible to receive
from a predecessor in some ordering, and send to a successor in that ordering; see figure 4.5.
For the above three-point combination scheme you need to move data both left right, so you need two
MPI_Sendrecvcalls; see figure 4.6.

110 Parallel Computing – r428


4.1. Blocking point-to-point operations

Figure 4.5: An MPI Sendrecv call

Figure 4.6: Two steps of send/recv to do a three-point combination


.

Exercise 4.8. Implement the above three-point combination scheme using MPI_Sendrecv;
every processor only has a single number to send to its neighbor.
(There is a skeleton for this exercise under the name sendrecv.)
Hints for this exercise:
• Each process does one send and one receive; if a process needs to skip one or the other, you can
specify MPI_PROC_NULL as the other process in the send or receive specification. In that case the
corresponding action is not taken.
• As with the simple send/recv calls, processes have to match up: if process 𝑝 specifies 𝑝 ′ as the
destination of the send part of the call, 𝑝 ′ needs to specify 𝑝 as the source of the recv part.
The following exercise lets you implement a sorting algorithm with the send-receive call1 .
Exercise 4.9. A very simple sorting algorithm is swap sort or odd-even transposition sort:
pairs of processors compare data, and if necessary exchange. The elementary step is
called a compare-and-swap: in a pair of processors each sends their data to the other;
one keeps the minimum values, and the other the maximum. For simplicity, in this
exercise we give each processor just a single number.
The transposition sort algorithm is split in even and odd stages, where in the even
stage processors 2𝑖 and 2𝑖 + 1 compare and swap data, and in the odd stage

1. There is an MPI_Compare_and_swap call. Do not use that.

Victor Eijkhout 111


4. MPI topic: Point-to-point

Figure 4.7: Odd-even transposition sort on 4 elements.

processors 2𝑖 + 1 and 2𝑖 + 2 compare and swap. You need to repeat this 𝑃/2 times,
where 𝑃 is the number of processors; see figure 4.7.
Implement this algorithm using MPI_Sendrecv. (Use MPI_PROC_NULL for the edge cases
if needed.) Use a gather call to print the global state of the distributed array at the
beginning and end of the sorting process.

Figure 4.8: Odd-even transposition sort on 4 processes, holding 2 elements each.

Remark 12 It is not possible to use MPI_IN_PLACE for the buffers, as in section 3.3.2. Instead, the routine
MPI_Sendrecv_replace (figure 4.4) has only one buffer, used as both send and receive buffer. Of course, this
requires the send and receive messages to fit in that one buffer.

Exercise 4.10. Extend this exercise to the case where each process hold an equal number of
elements, more than 1. Consider figure 4.8 for inspiration. Is it coincidence that the
algorithm takes the same number of steps as in the single scalar case?
The following material is for the recently released MPI-4 standard and may not be supported yet.

112 Parallel Computing – r428


4.2. Nonblocking point-to-point operations

Figure 4.4 MPI_Sendrecv_replace


Name Param name Explanation C type F type inout
MPI_Sendrecv_replace (
MPI_Sendrecv_replace_c (
buf initial address of send void* TYPE(*), INOUT
and receive buffer DIMENSION(..)
int
count number of elements in send [ INTEGER IN
MPI_Count
and receive buffer
datatype type of elements in send MPI_Datatype TYPE IN
and receive buffer (MPI_Datatype)
dest rank of destination int INTEGER IN
sendtag send message tag int INTEGER IN
source rank of source or int INTEGER IN
MPI_ANY_SOURCE
recvtag receive message tag or int INTEGER IN
MPI_ANY_TAG
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
status status object MPI_Status* TYPE OUT
(MPI_Status)
)

There are non-blocking and persistent versions of MPI_Sendrecv: MPI_Isendrecv, MPI_Sendrecv_init,


MPI_Isendrecv_replace, MPI_Sendrecv_replace_init.
End of MPI-4 material

4.2 Nonblocking point-to-point operations


The structure of communication is often a reflection of the structure of the operation. With some regular
applications we also get a regular communication pattern. Consider again the above operation:

𝑦𝑖 = 𝑥𝑖−1 + 𝑥𝑖 + 𝑥𝑖+1 ∶ 𝑖 = 1, … , 𝑁 − 1

Doing this in parallel induces communication, as pictured in figure 4.1.


We note:
• The data is one-dimensional, and we have a linear ordering of the processors.
• The operation involves neighboring data points, and we communicate with neighboring pro-
cessors.
Above you saw how you can use information exchange between pairs of processors
• using MPI_Send and MPI_Recv, if you are careful; or
• using MPI_Sendrecv, as long as there is indeed some sort of pairing of processors.
However, there are circumstances where it is not possible, not efficient, or simply not convenient, to have
such a deterministic setup of the send and receive calls. Figure 4.9 illustrates such a case, where processors

Victor Eijkhout 113


4. MPI topic: Point-to-point

Figure 4.9: Processors with unbalanced send/receive patterns

are organized in a general graph pattern. Here, the numbers of sends and receive of a processor do not
need to match.
In such cases, one wants a possibility to state ‘these are the expected incoming messages’, without having
to wait for them in sequence. Likewise, one wants to declare the outgoing messages without having to do
them in any particular sequence. Imposing any sequence on the sends and receives is likely to run into
the serialization behavior observed above, or at least be inefficient.

4.2.1 Nonblocking send and receive calls


In the previous section you saw that blocking communication makes programming tricky if you want
to avoid deadlock and performance problems. The main advantage of these routines is that you have full
control about where the data is: if the send call returns the data has been successfully received, and the
send buffer can be used for other purposes or de-allocated.

Figure 4.10: Nonblocking send

By contrast, the nonblocking calls MPI_Isend (figure 4.5) and MPI_Irecv (figure 4.6) (where the ‘I’ stands
for ‘immediate’ or ‘incomplete’ ) do not wait for their counterpart: in effect they tell the runtime system
‘here is some data and please send it as follows’ or ‘here is some buffer space, and expect such-and-such
data to come’. This is illustrated in figure 4.10.
// isendandirecv.c

114 Parallel Computing – r428


4.2. Nonblocking point-to-point operations

Figure 4.5 MPI_Isend


Name Param name Explanation C type F type inout
MPI_Isend (
MPI_Isend_c (
buf initial address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
count number of elements in send [ INTEGER IN
MPI_Count
buffer
datatype datatype of each send MPI_Datatype TYPE IN
buffer element (MPI_Datatype)
dest rank of destination int INTEGER IN
tag message tag int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
request communication request MPI_Request* TYPE OUT
(MPI_Request)
)
MPL:

template<typename T >
irequest mpl::communicator::isend
( const T & data, int dest, tag t = tag(0) ) const;
( const T * data, const layout< T > & l, int dest, tag t = tag(0) ) const;
( iterT begin, iterT end, int dest, tag t = tag(0) ) const;
Python:

request = MPI.Comm.Isend(self, buf, int dest, int tag=0)

double send_data = 1.;


MPI_Request request;
MPI_Isend
( /* send buffer/count/type: */ &send_data,1,MPI_DOUBLE,
/* to: */ receiver, /* tag: */ 0,
/* communicator: */ comm,
/* request: */ &request);
MPI_Wait(&request,MPI_STATUS_IGNORE);

double recv_data;
MPI_Request request;
MPI_Irecv
( /* recv buffer/count/type: */ &recv_data,1,MPI_DOUBLE,
/* from: */ sender, /* tag: */ 0,
/* communicator: */ comm,
/* request: */ &request);
MPI_Wait(&request,MPI_STATUS_IGNORE);

Issuing the MPI_Isend / MPI_Irecv call is sometimes referred to as posting a send/receive.

Victor Eijkhout 115


4. MPI topic: Point-to-point

Figure 4.6 MPI_Irecv


Name Param name Explanation C type F type inout
MPI_Irecv (
MPI_Irecv_c (
buf initial address of receive void* TYPE(*), OUT
buffer DIMENSION(..)
int
count number of elements in [ INTEGER IN
MPI_Count
receive buffer
datatype datatype of each receive MPI_Datatype TYPE IN
buffer element (MPI_Datatype)
source rank of source or int INTEGER IN
MPI_ANY_SOURCE
tag message tag or MPI_ANY_TAG int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
request communication request MPI_Request* TYPE OUT
(MPI_Request)
)
MPL:

template<typename T >
irequest mpl::communicator::irecv
( const T & data, int src, tag t = tag(0) ) const;
( const T * data, const layout< T > & l, int src, tag t = tag(0) ) const;
( iterT begin, iterT end, int src, tag t = tag(0) ) const;
Python:

recvbuf = Comm.irecv(self, buf=None, int source=ANY_SOURCE, int tag=ANY_TAG,


Request request=None)
Python numpy:
Comm.Irecv(self, buf, int source=ANY_SOURCE, int tag=ANY_TAG,
Request status=None)

4.2.2 Request completion: wait calls


From the definition of MPI_Isend / MPI_Irecv, you seen that nonblocking routine yields an MPI_Request
object. This request can then be used to query whether the operation has concluded. You may also notice
that the MPI_Irecv routine does not yield an MPI_Status object. This makes sense: the status object describes
the actually received data, and at the completion of the MPI_Irecv call there is no received data yet.
Waiting for the request is done with a number of routines. We first consider MPI_Wait (figure 4.7). It takes
the request as input, and gives an MPI_Status as output. If you don’t need the status object, you can pass
MPI_STATUS_IGNORE.
// hangwait.c
if (procno==sender) {
for (int p=0; p<nprocs-1; p++) {
double send = 1.;
MPI_Send( &send,1,MPI_DOUBLE,p,0,comm);
}

116 Parallel Computing – r428


4.2. Nonblocking point-to-point operations

Figure 4.7 MPI_Wait


Name Param name Explanation C type F type inout
MPI_Wait (
request request MPI_Request* TYPE INOUT
(MPI_Request)
status status object MPI_Status* TYPE OUT
(MPI_Status)
)
Python:

MPI.Request.Wait(type cls, request, status=None)

} else {
double recv=0.;
MPI_Request request;
MPI_Irecv( &recv,1,MPI_DOUBLE,sender,0,comm,&request);
do_some_work();
MPI_Wait(&request,MPI_STATUS_IGNORE);
}

(Note that this example uses a mix of blocking and non-blocking operations: a blocking send is paired
with a non-blocking receive.)
The request is passed by reference, so that the wait routine can free it:
• The wait call deallocates the request object, and
• sets the value of the variable to MPI_REQUEST_NULL.
(See section 4.2.4 for details.)
MPL note 29: Requests from nonblocking calls. Nonblocking routines have an irequest as function result.
Note: not a parameter passed by reference, as in the C interface. The various wait calls are
methods of the irequest class.
double recv_data;
mpl::irequest recv_request =
comm_world.irecv( recv_data,sender );
recv_request.wait();

You can not default-construct the request variable:


// DOES NOT COMPILE:
mpl::irequest recv_request;
recv_request = comm.irecv( ... );

This means that the normal sequence of first declaring, and then filling in, the request variable
is not possible.
MPL implementation note: The wait call always returns a status object; not assigning
it means that the destructor is called on it.
Now we discuss in some detail the various wait calls. These are blocking; for the nonblocking versions
see section 4.2.3.

Victor Eijkhout 117


4. MPI topic: Point-to-point

Figure 4.8 MPI_Waitall


Name Param name Explanation C type F type inout
MPI_Waitall (
count list length int INTEGER IN
array_of_requests array of requests MPI_Request[] TYPE INOUT
(MPI_Request)
(count)
array_of_statuses array of status objects MPI_Status[] TYPE OUT
(MPI_Status)
(*)
)
Python:

MPI.Request.Waitall(type cls, requests, statuses=None)

4.2.2.1 Wait for one request


MPI_Wait waits for a a single request. If you are indeed waiting for a single nonblocking communication
to complete, this is the right routine. If you are waiting for multiple requests you could call this routine
in a loop.
for (p=0; p<nrequests ; p++) // Not efficient!
MPI_Wait(&request[p],&(status[p]));

However, this would be inefficient if the first request is fulfilled much later than the others: your waiting
process would have lots of idle time. In that case, use one of the following routines.

4.2.2.2 Wait for all requests


MPI_Waitall(figure 4.8) allows you to wait for a number of requests, and it does not matter in what se-
quence they are satisfied. Using this routine is easier to code than the loop above, and it could be more
efficient.
// irecvloop.c
MPI_Request requests =
(MPI_Request*) malloc( 2*nprocs*sizeof(MPI_Request) );
recv_buffers = (int*) malloc( nprocs*sizeof(int) );
send_buffers = (int*) malloc( nprocs*sizeof(int) );
for (int p=0; p<nprocs; p++) {
int
left_p = (p-1+nprocs) % nprocs,
right_p = (p+1) % nprocs;
send_buffer[p] = nprocs-p;
MPI_Isend(sendbuffer+p,1,MPI_INT, right_p,0, requests+2*p);
MPI_Irecv(recvbuffer+p,1,MPI_INT, left_p,0, requests+2*p+1);
}
/* your useful code here */
MPI_Waitall(2*nprocs,requests,MPI_STATUSES_IGNORE);

The output argument is an array or MPI_Status object. If you don’t need the status objects, you can pass
MPI_STATUSES_IGNORE.

118 Parallel Computing – r428


4.2. Nonblocking point-to-point operations

As an illustration, we realize exercise 4.4, and its trace in figure 4.4, with non-blocking execution and
MPI_Waitall. Figure 4.11 shows the trace of this variant of the code.

Figure 4.11: A trace of a nonblocking send between neighboring processors


Exercise 4.11. Revisit exercise 4.6 and consider replacing the blocking calls by nonblocking
ones. How far apart can you put the MPI_Isend / MPI_Irecv calls and the
corresponding MPI_Waits?
(There is a skeleton for this exercise under the name bucketpipenonblock.)
Exercise 4.12. Create two distributed arrays of positive integers. Take the set difference of
the two: the first array needs to be transformed to remove from it those numbers
that are in the second array.
How could you solve this with an MPI_Allgather call? Why is it not a good idea to do
so? Solve this exercise instead with a circular bucket brigade algorithm.
(There is a skeleton for this exercise under the name setdiff.)
Python note 14: Handling a single request. Non-blocking routines such as MPI_Isend return a request ob-
ject. The MPI_Wait is a class method, not a method of the request object:
## irecvsingle.py
sendbuffer = np.empty( nprocs, dtype=int )
recvbuffer = np.empty( nprocs, dtype=int )

left_p = (procid-1) % nprocs


right_p = (procid+1) % nprocs
send_request = comm.Isend\
( sendbuffer[procid:procid+1],dest=left_p)
recv_request = comm.Irecv\
( sendbuffer[procid:procid+1],source=right_p)
MPI.Request.Wait(send_request)
MPI.Request.Wait(recv_request)

Python note 15: Request arrays. An array of requests (for the waitall/some/any calls) is an ordinary Python

Victor Eijkhout 119


4. MPI topic: Point-to-point

Figure 4.9 MPI_Waitany


Name Param name Explanation C type F type inout
MPI_Waitany (
count list length int INTEGER IN
array_of_requests array of requests MPI_Request[] TYPE INOUT
(MPI_Request)
(count)
index index of handle for int* INTEGER OUT
operation that completed
status status object MPI_Status* TYPE OUT
(MPI_Status)
)
Python:

MPI.Request.Waitany( requests,status=None )
class method, returns index

list:
## irecvloop.py
requests = []
sendbuffer = np.empty( nprocs, dtype=int )
recvbuffer = np.empty( nprocs, dtype=int )

for p in range(nprocs):
left_p = (p-1) % nprocs
right_p = (p+1) % nprocs
requests.append( comm.Isend\
( sendbuffer[p:p+1],dest=left_p) )
requests.append( comm.Irecv\
( sendbuffer[p:p+1],source=right_p) )
MPI.Request.Waitall(requests)

The MPI_Waitall method is again a class method.

4.2.2.3 Wait for any requests


The ‘waitall’ routine is good if you need all nonblocking communications to be finished before you can
proceed with the rest of the program. However, sometimes it is possible to take action as each request is
satisfied. In that case you could use MPI_Waitany (figure 4.9) and write:
for (p=0; p<nrequests; p++) {
MPI_Irecv(buffer+index, /* ... */, requests+index);
}
for (p=0; p<nrequests; p++) {
MPI_Waitany(nrequests,request_array,&index,&status);
// operate on buffer[index]
}

Note that this routine takes a single status argument, passed by reference, and not an array of statuses!

120 Parallel Computing – r428


4.2. Nonblocking point-to-point operations

Fortran note 7: Index of requests. The index parameter is the index in the array of requests, which is a
Fortran array, so it uses 1-based indexing.
!! irecvsource.F90
if (mytid==ntids-1) then
do p=1,ntids-1
print *,"post"
call MPI_Irecv(recv_buffer(p),1,MPI_INTEGER,p-1,0,comm,&
requests(p),err)
end do
do p=1,ntids-1
call MPI_Waitany(ntids-1,requests,index,MPI_STATUS_IGNORE,err)
write(*,'("Message from",i3,":",i5)') index,recv_buffer(index)
end do

!! waitnull.F90
Type(MPI_Request),dimension(:),allocatable :: requests
allocate(requests(ntids-1))
call MPI_Waitany(ntids-1,requests,index,MPI_STATUS_IGNORE)
if ( .not. requests(index)==MPI_REQUEST_NULL) then
print *,"This request should be null:",index

!! waitnull.F90
Type(MPI_Request),dimension(:),allocatable :: requests
allocate(requests(ntids-1))
call MPI_Waitany(ntids-1,requests,index,MPI_STATUS_IGNORE)
if ( .not. requests(index)==MPI_REQUEST_NULL) then
print *,"This request should be null:",index

MPL note 30: Request pools. Instead of an array of requests, use an irequest_pool object, which acts like a
vector of requests, meaning that you can push onto it.
// irecvsource.cxx
mpl::irequest_pool recv_requests;
for (int p=0; p<nprocs-1; p++) {
recv_requests.push( comm_world.irecv( recv_buffer[p], p ) );
}

You can not declare a pool of a fixed size and assign elements.
MPL note 31: Wait any. The irequest_pool class has methods waitany, waitall, testany, testall, waitsome,
testsome.

The ‘any’ methods return a std::pair<bool,size_t>, with false meaning index==MPI_UNDEFINED


meaning no more requests to be satisfied.
auto [success,index] = recv_requests.waitany();
if (success) {
auto recv_status = recv_requests.get_status(index);

Same for testany, then false means no requests test true.


MPL note 32: Request handling.

Victor Eijkhout 121


4. MPI topic: Point-to-point

auto [success,index] = recv_requests.waitany();


if (success) {
auto recv_status = recv_requests.get_status(index);

4.2.2.4 Polling with MPI Wait any


The MPI_Waitany routine can be used to implement polling: occasionally check for incoming messages
while other work is going on.
Code: Output:
// irecvsource.c make[3]: `irecvsource' is up
if (procno==nprocs-1) { ↪to date.
int *recv_buffer; process 1 waits 6s before
MPI_Request *request; MPI_Status status; ↪sending
recv_buffer = (int*) malloc((nprocs-1)*sizeof(int)); process 2 waits 3s before
request = (MPI_Request*) malloc ↪sending
((nprocs-1)*sizeof(MPI_Request)); process 0 waits 13s before
↪sending
for (int p=0; p<nprocs-1; p++) { process 3 waits 8s before
ierr = MPI_Irecv(recv_buffer+p,1,MPI_INT, p,0,comm, ↪sending
request+p); CHK(ierr); process 5 waits 1s before
} ↪sending
for (int p=0; p<nprocs-1; p++) { process 6 waits 14s before
int index,sender; ↪sending
MPI_Waitany(nprocs-1,request,&index,&status); process 4 waits 12s before
if (index!=status.MPI_SOURCE) ↪sending
printf("Mismatch index %d vs source %d\n", Message from 5: 5
index,status.MPI_SOURCE); Message from 2: 2
printf("Message from %d: %d\n", Message from 1: 1
index,recv_buffer[index]); Message from 3: 3
} Message from 4: 4
} else { Message from 0: 0
ierr = MPI_Send(&procno,1,MPI_INT, nprocs-1,0,comm); Message from 6: 6
}

## irecvsource.py
if procid==nprocs-1:
receive_buffer = np.empty(nprocs-1,dtype=int)
requests = [ None ] * (nprocs-1)
for sender in range(nprocs-1):
requests[sender] = comm.Irecv(receive_buffer[sender:sender+1],source=sender)
# alternatively: requests = [ comm.Irecv(s) for s in .... ]
status = MPI.Status()
for sender in range(nprocs-1):
ind = MPI.Request.Waitany(requests,status=status)
if ind!=status.Get_source():
print("sender mismatch: %d vs %d" % (ind,status.Get_source()))
print("received from",ind)
else:
mywait = random.randint(1,2*nprocs)
print("[%d] wait for %d seconds" % (procid,mywait))
time.sleep(mywait)

122 Parallel Computing – r428


4.2. Nonblocking point-to-point operations

mydata = np.empty(1,dtype=int)
mydata[0] = procid
comm.Send([mydata,MPI.INT],dest=nprocs-1)

Each process except for the root does a blocking send; the root posts MPI_Irecv from all other processors,
then loops with MPI_Waitany until all requests have come in. Use MPI_SOURCE to test the index parameter of
the wait call.
Note the MPI_STATUS_IGNORE parameter: we know everything about the incoming message, so we do not
need to query a status object. Contrast this with the example in section 4.3.1.

4.2.2.5 Wait for some requests


Finally, MPI_Waitsome is very much like MPI_Waitany, except that it returns multiple numbers, if multiple
requests are satisfied. Now the status argument is an array of MPI_Status objects.

4.2.2.6 Receive status of the wait calls


The MPI_Wait... routines have the MPI_Status objects as output. If you are not interested in the status in-
formation, you can use the values MPI_STATUS_IGNORE for MPI_Wait and MPI_Waitany, or MPI_STATUSES_IGNORE
for MPI_Waitall, MPI_Waitsome, MPI_Testall, MPI_Testsome.

Remark 13 The routines that can return multiple statuses, can return the error condition MPI_ERR_IN_STATUS,
indicating that one of the statuses was in error. See section 4.3.3.

Exercise 4.13.
(There is a skeleton for this exercise under the name isendirecv.) Now use
nonblocking send/receive routines to implement the three-point averaging operation
𝑦𝑖 = (𝑥𝑖−1 + 𝑥𝑖 + 𝑥𝑖+1 )/3 ∶ 𝑖 = 1, … , 𝑁 − 1
on a distributed array. There are two approaches to the first and last process:
1. you can use MPI_PROC_NULL for the ‘missing’ communications;
2. you can skip these communications altogether, but now you have to count the
requests carefully.

4.2.2.7 Latency hiding / overlapping communication and computation


There is a second motivation for the Isend/Irecv calls: if your hardware supports it, the communication
can happen while your program can continue to do useful work:
// start nonblocking communication
MPI_Isend( ... ); MPI_Irecv( ... );
// do work that does not depend on incoming data
....
// wait for the Isend/Irecv calls to finish
MPI_Wait( ... );
// now do the work that absolutely needs the incoming data
....

Victor Eijkhout 123


4. MPI topic: Point-to-point

This is known as overlapping computation and communication, or latency hiding. See also asynchronous
progress; section 15.4.
Unfortunately, a lot of this communication involves activity in user space, so the solution would have
been to let it be handled by a separate thread. Until recently, processors were not efficient at doing such
multi-threading, so true overlap stayed a promise for the future. Some network cards have support for
this overlap, but it requires a nontrivial combination of hardware, firmware, and MPI implementation.
Exercise 4.14.
(There is a skeleton for this exercise under the name isendirecvarray.) Take your
code of exercise 4.13 and modify it to use latency hiding. Operations that can be
performed without needing data from neighbors should be performed in between
the MPI_Isend / MPI_Irecv calls and the corresponding MPI_Wait calls.

Remark 14 You have now seen various send types: blocking, nonblocking, synchronous. Can a receiver see
what kind of message was sent? Are different receive routines needed? The answer is that, on the receiving
end, there is nothing to distinguish a nonblocking or synchronous message. The MPI_Recv call can match any
of the send routines you have seen so far, and conversely a message sent with MPI_Send can be received by
MPI_Irecv.

4.2.2.8 Buffer issues in nonblocking communication


While the use of nonblocking routines prevents deadlock, it introduces problems of its own.
• With a blocking send call, you could repeatedly fill the send buffer and send it off.
double *buffer;
for ( ... p ... ) {
buffer = // fill in the data
MPI_Send( buffer, ... /* to: */ p );

• On the other hand, when a nonblocking send call returns, the actual send may not have been
executed, so the send buffer may not be safe to overwrite. Similarly, when the recv call returns,
you do not know for sure that the expected data is in it. Only after the corresponding wait call
are you use that the buffer has been sent, or has received its contents.
• To send multiple messages with nonblocking calls you therefore have to allocate multiple
buffers.
double **buffers;
for ( ... p ... ) {
buffers[p] = // fill in the data
MPI_Send( buffers[p], ... /* to: */ p );
}
MPI_Wait( /* the requests */ );

// irecvloop.c
MPI_Request requests =
(MPI_Request*) malloc( 2*nprocs*sizeof(MPI_Request) );
recv_buffers = (int*) malloc( nprocs*sizeof(int) );
send_buffers = (int*) malloc( nprocs*sizeof(int) );
for (int p=0; p<nprocs; p++) {

124 Parallel Computing – r428


4.2. Nonblocking point-to-point operations

int
left_p = (p-1+nprocs) % nprocs,
right_p = (p+1) % nprocs;
send_buffer[p] = nprocs-p;
MPI_Isend(sendbuffer+p,1,MPI_INT, right_p,0, requests+2*p);
MPI_Irecv(recvbuffer+p,1,MPI_INT, left_p,0, requests+2*p+1);
}
/* your useful code here */
MPI_Waitall(2*nprocs,requests,MPI_STATUSES_IGNORE);

4.2.3 Wait and test calls


The MPI_Wait... routines are blocking. Thus, they are a good solution if the receiving process can not do
anything until the data (or at least some data) is actually received. The MPI_Test... calls are themselves
nonblocking: they test for whether one or more requests have been fullfilled, but otherwise immediately
return. It is also a local operation: it does not force progress.
The MPI_Test call can be used in the manager-worker model: the manager process creates tasks, and sends
them to whichever worker process has finished its work. (This uses a receive from MPI_ANY_SOURCE, and a
subsequent test on the MPI_SOURCE field of the receive status.) While waiting for the workers, the manager
can do useful work too, which requires a periodic check on incoming message.
Pseudo-code:
while ( not done ) {
// create new inputs for a while
....
// see if anyone has finished
MPI_Test( .... &index, &flag );
if ( flag ) {
// receive processed data and send new
}

If the test is true, the request is deallocated and set to MPI_REQUEST_NULL, or, in the case of an active persistent
request (section 5.1), set to inactive.
Analogous to MPI_Wait, MPI_Waitany, MPI_Waitall, MPI_Waitsome, there are MPI_Test (figure 4.10),
MPI_Testany, MPI_Testall, MPI_Testsome.
Exercise 4.15. Read section HPC book, section-6.5 and give pseudo-code for the distributed
sparse matrix-vector product using the above idiom for using MPI_Test... calls.
Discuss the advantages and disadvantages of this approach. The answer is not going
to be black and white: discuss when you expect which approach to be preferable.

4.2.4 More about requests


Every nonblocking call allocates an MPI_Request object. Unlike MPI_Status, an MPI_Request variable is not
actually an object, but instead it is an (opaque) pointer. This meeans that when you call, for instance,
MPI_Irecv, MPI will allocate an actual request object, and return its address in the MPI_Request variable.

Victor Eijkhout 125


4. MPI topic: Point-to-point

Figure 4.10 MPI_Test


Name Param name Explanation C type F type inout
MPI_Test (
request communication request MPI_Request* TYPE INOUT
(MPI_Request)
flag true if operation int* LOGICAL OUT
completed
status status object MPI_Status* TYPE OUT
(MPI_Status)
)
Python:

reqest.Test()

Figure 4.11 MPI_Request_free


Name Param name Explanation C type F type inout
MPI_Request_free (
request communication request MPI_Request* TYPE INOUT
(MPI_Request)
)

Correspondingly, calls to MPI_Wait or MPI_Test free this object, setting the handle to MPI_REQUEST_NULL.
(There is an exception for persistent communications where the request is only set to ‘inactive’; sec-
tion 5.1.) Thus, it is wise to issue wait calls even if you know that the operation has succeeded. For in-
stance, if all receive calls are concluded, you know that the corresponding send calls are finished and there
is no strict need to wait for their requests. However, omitting the wait calls would lead to a memory leak.
Another way around this is to call MPI_Request_free (figure 4.11), which sets the request variable to
MPI_REQUEST_NULL, and marks the object for deallocation after completion of the operation. Conceivably,
one could issue a nonblocking call, and immediately call MPI_Request_free, dispensing with any wait call.
However, this makes it hard to know when the operation is concluded and when the buffer is safe to
reuse [25].
You can inspect the status of a request without freeing the request object with MPI_Request_get_status
(figure 4.12).

4.3 The Status object and wildcards


In section 4.1.1 you saw that MPI_Recv has a ‘status’ argument of type MPI_Status that MPI_Send lacks.
(The various MPI_Wait... routines also have a status argument; see section 4.2.1.) Often you specify
MPI_STATUS_IGNORE for this argument: commonly you know what data is coming in and where it is coming
from.
However, in some circumstances the recipient may not know all details of a message when you make the
receive call, so MPI has a way of querying the status of the message:

126 Parallel Computing – r428


4.3. The Status object and wildcards

Figure 4.12 MPI_Request_get_status


Name Param name Explanation C type F type inout
MPI_Request_get_status (
request request MPI_Request TYPE IN
(MPI_Request)
flag boolean flag, same as from int* LOGICAL OUT
MPI_TEST
status status object if flag is MPI_Status* TYPE OUT
true (MPI_Status)
)

• If you are expecting multiple incoming messages, it may be most efficient to deal with them in
the order in which they arrive. So, instead of waiting for a specific message, you would specify
MPI_ANY_SOURCE or MPI_ANY_TAG in the description of the receive message. Now you have to be
able to ask ‘who did this message come from, and what is in it’.
• Maybe you know the sender of a message, but the amount of data is unknown. In that case you
can overallocate your receive buffer, and after the message is received ask how big it was, or
you can ‘probe’ an incoming message and allocate enough data when you find out how much
data is being sent.
To do this, the receive call has a MPI_Status parameter. The MPI_Status object is a structure (in C a struct,
in F90 an array, in F2008 a derived type) with freely accessible members:
• MPI_SOURCE gives the source of the message; see section 4.3.1.
• MPI_TAG gives the tag with which the message was received; see section 4.3.2.
• MPI_ERROR gives the error status of the receive call; see section 4.3.3.
• The number of items in the message can be deduced from the status object, not as a structure
member, but through a function call to MPI_Get_count; see section 4.3.4.
Fortran note 8: Status object in f08. The mpi_f08 module turns many handles (such as communicators)
from Fortran Integers into Types. Retrieving the integer from the type is usually done through
the %val member, but for the status object this is more difficult. The routines MPI_Status_f2f08
and MPI_Status_f082f convert between these. (Remarkably, these routines are even available
in C, where they operate on MPI_Fint, MPI_F08_status arguments.)
Python note 16: Status object. The status object is explicitly created before being passed to the receive
routine. It has the usual query method for the message count:
## pingpongbig.py
status = MPI.Status()
comm.Recv( rdata,source=0,status=status)
count = status.Get_count(MPI.DOUBLE)

(The count function without argument returns a result in bytes.)


However, unlike in C/F where the fields of the status object are directly accessible, Python has
query methods for these too:
status.Get_source()

Victor Eijkhout 127


4. MPI topic: Point-to-point

status.Get_tag()
status.Get_elements()
status.Get_error()
status.Is_cancelled()

Should you need them, there are even Set variants of these. https://mpi4py.readthedocs.
io/en/stable/reference/mpi4py.MPI.Status.html
MPL note 33: Status object. The mpl::status_t object is created by the receive (or wait) call:
mpl::contiguous_layout<double> target_layout(count);
mpl::status_t recv_status =
comm_world.recv(target.data(),target_layout, the_other);
recv_count = recv_status.get_count<double>();

4.3.1 Source
In some applications it makes sense that a message can come from one of a number of processes. In
this case, it is possible to specify MPI_ANY_SOURCE as the source. To find out the source where the message
actually came from, you would use the MPI_SOURCE field of the status object that is delivered by MPI_Recv
or the MPI_Wait... call after an MPI_Irecv.
MPI_Recv(recv_buffer+p,1,MPI_INT, MPI_ANY_SOURCE,0,comm,
&status);
sender = status.MPI_SOURCE;

There are various scenarios where receiving from ‘any source’ makes sense. One is that of the manager-
worker model. The manager task would first send data to the worker tasks, then issues a blocking wait
for the data of whichever process finishes first.
In Fortran2008 style, the source is a member of the Status type.
!! anysource.F90
Type(MPI_Status) :: status
allocate(recv_buffer(ntids-1))
do p=0,ntids-2
call MPI_Recv(recv_buffer(p+1),1,MPI_INTEGER,&
MPI_ANY_SOURCE,0,comm,status)
sender = status%MPI_SOURCE

In Fortran90 style, the source is an index in the Status array.


!! anysource.F90
integer :: status(MPI_STATUS_SIZE)
allocate(recv_buffer(ntids-1))
do p=0,ntids-2
call MPI_Recv(recv_buffer(p+1),1,MPI_INTEGER,&
MPI_ANY_SOURCE,0,comm,status,err)
sender = status(MPI_SOURCE)

MPL note 34: Status source querying. The status object can be queried:
int source = recv_status.source();

128 Parallel Computing – r428


4.3. The Status object and wildcards

4.3.2 Tag
In some circumstances, a tag wildcard can be used on the receive operation: MPI_ANY_TAG. The actual tag
of a message can be retrieved as the MPI_TAG member in the status structure.
There are not many cases where this is needed.
• Messages from a single source, even non-blocking, are non-overtaking. This means that mes-
sages can be distinguished by their order.
• Messages from multiple sources can be distinguished by the source field.
• Retrieving the message tag might be needed if information is encoded in it.
• The non-overtaking argument does not apply in the case of hybrid computing: two threads may
send messages that do not have an MPI-imposed order. See the example in section 45.1.
MPL note 35: Tag types. Tag are int or an enum typ:
template<typename T >
tag_t (T t);
tag_t (int t);

Example:
// inttag.cxx
enum class tag : int { ping=1, pong=2 };
int pinger = 0, ponger = world.size()-1;
if (world.rank()==pinger) {
world.send(x, 1, tag::ping);
world.recv(x, 1, tag::pong);
} else if (world.rank()==ponger) {
world.recv(x, 0, tag::ping);
world.send(x, 0, tag::pong);
}

MPL note 36: Message tag. MPL differs from other APIs in its treatment of tags: a tag is not directly an
integer, but an object of class tag.
// sendrecv.cxx
mpl::tag t0(0);
comm_world.sendrecv
( mydata,sendto,t0,
leftdata,recvfrom,t0 );

The tag class has a couple of methods such as mpl::tag::any() (for the MPI_ANY_TAG wildcard in
receive calls) and mpl::tag::up() (maximal tag, found from the MPI_TAG_UB attribute).

4.3.3 Error
For functions that return a single status, any error is returned as the function result. For a function re-
turning multiple statuses, such as MPI_Waitall, the presence of an error in one of the receives is indicated
by a result of MPI_ERR_IN_STATUS. Any errors during the receive operation can be found as the MPI_ERROR
member of the status structure.

Victor Eijkhout 129


4. MPI topic: Point-to-point

Figure 4.13 MPI_Get_count


Name Param name Explanation C type F type inout
MPI_Get_count (
MPI_Get_count_c (
status return status of receive const TYPE IN
operation MPI_Status* (MPI_Status)
datatype datatype of each receive MPI_Datatype TYPE IN
buffer entry (MPI_Datatype)
int∗
count number of received entries [ INTEGER OUT
MPI_Count∗
)
MPL:

template<typename T>
int mpl::status::get_count () const

template<typename T>
int mpl::status::get_count (const layout<T> &l) const
Python:

status.Get_count( Datatype datatype=BYTE )

4.3.4 Count

If the amount of data received is not known a priori, the count of elements received can be found by
MPI_Get_count (figure 4.13):

// count.c
if (procid==0) {
int sendcount = (rand()>.5) ? N : N-1;
MPI_Send( buffer,sendcount,MPI_FLOAT,target,0,comm );
} else if (procid==target) {
MPI_Status status;
int recvcount;
MPI_Recv( buffer,N,MPI_FLOAT,0,0, comm, &status );
MPI_Get_count(&status,MPI_FLOAT,&recvcount);
printf("Received %d elements\n",recvcount);
}

130 Parallel Computing – r428


4.3. The Status object and wildcards

Figure 4.14 MPI_Get_elements


Name Param name Explanation C type F type inout
MPI_Get_elements (
MPI_Get_elements_c (
status return status of receive const TYPE IN
operation MPI_Status* (MPI_Status)
datatype datatype used by receive MPI_Datatype TYPE IN
operation (MPI_Datatype)
int∗
count number of received basic [ INTEGER OUT
MPI_Count∗
elements
)

Code: Output:
!! count.F90 make[3]: `count' is up to
if (procid==0) then ↪date.
sendcount = N TACC: Starting up job
call random_number(fraction) ↪4051425
if (fraction>.5) then TACC: Setting up parallel
print *,"One less" ; sendcount = N-1 ↪environment for
end if ↪MVAPICH2+mpispawn.
call MPI_Send( TACC: Starting parallel
↪buffer,sendcount,MPI_REAL,target,0,comm ) ↪tasks...
else if (procid==target) then One less
call MPI_Recv( buffer,N,MPI_REAL,0,0, comm, status ) Received 9
call MPI_Get_count(status,MPI_FLOAT,recvcount) ↪elements
print *,"Received",recvcount,"elements" TACC: Shutdown complete.
end if ↪Exiting.

This may be necessary since the count argument to MPI_Recv is the buffer size, not an indication of the
actually received number of data items.

Remarks.
• Unlike the source and tag, the message count is not directly a member of the status structure.
• The ‘count’ returned is the number of elements of the specified datatype. If this is a derived
type (section 6.3) this is not the same as the number of predefined datatype elements. For that,
use MPI_Get_elements (figure 4.14) or MPI_Get_elements_x which returns the number of basic el-
ements.
MPL note 37: Receive count. The get_count function is a method of the status object. The argument type
is handled through templating:
// recvstatus.cxx
double pi=0;
auto s = comm_world.recv(pi, 0); // receive from rank 0
int c = s.get_count<double>();
std::cout << "got : " << c << " scalar(s): " << pi << '\n';

Victor Eijkhout 131


4. MPI topic: Point-to-point

4.3.5 Example: receiving from any source


Consider an example where the last process receives from every other process. We could implement this
as a loop
for (int p=0; p<nprocs-1; p++)
MPI_Recv( /* from source= */ p );

but this may incur idle time if the messages arrive out of order.
Instead, we use the MPI_ANY_SOURCE specifier to give a wildcard behavior to the receive call: using this
value for the ‘source’ value means that we accept mesages from any source within the communicator,
and messages are only matched by tag value. (Note that size and type of the receive buffer are not used
for message matching!)
We then retrieve the actual source from the MPI_Status object through the MPI_SOURCE field.
// anysource.c
if (procno==nprocs-1) {
/*
* The last process receives from every other process
*/
int *recv_buffer;
recv_buffer = (int*) malloc((nprocs-1)*sizeof(int));

/*
* Messages can come in in any order, so use MPI_ANY_SOURCE
*/
MPI_Status status;
for (int p=0; p<nprocs-1; p++) {
err = MPI_Recv(recv_buffer+p,1,MPI_INT, MPI_ANY_SOURCE,0,comm,
&status); CHK(err);
int sender = status.MPI_SOURCE;
printf("Message from sender=%d: %d\n",
sender,recv_buffer[p]);
}
free(recv_buffer);
} else {
/*
* Each rank waits an unpredictable amount of time,
* then sends to the last process in line.
*/
float randomfraction = (rand() / (double)RAND_MAX);
int randomwait = (int) ( nprocs * randomfraction );
printf("process %d waits for %e/%d=%d\n",
procno,randomfraction,nprocs,randomwait);
sleep(randomwait);
err = MPI_Send(&randomwait,1,MPI_INT, nprocs-1,0,comm); CHK(err);
}

## anysource.py
rstatus = MPI.Status()
comm.Recv(rbuf,source=MPI.ANY_SOURCE,status=rstatus)
print("Message came from %d" % rstatus.Get_source())

132 Parallel Computing – r428


4.4. More about point-to-point communication

The manager-worker model is a design patterns that offers an opportunity for inspecting the MPI_SOURCE
field of the MPI_Status object describing the data that was received. All workers processes model their
work by waitin a random amount of time, and the manager process accepts messages from any source.
// anysource.c
if (procno==nprocs-1) {
/*
* The last process receives from every other process
*/
int *recv_buffer;
recv_buffer = (int*) malloc((nprocs-1)*sizeof(int));

/*
* Messages can come in in any order, so use MPI_ANY_SOURCE
*/
MPI_Status status;
for (int p=0; p<nprocs-1; p++) {
err = MPI_Recv(recv_buffer+p,1,MPI_INT, MPI_ANY_SOURCE,0,comm,
&status); CHK(err);
int sender = status.MPI_SOURCE;
printf("Message from sender=%d: %d\n",
sender,recv_buffer[p]);
}
free(recv_buffer);
} else {
/*
* Each rank waits an unpredictable amount of time,
* then sends to the last process in line.
*/
float randomfraction = (rand() / (double)RAND_MAX);
int randomwait = (int) ( nprocs * randomfraction );
printf("process %d waits for %e/%d=%d\n",
procno,randomfraction,nprocs,randomwait);
sleep(randomwait);
err = MPI_Send(&randomwait,1,MPI_INT, nprocs-1,0,comm); CHK(err);
}

In chapter 49 you can do programming project with this model.

4.4 More about point-to-point communication


4.4.1 Message probing
MPI receive calls specify a receive buffer, and its size has to be enough for any data sent. In case you really
have no idea how much data is being sent, and you don’t want to overallocate the receive buffer, you can
use a ‘probe’ call.
The routines MPI_Probe (figure 4.15) and MPI_Iprobe (figure 4.16) (for which see also section 15.4) accept a
message but do not copy the data. Instead, when probing tells you that there is a message, you can use
MPI_Get_count (section 4.3.4) to determine its size, allocate a large enough receive buffer, and do a regular
receive to have the data copied.

Victor Eijkhout 133


4. MPI topic: Point-to-point

Figure 4.15 MPI_Probe


Name Param name Explanation C type F type inout
MPI_Probe (
source rank of source or int INTEGER IN
MPI_ANY_SOURCE
tag message tag or MPI_ANY_TAG int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
status status object MPI_Status* TYPE OUT
(MPI_Status)
)

Figure 4.16 MPI_Iprobe


Name Param name Explanation C type F type inout
MPI_Iprobe (
source rank of source or int INTEGER IN
MPI_ANY_SOURCE
tag message tag or MPI_ANY_TAG int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
flag true if there is a int* LOGICAL OUT
matching message that can
be received
status status object MPI_Status* TYPE OUT
(MPI_Status)
)

// probe.c
if (procno==receiver) {
MPI_Status status;
MPI_Probe(sender,0,comm,&status);
int count;
MPI_Get_count(&status,MPI_FLOAT,&count);
float recv_buffer[count];
MPI_Recv(recv_buffer,count,MPI_FLOAT, sender,0,comm,MPI_STATUS_IGNORE);
} else if (procno==sender) {
float buffer[buffer_size];
ierr = MPI_Send(buffer,buffer_size,MPI_FLOAT, receiver,0,comm); CHK(ierr);
}

There is a problem with the MPI_Probe call in a multithreaded environment: the following scenario can
happen.
1. A thread determines by probing that a certain message has come in.
2. It issues a blocking receive call for that message…
3. But in between the probe and the receive call another thread has already received the message.
4. … Leaving the first thread in a blocked state with no message to receive.
This is solved by MPI_Mprobe (figure 4.17), which after a successful probe removes the message from the

134 Parallel Computing – r428


4.4. More about point-to-point communication

Figure 4.17 MPI_Mprobe


Name Param name Explanation C type F type inout
MPI_Mprobe (
source rank of source or int INTEGER IN
MPI_ANY_SOURCE
tag message tag or MPI_ANY_TAG int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
message returned message MPI_Message* TYPE OUT
(MPI_Message)
status status object MPI_Status* TYPE OUT
(MPI_Status)
)

Figure 4.18 MPI_Mrecv


Name Param name Explanation C type F type inout
MPI_Mrecv (
MPI_Mrecv_c (
buf initial address of receive void* TYPE(*), OUT
buffer DIMENSION(..)
int
count number of elements in [ INTEGER IN
MPI_Count
receive buffer
datatype datatype of each receive MPI_Datatype TYPE IN
buffer element (MPI_Datatype)
message message MPI_Message* TYPE INOUT
(MPI_Message)
status status object MPI_Status* TYPE OUT
(MPI_Status)
)

matching queue: the list of messages that can be matched by a receive call. The thread that matched the
probe now issues an MPI_Mrecv (figure 4.18) call on that message through an object of type MPI_Message.

4.4.2 Errors

MPI routines return MPI_SUCCESS upon succesful completion. The following error codes can be returned
(see section 15.2.1 for details) for completion with error by both send and receive operations: MPI_ERR_COMM,
MPI_ERR_COUNT, MPI_ERR_TYPE, MPI_ERR_TAG, MPI_ERR_RANK.

4.4.3 Message envelope

Apart from its bare data, each message has a message envelope. This has enough information to distinguish
messages from each other: the source, destination, tag, communicator.

Victor Eijkhout 135


4. MPI topic: Point-to-point

4.5 Review questions


For all true/false questions, if you answer that a statement is false, give a one-line explanation.
Review 4.16. Describe a deadlock scenario involving three processors.
Review 4.17. True or false: a message sent with MPI_Isend from one processor can be
received with an MPI_Recv call on another processor.
Review 4.18. True or false: a message sent with MPI_Send from one processor can be
received with an MPI_Irecv on another processor.
Review 4.19. Why does the MPI_Irecv call not have an MPI_Status argument?
Review 4.20. Suppose you are testing ping-pong timings. Why is it generally not a good
idea to use processes 0 and 1 for the source and target processor? Can you come up
with a better guess?
Review 4.21. What is the relation between the concepts of ‘origin’, ‘target’, ‘fence’, and
‘window’ in one-sided communication.
Review 4.22. What are the three routines for one-sided data transfer?
Review 4.23. In the following fragments assume that all buffers have been allocated with
sufficient size. For each fragment note whether it deadlocks or not. Discuss
performance issues.
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm);
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE);

˜
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE);
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm);

int ireq = 0;
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Isend(sbuffers[p],buflen,MPI_INT,p,0,comm,&(requests[ireq++]));
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE);
MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE);

int ireq = 0;
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Irecv(rbuffers[p],buflen,MPI_INT,p,0,comm,&(requests[ireq++]));
for (int p=0; p<nprocs; p++)

136 Parallel Computing – r428


4.5. Review questions

if (p!=procid)
MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm);
MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE);

int ireq = 0;
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Irecv(rbuffers[p],buflen,MPI_INT,p,0,comm,&(requests[ireq++]));
MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE);
for (int p=0; p<nprocs; p++)
if (p!=procid)
MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm);

Victor Eijkhout 137


4. MPI topic: Point-to-point

Fortran codes:
do p=0,nprocs-1
if (p/=procid) then
call MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm,ierr)
end if
end do
do p=0,nprocs-1
if (p/=procid) then
call MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE,ierr)
end if
end do

do p=0,nprocs-1
if (p/=procid) then
call MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE,ierr)
end if
end do
do p=0,nprocs-1
if (p/=procid) then
call MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm,ierr)
end if
end do

ireq = 0
do p=0,nprocs-1
if (p/=procid) then
call MPI_Isend(sbuffers(1,p+1),buflen,MPI_INT,p,0,comm,&
requests(ireq+1),ierr)
ireq = ireq+1
end if
end do
do p=0,nprocs-1
if (p/=procid) then
call MPI_Recv(rbuffer,buflen,MPI_INT,p,0,comm,MPI_STATUS_IGNORE,ierr)
end if
end do
call MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE,ierr)

ireq = 0
do p=0,nprocs-1
if (p/=procid) then
call MPI_Irecv(rbuffers(1,p+1),buflen,MPI_INT,p,0,comm,&
requests(ireq+1),ierr)
ireq = ireq+1
end if
end do
do p=0,nprocs-1
if (p/=procid) then
call MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm,ierr)
end if
end do
call MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE,ierr)

138 Parallel Computing – r428


4.5. Review questions

// block5.F90
ireq = 0
do p=0,nprocs-1
if (p/=procid) then
call MPI_Irecv(rbuffers(1,p+1),buflen,MPI_INT,p,0,comm,&
requests(ireq+1),ierr)
ireq = ireq+1
end if
end do
call MPI_Waitall(nprocs-1,requests,MPI_STATUSES_IGNORE,ierr)
do p=0,nprocs-1
if (p/=procid) then
call MPI_Send(sbuffer,buflen,MPI_INT,p,0,comm,ierr)
end if
end do

Review 4.24. Consider a ring-wise communication where


int
next = (mytid+1) % ntids,
prev = (mytid+ntids-1) % ntids;

and each process sends to next, and receives from prev.


The normal solution for preventing deadlock is to use both MPI_Isend and MPI_Irecv.
The send and receive complete at the wait call. But does it matter in what sequence
you do the wait calls?

// ring3.c // ring4.c
MPI_Request req1,req2; MPI_Request req1,req2;
MPI_Irecv(&y,1,MPI_DOUBLE,prev,0,comm,&req1); MPI_Irecv(&y,1,MPI_DOUBLE,prev,0,comm,&req1);
MPI_Isend(&x,1,MPI_DOUBLE,next,0,comm,&req2); MPI_Isend(&x,1,MPI_DOUBLE,next,0,comm,&req2);
MPI_Wait(&req1,MPI_STATUS_IGNORE); MPI_Wait(&req2,MPI_STATUS_IGNORE);
MPI_Wait(&req2,MPI_STATUS_IGNORE); MPI_Wait(&req1,MPI_STATUS_IGNORE);

Can we have one nonblocking and one blocking call? Do these scenarios block?

// ring1.c // ring2.c
MPI_Request req; MPI_Request req;
MPI_Issend(&x,1,MPI_DOUBLE,next,0,comm,&req); MPI_Irecv(&y,1,MPI_DOUBLE,prev,0,comm,&req);
MPI_Recv(&y,1,MPI_DOUBLE,prev,0,comm, MPI_Ssend(&x,1,MPI_DOUBLE,next,0,comm);
MPI_STATUS_IGNORE); MPI_Wait(&req,MPI_STATUS_IGNORE);
MPI_Wait(&req,MPI_STATUS_IGNORE);

Victor Eijkhout 139


4. MPI topic: Point-to-point

4.6 Full source code of examples


Please see the repository: https://github.com/VictorEijkhout/TheArtOfHPC_vol2_
parallelprogramming/tree/main. Examples are sorted first by programming system, then by
language.

140 Parallel Computing – r428


Chapter 5

MPI topic: Communication modes

5.1 Persistent communication


You can imagine that setting up a communication carries some overhead, and if the same communication
structure is repeated many times, this overhead may be avoided.
Persistent communication is a mechanism for dealing with a repeating communication transaction, where
the parameters of the transaction, such as sender, receiver, tag, root, and buffer address /type / size, stay
the same. Only the contents of the buffers involved may change between the transactions.
1. For nonblocking communications MPI_Ixxx (both point-to-point and collective) there is a
persistent variant MPI_Xxx_init with the same calling sequence. The ‘init’ call produces an
MPI_Request output parameter, which can be used to test for completion of the communication.
2. The ‘init’ routine does not start the actual communication: that is done in MPI_Start, or
MPI_Startall for multiple requests.
3. Any of the MPI ‘wait’ calls can then be used to conclude the communication.
4. The communication can then be restarted with another ‘start’ call.
5. The wait call does not release the request object, since it can be used for repeat occurrences of
this transaction. The request object is only freed with MPI_Request_free.
MPI_Send_init( /* ... */ &request);
while ( /* ... */ ) {
MPI_Start( request );
MPI_Wait( request, &status );
}
MPI_Request_free( & request );

MPL note 38: Persistent requests. MPL returns a prequest from persistent ‘init’ routines, rather than an
irequest (MPL note 29):

template<typename T >
prequest send_init (const T &data, int dest, tag t=tag(0)) const;

Likewise, there is a prequest_pool instead of an irequest_pool (note 30).

141
5. MPI topic: Communication modes

Figure 5.1 MPI_Send_init


Name Param name Explanation C type F type inout
MPI_Send_init (
MPI_Send_init_c (
buf initial address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
count number of elements sent [ INTEGER IN
MPI_Count
datatype type of each element MPI_Datatype TYPE IN
(MPI_Datatype)
dest rank of destination int INTEGER IN
tag message tag int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
request communication request MPI_Request* TYPE OUT
(MPI_Request)
)
Python:

MPI.Comm.Send_init(self, buf, int dest, int tag=0)

Figure 5.2 MPI_Startall


Name Param name Explanation C type F type inout
MPI_Startall (
count list length int INTEGER IN
array_of_requests array of requests MPI_Request[] TYPE INOUT
(MPI_Request)
(count)
)
Python:

MPI.Prequest.Startall(type cls, requests)

5.1.1 Persistent point-to-point communication

The main persistent point-to-point routines are MPI_Send_init (figure 5.1), which has the same calling
sequence as MPI_Isend, and MPI_Recv_init, which has the same calling sequence as MPI_Irecv.

In the following example a ping-pong is implemented with persistent communication. Since we use per-
sistent operations for both send and receive on the ‘ping’ process, we use MPI_Startall (figure 5.2) to start
both at the same time, and MPI_Waitall to test their completion. (There is MPI_Start for starting a single
persistent transfer.)

142 Parallel Computing – r428


5.1. Persistent communication

Code: Output:
// persist.c make[3]: `persist' is up to date.
if (procno==src) { TACC: Starting up job 4328411
/* TACC: Starting parallel tasks...
* Send ping, receive pong Pingpong size=1: t=1.2123e-04
*/ Pingpong size=10: t=4.2826e-06
MPI_Send_init Pingpong size=100: t=7.1507e-06
(send,s,MPI_DOUBLE,tgt,0,comm, Pingpong size=1000: t=1.2084e-05
requests+0); Pingpong size=10000: t=3.7668e-05
MPI_Recv_init Pingpong size=100000: t=3.4415e-04
(recv,s,MPI_DOUBLE,tgt,0,comm, Persistent size=1: t=3.8177e-06
requests+1); Persistent size=10: t=3.2410e-06
for (int n=0; n<NEXPERIMENTS; n++) { Persistent size=100: t=4.0468e-06
fill_buffer(send,s,n); Persistent size=1000: t=1.1525e-05
MPI_Startall(2,requests); Persistent size=10000: t=4.1672e-05
MPI_Waitall(2,requests, Persistent size=100000: t=2.8648e-04
MPI_STATUSES_IGNORE); TACC: Shutdown complete. Exiting.
int r = chck_buffer(send,s,n);
if (!r) printf("buffer problem %d\n",s);
}
} else if (procno==tgt) {
/*
* Receive ping, send pong
*/
MPI_Send_init
(recv,s,MPI_DOUBLE,src,0,comm,
requests+0);
MPI_Recv_init
(recv,s,MPI_DOUBLE,src,0,comm,
requests+1);
for (int n=0; n<NEXPERIMENTS; n++) {
// receive
MPI_Start(requests+1);

↪MPI_Wait(requests+1,MPI_STATUS_IGNORE);
// send
MPI_Start(requests+0);

↪MPI_Wait(requests+0,MPI_STATUS_IGNORE);
}
}
MPI_Request_free(requests+0);
MPI_Request_free(requests+1);

(Ask yourself: why does the sender use MPI_Startall and MPI_Waitall, but the receiver uses MPI_Start and
MPI_Wait twice?)

## persist.py
requests = [ None ] * 2
sendbuf = np.ones(size,dtype=int)
recvbuf = np.ones(size,dtype=int)
if procid==src:
print("Size:",size)

Victor Eijkhout 143


5. MPI topic: Communication modes

Figure 5.3 MPI_Allreduce_init


Name Param name Explanation C type F type inout
MPI_Allreduce_init (
MPI_Allreduce_init_c (
sendbuf starting address of send const void* TYPE(*), IN
buffer DIMENSION(..)
recvbuf starting address of void* TYPE(*), OUT
receive buffer DIMENSION(..)
int
count number of elements in send [ INTEGER IN
MPI_Count
buffer
datatype datatype of elements of MPI_Datatype TYPE IN
send buffer (MPI_Datatype)
op operation MPI_Op TYPE(MPI_Op) IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
info info argument MPI_Info TYPE IN
(MPI_Info)
request communication request MPI_Request* TYPE OUT
(MPI_Request)
)

times[isize] = MPI.Wtime()
for n in range(nexperiments):
requests[0] = comm.Isend(sendbuf[0:size],dest=tgt)
requests[1] = comm.Irecv(recvbuf[0:size],source=tgt)
MPI.Request.Waitall(requests)
sendbuf[0] = sendbuf[0]+1
times[isize] = MPI.Wtime()-times[isize]
elif procid==tgt:
for n in range(nexperiments):
comm.Recv(recvbuf[0:size],source=src)
comm.Send(recvbuf[0:size],dest=src)

As with ordinary send commands, there are persistent variants of the other send modes:
• MPI_Bsend_init for buffered communication, section 5.5;
• MPI_Ssend_init for synchronous communication, section 5.3.1;
• MPI_Rsend_init for ready sends, section 15.8.

5.1.2 Persistent collectives


The following material is for the recently released MPI-4 standard and may not be supported yet.
For each collective call, there is a persistent variant. As with persistent point-to-point calls (section 5.1.1),
these have largely the same calling sequence as the nonpersistent variants, except for:
• an MPI_Info parameter that can be used to pass system-dependent hints; and
• an added final MPI_Request parameter.
(See for instance MPI_Allreduce_init (figure 5.3).) This request (or an array of requests from multiple calls)
can then be used by MPI_Start (or MPI_Startall) to initiate the actual communication.

144 Parallel Computing – r428


5.1. Persistent communication

// powerpersist1.c
double localnorm,globalnorm=1.;
MPI_Request reduce_request;
MPI_Allreduce_init
( &localnorm,&globalnorm,1,MPI_DOUBLE,MPI_SUM,
comm,MPI_INFO_NULL,&reduce_request);
for (int it=0; ; it++) {
/*
* Matrix vector product
*/
matmult(indata,outdata,buffersize);

// start computing norm of output vector


localnorm = local_l2_norm(outdata,buffersize);
double old_globalnorm = globalnorm;
MPI_Start( &reduce_request );

// end computing norm of output vector


MPI_Wait( &reduce_request,MPI_STATUS_IGNORE );
globalnorm = sqrt(globalnorm);
// now `globalnorm' is the L2 norm of `outdata'
scale(outdata,indata,buffersize,1./globalnorm);
}
MPI_Request_free( &reduce_request );

Some points.
• Metadata arrays, such as of counts and datatypes, must not be altered until the MPI_Request_free
call.
• The initialization call is nonlocal (for this particular case of persistent collectives), so it can block
until all processes have performed it.
• Multiple persistent collective can be initialized, in which case they satisfy the same restrictions
as ordinary collectives, in particular on ordering. Thus, the following code is incorrect:
// WRONG
if (procid==0) {
MPI_Reduce_init( /* ... */ &req1);
MPI_Bcast_init( /* ... */ &req2);
} else {
MPI_Bcast_init( /* ... */ &req2);
MPI_Reduce_init( /* ... */ &req1);
}

However, after initialization the start calls can be in arbitrary order, and in different order among
the processes.
Available persistent collectives are: MPI_Barrier_init MPI_Bcast_init MPI_Reduce_init MPI_Allreduce_init
MPI_Reduce_scatter_init MPI_Reduce_scatter_block_init MPI_Gather_init MPI_Gatherv_init
MPI_Allgather_init MPI_Allgatherv_init MPI_Scatter_init MPI_Scatterv_init MPI_Alltoall_init
MPI_Alltoallv_init MPI_Alltoallw_init MPI_Scan_init MPI_Exscan_init
End of MPI-4 material

Victor Eijkhout 145


5. MPI topic: Communication modes

Figure 5.4 MPI_Psend_init


Name Param name Explanation C type F type inout
MPI_Psend_init (
buf initial address of send const void* TYPE(*), IN
buffer DIMENSION(..)
partitions number of partitions int INTEGER IN
count number of elements sent MPI_Count INTEGER IN
per partition (KIND=MPI_COUNT_KIND)
datatype type of each element MPI_Datatype TYPE IN
(MPI_Datatype)
dest rank of destination int INTEGER IN
tag message tag int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
info info argument MPI_Info TYPE IN
(MPI_Info)
request communication request MPI_Request* TYPE OUT
(MPI_Request)
)

5.1.3 Persistent neighbor communications


The following material is for the recently released MPI-4 standard and may not be supported yet.
There are persistent version of the neighborhood collectives; section 11.2.2.
MPI_Neighbor_allgather_init, MPI_Neighbor_allgatherv_init, MPI_Neighbor_alltoall_init,
MPI_Neighbor_alltoallv_init, MPI_Neighbor_alltoallw_init,
End of MPI-4 material

5.2 Partitioned communication


The following material is for the recently released MPI-4 standard and may not be supported yet.
Partitioned communication is a variant on persistent communication, in the sense that we use the init /
start / wait sequence. There difference is that now a message can be constructed in bit-by-bit.
• The normal MPI_Send_init is replaced by MPI_Psend_init (figure 5.4). Note the presence of an
MPI_Info argument, as in persistent collectives, but unlike in persistent sends and receives.
• After this, the MPI_Start does not actually start the transfer; instead:
• Each partition of the message is separately declared as ready-to-be-sent with MPI_Pready.
• An MPI_Wait call completes the operation, indicating that all partitions have been sent.
A common scenario for this is in multi-threaded environments, where each thread can construct its own
part of a message. Having partitioned messages means that partially constructed message buffers can be
sent off without having to wait for all threads to finish.
Indicating that parts of a message are ready for sending is done by one of the following calls:
• MPI_Pready (figure 5.5) for a single partition;

146 Parallel Computing – r428


5.2. Partitioned communication

Figure 5.5 MPI_Pready


Name Param name Explanation C type F type inout
MPI_Pready (
partition partition to mark ready int INTEGER IN
for transfer
request partitioned communication MPI_Request TYPE INOUT
request (MPI_Request)
)

• MPI_Pready_range for a range of partitions; and


• MPI_Pready_list for an explicitly enumerated list of partitions.
The MPI_Psend_init call yields an MPI_Request object that can be used to test for completion (see sections
4.2.2 and 4.2.3) of the full operation.
MPI_Request send_request;
MPI_Psend_init
(sendbuffer,nparts,SIZE,MPI_DOUBLE,tgt,0,
comm,MPI_INFO_NULL,&send_request);
for (int it=0; it<ITERATIONS; it++) {
MPI_Start(&send_request);
for (int ip=0; ip<nparts; ip++) {
fill_buffer(sendbuffer,partitions[ip],partitions[ip+1],ip);
MPI_Pready(ip,send_request);
}
MPI_Wait(&send_request,MPI_STATUS_IGNORE);
}
MPI_Request_free(&send_request);

The receiving side is largely the mirror image of the sending side:
double *recvbuffer = (double*)malloc(bufsize*sizeof(double));
MPI_Request recv_request;
MPI_Precv_init
(recvbuffer,nparts,SIZE,MPI_DOUBLE,src,0,
comm,MPI_INFO_NULL,&recv_request);
for (int it=0; it<ITERATIONS; it++) {
MPI_Start(&recv_request); int r=1,flag;
for (int ip=0; ip<nparts; ip++) // cycle this many times
for (int ap=0; ap<nparts; ap++) { // check specific part
MPI_Parrived(recv_request,ap,&flag);
if (flag) {
r *= chck_buffer
(recvbuffer,partitions[ap],partitions[ap+1],ap);
break; }
}
MPI_Wait(&recv_request,MPI_STATUS_IGNORE);
}
MPI_Request_free(&recv_request);

• a partitioned send can only be matched with a partitioned receive, so we start with an
MPI_Precv_init.

Victor Eijkhout 147


5. MPI topic: Communication modes

Figure 5.6 MPI_Parrived


Name Param name Explanation C type F type inout
MPI_Parrived (
request partitioned communication MPI_Request TYPE IN
request (MPI_Request)
partition partition to be tested int INTEGER IN
flag true if operation int* LOGICAL OUT
completed on the specified
partition, false if not
)

• Arrival of a partition can be tested with MPI_Parrived (figure 5.6).


• A call to MPI_Wait completes the operation, indicating that all partitions have arrived.
Again, the MPI_Request object from the receive-init call can be used to test for completion of the full receive
operation.
End of MPI-4 material

5.3 Synchronous and asynchronous communication


It is easiest to think of blocking as a form of synchronization with the other process, but that is not quite
true. Synchronization is a concept in itself, and we talk about synchronous communication if there is actual
coordination going on with the other process, and asynchronous communication if there is not. Blocking
then only refers to the program waiting until the user data is safe to reuse; in the synchronous case a
blocking call means that the data is indeed transferred, in the asynchronous case it only means that the
data has been transferred to some system buffer. The four possible cases are illustrated in figure 5.1.

5.3.1 Synchronous send operations


MPI has a number of routines for synchronous communication, such as MPI_Ssend. Driving home the
point that nonblocking and asynchronous are different concepts, there is a routine MPI_Issend, which
is synchronous but nonblocking. These routines have the same calling sequence as their not-explicitly
synchronous variants, and only differ in their semantics.
See section 4.1.4.2 for examples.

5.4 Local and nonlocal operations


The MPI standard does not dictate whether communication is buffered. If a message is buffered, a send
call can complete, even if no corresponding send has been posted yet. See section 4.1.4.2. Thus, in the
standard communication, a send operation is nonlocal: its completion may be depend on whether the
corresponding receive has been posted. A local operation is one that is not nonlocal.

148 Parallel Computing – r428


5.5. Buffered communication

Figure 5.1: Blocking and synchronicity

On the other hand, buffered communication (routines MPI_Bsend, MPI_Ibsend, MPI_Bsend_init; section 5.5) is
local: the presence of an explicit buffer means that a send operation can complete no matter whether the
receive has been posted.
The synchronous send (routines MPI_Ssend, MPI_Issend, MPI_Ssend_init; section 15.8) is again nonlocal (even
in the nonblocking variant) since it will only complete when the receive call has completed.
Finally, the ready mode send (MPI_Rsend, MPI_Irsend) is nonlocal in the sense that its only correct use is
when the corresponding receive has been issued.

5.5 Buffered communication


By now you have probably got the notion that managing buffer space in MPI is important: data has to
be somewhere, either in user-allocated arrays or in system buffers. Using buffered communication is yet
another way of managing buffer space.
1. You allocate your own buffer space, and you attach it to your process. This buffer is not a send
buffer: it is a replacement for buffer space used inside the MPI library or on the network card;
figure 5.2. If high-bandwidth memory is available, you could create your buffer there.
2. You use the MPI_Bsend (figure 5.7) (or its local variant MPI_Ibsend) call for sending, using other-
wise normal send and receive buffers;
3. You detach the buffer when you’re done with the buffered sends.
One advantage of buffered sends is that they are nonblocking: since there is a guaranteed buffer long
enough to contain the message, it is not necessary to wait for the receiving process.

Victor Eijkhout 149


5. MPI topic: Communication modes

Figure 5.2: User communication routed through an attached buffer

Figure 5.7 MPI_Bsend


Name Param name Explanation C type F type inout
MPI_Bsend (
MPI_Bsend_c (
buf initial address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
count number of elements in send [ INTEGER IN
MPI_Count
buffer
datatype datatype of each send MPI_Datatype TYPE IN
buffer element (MPI_Datatype)
dest rank of destination int INTEGER IN
tag message tag int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
)

We illustrate the use of buffered sends:

// bufring.c
int bsize = BUFLEN*sizeof(float);
float
*sbuf = (float*) malloc( bsize ),
*rbuf = (float*) malloc( bsize );
MPI_Pack_size( BUFLEN,MPI_FLOAT,comm,&bsize);
bsize += MPI_BSEND_OVERHEAD;
float
*buffer = (float*) malloc( bsize );

MPI_Buffer_attach( buffer,bsize );
err = MPI_Bsend(sbuf,BUFLEN,MPI_FLOAT,next,0,comm);
MPI_Recv (rbuf,BUFLEN,MPI_FLOAT,prev,0,comm,MPI_STATUS_IGNORE);
MPI_Buffer_detach( &buffer,&bsize );

150 Parallel Computing – r428


5.5. Buffered communication

Figure 5.8 MPI_Buffer_attach


Name Param name Explanation C type F type inout
MPI_Buffer_attach (
MPI_Buffer_attach_c (
buffer initial buffer address void* TYPE(*), IN
DIMENSION(..)
int
size buffer size, in bytes [ INTEGER IN
MPI_Count
)

Figure 5.9 MPI_Buffer_detach


Name Param name Explanation C type F type inout
MPI_Buffer_detach (
MPI_Buffer_detach_c (
buffer_addr initial buffer address void* TYPE(C_PTR) OUT
int∗
size buffer size, in bytes [ INTEGER OUT
MPI_Count∗
)

5.5.1 Buffer treatment


There can be only one buffer per process, attached with MPI_Buffer_attach (figure 5.8). Its size should be
enough for all MPI_Bsend calls that are simultaneously outstanding. You can compute the needed size of
the buffer with MPI_Pack_size; see section 6.8. Additionally, a term of MPI_BSEND_OVERHEAD is needed. See
the above code fragment.
The buffer is detached with MPI_Buffer_detach (figure 5.9). This returns the address and size of the buffer;
the call blocks until all buffered messages have been delivered.
Note that both MPI_Buffer_attach and MPI_Buffer_detach have a void* argument for the buffer, but
• in the attach routine this is the address of the buffer,
• while the detach routine it is the address of the buffer pointer.
This is done so that the detach routine can zero the buffer pointer.
While the buffered send is nonblocking like an MPI_Isend, there is no corresponding wait call. You can
force delivery by
MPI_Buffer_detach( &b, &n );
MPI_Buffer_attach( b, n );

MPL note 39: Buffered send. Creating and attaching a buffer is done through bsend_buffer and a support
routine bsend_size helps in calculating the buffer size:
// bufring.cxx
vector<float> sbuf(BUFLEN), rbuf(BUFLEN);
int size{ comm_world.bsend_size<float>(mpl::contiguous_layout<float>(BUFLEN)) };
mpl::bsend_buffer buff(size);
comm_world.bsend(sbuf.data(),mpl::contiguous_layout<float>(BUFLEN), next);

Victor Eijkhout 151


5. MPI topic: Communication modes

Figure 5.10 MPI_Bsend_init


Name Param name Explanation C type F type inout
MPI_Bsend_init (
MPI_Bsend_init_c (
buf initial address of send const void* TYPE(*), IN
buffer DIMENSION(..)
int
count number of elements sent [ INTEGER IN
MPI_Count
datatype type of each element MPI_Datatype TYPE IN
(MPI_Datatype)
dest rank of destination int INTEGER IN
tag message tag int INTEGER IN
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
request communication request MPI_Request* TYPE OUT
(MPI_Request)
)

Constant: mpl::bsend_overhead is constexpr’d to the MPI constant MPI_BSEND_OVERHEAD.


MPL note 40: Buffer attach and detach. There is a separate attach routine, but normally this is called by the
constructor of the bsend_buffer. Likewise, the detach routine is called in the buffer destructor.
void mpl::environment::buffer_attach (void *buff, int size);
std::pair< void *, int > mpl::environment::buffer_detach ();

5.5.2 Buffered send calls


The possible error codes are
• MPI_SUCCESS the routine completed successfully.
• MPI_ERR_BUFFER The buffer pointer is invalid; this typically means that you have supplied a null
pointer.
• MPI_ERR_INTERN An internal error in MPI has been detected.
The asynchronous version is MPI_Ibsend, the persistent (see section 5.1) call is MPI_Bsend_init (figure 5.10).

152 Parallel Computing – r428


5.6. Full source code of examples

5.6 Full source code of examples


Please see the repository: https://github.com/VictorEijkhout/TheArtOfHPC_vol2_
parallelprogramming/tree/main. Examples are sorted first by programming system, then by
language.

Victor Eijkhout 153


Chapter 6

MPI topic: Data types

In the examples you have seen so far, every time data was sent it was as a contiguous buffer with elements
of a single type. In practice you may want to send noncontiguous or heterogeneous data.
• As an example of noncontiguous data, communicating the real parts of an array of complex
numbers means specifying every other number.
• Heterogeneous data is needed when communicating a C structure or Fortran type with more
than one type of element.
The datatypes you have dealt with so far are known as predefined datatypes; the datatypes you create to
deal with other data are known as derived datatypes.

6.1 The MPI_Datatype data type


Datatypes such as MPI_INT are values of the type MPI_Datatype. This type is handled differently in different
languages.
In C you can declare variables as
MPI_Datatype mytype;

Fortran note 9: Derived types for handles. In Fortran before 2008, datatypes variables are stored in Integer
variables. With the Fortran2008 standard, datatypes are Fortran derived types:
!! vector.F90
Type(MPI_Datatype) :: newvectortype

Implementationwise speaking, these types have exactly one member, MPI_VAL, which is the same
integer as was used for that datatype in the earlier Fortran version.
Python note 17: Data types. There is a class
mpi4py.MPI.Datatype

with predefined values such as


mpi4py.MPI.Datatype.DOUBLE

154
6.2. Predefined data types

which are themselves objects with methods for creating derived types; see section 6.3.1.
MPL note 41: Datatype handling. MPL mostly handles datatypes through subclasses of the layout class.
Layouts are MPL routines are templated over the data type.
// sendlong.cxx
mpl::contiguous_layout<long long> v_layout(v.size());
comm.send(v.data(), v_layout, 1); // send to rank 1

Also works with complex of float and double.


The data types, where MPL can infer their internal representation, are enumeration types, C ar-
rays of constant size and the template classes std::array, std::pair and std::tuple of the C++
Standard Template Library. The only limitation is, that the C array and the mentioned template
classes hold data elements of types that can be sent or received by MPL.
MPL note 42: Native MPI data types. Should you need the MPI_Datatype object contained in an MPL layout,
there is an access function native_handle.

6.2 Predefined data types


MPI has a number of predefined data types of various kinds
• First of all there are the types corresponding to the simple data types of the host languages. The
names are made to resemble the types of C and Fortran, for instance MPI_FLOAT and MPI_DOUBLE
corresponding to float and double in C, versus MPI_REAL and MPI_DOUBLE_PRECISION correspond-
ing to Real and Double precision in Fortran.
• The types MPI_PACKED and MPI_BYTE do not correspond to language types.
• The type MPI_Aint (and the Fortran kind MPI_ADDRESS_KIND) is used in Remote Memory Access
(RMA) windows; see section 9.3.1.
• The type MPI_Offset (and the corresponding Fortran MPI_OFFSET_KIND kind) is used to define
MPI_Offset quantities, used in file I/O; section 10.2.2.
• The type MPI_Count describes buffers; see section 6.4.
• The type MPI_CHAR corresponds to a character, which is not the same as a C char: it can be more
than one byte. Also, MPI converts between native character representations when communi-
cating between different architectures.

6.2.1 C/C++
Here we illustrate for C/C++ the correspondence between a type used to declare a variable, and how this
type appears in MPI communication routines:
long int i;
MPI_Send(&i,1,MPI_LONG,target,tag,comm);

See table 6.1.


• There is some, but not complete, support for C99 types; see table 6.2.

Victor Eijkhout 155


6. MPI topic: Data types

C type MPI type


char MPI_CHAR
unsigned char MPI_UNSIGNED_CHAR
char MPI_SIGNED_CHAR
short MPI_SHORT
unsigned short MPI_UNSIGNED_SHORT
int MPI_INT
unsigned int MPI_UNSIGNED
long int MPI_LONG
unsigned long int MPI_UNSIGNED_LONG
long long int MPI_LONG_LONG_INT
float MPI_FLOAT
double MPI_DOUBLE
long double MPI_LONG_DOUBLE
unsigned char MPI_BYTE
(does not correspond to a C type) MPI_PACKED

Table 6.1: Predefined datatypes in C

C9 type MPI type


_Bool MPI_C_BOOL
float _Complex MPI_C_COMPLEX
MPI_C_FLOAT_COMPLEX
double _Complex MPI_C_DOUBLE_COMPLEX
long double _Complex MPI_C_LONG_DOUBLE_COMPLEX

Table 6.2: C99 synonym types.

• There is support for C11 fixed width integer types; see table 6.3.
• The MPI_LONG_INT type is not an integer type, but rather a long and an int packed together; see
section 3.10.1.1.
• See section 6.2.4 for MPI_Aint and more about byte counting.

6.2.2 Fortran

Table 6.4 lists standard Fortran types and common extensions. Not all the types in the right table
need be supported; for instance MPI_INTEGER16 may not exist, in which case it will be equivalent to
MPI_DATATYPE_NULL.

The default integer type MPI_INTEGER is equivalent to INTEGER(KIND=MPI_INTEGER_KIND).

156 Parallel Computing – r428


6.2. Predefined data types

C11 type MPI type


int8_t MPI_INT8_T
int16_t MPI_INT16_T
int32_t MPI_INT32_T
int64_t MPI_INT64_T

uint8_t MPI_UINT8_T
uint16_t MPI_UINT16_T
uint32_t MPI_UINT32_T
uint64_t MPI_UINT64_T

Table 6.3: C11 fixed width integer types.

MPI_INTEGER1
MPI_CHARACTER Character(Len=1) MPI_INTEGER2
MPI_INTEGER MPI_INTEGER4
MPI_REAL MPI_INTEGER8
MPI_DOUBLE_PRECISION MPI_INTEGER16
MPI_COMPLEX MPI_REAL2
MPI_LOGICAL MPI_REAL4
MPI_BYTE MPI_REAL8
MPI_PACKED MPI_DOUBLE_COMPLEX
Complex(Kind=Kind(0.d0))

Table 6.4: Standard Fortran types (left) and common extension (right)

6.2.2.1 Fortran90 kind-defined types


If your Fortran90 code uses KIND to define scalar types with specified precision, you need to use the fol-
lowing routines to make MPI equivalences of Fortran scalar types:
• MPI_Type_create_f90_integer (figure 6.1)
• MPI_Type_create_f90_real (figure 6.2)
• MPI_Type_create_f90_complex (figure 6.3).
Example of an integer kind;
INTEGER ( KIND = SELECTED_INT_KIND(15) ) , &
DIMENSION(100) :: array
INTEGER :: root , error
Type(MPI_Datatype) :: integertype

CALL MPI_Type_create_f90_integer( 15 , integertype , error )


CALL MPI_Bcast ( array , 100 , &
integertype , root , MPI_COMM_WORLD , error )
! error parameter optional in f08, both routines.

Victor Eijkhout 157


6. MPI topic: Data types

Figure 6.1 MPI_Type_create_f90_integer


Name Param name Explanation C type F type inout
MPI_Type_create_f90_integer (
r decimal exponent range, int INTEGER IN
i.e., number of decimal
digits
newtype the requested MPI datatype MPI_Datatype* TYPE OUT
(MPI_Datatype)
)

Figure 6.2 MPI_Type_create_f90_real


Name Param name Explanation C type F type inout
MPI_Type_create_f90_real (
p precision, in decimal int INTEGER IN
digits
r decimal exponent range int INTEGER IN
newtype the requested MPI datatype MPI_Datatype* TYPE OUT
(MPI_Datatype)
)

Code: Output:
!! kindsend.F90 Fortran type has range
Integer,parameter :: digits=16 ↪ 18
Integer,parameter :: ip = Selected_Int_Kind(digits) Sending: 729000000000000
Integer (kind=ip) :: data Received: 729000000000000
Type(MPI_Datatype) :: mpi_ip
Call MPI_Type_create_f90_integer(digits,mpi_ip)
if (rank==0) then
print *,"Fortran type has range",range(data)
call MPI_Send( data,1,mpi_ip, 1,0,comm )
else if (rank==1) then
call MPI_Recv( data,1,mpi_ip, 0,0,comm,
↪MPI_STATUS_IGNORE )

Example of a real kind:


REAL ( KIND = SELECTED_REAL_KIND(15 ,300) ) , &
DIMENSION(100) :: array
CALL MPI_Type_create_f90_real( 15 , 300 , realtype , error )

Example of a complex kind:


COMPLEX ( KIND = SELECTED_REAL_KIND(15 ,300) ) , &
DIMENSION(100) :: array
CALL MPI_Type_create_f90_complex( 15 , 300 , complextype , error )

Remark 15 The MPI types thus created are predefined data types, so there is no need to commit or free them.

158 Parallel Computing – r428


6.2. Predefined data types

Figure 6.3 MPI_Type_create_f90_complex


Name Param name Explanation C type F type inout
MPI_Type_create_f90_complex (
p precision, in decimal int INTEGER IN
digits
r decimal exponent range int INTEGER IN
newtype the requested MPI datatype MPI_Datatype* TYPE OUT
(MPI_Datatype)
)

6.2.3 Python
Python note 18: Predefined data types. This section 6.2.3 discusses of predefined datatypes in Python.
In python, all buffer data comes from Numpy.

mpi4py type NumPy type


MPI.INT np.intc
np.int32
MPI.LONG np.int64
MPI.FLOAT np.float32
MPI.DOUBLE np.float64

In this table we see that Numpy has three integer types, one corresponding to C ints, and two with the
number of bits explicitly indicated. There used to be a np.int type, but this is deprecated as of Numpy
1.20.
Examples:
Code: Output:
## inttype.py Size of numpy int32: 4
sizeofint = np.dtype('int32').itemsize Size of C int: 4
print("Size of numpy int32: {}".format(sizeofint))
sizeofint = np.dtype('intc').itemsize
print("Size of C int: {}".format(sizeofint))

## allgatherv.py
mycount = procid+1
my_array = np.empty(mycount,dtype=np.float64)

6.2.3.1 Type correspondences MPI / Python


Above we saw that the number of bytes of a Numpy type can be deduced from
sizeofint = np.dtype('intc').itemsize

It is possible to derive the Numpy type corresponding to an MPI type:

Victor Eijkhout 159


6. MPI topic: Data types

## typesize.py
datatype = MPI.FLOAT
typecode = MPI._typecode(datatype)
assert typecode is not None # check MPI datatype is built-in
dtype = np.dtype(typecode)

6.2.4 Byte addressing types


So far we have mostly been taking about datatypes in the context of sending them. The MPI_Aint type is
not so much for sending, as it is for describing the size of objects, such as the size of an MPI_Win object
(section 9.1) or byte displacements in MPI_Type_create_hindexed.
Addresses have type MPI_Aint. The start of the address range is given in MPI_BOTTOM. See also the MPI_Sizeof
(section 6.2.5) and MPI_Get_address (section 6.3.6) routines.
Variables of type MPI_Aint can be sent as MPI_AINT:
MPI_Aint address;
MPI_Send( address,1,MPI_AINT, ... );

See section 9.5.2 for an example.


In order to prevent overflow errors in byte calculations there are support routines MPI_Aint_add
MPI_Aint MPI_Aint_add(MPI_Aint base, MPI_Aint disp)

and similarly MPI_Aint_diff.


Fortran note 10: Byte counting types in Fortran. The equivalent of MPI_Aint in Fortran is an integer of kind
MPI_ADDRESS_KIND:

integer(kind=MPI_ADDRESS_KIND) :: winsize

Using this integer kind to compute the size of a window also requires being able to query the
size of the datatype in that window. See section 6.2.5 for details.
Example usage in MPI_Win_create:
call MPI_Sizeof(windowdata,window_element_size,ierr)
window_size = window_element_size*500
call MPI_Win_create( windowdata,window_size,window_element_size,... )

Python note 19: Size of numpy types. Here is a good way for finding the size of numpy datatypes in bytes:
## putfence.py
intsize = np.dtype('int').itemsize
window_data = np.zeros(2,dtype=int)
win = MPI.Win.Create(window_data,intsize,comm=comm)

160 Parallel Computing – r428


6.2. Predefined data types

Figure 6.4 MPI_Type_match_size


Name Param name Explanation C type F type inout
MPI_Type_match_size (
typeclass generic type specifier int INTEGER IN
size size, in bytes, of int INTEGER IN
representation
datatype datatype with correct MPI_Datatype* TYPE OUT
type, size (MPI_Datatype)
)

6.2.5 Matching language type to MPI type

In some circumstances you may want to find the MPI type that corresponds to a type in your programming
language.

• In C++ functions and classes can be templated, meaning that the type is not fully known:

template<typename T> {
class something<T> {
public:
void dosend(T input) {
MPI_Send( &input,1,/* ????? */ );
};
};

(Note that in MPL this is hardly ever needed because MPI calls are templated there.)
• Petsc installations use a generic identifier PetscScalar (or PetscReal) with a
configuration-dependent realization.
• The size of a datatype is not always statically known, for instance if the Fortran KIND keyword
is used.

Here are some MPI mechanisms that address this problem.

6.2.5.1 Type matching in C

Datatypes in C can be translated to MPI types with MPI_Type_match_size (figure 6.4) where the typeclass
argument is one of MPI_TYPECLASS_REAL, MPI_TYPECLASS_INTEGER, MPI_TYPECLASS_COMPLEX.

Victor Eijkhout 161


6. MPI topic: Data types

Figure 6.5 MPI_Type_size


Name Param name Explanation C type F type inout
MPI_Type_size (
MPI_Type_size_c (
datatype datatype to get MPI_Datatype TYPE IN
information on (MPI_Datatype)
int∗
size datatype size [ INTEGER OUT
MPI_Count∗
)

Figure 6.6 MPI_Sizeof


Name Param name Explanation C type F type inout
MPI_Sizeof (
)

Code: Output:
// typematch.c mpiexec -n 1 ./typematch
float x5; float: size=4, mpi size=4
double x10; double: size=8, mpi size=8
int s5,s10;
MPI_Datatype mpi_x5,mpi_x10;

MPI_Type_match_size
(MPI_TYPECLASS_REAL,sizeof(x5),&mpi_x5);
MPI_Type_match_size
(MPI_TYPECLASS_REAL,sizeof(x10),&mpi_x10);
MPI_Type_size(mpi_x5,&s5);
printf("float: size=%d, mpi size=%d\n",
sizeof(x5),s5);
MPI_Type_size(mpi_x10,&s10);
printf("double: size=%d, mpi size=%d\n",
sizeof(x10),s10);

The space that MPI takes for a structure type can be queried in a variety of ways. First of all MPI_Type_size
(figure 6.5) counts the datatype size as the number of bytes occupied by the data in a type. That means
that in an MPI vector datatype it does not count the gaps.
// typesize.c
MPI_Type_vector(count,bs,stride,MPI_DOUBLE,&newtype);
MPI_Type_commit(&newtype);
MPI_Type_size(newtype,&size);
ASSERT( size==(count*bs)*sizeof(double) );

6.2.5.2 Type matching in Fortran


In Fortran, the size of the datatype in the language can be obtained with MPI_Sizeof (figure 6.6) (note

162 Parallel Computing – r428


6.3. Derived datatypes

the nonoptional error parameter!). This routine is deprecated in MPI-4: use of storage_size (which
reports the number of bits) and/or c_sizeof (from the iso_c_binding module, which reports bytes) is
recommended.
!! matchkind.F90
call MPI_Sizeof(x10,s10,ierr)
call MPI_Type_match_size(MPI_TYPECLASS_REAL,s10,mpi_x10)
call MPI_Type_size(mpi_x10,s10)
print *,"10 positions supported, MPI type size is",s10

Petsc has its own translation mechanism; see section 32.2.

6.3 Derived datatypes


MPI allows you to create your own data types, somewhat (but not completely…) analogous to defining
structures in a programming language. MPI data types are mostly of use if you want to send multiple
items in one message.
There are two problems with using only predefined datatypes as you have seen so far.
• MPI communication routines can only send multiples of a single data type: it is not possible to
send items of different types, even if they are contiguous in memory. It would be possible to use
the MPI_BYTE data type, but this is not advisable.
• It is also ordinarily not possible to send items of one type if they are not contiguous in memory.
You could of course send a contiguous memory area that contains the items you want to send,
but that is wasteful of bandwidth, and of memory space on the receiving side.
With MPI data types you can solve these problems in several ways.
• You can create a new contiguous data type consisting of an array of elements of another data
type. There is no essential difference between sending one element of such a type and multiple
elements of the component type.
• You can create a vector data type consisting of regularly spaced blocks of elements of a compo-
nent type. This is a first solution to the problem of sending noncontiguous data.
• For not regularly spaced data, there is the indexed data type, where you specify an array of index
locations for blocks of elements of a component type. The blocks can each be of a different size.
• The struct data type can accomodate multiple data types.
And you can combine these mechanisms to get irregularly spaced heterogeneous data, et cetera.

6.3.1 Basic calls


The typical sequence of calls for creating a new datatype is as follows:
• You need a variable for the datatype; this is of type MPI_Datatype.
• There is a create call, followed by a ‘commit’ call where MPI performs internal bookkeeping
and optimizations; we will discuss this in great detail below.
• The type needs to be ‘committed’. After this:

Victor Eijkhout 163


6. MPI topic: Data types

• The datatype is used, possibly multiple times;


• When the datatype is no longer needed, it must be freed to prevent memory leaks; see sec-
tion 6.3.1.2.
In code:
MPI_Datatype newtype;
MPI_Type_something( < oldtype specifications >, &newtype );
MPI_Type_commit( &newtype );
/* code that uses your new type */
MPI_Type_free( &newtype );

In Fortran2008:
Type(MPI_Datatype) :: newvectortype
call MPI_Type_something( <oldtype specification>, &
newvectortype)
call MPI_Type_commit(newvectortype)
!! code that uses your type
call MPI_Type_free(newvectortype)

Python note 20: Derived type handling. The various type creation routines are methods of the datatype
classes, after which commit and free are methods on the new type.
## vector.py
source = np.empty(stride*count,dtype=np.float64)
target = np.empty(count,dtype=np.float64)
if procid==sender:
newvectortype = MPI.DOUBLE.Create_vector(count,1,stride)
newvectortype.Commit()
comm.Send([source,1,newvectortype],dest=the_other)
newvectortype.Free()
elif procid==receiver:
comm.Recv([target,count,MPI.DOUBLE],source=the_other)

MPL note 43: Derived type handling. In MPL type creation routines are in the main namespace, templated
over the datatypes.
// vector.cxx
vector<double>
source(stride*count);
if (procno==sender) {
mpl::strided_vector_layout<double>
newvectortype(count,1,stride);
comm_world.send
(source.data(),newvectortype,the_other);
}

The commit call is part of the type creation, and freeing is done in the destructor.
MPL note 44: Layouts.
namespace mpl {
template <typename T> class layout; // Basisklasse

164 Parallel Computing – r428


6.3. Derived datatypes

Figure 6.7 MPI_Type_commit


Name Param name Explanation C type F type inout
MPI_Type_commit (
datatype datatype that is committed MPI_Datatype* TYPE INOUT
(MPI_Datatype)
)
MPL:

Done as part of the type create call.

template <typename T> class null_layout; // MPI_DATATYPE_NULL


template <typename T> class empty_layout; // leere Nachricht
template <typename T> class contiguous_layout; // MPI_Type_contiguous
template <typename T> class vector_layout; // MPI_Type_contiguous
template <typename T> class strided_vector_layout; // MPI_Type_vector
template <typename T> class indexed_layout; // MPI_Type_indexed
template <typename T> class hindexed_layout; // MPI_Type_create_hindexed
template <typename T> class indexed_block_layout; // MPI_Type_create_indexed_block
template <typename T> class hindexed_block_layout; // MPI_Type_create_hindexed_block
template <typename T> class iterator_layout; // MPI_Type_create_hindexed_block
template <typename T> subarray_layout; // MPI_Type_create_subarray
class heterogeneous_layout; // MPI_Type_create_struct
}

6.3.1.1 Create calls


The MPI_Datatype variable gets its value by a call to one of the following routines:
• MPI_Type_contiguous for contiguous blocks of data; section 6.3.2;
• MPI_Type_vector for regularly strided data; section 6.3.3;
• MPI_Type_create_subarray for subsets out higher dimensional block; section 6.3.4;
• MPI_Type_create_struct for heterogeneous irregular data; section 6.3.7;
• MPI_Type_indexed and MPI_Type_hindexed for irregularly strided data; section 6.3.6.
These calls take an existing type, whether predefined or also derived, and produce a new type.

6.3.1.2 Commit and free


It is necessary to call MPI_Type_commit (figure 6.7) on a new data type, which makes MPI do the indexing
calculations for the data type.
When you no longer need the data type, you call MPI_Type_free (figure 6.8). (This is typically not needed
in OO APIs.) This has the following effects:
• The definition of the datatype identifier will be changed to MPI_DATATYPE_NULL.
• Any communication using this data type, that was already started, will be completed succesfully.
• Datatypes that are defined in terms of this data type will still be usable.

Victor Eijkhout 165


6. MPI topic: Data types

Figure 6.8 MPI_Type_free


Name Param name Explanation C type F type inout
MPI_Type_free (
datatype datatype that is freed MPI_Datatype* TYPE INOUT
(MPI_Datatype)
)
MPL:

Done in the destructor.

Figure 6.9 MPI_Type_contiguous


Name Param name Explanation C type F type inout
MPI_Type_contiguous (
MPI_Type_contiguous_c (
int
count replication count [ INTEGER IN
MPI_Count
oldtype old datatype MPI_Datatype TYPE IN
(MPI_Datatype)
newtype new datatype MPI_Datatype* TYPE OUT
(MPI_Datatype)
)
Python:

Create_contiguous(self, int count)

6.3.2 Contiguous type


The simplest derived type is the ‘contiguous’ type, constructed with MPI_Type_contiguous (figure 6.9).
A contigous type describes an array of items of an predefined or earlier defined type. There is no differ-
ence between sending one item of a contiguous type and multiple items of the constituent type. This is
illustrated in figure 6.1.
// contiguous.c
MPI_Datatype newvectortype;
if (procno==sender) {
MPI_Type_contiguous(count,MPI_DOUBLE,&newvectortype);
MPI_Type_commit(&newvectortype);
MPI_Send(source,1,newvectortype,receiver,0,comm);
MPI_Type_free(&newvectortype);
} else if (procno==receiver) {
MPI_Status recv_status;
int recv_count;
MPI_Recv(target,count,MPI_DOUBLE,sender,0,comm,
&recv_status);
MPI_Get_count(&recv_status,MPI_DOUBLE,&recv_count);
ASSERT(count==recv_count);
}

166 Parallel Computing – r428


6.3. Derived datatypes

Figure 6.1: A contiguous datatype is built up out of elements of a constituent type

!! contiguous.F90
integer :: newvectortype
if (mytid==sender) then
call MPI_Type_contiguous(count,MPI_DOUBLE_PRECISION,newvectortype)
call MPI_Type_commit(newvectortype)
call MPI_Send(source,1,newvectortype,receiver,0,comm)
call MPI_Type_free(newvectortype)
else if (mytid==receiver) then
call MPI_Recv(target,count,MPI_DOUBLE_PRECISION,sender,0,comm,&
recv_status)
call MPI_Get_count(recv_status,MPI_DOUBLE_PRECISION,recv_count)
!ASSERT(count==recv_count);
end if

## contiguous.py
source = np.empty(count,dtype=np.float64)
target = np.empty(count,dtype=np.float64)
if procid==sender:
newcontiguoustype = MPI.DOUBLE.Create_contiguous(count)
newcontiguoustype.Commit()
comm.Send([source,1,newcontiguoustype],dest=the_other)
newcontiguoustype.Free()
elif procid==receiver:
comm.Recv([target,count,MPI.DOUBLE],source=the_other)

MPL note 45: Contiguous type. The MPL interface makes extensive use of contiguous_layout, as it is the
main way to declare a nonscalar buffer; see note 11.
MPL note 46: Contiguous composing. Contiguous layouts can only use predefined types or other contigu-
ous layouts as their ‘old’ type. To make a contiguous type for other layouts, use vector_layout:
// contiguous.cxx
mpl::contiguous_layout<int> type1(7);
mpl::vector_layout<int> type2(8,type1);

(Contrast this with strided_vector_layout; note 47.)

6.3.3 Vector type


The simplest noncontiguous datatype is the ‘vector’ type, constructed with MPI_Type_vector (figure 6.10).
A vector type describes a series of blocks, all of equal size, spaced with a constant stride. This is illustrated

Victor Eijkhout 167


6. MPI topic: Data types

Figure 6.10 MPI_Type_vector


Name Param name Explanation C type F type inout
MPI_Type_vector (
MPI_Type_vector_c (
int
count number of blocks [ INTEGER IN
MPI_Count
int
blocklength number of elements in each [ INTEGER IN
MPI_Count
block
int
stride number of elements between [ INTEGER IN
MPI_Count
start of each block
oldtype old datatype MPI_Datatype TYPE IN
(MPI_Datatype)
newtype new datatype MPI_Datatype* TYPE OUT
(MPI_Datatype)
)
Python:

MPI.Datatype.Create_vector(self, int count, int blocklength, int stride)

Figure 6.2: A vector datatype is built up out of strided blocks of elements of a constituent type

in figure 6.2.
The vector datatype gives the first nontrivial illustration that datatypes can be different on the sender and
receiver. If the sender sends b blocks of length l each, the receiver can receive them as bl contiguous
elements, either as a contiguous datatype, or as a contiguous buffer of an predefined type; see figure 6.3.
The receiver has no knowledge of the stride of the datatype on the sender.
In this example a vector type is created only on the sender, in order to send a strided subset of an array;
the receiver receives the data as a contiguous block.
// vector.c
source = (double*) malloc(stride*count*sizeof(double));
target = (double*) malloc(count*sizeof(double));
MPI_Datatype newvectortype;
if (procno==sender) {

168 Parallel Computing – r428


6.3. Derived datatypes

Figure 6.3: Sending a vector datatype and receiving it as predefined or contiguous

MPI_Type_vector(count,1,stride,MPI_DOUBLE,&newvectortype);
MPI_Type_commit(&newvectortype);
MPI_Send(source,1,newvectortype,the_other,0,comm);
MPI_Type_free(&newvectortype);
} else if (procno==receiver) {
MPI_Status recv_status;
int recv_count;
MPI_Recv(target,count,MPI_DOUBLE,the_other,0,comm,
&recv_status);
MPI_Get_count(&recv_status,MPI_DOUBLE,&recv_count);
ASSERT(recv_count==count);
}

We illustrate Fortran2008:
if (mytid==sender) then
call MPI_Type_vector(count,1,stride,MPI_DOUBLE_PRECISION,&
newvectortype)
call MPI_Type_commit(newvectortype)
call MPI_Send(source,1,newvectortype,receiver,0,comm)
call MPI_Type_free(newvectortype)
if ( .not. newvectortype==MPI_DATATYPE_NULL) then
print *,"Trouble freeing datatype"
else
print *,"Datatype successfully freed"
end if
else if (mytid==receiver) then
call MPI_Recv(target,count,MPI_DOUBLE_PRECISION,sender,0,comm,&
recv_status)
call MPI_Get_count(recv_status,MPI_DOUBLE_PRECISION,recv_count)
end if

In legacy mode Fortran90, code stays the same except that the type is declared as Integer:

Victor Eijkhout 169


6. MPI topic: Data types

!! vector.F90
integer :: newvectortype
integer :: recv_count
call MPI_Type_vector(count,1,stride,MPI_DOUBLE_PRECISION,&
newvectortype,err)
call MPI_Type_commit(newvectortype,err)

Python note 21: Vector type. The vector creation routine is a method of the datatype class. For the general
discussion, see section 6.3.1.
## vector.py
source = np.empty(stride*count,dtype=np.float64)
target = np.empty(count,dtype=np.float64)
if procid==sender:
newvectortype = MPI.DOUBLE.Create_vector(count,1,stride)
newvectortype.Commit()
comm.Send([source,1,newvectortype],dest=the_other)
newvectortype.Free()
elif procid==receiver:
comm.Recv([target,count,MPI.DOUBLE],source=the_other)

MPL note 47: Vector type. MPL has the strided_vector_layout class as equivalent of the vector type:
// vector.cxx
vector<double>
source(stride*count);
if (procno==sender) {
mpl::strided_vector_layout<double>
newvectortype(count,1,stride);
comm_world.send
(source.data(),newvectortype,the_other);
}

(See note 46 for nonstrided vectors.)

6.3.3.1 Two-dimensional arrays


Figure 6.4 indicates one source of irregular data: with a matrix on column-major storage, a column is stored
in contiguous memory. However, a row of such a matrix is not contiguous; its elements being separated
by a stride equal to the column length.
Exercise 6.1. How would you describe the memory layout of a submatrix, if the whole
matrix has size 𝑀 × 𝑁 and the submatrix 𝑚 × 𝑛?
As an example of this datatype, consider the example of transposing a matrix, for instance to convert
between C and Fortran arrays. Suppose that a processor has a matrix stored in C, row-major, layout, and
it needs to send a column to another processor. If the matrix is declared as
int M,N; double mat[M][N]

then a column has 𝑀 blocks of one element, spaced 𝑁 locations apart. In other words:

170 Parallel Computing – r428


6.3. Derived datatypes

Figure 6.4: Memory layout of a row and column of a matrix in column-major storage

MPI_Datatype MPI_column;
MPI_Type_vector(
/* count= */ M, /* blocklength= */ 1, /* stride= */ N,
MPI_DOUBLE, &MPI_column );

Sending the first column is easy:


MPI_Send( mat, 1,MPI_column, ... );

The second column is just a little trickier: you now need to pick out elements with the same stride, but
starting at A[0][1].
MPI_Send( &(mat[0][1]), 1,MPI_column, ... );

You can make this marginally more efficient (and harder to read) by replacing the index expression by
mat+1.
Exercise 6.2. Suppose you have a matrix of size 4𝑁 × 4𝑁 , and you want to send the
elements A[4*i][4*j] with 𝑖, 𝑗 = 0, … , 𝑁 − 1. How would you send these elements
with a single transfer?
Exercise 6.3. Allocate a matrix on processor zero, using Fortran column-major storage.
Using 𝑃 sendrecv calls, distribute the rows of this matrix among the processors.
Python note 22: Sending from the middle of a matrix. In C and Fortran it’s easy to apply a derived type to
data in the middle of an array, for instance to extract an arbitrary column out of a C matrix,
or row out of a Fortran matrix. While Python has no trouble describing sections from an array,
usually it copies these instead of taking the address. Therefore, it is necessary to convert the
matrix to a buffer and compute an explicit offset in bytes:
## rowcol.py
rowsize = 4; colsize = 5

Victor Eijkhout 171


6. MPI topic: Data types

Figure 6.5: Send strided data from process zero to all others

coltype = MPI.INT.Create_vector(4, 1, 5)
coltype.Commit()
columntosend = 2
comm.Send\
( [np.frombuffer(matrix.data, intc,
offset=columntosend*np.dtype('intc').itemsize),
1,coltype],
receiver)

Exercise 6.4. Let processor 0 have an array 𝑥 of length 10𝑃, where 𝑃 is the number of
processors. Elements 0, 𝑃, 2𝑃, … , 9𝑃 should go to processor zero, 1, 𝑃 + 1, 2𝑃 + 1, … to
processor 1, et cetera.
• Code this as a sequence of send/recv calls, using a vector datatype for the send,
and a contiguous buffer for the receive.
• For simplicity, skip the send to/from zero. What is the most elegant solution if
you want to include that case?
• For testing, define the array as 𝑥[𝑖] = 𝑖.
(There is a skeleton for this exercise under the name stridesend.)
Exercise 6.5. Write code to compare the time it takes to send a strided subset from an array:
copy the elements by hand to a smaller buffer, or use a vector data type. What do
you find? You may need to test on fairly large arrays.

6.3.4 Subarray type


The vector datatype can be used for blocks in an array of dimension more than 2 by using it recur-
sively. However, this gets tedious. Instead, there is an explicit subarray type MPI_Type_create_subarray
(figure 6.11). This describes the dimensionality and extent of the array, and the starting point (the ‘upper
left corner’) and extent of the subarray.
MPL note 48: Subarray layout. The templated subarray_layout class is constructed from a vector of triplets
of global size / subblock size / first coordinate.
mpl::subarray_layout<int>(

172 Parallel Computing – r428


6.3. Derived datatypes

Figure 6.11 MPI_Type_create_subarray


Name Param name Explanation C type F type inout
MPI_Type_create_subarray (
MPI_Type_create_subarray_c (
ndims number of array dimensions int INTEGER IN
const int[]
array_of_sizes number of elements of type [ INTEGER IN
MPI_Count[]
oldtype in each dimension (ndims)
of the full array
const int[]
array_of_subsizes number of elements of type [ INTEGER IN
MPI_Count[]
oldtype in each dimension (ndims)
of the subarray
const int[]
array_of_starts starting coordinates [ INTEGER IN
MPI_Count[]
of the subarray in each (ndims)
dimension
order array storage order flag int INTEGER IN
oldtype old datatype MPI_Datatype TYPE IN
(MPI_Datatype)
newtype new datatype MPI_Datatype* TYPE OUT
(MPI_Datatype)
)
Python:

MPI.Datatype.Create_subarray
(self, sizes, subsizes, starts, int order=ORDER_C)

{ {ny, ny_l, ny_0}, {nx, nx_l, nx_0} }


);

Exercise 6.6. Assume that your number of processors is 𝑃 = 𝑄 3 , and that each process has
an array of identical size. Use MPI_Type_create_subarray to gather all data onto a root
process. Use a sequence of send and receive calls; MPI_Gather does not work here.
(There is a skeleton for this exercise under the name cubegather.)
Fortran note 11: Subarrays. Subarrays are naturally supported in Fortran through array sections.
!! section.F90
integer,parameter :: siz=20
real,dimension(siz,siz) :: matrix = [ ((j+(i-1)*siz,i=1,siz),j=1,siz) ]
real,dimension(2,2) :: submatrix
if (procno==0) then
call MPI_Send(matrix(1:2,1:2),4,MPI_REAL,1,0,comm)
else if (procno==1) then
call MPI_Recv(submatrix,4,MPI_REAL,0,0,comm,MPI_STATUS_IGNORE)
if (submatrix(2,2)==22) then
print *,"Yay"
else
print *,"nay...."
end if
end if

Victor Eijkhout 173


6. MPI topic: Data types

However, there is a subtlety with non-blocking operations: for a non-contiguous buffer a tem-
porary is created, which is released after the MPI call. This is correct for blocking sends, but for
non-blocking the temporary has to stay around till the wait call.
!! sectionisend.F90
integer :: siz
real,dimension(:,:),allocatable :: matrix
real,dimension(2,2) :: submatrix
siz = 20
allocate( matrix(siz,siz) )
matrix = reshape( [ ((j+(i-1)*siz,i=1,siz),j=1,siz) ], (/siz,siz/) )
call MPI_Isend(matrix(1:2,1:2),4,MPI_REAL,1,0,comm,request)
call MPI_Wait(request,MPI_STATUS_IGNORE)
deallocate(matrix)

In MPI-3 the variable MPI_SUBARRAYS_SUPPORTED indicates support for this mechanism:


if ( .not. MPI_SUBARRAYS_SUPPORTED ) then
print *,"This code will not work"
call MPI_Abort(comm,0)
end if

The possibilities for the order parameter are MPI_ORDER_C and MPI_ORDER_FORTRAN. However, this has noth-
ing to do with the order of traversal of elements; it determines how the bounds of the subarray are inter-
preted. As an example, we fill a 4 × 4 array in C order with the numbers 0 ⋯ 15, and send the [0, 1] × [0 ⋯ 4]
slice two ways, first C order, then Fortran order:
// row2col.c
#define SIZE 4
int
sizes[2], subsizes[2], starts[2];
sizes[0] = SIZE; sizes[1] = SIZE;
subsizes[0] = SIZE/2; subsizes[1] = SIZE;
starts[0] = starts[1] = 0;
MPI_Type_create_subarray
(2,sizes,subsizes,starts,
MPI_ORDER_C,MPI_DOUBLE,&rowtype);
MPI_Type_create_subarray
(2,sizes,subsizes,starts,
MPI_ORDER_FORTRAN,MPI_DOUBLE,&coltype);

The receiver receives the following, formatted to bring out where the numbers originate:
Received C order:
0.000 1.000 2.000 3.000
4.000 5.000 6.000 7.000
Received F order:
0.000 1.000
4.000 5.000
8.000 9.000
12.000 13.000

174 Parallel Computing – r428


6.3. Derived datatypes

Figure 6.12 MPI_Type_indexed


Name Param name Explanation C type F type inout
MPI_Type_indexed (
MPI_Type_indexed_c (
int
count number of blocks---also [ INTEGER IN
MPI_Count
number of entries in
array_of_displacements
and array_of_blocklengths
const int[]
array_of_blocklengths number of elements per [ INTEGER IN
MPI_Count[]
block (count)
const int[]
array_of_displacements displacement for each [ INTEGER IN
MPI_Count[]
block, in multiples of (count)
oldtype
oldtype old datatype MPI_Datatype TYPE IN
(MPI_Datatype)
newtype new datatype MPI_Datatype* TYPE OUT
(MPI_Datatype)
)
Python:

MPI.Datatype.Create_indexed(self, blocklengths,displacements )

6.3.5 Distributed array type


Each dimension can independently be distributed as MPI_DISTRIBUTE_BLOCK, MPI_DISTRIBUTE_CYCLIC,
MPI_DISTRIBUTE_NONE,

With the cyclic distribution, the amount of cyclicity can be indicated by setting dargs[id] to a certain
number.
With the block distribution, blocks can be set explicitly in dargs[id], but MPI_DISTRIBUTE_DFLT_DARG causes
an even distribution to be found.
Ordering can be MPI_ORDER_C or MPI_ORDER_FORTRAN.

6.3.6 Indexed type


The indexed datatype, constructed with MPI_Type_indexed (figure 6.12) can send arbitrarily located ele-
ments from an array of a single datatype. You need to supply an array of index locations, plus an array of
blocklengths with a separate blocklength for each index. The total number of elements sent is the sum of
the blocklengths.
The following example picks items that are on prime number-indexed locations.
// indexed.c
displacements = (int*) malloc(count*sizeof(int));
blocklengths = (int*) malloc(count*sizeof(int));
source = (int*) malloc(totalcount*sizeof(int));
target = (int*) malloc(targetbuffersize*sizeof(int));

Victor Eijkhout 175


6. MPI topic: Data types

Figure 6.6: The elements of an MPI Indexed datatype

MPI_Datatype newvectortype;
if (procno==sender) {
MPI_Type_indexed(count,blocklengths,displacements,MPI_INT,&newvectortype);
MPI_Type_commit(&newvectortype);
MPI_Send(source,1,newvectortype,the_other,0,comm);
MPI_Type_free(&newvectortype);
} else if (procno==receiver) {
MPI_Status recv_status;
int recv_count;
MPI_Recv(target,targetbuffersize,MPI_INT,the_other,0,comm,
&recv_status);
MPI_Get_count(&recv_status,MPI_INT,&recv_count);
ASSERT(recv_count==count);
}

For Fortran we show the legacy syntax for once:


!! indexed.F90
integer :: newvectortype;
ALLOCATE(indices(count))
ALLOCATE(blocklengths(count))
ALLOCATE(source(totalcount))
ALLOCATE(targt(count))
if (mytid==sender) then
call MPI_Type_indexed(count,blocklengths,indices,MPI_INT,&
newvectortype,err)
call MPI_Type_commit(newvectortype,err)
call MPI_Send(source,1,newvectortype,receiver,0,comm,err)
call MPI_Type_free(newvectortype,err)
else if (mytid==receiver) then
call MPI_Recv(targt,count,MPI_INT,sender,0,comm,&
recv_status,err)
call MPI_Get_count(recv_status,MPI_INT,recv_count,err)
! ASSERT(recv_count==count);
end if

## indexed.py
displacements = np.empty(count,dtype=int)
blocklengths = np.empty(count,dtype=int)
source = np.empty(totalcount,dtype=np.float64)

176 Parallel Computing – r428


6.3. Derived datatypes

target = np.empty(count,dtype=np.float64)
if procid==sender:
newindextype = MPI.DOUBLE.Create_indexed(blocklengths,displacements)
newindextype.Commit()
comm.Send([source,1,newindextype],dest=the_other)
newindextype.Free()
elif procid==receiver:
comm.Recv([target,count,MPI.DOUBLE],source=the_other)

MPL note 49: Indexed type. In MPL, the indexed_layout is based on a vector of 2-tuples denoting block
length / block location.
// indexed.cxx
const int count = 5;
mpl::contiguous_layout<int>
fiveints(count);
mpl::indexed_layout<int>
indexed_where{ { {1,2}, {1,3}, {1,5}, {1,7}, {1,11} } };

if (procno==sender) {
comm_world.send( source_buffer.data(),indexed_where, receiver );
} else if (procno==receiver) {
auto recv_status =
comm_world.recv( target_buffer.data(),fiveints, sender );
int recv_count = recv_status.get_count<int>();
assert(recv_count==count);
}

MPL note 50: Layouts for gatherv. The size/displacement arrays for MPI_Gatherv / MPI_Alltoallv are han-
dled through a layouts object, which is basically a vector of layout objects.
mpl::layouts<int> receive_layout;
for ( int iproc=0,loc=0; iproc<nprocs; iproc++ ) {
auto siz = size_buffer.at(iproc);
receive_layout.push_back
( mpl::indexed_layout<int>( {{ siz,loc }} ) );
loc += siz;
}

MPL note 51: Indexed block type. For the case where all block lengths are the same, use
indexed_block_layout:
// indexedblock.cxx
mpl::indexed_block_layout<int>
indexed_where( 1, {2,3,5,7,11} );
comm_world.send( source_buffer.data(),indexed_where, receiver );

You can also MPI_Type_create_hindexed which describes blocks of a single old type, but with index locations
in bytes, rather than in multiples of the old type.
int MPI_Type_create_hindexed
(int count, int blocklens[], MPI_Aint indices[],
MPI_Datatype old_type,MPI_Datatype *newtype)

Victor Eijkhout 177


6. MPI topic: Data types

Figure 6.13 MPI_Type_create_hindexed_block


Name Param name Explanation C type F type inout
MPI_Type_create_hindexed_block (
MPI_Type_create_hindexed_block_c (
int
count number of blocks---also [ INTEGER IN
MPI_Count
number of entries in
array_of_displacements
int
blocklength number of elements in each [ INTEGER IN
MPI_Count
block
const MPI_Aint[]
array_of_displacements byte displacement of each [ INTEGER IN
MPI_Count[]
block (KIND=MPI_ADDRESS_KIND)
(count)
oldtype old datatype MPI_Datatype TYPE IN
(MPI_Datatype)
newtype new datatype MPI_Datatype* TYPE OUT
(MPI_Datatype)
)

Figure 6.14 MPI_Get_address


Name Param name Explanation C type F type inout
MPI_Get_address (
location location in caller memory const void* TYPE(*), IN
DIMENSION(..)
address address of location MPI_Aint* INTEGER OUT
(KIND=MPI_ADDRESS_KIND)
)

A slightly simpler version, MPI_Type_create_hindexed_block (figure 6.13) assumes constant block length.
There is an important difference between the hindexed and the above MPI_Type_indexed: that one de-
scribed offsets from a base location; these routines describes absolute memory addresses. You can use this
to send for instance the elements of a linked list. You would traverse the list, recording the addresses of
the elements with MPI_Get_address (figure 6.14). (The routine MPI_Address is deprecated.)
In C++ you can use this to send an std::<vector>, that is, a vector object from the C++ standard library, if
the component type is a pointer.

6.3.7 Struct type


The structure type, created with MPI_Type_create_struct (figure 6.15), can contain multiple data types.
(The routine MPI_Type_struct is deprecated with MPI-3.) The specification contains a ‘count’ parameter
that specifies how many blocks there are in a single structure. For instance,
struct {
int i;
float x,y;

178 Parallel Computing – r428


6.3. Derived datatypes

Figure 6.15 MPI_Type_create_struct


Name Param name Explanation C type F type inout
MPI_Type_create_struct (
MPI_Type_create_struct_c (
int
count number of blocks---also [ INTEGER IN
MPI_Count
number of entries in
arrays array_of_types,
array_of_displacements,
and array_of_blocklengths
const int[]
array_of_blocklengths number of elements in each [ INTEGER IN
MPI_Count[]
block (count)
const MPI_Aint[]
array_of_displacements byte displacement of each [ INTEGER IN
MPI_Count[]
block (KIND=MPI_ADDRESS_KIND)
(count)
array_of_types type of elements in each const TYPE IN
block MPI_Datatype[] (MPI_Datatype)
(count)
newtype new datatype MPI_Datatype* TYPE OUT
(MPI_Datatype)
)

Figure 6.7: The elements of an MPI Struct datatype

} point;

has two blocks, one of a single integer, and one of two floats. This is illustrated in figure 6.7.
count The number of blocks in this datatype. The blocklengths, displacements, types arguments
have to be at least of this length.
blocklengths array containing the lengths of the blocks of each datatype.
displacements array describing the relative location of the blocks of each datatype.
types array containing the datatypes; each block in the new type is of a single datatype; there can be
multiple blocks consisting of the same type.
In this example, unlike the previous ones, both sender and receiver create the structure type. With struc-
tures it is no longer possible to send as a derived type and receive as a array of a simple type. (It would
be possible to send as one structure type and receive as another, as long as they have the same datatype
signature.)

Victor Eijkhout 179


6. MPI topic: Data types

// struct.c
struct object {
char c;
double x[2];
int i;
};
MPI_Datatype newstructuretype;
int structlen = 3;
int blocklengths[structlen]; MPI_Datatype types[structlen];
MPI_Aint displacements[structlen];

/*
* where are the components relative to the structure?
*/
MPI_Aint current_displacement=0;

// one character
blocklengths[0] = 1; types[0] = MPI_CHAR;
displacements[0] = (size_t)&(myobject.c) - (size_t)&myobject;

// two doubles
blocklengths[1] = 2; types[1] = MPI_DOUBLE;
displacements[1] = (size_t)&(myobject.x) - (size_t)&myobject;

// one int
blocklengths[2] = 1; types[2] = MPI_INT;
displacements[2] = (size_t)&(myobject.i) - (size_t)&myobject;

MPI_Type_create_struct(structlen,blocklengths,displacements,types,&newstructuretype);
MPI_Type_commit(&newstructuretype);
if (procno==sender) {
MPI_Send(&myobject,1,newstructuretype,the_other,0,comm);
} else if (procno==receiver) {
MPI_Recv(&myobject,1,newstructuretype,the_other,0,comm,MPI_STATUS_IGNORE);
}
MPI_Type_free(&newstructuretype);

Note the displacement calculations in this example, which involve some not so elegant pointer arith-
metic. The following Fortran code uses MPI_Get_address, which is more elegant, and in fact the only way
address calculations can be done in Fortran.
!! struct.F90
Type object
character :: c
real*8,dimension(2) :: x
integer :: i
end type object
type(object) :: myobject
integer,parameter :: structlen = 3
type(MPI_Datatype) :: newstructuretype
integer,dimension(structlen) :: blocklengths
type(MPI_Datatype),dimension(structlen) :: types;
MPI_Aint,dimension(structlen) :: displacements

180 Parallel Computing – r428


6.3. Derived datatypes

MPI_Aint :: base_displacement, next_displacement


if (procno==sender) then
myobject%c = 'x'
myobject%x(0) = 2.7; myobject%x(1) = 1.5
myobject%i = 37

!! component 1: one character


blocklengths(1) = 1; types(1) = MPI_CHAR
call MPI_Get_address(myobject,base_displacement)
call MPI_Get_address(myobject%c,next_displacement)
displacements(1) = next_displacement-base_displacement

!! component 2: two doubles


blocklengths(2) = 2; types(2) = MPI_DOUBLE
call MPI_Get_address(myobject%x,next_displacement)
displacements(2) = next_displacement-base_displacement

!! component 3: one int


blocklengths(3) = 1; types(3) = MPI_INT
call MPI_Get_address(myobject%i,next_displacement)
displacements(3) = next_displacement-base_displacement

if (procno==sender) then
call MPI_Send(myobject,1,newstructuretype,receiver,0,comm)
else if (procno==receiver) then
call MPI_Recv(myobject,1,newstructuretype,sender,0,comm,MPI_STATUS_IGNORE)
end if
call MPI_Type_free(newstructuretype)

It would have been incorrect to write


displacement[0] = 0;
displacement[1] = displacement[0] + sizeof(char);

since you do not know the way the compiler lays out the structure in memory1 .
If you want to send more than one structure, you have to worry more about padding in the structure. You
can solve this by adding an extra type MPI_UB for the ‘upper bound’ on the structure:
displacements[3] = sizeof(myobject); types[3] = MPI_UB;
MPI_Type_create_struct(struclen+1,.....);

MPL note 52: Struct type scalar. One could describe the MPI struct type as a collection of displacements, to
be applied to any set of items that conforms to the specifications. An MPL heterogeneous_layout
on the other hand, incorporates the actual data. Thus you could write
// structscalar.cxx
char c; double x; int i;
if (procno==sender) {
c = 'x'; x = 2.4; i = 37; }
mpl::heterogeneous_layout object( c,x,i );
if (procno==sender)

1. Homework question: what does the language standard say about this?

Victor Eijkhout 181


6. MPI topic: Data types

comm_world.send( mpl::absolute,object,receiver );
else if (procno==receiver)
comm_world.recv( mpl::absolute,object,sender );

Here, the absolute indicates the lack of an implicit buffer: the layout is absolute rather than a
relative description.

MPL note 53: Struct type general. More complicated data than scalars takes more work:
// struct.cxx
char c; vector<double> x(2); int i;
if (procno==sender) {
c = 'x'; x[0] = 2.7; x[1] = 1.5; i = 37; }
mpl::heterogeneous_layout object
( c,
mpl::make_absolute(x.data(),mpl::vector_layout<double>(2)),
i );
if (procno==sender) {
comm_world.send( mpl::absolute,object,receiver );
} else if (procno==receiver) {
comm_world.recv( mpl::absolute,object,sender );
}

Note the make_absolute in addition to absolute mentioned above.

6.4 Big data types


The size parameter in MPI send and receive calls is of type integer, meaning that it’s maximally (platform-
dependent, but typically:) 231 − 1. These day computers are big enough that this is a limitation. As of
the MPI-4 standard, this has been solved by allowing a larger count parameter of type MPI_Count. The
implementation of this depends somewhat on the language.
The following material is for the recently released MPI-4 standard and may not be supported yet.

6.4.1 C

For every routine, such as MPI_Send with an integer count, there is a corresponding MPI_Send_c with a
count of type MPI_Count.
MPI_Count buffersize = 1000;
double *indata,*outdata;
indata = (double*) malloc( buffersize*sizeof(double) );
outdata = (double*) malloc( buffersize*sizeof(double) );
MPI_Allreduce_c(indata,outdata,buffersize,
MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);

182 Parallel Computing – r428


6.4. Big data types

Code: Output:
// pingpongbig.c make[3]: `pingpongbig' is up to date.
assert( sizeof(MPI_Count)>4 ); Ping-pong between ranks 0--1, repeated 10
for ( int power=3; power<=10; power++) { ↪times
MPI_Count length=pow(10,power); MPI Count has 8 bytes
buffer = (double*)malloc( Size: 10^3, (repeats=10000)
↪length*sizeof(double) ); Time 1.399211e-05 for size 10^3: 1.1435
MPI_Ssend_c ↪Gb/sec
(buffer,length,MPI_DOUBLE, Size: 10^4, (repeats=10000)
processB,0,comm); Time 4.077882e-05 for size 10^4: 3.9236
MPI_Recv_c ↪Gb/sec
(buffer,length,MPI_DOUBLE, Size: 10^5, (repeats=1000)
processB,0,comm,MPI_STATUS_IGNORE); Time 1.532863e-04 for size 10^5: 10.4380
↪Gb/sec
Size: 10^6, (repeats=1000)
Time 1.418844e-03 for size 10^6: 11.2768
↪Gb/sec
Size: 10^7, (repeats=100)
Time 1.443470e-02 for size 10^7: 11.0844
↪Gb/sec
Size: 10^8, (repeats=100)
Time 1.540918e-01 for size 10^8: 10.3834
↪Gb/sec
Size: 10^9, (repeats=10)
Time 1.813220e+00 for size 10^9: 8.8241
↪Gb/sec
Size: 10^10, (repeats=10)
Time 1.846741e+01 for size 10^10: 8.6639
↪Gb/sec

6.4.2 Fortran
The count parameter can be declared to be
use mpi_f08
Integer(kind=MPI_COUNT_KIND) :: count

Since Fortran has polymorphism, the same routine names can be used.

The legit way of coding: … but you can see what’s under the hood:
!! typecheckkind.F90 !! typecheck8.F90
integer(8) :: source,n=1 integer(8) :: source,n=1
call MPI_Init() call MPI_Init()
call MPI_Send(source,n,MPI_INTEGER8, & call MPI_Send(source,n,MPI_INTEGER8, &
1,0,MPI_COMM_WORLD) 1,0,MPI_COMM_WORLD)

Routines using this type are not available unless using the mpi_f08 module.
End of MPI-4 material

Victor Eijkhout 183


6. MPI topic: Data types

!! pingpongbig.F90
integer :: power,countbytes
Integer(KIND=MPI_COUNT_KIND) :: length
call MPI_Sizeof(length,countbytes,ierr)
if (procno==0) &
print *,"Bytes in count:",countbytes
length = 10**power
allocate( senddata(length),recvdata(length) )
call MPI_Send(senddata,length,MPI_DOUBLE_PRECISION, &
processB,0, comm)
call MPI_Recv(recvdata,length,MPI_DOUBLE_PRECISION, &
processB,0, comm,MPI_STATUS_IGNORE)

6.4.3 Count datatype


The MPI_Count datatype is defined as being large enough to accomodate values of
• the ordinary 4-byte integer type;
• the MPI_Aint type, sections 6.2.4 and 6.2.4;
• the MPI_Offset type, section 10.2.2.
The size_t type in C/C++ is defined as big enough to contain the output of sizeof, that is, being big
enough to measure any object.

6.4.4 MPI 3 temporary solution


Large messages were already possible by using derived types: to send a big data type of 1040 elements you
would
• create a contiguous type with 1020 elements, and
• send 1020 elements of that type.
This often works, but it’s not perfect. For instance, the routine MPI_Get_elements returns the total number
of basic elements sent (as opposed to MPI_Get_count which would return the number of elements of the
derived type). Since its output argument is of integer type, it can’t store the right value.
The MPI-3 standard has addressed this through the introduction of an MPI_Count datatype, and new rou-
tines with an _x extension, that return that type of count. (In view of the ‘embiggened’ routines, this
solution is no longer needed, and will probably be deprecated in later standards.)
Let us consider an example.
Allocating a buffer of more than 4Gbyte is not hard:
// vectorx.c
float *source=NULL,*target=NULL;
int mediumsize = 1<<30;
int nblocks = 8;
size_t datasize = (size_t)mediumsize * nblocks * sizeof(float);
if (procno==sender) {
source = (float*) malloc(datasize);

184 Parallel Computing – r428


6.4. Big data types

We use the trick with sending elements of a derived type:


MPI_Datatype blocktype;
MPI_Type_contiguous(mediumsize,MPI_FLOAT,&blocktype);
MPI_Type_commit(&blocktype);
if (procno==sender) {
MPI_Send(source,nblocks,blocktype,receiver,0,comm);

We use the same trick for the receive call, but now we catch the status parameter which will later tell us
how many elements of the basic type were sent:
} else if (procno==receiver) {
MPI_Status recv_status;
MPI_Recv(target,nblocks,blocktype,sender,0,comm,
&recv_status);

When we query how many of the basic elements are in the buffer (remember that in the receive call the
buffer length is an upper bound on the number of elements received) do we need a counter that is larger
than an integer. MPI has introduced a type MPI_Count for this, and new routines such as MPI_Get_elements_x
(figure 4.14) that return a count of this type:
MPI_Count recv_count;
MPI_Get_elements_x(&recv_status,MPI_FLOAT,&recv_count);

Remark 16 Computing a big number to allocate is not entirely simple.


// getx.c
int gig = 1<<30;
int nblocks = 8;
size_t big1 = gig * nblocks * sizeof(double);
size_t big2 = (size_t)1 * gig * nblocks * sizeof(double);
size_t big3 = (size_t) gig * nblocks * sizeof(double);
size_t big4 = gig * nblocks * (size_t) ( sizeof(double) );
size_t big5 = sizeof(double) * gig * nblocks;
;

gives as output:
size of size_t = 8
0 68719476736 68719476736 0 68719476736
Clearly, not only do operations go left-to-right, but casting is done that way too: the computed subexpressions
are only cast to size_t if one operand is.

Above, we did not actually create a datatype that was bigger than 2G, but if you do so, you can query its
extent by MPI_Type_get_extent_x (figure 6.17) and MPI_Type_get_true_extent_x (figure 6.17).
Python note 23: Big data. Since python has unlimited size integers there is no explicit need for the
‘x’ variants of routines. Internally, MPI.Status.Get_elements is implemented in terms of
MPI_Get_elements_x.

Victor Eijkhout 185


6. MPI topic: Data types

Figure 6.16 MPI_Type_get_extent


Name Param name Explanation C type F type inout
MPI_Type_get_extent (
MPI_Type_get_extent_c (
datatype datatype to get MPI_Datatype TYPE IN
information on (MPI_Datatype)
MPI_Aint∗
lb lower bound of datatype [ INTEGER OUT
MPI_Count∗
(KIND=MPI_ADDRESS_KIND)
MPI_Aint∗
extent extent of datatype [ INTEGER OUT
MPI_Count∗
(KIND=MPI_ADDRESS_KIND)
)

6.5 Type maps and type matching


With derived types, you saw that it was not necessary for the type of the sender and receiver to match.
However, when the send buffer is constructed, and the receive buffer unpacked, it is necessary for the
successive types in that buffer to match.
The types in the send and receive buffers also need to match the datatypes of the underlying architecture,
with two exceptions. The MPI_PACKED and MPI_BYTE types can match any underlying type. However, this
still does not mean that it is a good idea to use these types on only sender or receiver, and a specific type
on the other.

6.6 Type extent


See section 6.2.5 about the related issue of type sizes.

6.6.1 Extent and true extent


The datatype extent, measured with MPI_Type_get_extent (figure 6.16), is strictly the distance from the first
to the last data item of the type, that is, with counting the gaps in the type. It is measured in bytes so the
output parameters are of type MPI_Aint.
In the following example (see also figure 6.8) we measure the extent of a vector type. Note that the extent
is not the stride times the number of blocks, because that would count a ‘trailing gap’.
MPI_Aint lb,asize;
MPI_Type_vector(count,bs,stride,MPI_DOUBLE,&newtype);
MPI_Type_commit(&newtype);
MPI_Type_get_extent(newtype,&lb,&asize);
ASSERT( lb==0 );
ASSERT( asize==((count-1)*stride+bs)*sizeof(double) );
MPI_Type_free(&newtype);

Similarly, using MPI_Type_get_extent counts the gaps in a struct induced by alignment issues.

186 Parallel Computing – r428


6.6. Type extent

Figure 6.8: Extent of a vector datatype

size_t size_of_struct = sizeof(struct object);


MPI_Aint typesize,typelb;
MPI_Type_get_extent(newstructuretype,&typelb,&typesize);
assert( typesize==size_of_struct );

See section 6.3.7 for the code defining the structure type.

Remark 17 Routine MPI_Type_get_extent replaces deprecated functions MPI_Type_extent, MPI_Type_lb,


MPI_Type_ub.

Figure 6.9: True lower bound and extent of a subarray data type

The subarray datatype need not start at the first element of the buffer, so the extent is an overstatement
of how much data is involved. In fact, the lower bound is zero, and the extent equals the size of the block
from which the subarray is taken. The routine MPI_Type_get_true_extent (figure 6.17) returns the lower
bound, indicating where the data starts, and the extent from that point. This is illustrated in figure 6.9.

Victor Eijkhout 187


6. MPI topic: Data types

Figure 6.17 MPI_Type_get_true_extent


Name Param name Explanation C type F type inout
MPI_Type_get_true_extent (
MPI_Type_get_true_extent_c (
datatype datatype to get MPI_Datatype TYPE IN
information on (MPI_Datatype)
MPI_Aint∗
true_lb true lower bound of [ INTEGER OUT
MPI_Count∗
datatype (KIND=MPI_ADDRESS_KIND)
MPI_Aint∗
true_extent true extent of datatype [ INTEGER OUT
MPI_Count∗
(KIND=MPI_ADDRESS_KIND)
)

Code: Output:
// trueextent.c In basic array of 192 bytes
int sender = 0, receiver = 1, the_other = 1-procno; find sub array of 48 bytes
int sizes[2] = {4,6},subsizes[2] = {2,3},starts[2] = {1,2}; Found lb=64, extent=72
MPI_Datatype subarraytype; Computing lb=64 extent=72
MPI_Type_create_subarray Non-true lb=0, extent=192,
(2,sizes,subsizes,starts, ↪computed=192
MPI_ORDER_C,MPI_DOUBLE,&subarraytype); Finished
MPI_Type_commit(&subarraytype); received: 8.500 9.500
↪10.500 14.500 15.500
MPI_Aint true_lb,true_extent,extent; ↪16.500
MPI_Type_get_true_extent 1,2
(subarraytype,&true_lb,&true_extent); 1,3
MPI_Aint 1,4
comp_lb = sizeof(double) * 2,2
( starts[0]*sizes[1]+starts[1] ), 2,3
comp_extent = sizeof(double) * 2,4
( sizes[1]-starts[1] // first row
+ starts[1]+subsizes[1] // last row
+ ( subsizes[0]>1 ? subsizes[0]-2 : 0 )*sizes[1]
↪);
ASSERT(true_lb==comp_lb);
ASSERT(true_extent==comp_extent);
MPI_Send(source,1,subarraytype,the_other,0,comm);
MPI_Type_free(&subarraytype);

There are also ‘big data’ routines MPI_Type_get_extent_x MPI_Type_get_true_extent_x that has an MPI_Count
as output.

The following material is for the recently released MPI-4 standard and may not be supported yet.

The C routines MPI_Type_get_extent_c MPI_Type_get_true_extent_c also output an MPI_Count.


End of MPI-4 material

188 Parallel Computing – r428


6.6. Type extent

Figure 6.18 MPI_Type_create_resized


Name Param name Explanation C type F type inout
MPI_Type_create_resized (
MPI_Type_create_resized_c (
oldtype input datatype MPI_Datatype TYPE IN
(MPI_Datatype)
MPI_Aint
lb new lower bound of [ INTEGER IN
MPI_Count
datatype (KIND=MPI_ADDRESS_KIND)
MPI_Aint
extent new extent of datatype [ INTEGER IN
MPI_Count
(KIND=MPI_ADDRESS_KIND)
newtype output datatype MPI_Datatype* TYPE OUT
(MPI_Datatype)
)

6.6.2 Extent resizing


A type is partly characterized by its lower bound and extent, or equivalently lower bound and upperbound.
Somewhat miraculously, you can actually change these to achieve special effects. This is needed for:
• Some cases of gather/scatter operations; see the example in section 6.6.2.2.
• When the count of derived items in a buffer is more than one. See the example in section 6.6.2.1.
The technicality on which the solution hinges is that you can ‘resize’ a type with MPI_Type_create_resized
(figure 6.18) to give it a different extent, while not affecting how much data there actually is in it.
We can describe the space taken by a data type (the ‘true extent’) and the ‘extent’ as follows. If the send
count is more than 1, or if you scatter some data type:
1. A pointer is set at the first data item; then
2. For each instance of the datatype to be sent:
(a) data is sent as described by the data type; and
(b) the pointer is advanced by the extent of the data type
(Receiving and gathering behave similarly, but with data going in the opposite direction.)

6.6.2.1 Example 1
In the examples of derived types so far we always used a send count of 1. What happens if you use a larger
count?
Consider a vector type, with a send count of 2.
MPI_Type_vector( count,bs,stride,oldtype,&one_n_type );
MPI_Type_contiguous( 2,&one_n_type,&two_n_type );

Contrast this with a twice-as-large vector type: It is clear that


MPI_Type_vector( 2*count,bs,stride,oldtype,&two_n_type );

Victor Eijkhout 189


6. MPI topic: Data types

Figure 6.10: Contiguous type of two vectors, before and after resizing the extent.

The difference is pictured in figure 6.10, where the two illustrates the result of using a send count of 2,
and the bottom the desired effect.
To show how this problem can be solved by resizing the extent, let’s look at a specific example, and
consider sending more than one derived type, from a buffer containing consecutive integers:
// vectorpadsend.c
for (int i=0; i<max_elements; i++) sendbuffer[i] = i;
MPI_Type_vector(count,blocklength,stride,MPI_INT,&stridetype);
MPI_Type_commit(&stridetype);
MPI_Send( sendbuffer,ntypes,stridetype, receiver,0, comm );

We receive into a contiguous buffer:


MPI_Recv( recvbuffer,max_elements,MPI_INT, sender,0, comm,&status );
int count; MPI_Get_count(&status,MPI_INT,&count);
printf("Receive %d elements:",count);
for (int i=0; i<count; i++) printf(" %d",recvbuffer[i]);
printf("\n");

giving an output of:


Receive 6 elements: 0 2 4 5 7 9
Next, we resize the type to ad the gap at the end. This is illustrated in figure 6.10.
Resizing the type looks like:
MPI_Type_get_extent(stridetype,&l,&e);
printf("Stride type l=%ld e=%ld\n",l,e);
e += ( stride-blocklength) * sizeof(int);
MPI_Type_create_resized(stridetype,l,e,&paddedtype);
MPI_Type_get_extent(paddedtype,&l,&e);
printf("Padded type l=%ld e=%ld\n",l,e);
MPI_Type_commit(&paddedtype);
MPI_Send( sendbuffer,ntypes,paddedtype, receiver,0, comm );

and the corresponding output, including querying the extents, is:

190 Parallel Computing – r428


6.6. Type extent

Figure 6.11: Placement of gathered strided types.

Strided type l=0 e=20


Padded type l=0 e=24
Receive 6 elements: 0 2 4 6 8 10
About the extent routines we make two observations:
1. the lower bound and extent parameters are of type MPI_Aint, meaning that they measure in
bytes; and
2. the lower bound is typically left unchanged: we give no examples in this book where it is
changed.

6.6.2.2 Example 2
For another example, let’s revisit exercise 6.4 (and figure 6.5) where each process makes a buffer of integers
that will be interleaved in a gather call: Strided data was sent in individual transactions. Would it be
possible to address all these interleaved packets in one gather or scatter call?
The problem here is that MPI uses the extent of the send type in a scatter, or the receive type in a gather:
if that type is 20 bytes big from its first to its last element, then data will be read out 20 bytes apart in a
scatter, or written 20 bytes apart in a gather. This ignores the ‘gaps’ in the type! (See exercise 6.4.)
int *mydata = (int*) malloc( localsize*sizeof(int) );
for (int i=0; i<localsize; i++)
mydata[i] = i*nprocs+procno;
MPI_Gather( mydata,localsize,MPI_INT,
/* rest to be determined */ );

An ordinary gather call will of course not interleave, but put the data end-to-end:
MPI_Gather( mydata,localsize,MPI_INT,
gathered,localsize,MPI_INT, // abutting
root,comm );

gather 4 elements from 3 procs:


0 3 6 9 1 4 7 10 2 5 8 11

Victor Eijkhout 191


6. MPI topic: Data types

Using a strided type still puts data end-to-end, but now there are unwritten gaps in the gather buffer:
MPI_Gather( mydata,localsize,MPI_INT,
gathered,1,stridetype, // abut with gaps
root,comm );

This is illustrated in figure 6.11. A sample printout of the result would be:
0 1879048192 1100361260 3 3 0 6 0 0 9 1 198654

Figure 6.12: Interleaved gather from data with resized extent

The trick is to use MPI_Type_create_resized to make the extent of the type only one int long:
// interleavegather.c
MPI_Datatype interleavetype;
MPI_Type_create_resized(stridetype,0,sizeof(int),&interleavetype);
MPI_Type_commit(&interleavetype);
MPI_Gather( mydata,localsize,MPI_INT,
gathered,1,interleavetype, // shrunk extent
root,comm );

Now data is written with the same stride, but at starting points equal to the shrunk extent:
0 1 2 3 4 5 6 7 8 9 10 11
This is illustrated in figure 6.12.
Fortran note 12: Extent as Aint. The lowerbound and extent parameters are of type
Integer(kind=MPI_Address_kind):
!! stridescatter.F90

192 Parallel Computing – r428


6.6. Type extent

integer(kind=MPI_Address_kind) :: l,e
call MPI_Type_get_extent(scattertype,l,e)
e = c_sizeof(i)
call MPI_Type_create_resized(scattertype,l,e,interleavetype)
call MPI_Type_commit(interleavetype)

Exercise 6.7. Rewrite exercise 6.4 to use a gather, rather than individual messages.
MPL note 54: Extent resizing. Resizing a datatype does not give a new type, but does the resize ‘in place’:
void layout::resize(ssize_t lb, ssize_t extent);

6.6.2.3 Example: dynamic vectors

Does it bother you (a little) that in the vector type you have to specify explicitly how many blocks there
are? It would be nice if you could create a ‘block with padding’ and then send however many of those.

Well, you can introduce that padding by resizing a type, making it a little larger.
// stridestretch.c
MPI_Datatype oneblock;
MPI_Type_vector(1,1,stride,MPI_DOUBLE,&oneblock);
MPI_Type_commit(&oneblock);
MPI_Aint block_lb,block_x;
MPI_Type_get_extent(oneblock,&block_lb,&block_x);
printf("One block has extent: %ld\n",block_x);

MPI_Datatype paddedblock;
MPI_Type_create_resized(oneblock,0,stride*sizeof(double),&paddedblock);
MPI_Type_commit(&paddedblock);
MPI_Type_get_extent(paddedblock,&block_lb,&block_x);
printf("Padded block has extent: %ld\n",block_x);

// now send a bunch of these padded blocks


MPI_Send(source,count,paddedblock,the_other,0,comm);

There is a second solution to this problem, using a structure type. This does not use resizing, but rather
indicates a displacement that reaches to the end of the structure. We do this by putting a type MPI_UB at
this displacement:
int blens[2]; MPI_Aint displs[2];
MPI_Datatype types[2], paddedblock;
blens[0] = 1; blens[1] = 1;
displs[0] = 0; displs[1] = 2 * sizeof(double);
types[0] = MPI_DOUBLE; types[1] = MPI_UB;
MPI_Type_struct(2, blens, displs, types, &paddedblock);
MPI_Type_commit(&paddedblock);
MPI_Status recv_status;
MPI_Recv(target,count,paddedblock,the_other,0,comm,&recv_status);

Victor Eijkhout 193


6. MPI topic: Data types

Figure 6.13: Transposing a 1D partitioned array

6.6.2.4 Example: transpose


Transposing data is an important part of such operations as the FFT. We develop this in steps. Refer to
figure 6.13.
The source data can be described as a vector type defined as:
• there are 𝑏 blocks,
• of blocksize 𝑏,
• spaced apart by the global 𝑖-size of the array.
## transposeblock.py
MPI_Datatype sourceblock;
MPI_Type_vector( blocksize_j,blocksize_i,isize,MPI_INT,&sourceblock);
MPI_Type_commit( &sourceblock);

The target type is harder to describe. First we note that each contiguous block from the source type can
be described as a vector type with:
• 𝑏 blocks,
• of size 1 each,
• stided by the global 𝑗-size of the matrix.
MPI_Datatype targetcolumn;
MPI_Type_vector( blocksize_i,1,jsize, MPI_INT,&targetcolumn);
MPI_Type_commit( &targetcolumn );

For the full type at the receiving process we now need to pack 𝑏 of these lines together.
Exercise 6.8. Finish the code.
• What is the extent of the targetcolumn type?
• What is the spacing of the first elements of the blocks? How do you therefore
resize the targetcolumn type?

6.7 Reconstructing types


It is possible to find from a datatype how it was constructed. This uses the routines MPI_Type_get_envelope
and MPI_Type_get_contents. The first routine returns the combiner (with values such as

194 Parallel Computing – r428


6.8. Packing

Figure 6.19 MPI_Pack


Name Param name Explanation C type F type inout
MPI_Pack (
MPI_Pack_c (
inbuf input buffer start const void* TYPE(*), IN
DIMENSION(..)
int
incount number of input data items [ INTEGER IN
MPI_Count
datatype datatype of each input MPI_Datatype TYPE IN
data item (MPI_Datatype)
outbuf output buffer start void* TYPE(*), OUT
DIMENSION(..)
int
outsize output buffer size, in [ INTEGER IN
MPI_Count
bytes
int∗
position current position in [ INTEGER INOUT
MPI_Count∗
buffer, in bytes
comm communicator for packed MPI_Comm TYPE IN
message (MPI_Comm)
)

MPI_COMBINER_VECTOR) and the number of parameters; the second routine is then used to retrieve the
actual parameters.

6.8 Packing
One of the reasons for derived datatypes is dealing with noncontiguous data. In older communication
libraries this could only be done by packing data from its original containers into a buffer, and likewise
unpacking it at the receiver into its destination data structures.
MPI offers this packing facility, partly for compatibility with such libraries, but also for reasons of flexibil-
ity. Unlike with derived datatypes, which transfers data atomically, packing routines add data sequentially
to the buffer and unpacking takes them sequentially.
This means that one could pack an integer describing how many floating point numbers are in the rest
of the packed message. Correspondingly, the unpack routine could then investigate the first integer and
based on it unpack the right number of floating point numbers.
MPI offers the following:
• The MPI_Pack command adds data to a send buffer;
• the MPI_Unpack command retrieves data from a receive buffer;
• the buffer is sent with a datatype of MPI_PACKED.
With MPI_Pack (figure 6.19) data elements can be added to a buffer one at a time. The position parameter
is updated each time by the packing routine.
Conversely, MPI_Unpack (figure 6.20) retrieves one element from the buffer at a time. You need to specify

Victor Eijkhout 195


6. MPI topic: Data types

Figure 6.20 MPI_Unpack


Name Param name Explanation C type F type inout
MPI_Unpack (
MPI_Unpack_c (
inbuf input buffer start const void* TYPE(*), IN
DIMENSION(..)
int
insize size of input buffer, in [ INTEGER IN
MPI_Count
bytes
int∗
position current position in bytes [ INTEGER INOUT
MPI_Count∗
outbuf output buffer start void* TYPE(*), OUT
DIMENSION(..)
int
outcount number of items to be [ INTEGER IN
MPI_Count
unpacked
datatype datatype of each output MPI_Datatype TYPE IN
data item (MPI_Datatype)
comm communicator for packed MPI_Comm TYPE IN
message (MPI_Comm)
)

the MPI datatype.

A packed buffer is sent or received with a datatype of MPI_PACKED. The sending routine uses the position
parameter to specify how much data is sent, but the receiving routine does not know this value a priori,
so has to specify an upper bound.

196 Parallel Computing – r428


6.8. Packing

Figure 6.21 MPI_Pack_size


Name Param name Explanation C type F type inout
MPI_Pack_size (
MPI_Pack_size_c (
int
incount count argument to packing [ INTEGER IN
MPI_Count
call
datatype datatype argument to MPI_Datatype TYPE IN
packing call (MPI_Datatype)
comm communicator argument to MPI_Comm TYPE IN
packing call (MPI_Comm)
int∗
size upper bound on size of [ INTEGER OUT
MPI_Count∗
packed message, in bytes
)

Code: Output:
if (procno==sender) { [0] pack 8.401877e-01
position = 0; [0] pack 3.943829e-01
MPI_Pack(&nsends,1,MPI_INT, [0] pack 7.830992e-01
buffer,buflen,&position,comm); [0] pack 7.984400e-01
for (int i=0; i<nsends; i++) { [0] pack 9.116474e-01
double value = rand()/(double)RAND_MAX; [0] pack 1.975514e-01
printf("[%d] pack %e\n",procno,value);
MPI_Pack(&value,1,MPI_DOUBLE,
buffer,buflen,&position,comm);
}
MPI_Pack(&nsends,1,MPI_INT,
buffer,buflen,&position,comm);
MPI_Send(buffer,position,MPI_PACKED,other,0,comm);
} else if (procno==receiver) {
int irecv_value;
double xrecv_value;
MPI_Recv(buffer,buflen,MPI_PACKED,other,0,
comm,MPI_STATUS_IGNORE);
position = 0;
MPI_Unpack(buffer,buflen,&position,
&nsends,1,MPI_INT,comm);
for (int i=0; i<nsends; i++) {
MPI_Unpack(buffer,buflen,
&position,&xrecv_value,1,MPI_DOUBLE,comm);
printf("[%d] unpack %e\n",procno,xrecv_value);
}
MPI_Unpack(buffer,buflen,&position,
&irecv_value,1,MPI_INT,comm);
ASSERT(irecv_value==nsends);
}

You can precompute the size of the required buffer with MPI_Pack_size (figure 6.21).

Victor Eijkhout 197


6. MPI topic: Data types

Code: Output:
// pack.c 1 chars: 1
for (int i=1; i<=4; i++) { 2 chars: 2
MPI_Pack_size(i,MPI_CHAR,comm,&s); 3 chars: 3
printf("%d chars: %d\n",i,s); 4 chars: 4
} 1 unsigned shorts: 2
for (int i=1; i<=4; i++) { 2 unsigned shorts: 4
MPI_Pack_size(i,MPI_UNSIGNED_SHORT,comm,&s); 3 unsigned shorts: 6
printf("%d unsigned shorts: %d\n",i,s); 4 unsigned shorts: 8
} 1 ints: 4
for (int i=1; i<=4; i++) { 2 ints: 8
MPI_Pack_size(i,MPI_INT,comm,&s); 3 ints: 12
printf("%d ints: %d\n",i,s); 4 ints: 16
}

Exercise 6.9. Suppose you have a ‘structure of arrays’


struct aos {
int length;
double *reals;
double *imags;
};

with dynamically created arrays. Write code to send and receive this structure.

198 Parallel Computing – r428


6.9. Review questions

6.9 Review questions


For all true/false questions, if you answer that a statement is false, give a one-line explanation.
1. Give two examples of MPI derived datatypes. What parameters are used to describe them?
2. Give a practical example where the sender uses a different type to send than the receiver uses
in the corresponding receive call. Name the types involved.
3. Fortran only. True or false?
(a) Array indices can be different between the send and receive buffer arrays.
(b) It is allowed to send an array section.
(c) You need to Reshape a multi-dimensional array to linear shape before you can send it.
(d) An allocatable array, when dimensioned and allocated, is treated by MPI as if it were a
normal static array, when used as send buffer.
(e) An allocatable array is allocated if you use it as the receive buffer: it is filled with the
incoming data.
4. Fortran only: how do you handle the case where you want to use an allocatable array as receive
buffer, but it has not been allocated yet, and you do not know the size of the incoming data?

Victor Eijkhout 199


6. MPI topic: Data types

6.10 Full source code of examples


Please see the repository: https://github.com/VictorEijkhout/TheArtOfHPC_vol2_
parallelprogramming/tree/main. Examples are sorted first by programming system, then by
language.

200 Parallel Computing – r428


Chapter 7

MPI topic: Communicators

A communicator is an object describing a group of processes. In many applications all processes work
together closely coupled, and the only communicator you need is MPI_COMM_WORLD, the group describing all
processes that your job starts with.
In this chapter you will see ways to make new groups of MPI processes: subgroups of the original
world communicator. Chapter 8 discusses dynamic process management, which, while not extending
MPI_COMM_WORLD does extend the set of available processes. That chapter also discusses the ‘sessions
model’, which is another way to constructing communicators.

7.1 Basic communicators


There are three predefined communicators:
• MPI_COMM_WORLD comprises all processes that were started together by mpiexec (or some related
program).
• MPI_COMM_SELF is the communicator that contains only the current process.
• MPI_COMM_NULL is the invalid communicator. This values results
– when a communicator is freed; see section 7.3;
– as error return value from routines that construct communicators;
– for processes outside a created Cartesian communicator (section 11.1.1);
– on non-spawned processes when querying their parent (section 7.6.3).
These values are constants, though not necessarily compile-time constants. Thus, they can not be used in
switch statements, array declarations, or constexpr evaluations.
If you don’t want to write MPI_COMM_WORLD repeatedly, you can assign that value to a variable of type
MPI_Comm.

Examples:
// C:
#include <mpi.h>
MPI_Comm comm = MPI_COMM_WORLD;

201
7. MPI topic: Communicators

Figure 7.1 MPI_Comm_dup


Name Param name Explanation C type F type inout
MPI_Comm_dup (
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
newcomm copy of comm MPI_Comm* TYPE OUT
(MPI_Comm)
)
MPL:

Done as part of the copy assignment operator.

!! Fortran 2008 interface


use mpi_f08
Type(MPI_Comm) :: comm = MPI_COMM_WORLD

!! Fortran legacy interface


#include <mpif.h>
Integer :: comm = MPI_COMM_WORLD

Python note 24: Communicator types.


comm = MPI.COMM_WORLD

MPL note 55: Predefined communicators. The environment namespace has the equivalents of MPI_COMM_WORLD
and MPI_COMM_SELF:
const communicator& mpl::environment::comm_world();
const communicator& mpl::environment::comm_self();

There doesn’t seem to be an equivalent of MPI_COMM_NULL.


MPL note 56: Raw communicator handles. Should you need the MPI_Comm object contained in an MPL
communicator, there is an access function native_handle.

You can name your communicators with MPI_Comm_set_name, which could improve the quality of error
messages when they arise.

7.2 Duplicating communicators


With MPI_Comm_dup (figure 7.1) you can make an exact duplicate of a communicator (see section 7.2.2 for
an application). There is a nonblocking variant MPI_Comm_idup (figure 7.2).
These calls do not propagate info hints (sections 15.1.1 and 15.1.1.2); to achieve this, use
MPI_Comm_dup_with_info and MPI_Comm_idup_with_info; section 15.1.1.2.

Python note 25: Communicator duplication. Duplicate communicators are created as output of the dupli-
cation routine:

202 Parallel Computing – r428


7.2. Duplicating communicators

Figure 7.2 MPI_Comm_idup


Name Param name Explanation C type F type inout
MPI_Comm_idup (
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
newcomm copy of comm MPI_Comm* TYPE OUT
(MPI_Comm)
request communication request MPI_Request* TYPE OUT
(MPI_Request)
)

newcomm = comm.Dup()

MPL note 57: Communicator duplication. Communicators can be duplicated but only during initialization.
Copy assignment has been deleted. Thus:

// LEGAL:
mpl::communicator init = comm;
// WRONG:
mpl::communicator init;
init = comm;

7.2.1 Communicator comparing

You may wonder what ‘an exact copy’ means precisely. For this, think of a communicator as a context
label that you can attach to, among others, operations such as sends and receives. And it’s that label that
counts, not what processes are in the communicator. A send and a receive ‘belong together’ if they have
the same communicator context. Conversely, a send in one communicator can not be matched to a receive
in a duplicate communicator, made by MPI_Comm_dup.

Testing whether two communicators are really the same is then more than testing if they comprise the
same processes. The call MPI_Comm_compare returns MPI_IDENT if two communicator values are the same,
and not if one is derived from the other by duplication:

Victor Eijkhout 203


7. MPI topic: Communicators

Code: Output:
// commcompare.c assign: comm==copy: 1
int result; congruent: 0
MPI_Comm copy = comm; not equal: 0
MPI_Comm_compare(comm,copy,&result); duplicate: comm==copy: 0
printf("assign: comm==copy: %d \n", congruent: 1
result==MPI_IDENT); not equal: 0
printf(" congruent: %d \n",
result==MPI_CONGRUENT);
printf(" not equal: %d \n",
result==MPI_UNEQUAL);

MPI_Comm_dup(comm,&copy);
MPI_Comm_compare(comm,copy,&result);
printf("duplicate: comm==copy: %d \n",
result==MPI_IDENT);
printf(" congruent: %d \n",
result==MPI_CONGRUENT);
printf(" not equal: %d \n",
result==MPI_UNEQUAL);

Communicators that are not actually the same can be


• consisting of the same processes, in the same order, giving MPI_CONGRUENT;
• merely consisting of the same processes, but not in the same order, giving MPI_SIMILAR;
• different, giving MPI_UNEQUAL.
Comparing against MPI_COMM_NULL is not allowed.
MPL note 58: Communicator comparing.
Code: Output:
const mpl::communicator &comm = Compare raw comms:
mpl::environment::comm_world(); identical: true
MPI_Comm congruent: false
world_extract = comm.native_handle(), unequal : false
world_given = MPI_COMM_WORLD;
int result;
MPI_Comm_compare(world_extract,world_given,&result);
cout << "Compare raw comms: " << "\n"
<< "identical: " << (result==MPI_IDENT)
<< "\n"
<< "congruent: " << (result==MPI_CONGRUENT)
<< "\n"
<< "unequal : " << (result==MPI_UNEQUAL)
<< "\n";

7.2.2 Communicator duplication for library use


Duplicating a communicator may seem pointless, but it is actually very useful for the design of software
libraries. Imagine that you have a code

204 Parallel Computing – r428


7.2. Duplicating communicators

MPI_Isend(...); MPI_Irecv(...);
// library call
MPI_Waitall(...);

and suppose that the library has receive calls. Now it is possible that the receive in the library inadvertently
catches the message that was sent in the outer environment.
Let us consider an example. First of all, here is code where the library stores the communicator of the
calling program:
// commdupwrong.cxx
class library {
private:
MPI_Comm comm;
int procno,nprocs,other;
MPI_Request request[2];
public:
library(MPI_Comm incomm) {
comm = incomm;
MPI_Comm_rank(comm,&procno);
other = 1-procno;
};
int communication_start();
int communication_end();
};

This models a main program that does a simple message exchange, and it makes two calls to library
routines. Unbeknown to the user, the library also issues send and receive calls, and they turn out to
interfere.
Here
• The main program does a send,
• the library call function_start does a send and a receive; because the receive can match either
send, it is paired with the first one;
• the main program does a receive, which will be paired with the send of the library call;
• both the main program and the library do a wait call, and in both cases all requests are succes-
fully fulfilled, just not the way you intended.
To prevent this confusion, the library should duplicate the outer communicator with MPI_Comm_dup and
send all messages with respect to its duplicate. Now messages from the user code can never reach the
library software, since they are on different communicators.
// commdupright.cxx
class library {
private:
MPI_Comm comm;
int procno,nprocs,other;
MPI_Request request[2];
public:
library(MPI_Comm incomm) {
MPI_Comm_dup(incomm,&comm);
MPI_Comm_rank(comm,&procno);

Victor Eijkhout 205


7. MPI topic: Communicators

other = 1-procno;
};
~library() {
MPI_Comm_free(&comm);
}
int communication_start();
int communication_end();
};

Note how the preceding example performs the MPI_Comm_free cal in a C++ destructor.
## commdup.py
class Library():
def __init__(self,comm):
# wrong: self.comm = comm
self.comm = comm.Dup()
self.other = self.comm.Get_size()-self.comm.Get_rank()-1
self.requests = [ None ] * 2
def __del__(self):
if self.comm.Get_rank()==0: print(".. freeing communicator")
self.comm.Free()
def communication_start(self):
sendbuf = np.empty(1,dtype=int); sendbuf[0] = 37
recvbuf = np.empty(1,dtype=int)
self.requests[0] = self.comm.Isend( sendbuf, dest=other,tag=2 )
self.requests[1] = self.comm.Irecv( recvbuf, source=other )
def communication_end(self):
MPI.Request.Waitall(self.requests)

mylibrary = Library(comm)
my_requests[0] = comm.Isend( sendbuffer,dest=other,tag=1 )
mylibrary.communication_start()
my_requests[1] = comm.Irecv( recvbuffer,source=other )
MPI.Request.Waitall(my_requests,my_status)
mylibrary.communication_end()

7.3 Sub-communicators
In many scenarios you divide a large job over all the available processors. However, your job may have
two or more parts that can be considered as jobs by themselves. In that case it makes sense to divide your
processors into subgroups accordingly.
Suppose that you are running a simulation where inputs are generated, a computation is performed on
them, and the results of this computation are analyzed or rendered graphically. You could then consider
dividing your processors in three groups corresponding to generation, computation, rendering. As long
as you only do sends and receives, this division works fine. However, if one group of processes needs to
perform a collective operation, you don’t want the other groups involved in this. Thus, you really want
the three groups to be distinct from each other: you want them to be in separate communicators.

206 Parallel Computing – r428


7.3. Sub-communicators

In order to make such subsets of processes, MPI has the mechanism of taking a subset of MPI_COMM_WORLD
(or other communicator) and turning that subset into a new communicator.
Now you understand why the MPI collective calls had an argument for the communicator. A collective
involves all processes of that communicator. If only the world communicator existed, no such argument
would be needed, but by making a communicator that contains a subset of all available processes, you
can do a collective on that subset.
The usage is as follows:
• You create a new communicator with routines such as MPI_Comm_dup (section 7.2), MPI_Comm_split
(section 7.4), MPI_Comm_create (section 7.5), MPI_Intercomm_create (section 7.6), MPI_Comm_spawn
(section 8.1);
• you use that communiator for a while;
• and you call MPI_Comm_free when you are done with it; this also sets the communicator variable
to MPI_COMM_NULL. A similar routine, MPI_Comm_disconnect waits for all pending communication
to finish. Both are collective.

7.3.1 Scenario: distributed linear algebra


For scalability reasons (see HPC book, section-6.2.3), matrices should often be distributed in a 2D manner,
that is, each process receives a subblock that is not a block of full columns or rows. This means that the
processors themselves are, at least logically, organized in a 2D grid. Operations then involve reductions
or broadcasts inside rows or columns. For this, a row or column of processors needs to be in a subcom-
municator.

7.3.2 Scenario: climate model


A climate simulation code has several components, for instance corresponding to land, air, ocean, and
ice. You can imagine that each needs a different set of equations and algorithms to simulate. You can then
divide your processes, where each subset simulates one component of the climate, occasionally commu-
nicating with the other components.

7.3.3 Scenario: quicksort


The popular quicksort algorithm works by splitting the data into two subsets that each can be sorted
individually. If you want to sort in parallel, you could implement this by making two subcommunicators,
and sorting the data on these, creating recursively more subcommunicators.

7.3.4 Shared memory


There is an important application of communicator splitting in the context of one-sided communication,
grouping processes by whether they access the same shared memory area; see section 12.1.

Victor Eijkhout 207


7. MPI topic: Communicators

Figure 7.3 MPI_Comm_split


Name Param name Explanation C type F type inout
MPI_Comm_split (
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
color control of subset int INTEGER IN
assignment
key control of rank assignment int INTEGER IN
newcomm new communicator MPI_Comm* TYPE OUT
(MPI_Comm)
)

7.3.5 Process spawning


Finally, newly created communicators do not always need to be subset of the initial MPI_COMM_WORLD. MPI
can dynamically spawn new processes (see chapter 8) which start in a MPI_COMM_WORLD of their own. Ad-
ditionally, another communicator will be created that spawns the old and new worlds so that you can
communicate with the new processes.

7.4 Splitting a communicator


Above we saw several scenarios where it makes sense to divide MPI_COMM_WORLD into disjoint subcommu-
nicators. The command MPI_Comm_split (figure 7.3) uses a ‘color’ to define these subcommunicators: all
processes in the old communicator with the same color wind up in a new communicator together. The
old communicator still exists, so processes now have two different contexts in which to communicate.
The ranking of processes in the new communicator is determined by a ‘key’ value: in a subcommunicator
the process with lowest key is given the lowest rank, et cetera. Most of the time, there is no reason to
use a relative ranking that is different from the global ranking, so the MPI_Comm_rank value of the global
communicator is a good choice. Any ties between identical key values are broken by using the rank from
the original communicator. Thus, specifying zero as the key will also retain the original process ordering.

Figure 7.1: Row and column broadcasts in subcommunicators

208 Parallel Computing – r428


7.4. Splitting a communicator

Here is one example of communicator splitting. Suppose your processors are in a two-dimensional grid:
MPI_Comm_rank( MPI_COMM_WORLD, &mytid );
proc_i = mytid % proc_column_length;
proc_j = mytid / proc_column_length;

You can now create a communicator per column:


MPI_Comm column_comm;
MPI_Comm_split( MPI_COMM_WORLD, proc_j, mytid, &column_comm );

and do a broadcast in that column:


MPI_Bcast( data, /* stuff */, column_comm );

Because of the SPMD nature of the program, you are now doing in parallel a broadcast in every processor
column. Such operations often appear in dense linear algebra.
Exercise 7.1. Organize your processes in a grid, and make subcommunicators for the rows
and columns. For this compute the row and column number of each process.
In the row and column communicator, compute the rank. For instance, on a 2 × 3
processor grid you should find:
Global ranks: Ranks in row: Ranks in colum:
0 1 2 0 1 2 0 0 0
3 4 5 0 1 2 1 1 1
Check that the rank in the row communicator is the column number, and the other
way around.
Run your code on different number of processes, for instance a number of rows and
columns that is a power of 2, or that is a prime number.
(There is a skeleton for this exercise under the name procgrid.)
Python note 26: Comm split key is optional. In Python, the ‘key’ argument is optional:
Code: Output:
## commsplit.py Proc 0 -> 0 -> 0
mydata = procid Proc 2 -> 1 -> 0
Proc 6 -> 3 -> 1
# communicator modulo 2 Proc 4 -> 2 -> 1
color = procid%2 Proc 3 -> 1 -> 0
mod2comm = comm.Split(color) Proc 7 -> 3 -> 1
procid2 = mod2comm.Get_rank() Proc 1 -> 0 -> 0
Proc 5 -> 2 -> 1
# communicator modulo 4 recursively
color = procid2 % 2
mod4comm = mod2comm.Split(color)
procid4 = mod4comm.Get_rank()

MPL note 59: Communicator splitting. In MPL, splitting a communicator is done as one of the overloads
of the communicator constructor;
// commsplit.cxx
// create sub communicator modulo 2
int color2 = procno % 2;

Victor Eijkhout 209


7. MPI topic: Communicators

mpl::communicator comm2( mpl::communicator::split, comm_world, color2 );


auto procno2 = comm2.rank();

// create sub communicator modulo 4 recursively


int color4 = procno2 % 2;
mpl::communicator comm4( mpl::communicator::split, comm2, color4 );
auto procno4 = comm4.rank();

MPL implementation note: The communicator::split identifier is an object of class


communicator::split_tag, itself is an otherwise empty subclass of communicator:
class split_tag {};
static constexpr split_tag split{};

There is also a routine MPI_Comm_split_type which uses a type rather than a key to split the communicator.
We will see this in action in section 12.1.
As another example of communicator splitting, consider the recursive algorithm for matrix transposition.
Processors are organized in a square grid. The matrix is divided on 2 × 2 block form.
Exercise 7.2. Implement a recursive algorithm for matrix transposition:

• Swap blocks (1, 2) and (2, 1); then


• Divide the processors into four subcommunicators, and apply this algorithm
recursively on each;
• If the communicator has only one process, transpose the matrix in place.
(assume one element per process)

7.5 Communicators and groups


You saw in section 7.4 that it is possible derive communicators that have a subset of the processes of
another communicator. There is a more general mechanism, using MPI_Group objects.
Using groups, it takes the following steps to create a new communicator:
1. Access the MPI_Group of a communicator object using MPI_Comm_group (figure 7.4).
2. Use various routines, discussed next, to form a new group.
Note: you would form that group even on the processes that will not be come part of the new
communicator.

210 Parallel Computing – r428


7.5. Communicators and groups

Figure 7.4 MPI_Comm_group


Name Param name Explanation C type F type inout
MPI_Comm_group (
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
group group corresponding to MPI_Group* TYPE OUT
comm (MPI_Group)
)

Figure 7.5 MPI_Comm_create


Name Param name Explanation C type F type inout
MPI_Comm_create (
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
group group, which is a subset MPI_Group TYPE IN
of the group of comm (MPI_Group)
newcomm new communicator MPI_Comm* TYPE OUT
(MPI_Comm)
)

3. Make a new communicator object from the group with using MPI_Comm_create (figure 7.5), col-
lective on the old communicator.
4. On the ranks that were not in the subgroup, the resulting communicator value will be
MPI_COMM_NULL.
There is also a routine MPI_Comm_create_group that only needs to be called on the group that constitutes
the new communicator.

7.5.1 Process groups


Groups are manipulated with MPI_Group_incl (figure 7.6), MPI_Group_excl (figure 7.7), MPI_Group_difference
and a few more.
MPI_Comm_group (comm, group)
MPI_Comm_create (MPI_Comm comm,MPI_Group group, MPI_Comm newcomm)

MPI_Group_union(group1, group2, newgroup)


MPI_Group_intersection(group1, group2, newgroup)
MPI_Group_difference(group1, group2, newgroup)

MPI_Group_size(group, size)
MPI_Group_rank(group, rank)

Certain MPI types, MPI_Win and MPI_File, are created on a communicator. While you can not directly
extract that communicator from the object, you can get the group with MPI_Win_get_group and
MPI_File_get_group.

Victor Eijkhout 211


7. MPI topic: Communicators

Figure 7.6 MPI_Group_incl


Name Param name Explanation C type F type inout
MPI_Group_incl (
group group MPI_Group TYPE IN
(MPI_Group)
n number of elements in int INTEGER IN
array ranks (and size of
newgroup)
ranks ranks of processes in const int[] INTEGER(n) IN
group to appear in
newgroup
newgroup new group derived from MPI_Group* TYPE OUT
above, in the order (MPI_Group)
defined by ranks
)

Figure 7.7 MPI_Group_excl


Name Param name Explanation C type F type inout
MPI_Group_excl (
group group MPI_Group TYPE IN
(MPI_Group)
n number of elements in int INTEGER IN
array ranks
ranks array of integer ranks of const int[] INTEGER(n) IN
processes in group not to
appear in newgroup
newgroup new group derived from MPI_Group* TYPE OUT
above, preserving the (MPI_Group)
order defined by group
)

There is a pre-defined empty group MPI_GROUP_EMPTY, which can be used as an input to group construc-
tion routines, or appear as the result of such operations as a zero intersection. This not the same as
MPI_GROUP_NULL, which is the output of invalid operations on groups, or the result of MPI_Group_free.

MPL note 60: Raw group handles. Should you need the MPI_Datatype object contained in an MPL group,
there is an access function native_handle.

7.5.2 Examples
Suppose you want to split the world communicator into one manager process, with the remaining pro-
cesses workers.
// portapp.c
MPI_Comm comm_work;
{
MPI_Group world_group,work_group;
MPI_Comm_group( comm_world,&world_group );

212 Parallel Computing – r428


7.6. Intercommunicators

int manager[] = {0};


MPI_Group_excl( world_group,1,manager,&work_group );
MPI_Comm_create( comm_world,work_group,&comm_work );
MPI_Group_free( &world_group ); MPI_Group_free( &work_group );
}

Exercise 7.3. Write a code that does a scaling study: your code needs to contain a loop over
increasingly sized subsets of MPI_COMM_WORLD.
for (int subsize=1; subsize<=worldsize; subsize++) {
MPI_Comm subcomm;
// form `subcomm' to be of size `subsize'
MPI_Allreduce( /* stuff */ subcomm );
}

Carefully address which process do the various communicator and group calls; in
particular do MPI_Comm_free and MPI_Group_free on the right processes.

7.6 Intercommunicators
In several scenarios it may be desirable to have a way to communicate between communicators. For
instance, an application can have clearly functionally separated modules (preprocessor, simulation, post-
processor) that need to stream data pairwise. In another example, dynamically spawned processes (sec-
tion 8.1) get their own value of MPI_COMM_WORLD, but still need to communicate with the process(es) that
spawned them. In this section we will discuss the inter-communicator mechanism that serves such use
cases.
Communicating between disjoint communicators can of course be done by having a communicator that
overlaps them, but this would be complicated: since the ‘inter’ communication happens in the overlap
communicator, you have to translate its ordering into those of the two worker communicators. It would
be easier to express messages directly in terms of those communicators, and this is what happens in an
inter-communicator.
A call to MPI_Intercomm_create (figure 7.8) involves the following communicators:
• Two local communicators, which in this context are known as intra-communicators: one process
in each will act as the local leader, connected to the remote leader;
• The peer communicator, often MPI_COMM_WORLD, that contains the local communicators;
• An inter-communicator that allows the leaders of the subcommunicators to communicate with
the other subcommunicator.
Even though the intercommunicator connects only two proceses, it is collective on the peer communicator.

7.6.1 Intercommunicator point-to-point


The local leaders can now communicate with each other.
• The sender specifies as target the local number of the other leader in the other
sub-communicator;

Victor Eijkhout 213


7. MPI topic: Communicators

Figure 7.2: Illustration of ranks in an intercommunicator setup

Figure 7.8 MPI_Intercomm_create


Name Param name Explanation C type F type inout
MPI_Intercomm_create (
local_comm local intra-communicator MPI_Comm TYPE IN
(MPI_Comm)
local_leader rank of local group leader int INTEGER IN
in local_comm
peer_comm ``peer'' communicator; MPI_Comm TYPE IN
significant only at the (MPI_Comm)
local_leader
remote_leader rank of remote group int INTEGER IN
leader in peer_comm;
significant only at the
local_leader
tag tag int INTEGER IN
newintercomm new inter-communicator MPI_Comm* TYPE OUT
(MPI_Comm)
)

• Likewise, the receiver specifies as source the local number of the sender in its
sub-communicator.
In one way, this design makes sense: processors are referred to in their natural, local, numbering. On the
other hand, it means that each group needs to know how the local ordering of the other group is arranged.
Using a complicated key value makes this difficult.
if (i_am_local_leader) {
if (color==0) {
interdata = 1.2;
int inter_target = local_number_of_other_leader;
printf("[%d] sending interdata %e to %d\n",
procno,interdata,inter_target);
MPI_Send(&interdata,1,MPI_DOUBLE,inter_target,0,intercomm);

214 Parallel Computing – r428


7.6. Intercommunicators

} else {
MPI_Status status;
MPI_Recv(&interdata,1,MPI_DOUBLE,MPI_ANY_SOURCE,MPI_ANY_TAG,intercomm,&status);
int inter_source = status.MPI_SOURCE;
printf("[%d] received interdata %e from %d\n",
procno,interdata,inter_source);
if (inter_source!=local_number_of_other_leader)
fprintf(stderr,
"Got inter communication from unexpected %d; s/b %d\n",
inter_source,local_number_of_other_leader);
}
}

7.6.2 Intercommunicator collectives


The intercommunicator can be used in collectives such as a broadcast.
• In the sending group, the root process passes MPI_ROOT as ‘root’ value; all others use
MPI_PROC_NULL.
• In the receiving group, all processes use a ‘root’ value that is the rank of the root process in the
root group. Note: this is not the global rank!
Gather and scatter behave similarly; the allgather is different: all send buffers of group A are concatenated
in rank order, and places on all processes of group B.
Intercommunicators can be used if two groups of process work asynchronously with respect to each other;
another application is fault tolerance (section 15.5).
if (color==0) { // sending group: the local leader sends
if (i_am_local_leader)
root = MPI_ROOT;
else
root = MPI_PROC_NULL;
} else { // receiving group: everyone indicates leader of other group
root = local_number_of_other_leader;
}
if (DEBUG) fprintf(stderr,"[%d] using root value %d\n",procno,root);
MPI_Bcast(&bcast_data,1,MPI_INT,root,intercomm);

7.6.3 Intercommunicator querying


Some of the operations you have seen before for intra-communicators behave differently with intercom-
municator:
• MPI_Comm_size returns the size of the local group, not the size of the intercommunicator.
• MPI_Comm_rank returns the rank in the local group.
• MPI_Comm_group returns the local group.
Spawned processes can find their parent communicator with MPI_Comm_get_parent (figure 7.9) (see exam-
ples in section 8.1). On other processes this returns MPI_COMM_NULL.
Test whether a communicator is intra or inter: MPI_Comm_test_inter (figure 7.10).

Victor Eijkhout 215


7. MPI topic: Communicators

Figure 7.9 MPI_Comm_get_parent


Name Param name Explanation C type F type inout
MPI_Comm_get_parent (
parent the parent communicator MPI_Comm* TYPE OUT
(MPI_Comm)
)

Figure 7.10 MPI_Comm_test_inter


Name Param name Explanation C type F type inout
MPI_Comm_test_inter (
comm communicator MPI_Comm TYPE IN
(MPI_Comm)
flag true if comm is an int* LOGICAL OUT
inter-communicator
)

MPI_Comm_compare works for intercommunicators.


Processes connected through an intercommunicator can query the size of the ‘other’ communicator with
MPI_Comm_remote_size (figure 7.11). The actual group can be obtained with MPI_Comm_remote_group (fig-
ure 7.12).
Virtual topologies (chapter 11) cannot be created with an intercommunicator. To set up virtual topologies,
first transform the intercommunicator to an intracommunicator with the function MPI_Intercomm_merge
(figure 7.13).

216 Parallel Computing – r428


7.7. Review questions

Figure 7.11 MPI_Comm_remote_size


Name Param name Explanation C type F type inout
MPI_Comm_remote_size (
comm inter-communicator MPI_Comm TYPE IN
(MPI_Comm)
size number of processes in the int* INTEGER OUT
remote group of comm
)
Python:

Intercomm.Get_remote_size(self)

Figure 7.12 MPI_Comm_remote_group


Name Param name Explanation C type F type inout
MPI_Comm_remote_group (
comm inter-communicator MPI_Comm TYPE IN
(MPI_Comm)
group remote group corresponding MPI_Group* TYPE OUT
to comm (MPI_Group)
)
Python:

Intercomm.Get_remote_group(self)

7.7 Review questions


For all true/false questions, if you answer that a statement is false, give a one-line explanation.
1. True or false: in each communicator, processes are numbered consecutively from zero.
2. If a process is in two communicators, it has the same rank in both.
3. Any communicator that is not MPI_COMM_WORLD is a strict subset of it.
4. The subcommunicators derived by MPI_Comm_split are disjoint.
5. If two processes have ranks 𝑝 < 𝑞 in some communicator, and they are in the same subcommu-
nicator, then their ranks 𝑝 ′ , 𝑞 ′ in the subcommunicator also obey 𝑝 ′ < 𝑞 ′ .

Victor Eijkhout 217


7. MPI topic: Communicators

Figure 7.13 MPI_Intercomm_merge


Name Param name Explanation C type F type inout
MPI_Intercomm_merge (
intercomm inter-communicator MPI_Comm TYPE IN
(MPI_Comm)
high ordering of the local and int LOGICAL IN
remote groups in the new
intra-communicator
newintracomm new intra-communicator MPI_Comm* TYPE OUT
(MPI_Comm)
)

7.8 Full source code of examples


Please see the repository: https://github.com/VictorEijkhout/TheArtOfHPC_vol2_
parallelprogramming/tree/main. Examples are sorted first by programming system, then by
language.

218 Parallel Computing – r428


Chapter 8

MPI topic: Process management

In this course we have up to now only considered the SPMD model of running MPI programs. In some
rare cases you may want to run in an MPMD mode, rather than SPMD. This can be achieved either on the
OS level, using options of the mpiexec mechanism, or you can use MPI’s built-in process management.
Read on if you’re interested in the latter.

8.1 Process spawning


The first version of MPI did not contain any process management routines, even though the earlier PVM
project did have that functionality. Process management was later added with MPI-2.
Unlike what you might think, newly added processes do not become part of MPI_COMM_WORLD; rather, they
get their own communicator, and an inter-communicator (section 7.6) is established between this new
group and the existing one. The first routine is MPI_Comm_spawn (figure 8.1), which tries to fire up multiple
copies of a single named executable. Errors in starting up these codes are returned in an array of integers,
or if you’re feeling sure of yourself, specify MPI_ERRCODES_IGNORE.
It is not immediately clear whether there is opportunity for spawning new executables; after all,
MPI_COMM_WORLD contains all your available processors. You can probably tell your job starter to
reserve space for a few extra processes, but that is installation-dependent (see below). However,
there is a standard mechanism for querying whether such space has been reserved. The attribute
MPI_UNIVERSE_SIZE, retrieved with MPI_Comm_get_attr (section 15.1.2), will tell you to the total number of
hosts available.
If this option is not supported, you can determine yourself how many processes you want to spawn.
However, if you exceed the hardware resources, your multi-tasking operating system (which is some
variant of Unix for almost everyone) will use time-slicing to start the spawned processes, but you will not
gain any performance.

8.1.1 Commandline arguments


The argv argument contains the commandline arguments passed to the spawned process.
• This array needs to be null-terminated, so that its length can be determined.

219
8. MPI topic: Process management

Figure 8.1 MPI_Comm_spawn


Name Param name Explanation C type F type inout
MPI_Comm_spawn (
command name of program to be const char* CHARACTER IN
spawned
argv arguments to command char*[] CHARACTER(*) IN
maxprocs maximum number of int INTEGER IN
processes to start
info a set of key-value pairs MPI_Info TYPE IN
telling the runtime system (MPI_Info)
where and how to start the
processes
root rank of process in which int INTEGER IN
previous arguments are
examined
comm intra-communicator MPI_Comm TYPE IN
containing group of (MPI_Comm)
spawning processes
intercomm inter-communicator between MPI_Comm* TYPE OUT
original group and the (MPI_Comm)
newly spawned group
array_of_errcodes one code per process int[] INTEGER(*) OUT
)
Python:

MPI.Intracomm.Spawn(self,
command, args=None, int maxprocs=1, Info info=INFO_NULL,
int root=0, errcodes=None)
returns an intracommunicator

• If the spawned process takes no commandline arguments, a value of MPI_ARGV_NULL can be used,
in both C and Fortran. In C this is the same as NULL.
• Unline the argv argument of a main program, the argv argument passed in the spawn call does
not contain the name of the executable.

8.1.2 Example: work manager


Here is an example of a work manager. First we query how much space we have for new processes, using
the flag to see if this option is supported:
int universe_size, *universe_size_attr,uflag;
MPI_Comm_get_attr
(comm_world,MPI_UNIVERSE_SIZE,
&universe_size_attr,&uflag);
if (uflag) {
universe_size = *universe_size_attr;
} else {
printf("This MPI does not support UNIVERSE_SIZE.\nUsing world size");
universe_size = world_n;

220 Parallel Computing – r428


8.1. Process spawning

}
int work_n = universe_size - world_n;
if (world_p==0) {
printf("A universe of size %d leaves room for %d workers\n",
universe_size,work_n);
printf(".. spawning from %s\n",procname);
}

(See section 15.1.2 for that dereference behavior.)


Then we actually spawn the processes:
const char *workerprogram = "./spawnapp";
MPI_Comm_spawn(workerprogram,MPI_ARGV_NULL,
work_n,MPI_INFO_NULL,
0,comm_world,&comm_inter,NULL);

## spawnmanager.py
try :
universe_size = comm.Get_attr(MPI.UNIVERSE_SIZE)
if universe_size is None:
print("Universe query returned None")
universe_size = nprocs + 4
else:
print("World has {} ranks in a universe of {}"\
.format(nprocs,universe_size))
except :
print("Exception querying universe size")
universe_size = nprocs + 4
nworkers = universe_size - nprocs

itercomm = comm.Spawn("./spawn_worker.py", maxprocs=nworkers)

A process can detect whether it was a spawning or a spawned process by using MPI_Comm_get_parent: the
resulting intercommunicator is MPI_COMM_NULL on the parent processes.
// spawnapp.c
MPI_Comm comm_parent;
MPI_Comm_get_parent(&comm_parent);
int is_child = (comm_parent!=MPI_COMM_NULL);
if (is_child) {
int nworkers,workerno;
MPI_Comm_size(MPI_COMM_WORLD,&nworkers);
MPI_Comm_rank(MPI_COMM_WORLD,&workerno);
printf("I detect I am worker %d/%d running on %s\n",
workerno,nworkers,procname);

The spawned program looks very much like a regular MPI program, with its own initialization and finalize
calls.
// spawnworker.c
MPI_Comm_size(MPI_COMM_WORLD,&nworkers);
MPI_Comm_rank(MPI_COMM_WORLD,&workerno);
MPI_Comm_get_parent(&parent);

Victor Eijkhout 221


8. MPI topic: Process management

## spawnworker.py
parentcomm = comm.Get_parent()
nparents = parentcomm.Get_remote_size()

Spawned processes wind up with a value of MPI_COMM_WORLD of their own, but managers and workers can
find each other regardless. The spawn routine returns the intercommunicator to the parent; the children
can find it through MPI_Comm_get_parent (section 7.6.3). The number of spawning processes can be found
through MPI_Comm_remote_size on the parent communicator.
Running spawnapp with usize=12, wsize=4
%%
%% manager output
%%
A universe of size 12 leaves room for 8 workers
.. spawning from c209-026.frontera.tacc.utexas.edu
%%
%% worker output
%%
Worker deduces 8 workers and 4 parents
I detect I am worker 0/8 running on c209-027.frontera.tacc.utexas.edu
I detect I am worker 1/8 running on c209-027.frontera.tacc.utexas.edu
I detect I am worker 2/8 running on c209-027.frontera.tacc.utexas.edu
I detect I am worker 3/8 running on c209-027.frontera.tacc.utexas.edu
I detect I am worker 4/8 running on c209-028.frontera.tacc.utexas.edu
I detect I am worker 5/8 running on c209-028.frontera.tacc.utexas.edu
I detect I am worker 6/8 running on c209-028.frontera.tacc.utexas.edu
I detect I am worker 7/8 running on c209-028.frontera.tacc.utexas.edu

8.1.3 MPI startup with universe


You could start up a single copy of this program with
mpiexec -n 1 spawnmanager
but with a hostfile that has more than one host.
TACC note. Intel MPI requires you to pass an option -usize to mpiexec indicating the size of the
comm universe. With the TACC jobs starter ibrun do the following:
export FI_MLX_ENABLE_SPAWN=yes
# specific
MY_MPIRUN_OPTIONS="-usize 8" ibrun -np 4 spawnmanager
# more generic
MY_MPIRUN_OPTIONS="-usize ${SLURM_NPROCS}" ibrun -np 4 spawnmanager
# using mpiexec:
mpiexec -np 2 -usize ${SLURM_NPROCS} spawnmanager

222 Parallel Computing – r428


8.2. Socket-style communications

Figure 8.2 MPI_Open_port


Name Param name Explanation C type F type inout
MPI_Open_port (
info implementation-specific MPI_Info TYPE IN
information on how to (MPI_Info)
establish an address
port_name newly established port char* CHARACTER OUT
)

8.1.4 MPMD
Instead of spawning a single executable, you can spawn multiple with MPI_Comm_spawn_multiple. In that
case a process can retrieve with the attribute MPI_APPNUM which of the executables it is; section 15.1.2.
Commandline arguments are handled similarly to MPI_Comm_spawn (section 8.1.1), except that that there
is now an array of arrays of strings. If not executables take commandline argumentscommandline argu-
ments!of multiple spawns, the value MPI_ARGVS_NULL can be passed. If only certain executables take no
arguments, for them an array of length 1 needs to be passed containing only the null-terminator.

8.2 Socket-style communications


It is possible to establish connections with running MPI programs that have their own world communi-
cator.
• The server process establishes a port with MPI_Open_port, and calls MPI_Comm_accept to accept
connections to its port.
• The client process specifies that port in an MPI_Comm_connect call. This establishes the connection.

8.2.1 Server calls


The server calls MPI_Open_port (figure 8.2), yielding a port name. Port names are generated by the system
and copied into a character buffer of length at most MPI_MAX_PORT_NAME.
The server then needs to call MPI_Comm_accept (figure 8.3) prior to the client doing a connect call. This
is collective over the calling communicator. It returns an intercommunicator (section 7.6) that allows
communication with the client.
MPI_Comm intercomm;
char myport[MPI_MAX_PORT_NAME];
MPI_Open_port( MPI_INFO_NULL,myport );
int portlen = strlen(myport);
MPI_Send( myport,portlen+1,MPI_CHAR,1,0,comm_world );
printf("Host sent port <<%s>>\n",myport);
MPI_Comm_accept( myport,MPI_INFO_NULL,0,comm_self,&intercomm );
printf("host accepted connection\n");

The port can be closed with MPI_Close_port.

Victor Eijkhout 223


8. MPI topic: Process management

Figure 8.3 MPI_Comm_accept


Name Param name Explanation C type F type inout
MPI_Comm_accept (
port_name port name const char* CHARACTER IN
info implementation-dependent MPI_Info TYPE IN
information (MPI_Info)
root rank in comm of root node int INTEGER IN
comm intra-communicator over MPI_Comm TYPE IN
which call is collective (MPI_Comm)
newcomm inter-communicator with MPI_Comm* TYPE OUT
client as remote group (MPI_Comm)
)

Figure 8.4 MPI_Comm_connect


Name Param name Explanation C type F type inout
MPI_Comm_connect (
port_name network address const char* CHARACTER IN
info implementation-dependent MPI_Info TYPE IN
information (MPI_Info)
root rank in comm of root node int INTEGER IN
comm intra-communicator over MPI_Comm TYPE IN
which call is collective (MPI_Comm)
newcomm inter-communicator with MPI_Comm* TYPE OUT
server as remote group (MPI_Comm)
)

8.2.2 Client calls


After the server has generated a port name, the client needs to connect to it with MPI_Comm_connect (fig-
ure 8.4), again specifying the port through a character buffer. The connect call is collective over its com-
municator.
char myport[MPI_MAX_PORT_NAME];
if (work_p==0) {
MPI_Recv( myport,MPI_MAX_PORT_NAME,MPI_CHAR,
MPI_ANY_SOURCE,0, comm_world,MPI_STATUS_IGNORE );
printf("Worker received port <<%s>>\n",myport);
}
MPI_Bcast( myport,MPI_MAX_PORT_NAME,MPI_CHAR,0,comm_work );

/*
* The workers collective connect over the inter communicator
*/
MPI_Comm intercomm;
MPI_Comm_connect( myport,MPI_INFO_NULL,0,comm_work,&intercomm );
if (work_p==0) {
int manage_n;
MPI_Comm_remote_size(intercomm,&manage_n);
printf("%d workers connected to %d managers\n",work_n,manage_n);

224 Parallel Computing – r428


8.2. Socket-style communications

Figure 8.5 MPI_Publish_name


Name Param name Explanation C type F type inout
MPI_Publish_name (
service_name a service name to const char* CHARACTER IN
associate with the port
info implementation-specific MPI_Info TYPE IN
information (MPI_Info)
port_name a port name const char* CHARACTER IN
)

If the named port does not exist (or has been closed), MPI_Comm_connect raises an error of class MPI_ERR_PORT.
The client can sever the connection with MPI_Comm_disconnect.
Running the above code on 5 processes gives:
# exchange port name:
Host sent port <<tag#0$OFA#000010e1:0001cde9:0001cdee$rdma_port#1024$rdma_host#10:16:225:0:1:205:199:25
Worker received port <<tag#0$OFA#000010e1:0001cde9:0001cdee$rdma_port#1024$rdma_host#10:16:225:0:1:205:

# Comm accept/connect
host accepted connection
4 workers connected to 1 managers

# Send/recv over the intercommunicator


Manager sent 4 items over intercomm
Worker zero received data

8.2.3 Published service names


More elegantly than the port mechanism above, it is possible to publish a named service, with
MPI_Publish_name (figure 8.5), which can then be discovered by other processes.
// publishapp.c
MPI_Comm intercomm;
char myport[MPI_MAX_PORT_NAME];
MPI_Open_port( MPI_INFO_NULL,myport );
MPI_Publish_name( service_name, MPI_INFO_NULL, myport );
MPI_Comm_accept( myport,MPI_INFO_NULL,0,comm_self,&intercomm );

Worker processes connect to the intercommunicator by


char myport[MPI_MAX_PORT_NAME];
MPI_Lookup_name( service_name,MPI_INFO_NULL,myport );
MPI_Comm intercomm;
MPI_Comm_connect( myport,MPI_INFO_NULL,0,comm_work,&intercomm );

For this it is necessary to have a name server running.

Victor Eijkhout 225


8. MPI topic: Process management

Figure 8.6 MPI_Unpublish_name


Name Param name Explanation C type F type inout
MPI_Unpublish_name (
service_name a service name const char* CHARACTER IN
info implementation-specific MPI_Info TYPE IN
information (MPI_Info)
port_name a port name const char* CHARACTER IN
)

Figure 8.7 MPI_Comm_join


Name Param name Explanation C type F type inout
MPI_Comm_join (
fd socket file descriptor int INTEGER IN
intercomm new inter-communicator MPI_Comm* TYPE OUT
(MPI_Comm)
)

Intel note. Start the hydra name server and use the corresponding mpi starter:
hydra_nameserver &
MPIEXEC=mpiexec.hydra
There is an environment variable, but that doesn’t seem to be needed.
export I_MPI_HYDRA_NAMESERVER=`hostname`:8008
It is also possible to specify the name server as an argument to the job starter.
At the end of a run, the service should be unpublished with MPI_Unpublish_name (figure 8.6). Unpublishing
a nonexisting or already unpublished service gives an error code of MPI_ERR_SERVICE.
MPI provides no guarantee of fairness in servicing connection attempts. That is, connection attempts are
not necessarily satisfied in the order in which they were initiated, and competition from other connection
attempts may prevent a particular connection attempt from being satisfied.

8.2.4 Unix sockets


It is also possible to create an intercommunicator from a Unix socket with MPI_Comm_join (figure 8.7).

8.3 Sessions
The most common way of initializing MPI, with MPI_Init (or MPI_Init_thread) and MPI_Finalize, is known
as the world model which can be described as:
1. There is a single call to MPI_Init or MPI_Init_thread;
2. There is a single call to MPI_Finalize;
3. With very few exceptions, all MPI calls appear in between the initialize and finalize calls.

226 Parallel Computing – r428


8.3. Sessions

Figure 8.8 MPI_Session_init


Name Param name Explanation C type F type inout
MPI_Session_init (
info info object to specify MPI_Info TYPE IN
thread support level (MPI_Info)
and MPI implementation
specific resources
errhandler error handler to invoke MPI_Errhandler TYPE IN
in the event that an error (MPI_Errhandler)
is encountered during this
function call
session new session MPI_Session* TYPE OUT
(MPI_Session)
)

This model suffers from some disadvantages:


1. There is no error handling during MPI_Init.
2. MPI can not be finalized and restarted;
3. If multiple libraries are active, they can not initialize or finalize MPI, but have to base themselves
on subcommunicators; section 7.2.2.
4. There is no threadsafe way of initializing MPI: a library can’t safely do
MPI_Initialized(&flag);
if (!flag) MPI_Init(0,0);

if it is running in a multi-threaded environment.


The following material is for the recently released MPI-4 standard and may not be supported yet.
In addition to the world, where all MPI is bracketed by MPI_Init (or MPI_Init_thread) and MPI_Finalize,
there is the session model, where entities such as libraries can start/end their MPI session independently.
The two models can be used in the same program, but there are limitations on how they can mix.

8.3.1 Short description of the session model


In the session model, each session starts and finalizes MPI independently, giving each a separate
MPI_COMM_WORLD. The world model then becomes a separate way of starting MPI. You can create a
communicator using the world model in addition to starting multiple sessions, each on their own set of
processes, possibly identical or overlapping. You can also create sessions without have an MPI_COMM_WORLD
created by the world model.
You can not mix in a single call objects from different sessions, from a session and from the world model,
or from a session and from MPI_Comm_get_parent or MPI_Comm_join.

8.3.2 Session creation


An MPI session is initialized and finalized with MPI_Session_init (figure 8.8) and MPI_Session_finalize,
somewhat similar to MPI_Init and MPI_Finalize.

Victor Eijkhout 227


8. MPI topic: Process management

MPI_Session the_session;
MPI_Session_init
( session_request_info,MPI_ERRORS_ARE_FATAL,
&the_session );
MPI_Session_finalize( &the_session );

This call is thread-safe, in view of the above reasoning.

8.3.2.1 Session info

The MPI_Info object that is passed to MPI_Session_init can be null, or it can be used to request a threading
level:
// session.c
MPI_Info session_request_info = MPI_INFO_NULL;
MPI_Info_create(&session_request_info);
char thread_key[] = "mpi_thread_support_level";
MPI_Info_set(session_request_info,
thread_key,"MPI_THREAD_MULTIPLE");

Other info keys can be implementation-dependent, but the key thread_support is pre-defined.

Info keys can be retrieved again with MPI_Session_get_info:


MPI_Info session_actual_info;
MPI_Session_get_info( the_session,&session_actual_info );
char thread_level[100]; int info_len = 100, flag;
MPI_Info_get_string( session_actual_info,
thread_key,&info_len,thread_level,&flag );

8.3.2.2 Session error handler

The error handler argument accepts a pre-defined error handler (section 15.2.2) or one created by
MPI_Session_create_errhandler.

8.3.3 Process sets and communicators

A session has a number of process sets. Process sets are indicated with a Uniform Resource Identifier (URI),
where the URIs mpi://WORLD and mpi://SELF are always defined.

You query the ‘psets’ with MPI_Session_get_num_psets and MPI_Session_get_nth_pset:

228 Parallel Computing – r428


8.3. Sessions

Code: Output:
int npsets; mpiexec -n 2 ./session
MPI_Session_get_num_psets Could not obtain thread
( the_session,MPI_INFO_NULL,&npsets ); ↪level,flag=0
if (mainproc) Number of process sets: 2
printf("Number of process sets: %d\n",npsets); Process set 0:
for (int ipset=0; ipset<npsets; ipset++) { ↪<<mpi://WORLD>>
int len_pset; char name_pset[MPI_MAX_PSET_NAME_LEN]; Process set 1:
MPI_Session_get_nth_pset ↪<<mpi://SELF>>
( the_session,MPI_INFO_NULL, Found WORLD as pset 0
ipset,&len_pset,name_pset ); World has 2 processes
if (mainproc)
printf("Process set %2d: <<%s>>\n",
ipset,name_pset);

The following partial code creates a communicator equivalent to MPI_COMM_WORLD in the session model:
MPI_Group world_group = MPI_GROUP_NULL;
MPI_Comm world_comm = MPI_COMM_NULL;
MPI_Group_from_session_pset
( the_session,world_name,&world_group );
MPI_Comm_create_from_group
( world_group,"victor-code-session.c",
MPI_INFO_NULL,MPI_ERRORS_ARE_FATAL,
&world_comm );
MPI_Group_free( &world_group );
int procid = -1, nprocs = 0;
MPI_Comm_size(world_comm,&nprocs);
MPI_Comm_rank(world_comm,&procid);

However, comparing communicators (with MPI_Comm_compare) from the session and world model, or from
different sessions, is undefined behavior.
Get the info object (section 15.1.1) from a process set: MPI_Session_get_pset_info. This info object always
has the key mpi_size.

8.3.4 Example
As an example of the use of sessions, we declare a library class, where each library object starts and ends
its own session:
// sessionlib.cxx
class Library {
private:
MPI_Comm world_comm; MPI_Session session;
public:
Library() {
MPI_Info info = MPI_INFO_NULL;
MPI_Session_init
( MPI_INFO_NULL,MPI_ERRORS_ARE_FATAL,&session );
char world_name[] = "mpi://WORLD";
MPI_Group world_group;

Victor Eijkhout 229


8. MPI topic: Process management

MPI_Group_from_session_pset
( session,world_name,&world_group );
MPI_Comm_create_from_group
( world_group,"world-session",
MPI_INFO_NULL,MPI_ERRORS_ARE_FATAL,
&world_comm );
MPI_Group_free( &world_group );
};
~Library() { MPI_Session_finalize(&session); };

Now we create a main program, using the world model, which activates two libraries, passing data to
them by parameter:
int main(int argc,char **argv) {

Library lib1,lib2;
MPI_Init(0,0);
MPI_Comm world = MPI_COMM_WORLD;
int procno,nprocs;
MPI_Comm_rank(world,&procno);
MPI_Comm_size(world,&nprocs);
auto sum1 = lib1.compute(procno);
auto sum2 = lib2.compute(procno+1);

Note that no mpi calls will go between main program and either of the libraries, or between the two
libraries, but this seems to make sense in this scenario.
End of MPI-4 material

8.4 Functionality available outside init/finalize


MPI_Initialized MPI_Finalized MPI_Get_version MPI_Get_library_version MPI_Info_create
MPI_Info_create_env MPI_Info_set MPI_Info_delete MPI_Info_get MPI_Info_get_valuelen
MPI_Info_get_nkeys MPI_Info_get_nthkey MPI_Info_dup MPI_Info_free MPI_Info_f2c MPI_Info_c2f
MPI_Session_create_errhandler MPI_Session_call_errhandler MPI_Errhandler_free MPI_Errhandler_f2c
MPI_Errhandler_c2f MPI_Error_string MPI_Error_class

Also all routines starting with MPI_Txxx.

230 Parallel Computing – r428


8.5. Full source code of examples

8.5 Full source code of examples


Please see the repository: https://github.com/VictorEijkhout/TheArtOfHPC_vol2_
parallelprogramming/tree/main. Examples are sorted first by programming system, then by
language.

Victor Eijkhout 231


Chapter 9

MPI topic: One-sided communication

Above, you saw point-to-point operations of the two-sided type: they require the co-operation of a sender
and receiver. This co-operation could be loose: you can post a receive with MPI_ANY_SOURCE as sender, but
there had to be both a send and receive call. This two-sidedness can be limiting. Consider code where the
receiving process is a dynamic function of the data:
x = f();
p = hash(x);
MPI_Send( x, /* to: */ p );

The problem is now: how does p know to post a receive, and how does everyone else know not to?
In this section, you will see one-sided communication routines where a process can do a ‘put’ or ‘get’ op-
eration, writing data to or reading it from another processor, without that other processor’s involvement.
In one-sided MPI operations, known as Remote Memory Access (RMA) operations in the standard, or
as Remote Direct Memory Access (RDMA) in other literature, there are still two processes involved: the
origin, which is the process that originates the transfer, whether this is a ‘put’ or a ‘get’, and the target
whose memory is being accessed. Unlike with two-sided operations, the target does not perform an action
that is the counterpart of the action on the origin.
That does not mean that the origin can access arbitrary data on the target at arbitrary times. First of all,
one-sided communication in MPI is limited to accessing only a specifically declared memory area on the
target: the target declares an area of memory that is accessible to other processes. This is known as a
window. Windows limit how origin processes can access the target’s memory: you can only ‘get’ data
from a window or ‘put’ it into a window; all the other memory is not reachable from other processes. On
the origin there is no such limitation; any data can function as the source of a ‘put’ or the recipient of a
‘get operation.
The alternative to having windows is to use distributed shared memory or virtual shared memory: memory
is distributed but acts as if it shared. The so-called Partitioned Global Address Space (PGAS) languages
such as Unified Parallel C (UPC) use this model.
Within one-sided communication, MPI has two modes: active RMA and passive RMA. In active RMA, or
active target synchronization, the target sets boundaries on the time period (the ‘epoch’) during which its
window can be accessed. The main advantage of this mode is that the origin program can perform many

232
9.1. Windows

small transfers, which are aggregated behind the scenes. This would be appropriate for applications that
are structured in a Bulk Synchronous Parallel (BSP) mode with supersteps. Active RMA acts much like
asynchronous transfer with a concluding MPI_Waitall.
In passive RMA, or passive target synchronization, the target process puts no limitation on when its window
can be accessed. (PGAS languages such as UPC are based on this model: data is simply read or written at
will.) While intuitively it is attractive to be able to write to and read from a target at arbitrary time, there
are problems. For instance, it requires a remote agent on the target, which may interfere with execution
of the main thread, or conversely it may not be activated at the optimal time. Passive RMA is also very
hard to debug and can lead to race conditions.

9.1 Windows

Figure 9.1: Collective definition of a window for one-sided data access

In one-sided communication, each processor can make an area of memory, called a window, available to
one-sided transfers. This is stored in a variable of type MPI_Win. A process can put an arbitrary item from
its own memory (not limited to any window) to the window of another process, or get something from
the other process’ window in its own memory.
A window can be characteristized as follows:
• The window is defined on a communicator, so the create call is collective; see figure 9.1.
• The window size can be set individually on each process. A zero size is allowed, but since win-
dow creation is collective, it is not possible to skip the create call.
• You can set a ‘displacement unit’ for the window: this is a number of bytes that will be used as
the indexing unit. For example if you use sizeof(double) as the displacement unit, an MPI_Put
to location 8 will go to the 8th double. That’s easier than having to specify the 64th byte.
• The window is the target of data in a put operation, or the source of data in a get operation; see
figure 9.2.
• There can be memory associated with a window, so it needs to be freed explicitly with
MPI_Win_free.

The typical calls involved are:

Victor Eijkhout 233


9. MPI topic: One-sided communication

Figure 9.1 MPI_Win_create


Name Param name Explanation C type F type inout
MPI_Win_create (
MPI_Win_create_c (
base initial address of window void* TYPE(*), IN
DIMENSION(..)
size size of window in bytes MPI_Aint INTEGER IN
(KIND=MPI_ADDRESS_KIND)
int
disp_unit local unit size for [ INTEGER IN
MPI_Aint
displacements, in bytes
info info argument MPI_Info TYPE IN
(MPI_Info)
comm intra-communicator MPI_Comm TYPE IN
(MPI_Comm)
win window object MPI_Win* TYPE(MPI_Win) OUT
)
Python:

MPI.Win.Create
(memory, int disp_unit=1,
Info info=INFO_NULL, Intracomm comm=COMM_SELF)

MPI_Info info;
MPI_Win window;
MPI_Win_allocate( /* size info */, info, comm, &memory, &window );
// do put and get calls
MPI_Win_free( &window );

Figure 9.2: Put and get between process memory and windows

9.1.1 Window creation and freeing


The memory for a window is at first sight ordinary data in user space. There are multiple ways you can
associate data with a window:
1. You can pass a user buffer to MPI_Win_create (figure 9.1). This buffer can be an ordinary array,
or it can be created with MPI_Alloc_mem. (In the former case, it may not be possible to lock the
window; section 9.4.)

234 Parallel Computing – r428


9.1. Windows

Figure 9.2 MPI_Win_allocate


Name Param name Explanation C type F type inout
MPI_Win_allocate (
MPI_Win_allocate_c (
size size of window in bytes MPI_Aint INTEGER IN
(KIND=MPI_ADDRESS_KIND)
int
disp_unit local unit size for [ INTEGER IN
MPI_Aint
displacements, in bytes
info info argument MPI_Info TYPE IN
(MPI_Info)
comm intra-communicator MPI_Comm TYPE IN
(MPI_Comm)
baseptr initial address of window void* TYPE(C_PTR) OUT
win window object returned by MPI_Win* TYPE(MPI_Win) OUT
call
)

2. You can let MPI do the allocation, so that MPI can perform various optimizations regarding
placement of the memory. The user code then receives the pointer to the data from MPI. This
can again be done in two ways:
• Use MPI_Win_allocate (figure 9.2) to create the data and the window in one call.
• If a communicator is on a shared memory (see section 12.1) you can create a window
in that shared memory with MPI_Win_allocate_shared. This will be useful for MPI shared
memory; see chapter 12.
3. Finally, you can create a window with MPI_Win_create_dynamic which postpones the allocation;
see section 9.5.2.
First of all, MPI_Win_create creates a window from a pointer to memory. The data array must not be
PARAMETER or static const.
The size parameter is measured in bytes. In C this can be done with the sizeof operator;
// putfencealloc.c
MPI_Win the_window;
int *window_data;
MPI_Win_allocate(2*sizeof(int),sizeof(int),
MPI_INFO_NULL,comm,
&window_data,&the_window);

for doing this calculation in Fortran, see section 15.3.1.


Python note 27: Displacement byte computations. For computing the displacement in bytes, here is a good
way for finding the size of numpy datatypes:
## putfence.py
intsize = np.dtype('int').itemsize
window_data = np.zeros(2,dtype=int)
win = MPI.Win.Create(window_data,intsize,comm=comm)

Victor Eijkhout 235


9. MPI topic: One-sided communication

Figure 9.3 MPI_Alloc_mem


Name Param name Explanation C type F type inout
MPI_Alloc_mem (
size size of memory segment in MPI_Aint INTEGER IN
bytes (KIND=MPI_ADDRESS_KIND)
info info argument MPI_Info TYPE IN
(MPI_Info)
baseptr pointer to beginning of void* TYPE(C_PTR) OUT
memory segment allocated
)

Next, one can obtain the memory from MPI by using MPI_Win_allocate, which has the data pointer as
output. Note the void* in the C signature; it is still necessary to pass a pointer to a pointer:
double *window_data;
MPI_Win_allocate( ... &window_data ... );

The routine MPI_Alloc_mem (figure 9.3) performs only the allocation part of MPI_Win_allocate, after which
you need to MPI_Win_create.
• An error of MPI_ERR_NO_MEM indicates that no memory could be allocated.
The following material is for the recently released MPI-4 standard and may not be supported yet.
• Allocated memory can be aligned by specifying an MPI_Info key of
mpi_minimum_memory_alignment.
End of MPI-4 material
This memory is freed with MPI_Free_mem:
// getfence.c
int *number_buffer = NULL;
MPI_Alloc_mem
( /* size: */ 2*sizeof(int),
MPI_INFO_NULL,&number_buffer);
MPI_Win_create
( number_buffer,2*sizeof(int),sizeof(int),
MPI_INFO_NULL,comm,&the_window);
MPI_Win_free(&the_window);
MPI_Free_mem(number_buffer);

(Note the lack of an ampersand in the free call!)


These calls reduce to malloc and free if there is no special memory area; SGI is an example where such
memory does exist.
A window is freed with a call to the collective MPI_Win_free (figure 9.4), which sets the window han-
dle to MPI_WIN_NULL. This call must only be done if all RMA operations are concluded, by MPI_Win_fence,
MPI_Win_wait, MPI_Win_complete, MPI_Win_unlock, depending on the case. If the window memory was allo-
cated internally by MPI through a call to MPI_Win_allocate or MPI_Win_allocate_shared, it is freed. User
memory used for the window can be freed after the MPI_Win_free call.
There will be more discussion of window memory in section 9.5.1.

236 Parallel Computing – r428


9.2. Active target synchronization: epochs

Figure 9.4 MPI_Win_free


Name Param name Explanation C type F type inout
MPI_Win_free (
win window object MPI_Win* TYPE(MPI_Win) INOUT
)

Python note 28: Window buffers. Unlike in C, the python window allocate call does not return a pointer to
the buffer memory, but an MPI.memory object. Should you need the bare memory, there are the
following options:
• Window objects expose the Python buffer interface. So you can do Pythonic things like
mview = memoryview(win)
array = numpy.frombuffer(win, dtype='i4')

• If you really want the raw base pointer (as an integer), you can do any of these:
base, size, disp_unit = win.atts
base = win.Get_attr(MPI.WIN_BASE)

• You can use mpi4py’s builtin memoryview/buffer-like type, but I do not recommend it, much
better to use NumPy as above:
mem = win.tomemory() # type(mem) is MPI.memory, similar to memoryview, but quite
↪limited in functionality
base = mem.address
size = mem.nbytes

9.1.2 Address arithmetic


Working with windows involves a certain amount of arithmetic on addresses, meaning MPI_Aint. See
MPI_Aint_add and MPI_Aint_diff in section 6.2.4.

9.2 Active target synchronization: epochs


One-sided communication has an obvious complication over two-sided: if you do a put call instead of a
send, how does the recipient know that the data is there? This process of letting the target know the state
of affairs is called ‘synchronization’, and there are various mechanisms for it. First of all we will consider
active target synchronization. Here the target knows when the transfer may happen (the communication
epoch), but does not do any data-related calls.
In this section we look at the first mechanism, which is to use a fence operation: MPI_Win_fence (figure 9.5).
This operation is collective on the communicator of the window. (Another, more sophisticated mechanism
for active target synchronization is discussed in section 9.2.2.)
The interval between two fences is known as an epoch. Roughly speaking, in an epoch you can make
one-sided communication calls, and after the concluding fence all these communications are concluded.

Victor Eijkhout 237


9. MPI topic: One-sided communication

Figure 9.5 MPI_Win_fence


Name Param name Explanation C type F type inout
MPI_Win_fence (
assert program assertion int INTEGER IN
win window object MPI_Win TYPE(MPI_Win) IN
)
Python:

win.Fence(self, int assertion=0)

MPI_Win_fence(0,win);
MPI_Get( /* operands */, win);
MPI_Win_fence(0, win);
// the `got' data is available

In between the two fences the window is exposed, and while it is you should not access it locally. If you
absolutely need to access it locally, you can use an RMA operation for that. Also, there can be only one
remote process that does a put; multiple accumulate accesses are allowed.
Fences are, together with other window calls, collective operations. That means they imply some amount
of synchronization between processes. Consider:
MPI_Win_fence( ... win ... ); // start an epoch
if (mytid==0) // do lots of work
MPI_Win_fence( ... win ... ); // end the epoch

and assume that all processes execute the first fence more or less at the same time. The zero process does
work before it can do the second fence call, but all other processes can call it immediately. However, they
can not finish that second fence call until all one-sided communication is finished, which means they wait
for the zero process.
As a further restriction, you can not mix MPI_Get with MPI_Put or MPI_Accumulate calls in a single epoch.
Hence, we can characterize an epoch as an access epoch on the origin, and as an exposure epoch on the
target.

9.2.1 Fence assertions


You can give various hints to the system about this epoch versus the ones before and after through the
assert parameter.
• MPI_MODE_NOSTORE This value can be specified or not per process.
• MPI_MODE_NOPUT This value can be specified or not per process.
• MPI_MODE_NOPRECEDE This value has to be specified or not the same on all processes.
• MPI_MODE_NOSUCCEED This value has to be specified or not the same on all processes.
Example:
MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOPRECEDE), win);
MPI_Get( /* operands */, win);
MPI_Win_fence(MPI_MODE_NOSUCCEED, win);

238 Parallel Computing – r428


9.2. Active target synchronization: epochs

Figure 9.3: A trace of a one-sided communication epoch where process zero only originates a one-sided
transfer

Assertions are an integer parameter: you can combine assertions by adding them or using logical-or. The
value zero is always correct. For further information, see section 9.6.

9.2.2 Non-global target synchronization


The ‘fence’ mechanism (section 9.2) uses a global synchronization on the communicator of the window,
giving a program a BSP like character. As such it is good for applications where the processes are largely
synchronized, but it may lead to performance inefficiencies if processors are not in step which each other.
Also, global synchronization may have hardware support, making this less restrictive than it may at first
seem.
There is a mechanism that is more fine-grained, by using synchronization only on a processor group. This
takes four different calls, two for starting and two for ending the epoch, separately for target and origin.
You start and complete an exposure epoch with MPI_Win_post / MPI_Win_wait:
int MPI_Win_post(MPI_Group group, int assert, MPI_Win win)
int MPI_Win_wait(MPI_Win win)

In other words, this turns your window into the target for a remote access. There is a non-blocking version
MPI_Win_test of MPI_Win_wait.

You start and complete an access epoch with MPI_Win_start / MPI_Win_complete:


int MPI_Win_start(MPI_Group group, int assert, MPI_Win win)
int MPI_Win_complete(MPI_Win win)

In other words, these calls border the access to a remote window, with the current processor being the
origin of the remote access.

Victor Eijkhout 239


9. MPI topic: One-sided communication

Figure 9.4: Window locking calls in fine-grained active target synchronization

In the following snippet a single processor puts data on one other. Note that they both have their own
definition of the group, and that the receiving process only does the post and wait calls.
// postwaitwin.c
MPI_Comm_group(comm,&all_group);
if (procno==origin) {
MPI_Group_incl(all_group,1,&target,&two_group);
// access
MPI_Win_start(two_group,0,the_window);
MPI_Put( /* data on origin: */ &my_number, 1,MPI_INT,
/* data on target: */ target,0, 1,MPI_INT,
the_window);
MPI_Win_complete(the_window);
}

if (procno==target) {
MPI_Group_incl(all_group,1,&origin,&two_group);
// exposure
MPI_Win_post(two_group,0,the_window);
MPI_Win_wait(the_window);
}

Both pairs of operations declare a group of processors; see section 7.5.1 for how to get such a group from
a communicator. On an origin processor you would specify a group that includes the targets you will
interact with, on a target processor you specify a group that includes the possible origins.

9.3 Put, get, accumulate


We will now look at the first three routines for doing one-sided operations: the Put, Get, and Accumulate
call. (We will look at so-called ‘atomic’ operations in section 9.3.7.) These calls are somewhat similar to a

240 Parallel Computing – r428


9.3. Put, get, accumulate

Figure 9.6 MPI_Put


Name Param name Explanation C type F type inout
MPI_Put (
MPI_Put_c (
origin_addr initial address of origin const void* TYPE(*), IN
buffer DIMENSION(..)
int
origin_count number of entries in [ INTEGER IN
MPI_Count
origin buffer
origin_datatype datatype of each entry in MPI_Datatype TYPE IN
origin buffer (MPI_Datatype)
target_rank rank of target int INTEGER IN
target_disp displacement from start of MPI_Aint INTEGER IN
window to target buffer (KIND=MPI_ADDRESS_KIND)
int
target_count number of entries in [ INTEGER IN
MPI_Count
target buffer
target_datatype datatype of each entry in MPI_Datatype TYPE IN
target buffer (MPI_Datatype)
win window object used for MPI_Win TYPE(MPI_Win) IN
communication
)
Python:

win.Put(self, origin, int target_rank, target=None)

Send, Receive and Reduce, except that of course only one process makes a call. Since one process does all
the work, its calling sequence contains both a description of the data on the origin (the calling process)
and the target (the affected other process).
As in the two-sided case, MPI_PROC_NULL can be used as a target rank.
The Accumulate routine has an MPI_Op argument that can be any of the usual operators, but no user-
defined ones (see section 3.10.1).

9.3.1 Put
The MPI_Put (figure 9.6) call can be considered as a one-sided send. As such, it needs to specify
• the target rank
• the data to be sent from the origin, and
• the location where it is to be written on the target.
The description of the data on the origin is the usual trio of buffer/count/datatype. However, the descrip-
tion of the data on the target is more complicated. It has a count and a datatype, but additionally it has
a displacement with respect to the start of the window on the target. This displacement can be given
in bytes, so its type is MPI_Aint, but strictly speaking it is a multiple of the displacement unit that was
specified in the window definition.
Specifically, data is written starting at
window_base + target_disp × disp_unit.

Victor Eijkhout 241


9. MPI topic: One-sided communication

Here is a single put operation. Note that the window create and window fence calls are collective, so they
have to be performed on all processors of the communicator that was used in the create call.
// putfence.c
MPI_Win the_window;
MPI_Win_create
(&window_data,2*sizeof(int),sizeof(int),
MPI_INFO_NULL,comm,&the_window);
MPI_Win_fence(0,the_window);
if (procno==0) {
MPI_Put
( /* data on origin: */ &my_number, 1,MPI_INT,
/* data on target: */ other,1, 1,MPI_INT,
the_window);
}
MPI_Win_fence(0,the_window);
MPI_Win_free(&the_window);

Fortran note 13: Displacement unit. The disp_unit variable is declared as an integer of ‘kind’
MPI_ADDRESS_KIND:

!! putfence.F90
integer(kind=MPI_ADDRESS_KIND) :: target_displacement
target_displacement = 1
call MPI_Put( my_number, 1,MPI_INTEGER, &
other,target_displacement, &
1,MPI_INTEGER, &
the_window)

Prior to Fortran2008, specifying a literal constant, such as 0, could lead to bizarre runtime errors;
the solution was to specify a zero-valued variable of the right type. With the mpi_f08 module
this is no longer allowed. Instead you get an error such as
error #6285: There is no matching specific subroutine for this generic subroutine call. [MPI
Python note 29: MPI one-sided transfer routines. MPI_Put (and Get and Accumulate) accept at minimum the
origin buffer and the target rank. The displacement is by default zero.
Exercise 9.1. Revisit exercise 4.3 and solve it using MPI_Put.
(There is a skeleton for this exercise under the name rightput.)
Exercise 9.2. Write code where:
• process 0 computes a random number 𝑟
• if 𝑟 < .5, zero writes in the window on 1;
• if 𝑟 ≥ .5, zero writes in the window on 2.
(There is a skeleton for this exercise under the name randomput.)

242 Parallel Computing – r428


9.3. Put, get, accumulate

Figure 9.7 MPI_Get


Name Param name Explanation C type F type inout
MPI_Get (
MPI_Get_c (
origin_addr initial address of origin void* TYPE(*), OUT
buffer DIMENSION(..)
int
origin_count number of entries in [ INTEGER IN
MPI_Count
origin buffer
origin_datatype datatype of each entry in MPI_Datatype TYPE IN
origin buffer (MPI_Datatype)
target_rank rank of target int INTEGER IN
target_disp displacement from window MPI_Aint INTEGER IN
start to the beginning of (KIND=MPI_ADDRESS_KIND)
the target buffer
int
target_count number of entries in [ INTEGER IN
MPI_Count
target buffer
target_datatype datatype of each entry in MPI_Datatype TYPE IN
target buffer (MPI_Datatype)
win window object used for MPI_Win TYPE(MPI_Win) IN
communication
)
Python:

win.Get(self, origin, int target_rank, target=None)

9.3.2 Get
The MPI_Get (figure 9.7) call is very similar.
Example:
MPI_Win_fence(0,the_window);
if (procno==0) {
MPI_Get( /* data on origin: */ &my_number, 1,MPI_INT,
/* data on target: */ other,1, 1,MPI_INT,
the_window);
}
MPI_Win_fence(0,the_window);

We make a null window on processes that do not participate.


## getfence.py
if procid==0 or procid==nprocs-1:
win_mem = np.empty( 1,dtype=np.float64 )
win = MPI.Win.Create( win_mem,comm=comm )
else:
win = MPI.Win.Create( None,comm=comm )

# put data on another process


win.Fence()
if procid==0 or procid==nprocs-1:

Victor Eijkhout 243


9. MPI topic: One-sided communication

putdata = np.empty( 1,dtype=np.float64 )


putdata[0] = mydata
print("[%d] putting %e" % (procid,mydata))
win.Put( putdata,other )
win.Fence()

9.3.3 Put and get example: halo update

As an example, let’s look at halo update. The array A is


updated using the local values and the halo that comes
from bordering processors, either through Put or Get
operations.

In a first version we separate computation and com-


munication. Each iteration has two fences. Between
the two fences in the loop body we do the MPI_Put op-
eration; between the second and and first one of the
next iteration there is only computation, so we add
the MPI_MODE_NOPRECEDE and MPI_MODE_NOSUCCEED asser-
tions. The MPI_MODE_NOSTORE assertion states that the
local window was not updated: the Put operation only
works on remote windows.
for ( .... ) {
update(A);
MPI_Win_fence(MPI_MODE_NOPRECEDE, win);
for(i=0; i < toneighbors; i++)
MPI_Put( ... );
MPI_Win_fence((MPI_MODE_NOSTORE | MPI_MODE_NOSUCCEED), win);
}

For much more about assertions, see section 9.6 below.

Next, we split the update in the core part, which can be done purely from local values, and the boundary,
which needs local and halo values. Update of the core can overlap the communication of the halo.
for ( .... ) {
update_boundary(A);
MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOPRECEDE), win);
for(i=0; i < fromneighbors; i++)
MPI_Get( ... );
update_core(A);
MPI_Win_fence(MPI_MODE_NOSUCCEED, win);
}

The MPI_MODE_NOPRECEDE and MPI_MODE_NOSUCCEED assertions still hold, but the Get operation implies that
instead of MPI_MODE_NOSTORE in the second fence, we use MPI_MODE_NOPUT in the first.

244 Parallel Computing – r428


9.3. Put, get, accumulate

Figure 9.8 MPI_Accumulate


Name Param name Explanation C type F type inout
MPI_Accumulate (
MPI_Accumulate_c (
origin_addr initial address of buffer const void* TYPE(*), IN
DIMENSION(..)
int
origin_count number of entries in [ INTEGER IN
MPI_Count
buffer
origin_datatype datatype of each entry MPI_Datatype TYPE IN
(MPI_Datatype)
target_rank rank of target int INTEGER IN
target_disp displacement from start MPI_Aint INTEGER IN
of window to beginning of (KIND=MPI_ADDRESS_KIND)
target buffer
int
target_count number of entries in [ INTEGER IN
MPI_Count
target buffer
target_datatype datatype of each entry in MPI_Datatype TYPE IN
target buffer (MPI_Datatype)
op reduce operation MPI_Op TYPE(MPI_Op) IN
win window object MPI_Win TYPE(MPI_Win) IN
)
Python:

MPI.Win.Accumulate(self, origin, int target_rank, target=None, Op op=SUM)

9.3.4 Accumulate
A third one-sided routine is MPI_Accumulate (figure 9.8) which does a reduction operation on the results
that are being put.
Accumulate is an atomic reduction with remote result. This means that multiple accumulates to a single
target in the same epoch give the correct result. As with MPI_Reduce, the order in which the operands are
accumulated is undefined.
The same predefined operators are available, but no user-defined ones. There is one extra operator:
MPI_REPLACE, this has the effect that only the last result to arrive is retained.

Exercise 9.3. Implement an ‘all-gather’ operation using one-sided communication: each


processor stores a single number, and you want each processor to build up an array
that contains the values from all processors. Note that you do not need a special case
for a processor collecting its own value: doing ‘communication’ between a processor
and itself is perfectly legal.
For the next exercise, refer to figure 9.5.
Exercise 9.4.
Implement a shared counter:
• One process maintains a counter;
• Iterate: all others at random moments update this counter.

Victor Eijkhout 245


9. MPI topic: One-sided communication

Figure 9.5: Pool of work descriptors with shared stack pointers

• When the counter is no longer positive, everyone stops iterating.


The problem here is data synchronization: does everyone see the counter the same
way?

9.3.5 Ordering and coherence of RMA operations


There are few guarantees about what happens inside one epoch.
• No ordering of Get and Put/Accumulate operations: if you do both, there is no guarantee
whether the Get will find the value before or after the update.
• No ordering of multiple Puts. It is safer to do an Accumulate.
The following operations are well-defined inside one epoch:
• Instead of multiple Put operations, use Accumulate with MPI_REPLACE.
• MPI_Get_accumulate with MPI_NO_OP is safe.
• Multiple Accumulate operations from one origin are done in program order by default. To allow
reordering, for instance to have all reads happen after all writes, use the info parameter when
the window is created; section 9.5.3.

9.3.6 Request-based operations


Analogous to MPI_Isend there are request-based one-sided operations: MPI_Rput (figure 9.9) and similarly
MPI_Rget and MPI_Raccumulate and MPI_Rget_accumulate. These only apply to passive target synchroniza-
tion. Any MPI_Win_flush... call also terminates these transfers.

9.3.7 Atomic operations


One-sided calls are said to emulate shared memory in MPI, but the put and get calls are not enough for
certain scenarios with shared data. Consider the scenario where:

246 Parallel Computing – r428


9.3. Put, get, accumulate

Figure 9.9 MPI_Rput


Name Param name Explanation C type F type inout
MPI_Rput (
MPI_Rput_c (
origin_addr initial address of origin const void* TYPE(*), IN
buffer DIMENSION(..)
int
origin_count number of entries in [ INTEGER IN
MPI_Count
origin buffer
origin_datatype datatype of each entry in MPI_Datatype TYPE IN
origin buffer (MPI_Datatype)
target_rank rank of target int INTEGER IN
target_disp displacement from start of MPI_Aint INTEGER IN
window to target buffer (KIND=MPI_ADDRESS_KIND)
int
target_count number of entries in [ INTEGER IN
MPI_Count
target buffer
target_datatype datatype of each entry in MPI_Datatype TYPE IN
target buffer (MPI_Datatype)
win window object used for MPI_Win TYPE(MPI_Win) IN
communication
request RMA request MPI_Request* TYPE OUT
(MPI_Request)
)

• One process stores a table of work descriptors, and a pointer to the first unprocessed descriptor;
• Each process reads the pointer, reads the corresponding descriptor, and increments the pointer;
and
• A process that has read a descriptor then executes the corresponding task.
The problem is that reading and updating the pointer is not an atomic operation, so it is possible that
multiple processes get hold of the same value; conversely, multiple updates of the pointer may lead to
work descriptors being skipped. These different overall behaviors, depending on precise timing of lower
level events, are called a race condition.
In MPI-3 some atomic routines have been added. Both MPI_Fetch_and_op (figure 9.10) and
MPI_Get_accumulate (figure 9.11) atomically retrieve data from the window indicated, and apply an
operator, combining the data on the target with the data on the origin. Unlike Put and Get, it is safe to
have multiple atomic operations in the same epoch.
Both routines perform the same operations: return data before the operation, then atomically update data
on the target, but MPI_Get_accumulate is more flexible in data type handling. The more simple routine,
MPI_Fetch_and_op, which operates on only a single element, allows for faster implementations, in particular
through hardware support.
Use of MPI_NO_OP as the MPI_Op turns these routines into an atomic Get. Similarly, using MPI_REPLACE turns
them into an atomic Put.
Exercise 9.5. Redo exercise 9.4 using MPI_Fetch_and_op. The problem is again to make sure
all processes have the same view of the shared counter.

Victor Eijkhout 247


9. MPI topic: One-sided communication

Figure 9.10 MPI_Fetch_and_op


Name Param name Explanation C type F type inout
MPI_Fetch_and_op (
origin_addr initial address of buffer const void* TYPE(*), IN
DIMENSION(..)
result_addr initial address of result void* TYPE(*), OUT
buffer DIMENSION(..)
datatype datatype of the entry in MPI_Datatype TYPE IN
origin, result, and target (MPI_Datatype)
buffers
target_rank rank of target int INTEGER IN
target_disp displacement from start MPI_Aint INTEGER IN
of window to beginning of (KIND=MPI_ADDRESS_KIND)
target buffer
op reduce operation MPI_Op TYPE(MPI_Op) IN
win window object MPI_Win TYPE(MPI_Win) IN
)

Figure 9.11 MPI_Get_accumulate


Name Param name Explanation C type F type inout
MPI_Get_accumulate (
MPI_Get_accumulate_c (
origin_addr initial address of buffer const void* TYPE(*), IN
DIMENSION(..)
int
origin_count number of entries in [ INTEGER IN
MPI_Count
origin buffer
origin_datatype datatype of each entry in MPI_Datatype TYPE IN
origin buffer (MPI_Datatype)
result_addr initial address of result void* TYPE(*), OUT
buffer DIMENSION(..)
int
result_count number of entries in [ INTEGER IN
MPI_Count
result buffer
result_datatype datatype of each entry in MPI_Datatype TYPE IN
result buffer (MPI_Datatype)
target_rank rank of target int INTEGER IN
target_disp displacement from start MPI_Aint INTEGER IN
of window to beginning of (KIND=MPI_ADDRESS_KIND)
target buffer
int
target_count number of entries in [ INTEGER IN
MPI_Count
target buffer
target_datatype datatype of each entry in MPI_Datatype TYPE IN
target buffer (MPI_Datatype)
op reduce operation MPI_Op TYPE(MPI_Op) IN
win window object MPI_Win TYPE(MPI_Win) IN
)

248 Parallel Computing – r428


9.3. Put, get, accumulate

Does it work to make the fetch-and-op conditional? Is there a way to do it


unconditionally? What should the ‘break’ test be, seeing that multiple processes can
update the counter at the same time?
Example. A root process has a table of data; the other processes do atomic gets and update of that data
using passive target synchronization through MPI_Win_lock.
// passive.cxx
if (procno==repository) {
// Repository processor creates a table of inputs
// and associates that with the window
}
if (procno!=repository) {
float contribution=(float)procno,table_element;
int loc=0;
MPI_Win_lock(MPI_LOCK_EXCLUSIVE,repository,0,the_window);
// read the table element by getting the result from adding zero
MPI_Fetch_and_op
(&contribution,&table_element,MPI_FLOAT,
repository,loc,MPI_SUM,the_window);
MPI_Win_unlock(repository,the_window);
}

## passive.py
if procid==repository:
# repository process creates a table of inputs
# and associates it with the window
win_mem = np.empty( ninputs,dtype=np.float32 )
win = MPI.Win.Create( win_mem,comm=comm )
else:
# everyone else has an empty window
win = MPI.Win.Create( None,comm=comm )
if procid!=repository:
contribution = np.empty( 1,dtype=np.float32 )
contribution[0] = 1.*procid
table_element = np.empty( 1,dtype=np.float32 )
win.Lock( repository,lock_type=MPI.LOCK_EXCLUSIVE )
win.Fetch_and_op( contribution,table_element,repository,0,MPI.SUM)
win.Unlock( repository )

Finally, MPI_Compare_and_swap (figure 9.12) swaps the origin and target data if the target data equals some
comparison value.

9.3.7.1 A case study in atomic operations


Let us consider an example where a process, identified by counter_process, has a table of work descriptors,
and all processes, including the counter process, take items from it to work on. To avoid duplicate work,
the counter process has as counter that indicates the highest numbered available item. The part of this
application that we simulate is this:
1. a process reads the counter, to find an available work item; and
2. subsequently decrements the counter by one.

Victor Eijkhout 249


9. MPI topic: One-sided communication

Figure 9.12 MPI_Compare_and_swap


Name Param name Explanation C type F type inout
MPI_Compare_and_swap (
origin_addr initial address of buffer const void* TYPE(*), IN
DIMENSION(..)
compare_addr initial address of compare const void* TYPE(*), IN
buffer DIMENSION(..)
result_addr initial address of result void* TYPE(*), OUT
buffer DIMENSION(..)
datatype datatype of the element in MPI_Datatype TYPE IN
all buffers (MPI_Datatype)
target_rank rank of target int INTEGER IN
target_disp displacement from start MPI_Aint INTEGER IN
of window to beginning of (KIND=MPI_ADDRESS_KIND)
target buffer
win window object MPI_Win TYPE(MPI_Win) IN
)

We initialize the window content, under the separate memory model:


// countdownop.c
MPI_Win_fence(0,the_window);
if (procno==counter_process)
MPI_Put(&counter_init,1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);

We start by considering the naive approach, where we execute the above scheme literally with MPI_Get
and MPI_Put:
// countdownput.c
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( &counter_value,1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
if (i_am_available) {
int decrement = -1;
counter_value += decrement;
MPI_Put
( &counter_value, 1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
}
MPI_Win_fence(0,the_window);

This scheme is correct if only process has a true value for i_am_available: that processes ‘owns’ the current
counter values, and it correctly updates the counter through the MPI_Put operation. However, if more than

250 Parallel Computing – r428


9.3. Put, get, accumulate

one process is available, they get duplicate counter values, and the update is also incorrect. If we run this
program, we see that the counter did not get decremented by the total number of ‘put’ calls.

Exercise 9.6. Supposing only one process is available, what is the function of the middle of
the three fences? Can it be omitted?

We can fix the decrement of the counter by using MPI_Accumulate for the counter update, since it is atomic:
multiple updates in the same epoch all get processed.

// countdownacc.c
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( &counter_value,1,MPI_INT,
counter_process,0,1,MPI_INT,
the_window);
MPI_Win_fence(0,the_window);
if (i_am_available) {
int decrement = -1;
MPI_Accumulate
( &decrement, 1,MPI_INT,
counter_process,0,1,MPI_INT,
MPI_SUM,
the_window);
}
MPI_Win_fence(0,the_window);

This scheme still suffers from the problem that processes will obtain duplicate counter values. The true
solution is to combine the ‘get’ and ‘put’ operations into one atomic action; in this case MPI_Fetch_and_op:

MPI_Win_fence(0,the_window);
int
counter_value;
if (i_am_available) {
int
decrement = -1;
total_decrement++;
MPI_Fetch_and_op
( /* operate with data from origin: */ &decrement,
/* retrieve data from target: */ &counter_value,
MPI_INT, counter_process, 0, MPI_SUM,
the_window);
}
MPI_Win_fence(0,the_window);
if (i_am_available) {
my_counter_values[n_my_counter_values++] = counter_value;
}

Now, if there are multiple accesses, each retrieves the counter value and updates