100% found this document useful (2 votes)

2K views477 pages

Gpu Parallel Program Development Cuda

Uploaded by

András Benedek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

2K views477 pages

Gpu Parallel Program Development Cuda

Uploaded by

András Benedek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 477

GPU Parallel Program

Development Using CUDA

Chapman & Hall/CRC
Computational Science Series
SERIES EDITOR

Horst Simon
Deputy Director
Lawrence Berkeley National Laboratory
Berkeley, California, U.S.A.

PUBLISHED TITLES

COMBINATORIAL SCIENTIFIC COMPUTING

Edited by Uwe Naumann and Olaf Schenk
CONTEMPORARY HIGH PERFORMANCE COMPUTING: FROM PETASCALE
TOWARD EXASCALE
Edited by Jeffrey S. Vetter
CONTEMPORARY HIGH PERFORMANCE COMPUTING: FROM PETASCALE
TOWARD EXASCALE, VOLUME TWO
Edited by Jeffrey S. Vetter
DATA-INTENSIVE SCIENCE
Edited by Terence Critchlow and Kerstin Kleese van Dam
ELEMENTS OF PARALLEL COMPUTING
Eric Aubanel
THE END OF ERROR: UNUM COMPUTING
John L. Gustafson
EXASCALE SCIENTIFIC APPLICATIONS: SCALABILITY AND
PERFORMANCE PORTABILITY
Edited by Tjerk P. Straatsma, Katerina B. Antypas, and Timothy J. Williams
FROM ACTION SYSTEMS TO DISTRIBUTED SYSTEMS: THE REFINEMENT APPROACH
Edited by Luigia Petre and Emil Sekerinski
FUNDAMENTALS OF MULTICORE SOFTWARE DEVELOPMENT
Edited by Victor Pankratius, Ali-Reza Adl-Tabatabai, and Walter Tichy
FUNDAMENTALS OF PARALLEL MULTICORE ARCHITECTURE
Yan Solihin
THE GREEN COMPUTING BOOK: TACKLING ENERGY EFFICIENCY AT LARGE SCALE
Edited by Wu-chun Feng
GRID COMPUTING: TECHNIQUES AND APPLICATIONS
Barry Wilkinson
GPU PARALLEL PROGRAM DEVELOPMENT USING CUDA
Tolga Soyata
PUBLISHED TITLES CONTINUED

HIGH PERFORMANCE COMPUTING: PROGRAMMING AND APPLICATIONS

John Levesque with Gene Wagenbreth
HIGH PERFORMANCE PARALLEL I/O
Prabhat and Quincey Koziol
HIGH PERFORMANCE VISUALIZATION:
ENABLING EXTREME-SCALE SCIENTIFIC INSIGHT
Edited by E. Wes Bethel, Hank Childs, and Charles Hansen
INDUSTRIAL APPLICATIONS OF HIGH-PERFORMANCE COMPUTING:
BEST GLOBAL PRACTICES
Edited by Anwar Osseyran and Merle Giles
INTRODUCTION TO COMPUTATIONAL MODELING USING C AND
OPEN-SOURCE TOOLS
José M Garrido
INTRODUCTION TO CONCURRENCY IN PROGRAMMING LANGUAGES
Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
INTRODUCTION TO ELEMENTARY COMPUTATIONAL MODELING: ESSENTIAL
CONCEPTS, PRINCIPLES, AND PROBLEM SOLVING
José M. Garrido
INTRODUCTION TO HIGH PERFORMANCE COMPUTING FOR SCIENTISTS
AND ENGINEERS
Georg Hager and Gerhard Wellein
INTRODUCTION TO MODELING AND SIMULATION WITH MATLAB® AND PYTHON
Steven I. Gordon and Brian Guilfoos
INTRODUCTION TO REVERSIBLE COMPUTING
Kalyan S. Perumalla
INTRODUCTION TO SCHEDULING
Yves Robert and Frédéric Vivien
INTRODUCTION TO THE SIMULATION OF DYNAMICS USING SIMULINK®
Michael A. Gray
PEER-TO-PEER COMPUTING: APPLICATIONS, ARCHITECTURE, PROTOCOLS,
AND CHALLENGES
Yu-Kwong Ricky Kwok
PERFORMANCE TUNING OF SCIENTIFIC APPLICATIONS
Edited by David Bailey, Robert Lucas, and Samuel Williams
PETASCALE COMPUTING: ALGORITHMS AND APPLICATIONS
Edited by David A. Bader
PROCESS ALGEBRA FOR PARALLEL AND DISTRIBUTED PROCESSING
Edited by Michael Alexander and William Gardner
PUBLISHED TITLES CONTINUED

PROGRAMMING FOR HYBRID MULTI/MANY-CORE MPP SYSTEMS

John Levesque and Aaron Vose
SCIENTIFIC DATA MANAGEMENT: CHALLENGES, TECHNOLOGY, AND DEPLOYMENT
Edited by Arie Shoshani and Doron Rotem
SOFTWARE ENGINEERING FOR SCIENCE
Edited by Jeffrey C. Carver, Neil P. Chue Hong, and George K. Thiruvathukal
GPU Parallel Program
Development Using CUDA

Tolga Soyata
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2018 by Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

International Standard Book Number-13: 978-1-4987-5075-2 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize
to copyright holders if permission to publish in this form has not been obtained. If any copyright material
has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans-
mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com
(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive,
Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and regis-
tration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
——————————————————————————————————————————————–
Library of Congress Cataloging-in-Publication Data
——————————————————————————————————————————————–
Names: Soyata, Tolga, 1967- author.
Title: GPU parallel program development using CUDA
/ by Tolga Soyata.
Description: Boca Raton, Florida : CRC Press, [2018] | Includes bibliographical
references and index.
Identifiers: LCCN 2017043292 | ISBN 9781498750752 (hardback) |
ISBN 9781315368290 (e-book)
Subjects: LCSH: Parallel programming (Computer science) | CUDA (Computer architecture) |
Graphics processing units–Programming.
Classification: LCC QA76.642.S67 2018 | DDC 005.2/75–dc23
LC record available at https://lccn.loc.gov/2017043292
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at

http://www.crcpress.com
To my wife Eileen
and my step-children Katherine, Andrew, and Eric.
Contents

List of Figures xxiii

List of Tables xxix
Preface xxxiii
About the Author xxxv

Part I Understanding CPU Parallelism

Chapter 1 Introduction to CPU Parallel Programming 3

1.1 EVOLUTION OF PARALLEL PROGRAMMING 3

1.2 MORE CORES, MORE PARALLELISM 4
1.3 CORES VERSUS THREADS 5
1.3.1 More Threads or More Cores to Parallelize? 5
1.3.2 Influence of Core Resource Sharing 7
1.3.3 Influence of Memory Resource Sharing 7
1.4 OUR FIRST SERIAL PROGRAM 8
1.4.1 Understanding Data Transfer Speeds 8
1.4.2 The main() Function in imflip.c 10
1.4.3 Flipping Rows Vertically: FlipImageV() 11
1.4.4 Flipping Columns Horizontally: FlipImageH() 12
1.5 WRITING, COMPILING, RUNNING OUR PROGRAMS 13
1.5.1 Choosing an Editor and a Compiler 13
1.5.2 Developing in Windows 7, 8, and Windows 10
Platforms 13
1.5.3 Developing in a Mac Platform 15
1.5.4 Developing in a Unix Platform 15
1.6 CRASH COURSE ON UNIX 15
1.6.1 Unix Directory-Related Commands 15
1.6.2 Unix File-Related Commands 16
1.7 DEBUGGING YOUR PROGRAMS 19
1.7.1 gdb 20
1.7.2 Old School Debugging 21
1.7.3 valgrind 22

ix
x Contents

1.8 PERFORMANCE OF OUR FIRST SERIAL PROGRAM 23

1.8.1 Can We Estimate the Execution Time? 24
1.8.2 What Does the OS Do When Our Code Is
Executing? 24
1.8.3 How Do We Parallelize It? 25
1.8.4 Thinking About the Resources 25

Chapter 2 Developing Our First Parallel CPU Program 27

2.1 OUR FIRST PARALLEL PROGRAM 27

2.1.1 The main() Function in imflipP.c 28
2.1.2 Timing the Execution 29
2.1.3 Split Code Listing for main() in imflipP.c 29
2.1.4 Thread Initialization 32
2.1.5 Thread Creation 32
2.1.6 Thread Launch/Execution 34
2.1.7 Thread Termination (Join) 35
2.1.8 Thread Task and Data Splitting 35
2.2 WORKING WITH BITMAP (BMP) FILES 37
2.2.1 BMP is a Non-Lossy/Uncompressed File
Format 37
2.2.2 BMP Image File Format 38
2.2.3 Header File ImageStuff.h 39
2.2.4 Image Manipulation Routines in ImageStuff.c 40
2.3 TASK EXECUTION BY THREADS 42
2.3.1 Launching a Thread 43
2.3.2 Multithreaded Vertical Flip: MTFlipV() 45
2.3.3 Comparing FlipImageV() and MTFlipV() 48
2.3.4 Multithreaded Horizontal Flip: MTFlipH() 50
2.4 TESTING/TIMING THE MULTITHREADED CODE 51

Chapter 3 Improving Our First Parallel CPU Program 53

3.1 EFFECT OF THE “PROGRAMMER” ON PERFORMANCE 53

3.2 EFFECT OF THE “CPU” ON PERFORMANCE 54
3.2.1 In-Order versus Out-Of-Order Cores 55
3.2.2 Thin versus Thick Threads 57
3.3 PERFORMANCE OF IMFLIPP 57
3.4 EFFECT OF THE “OS” ON PERFORMANCE 58
3.4.1 Thread Creation 59
3.4.2 Thread Launch and Execution 59
3.4.3 Thread Status 60
Contents xi

3.4.4 Mapping Software Threads to Hardware Threads 61

3.4.5 Program Performance versus Launched Pthreads 62
3.5 IMPROVING IMFLIPP 63
3.5.1 Analyzing Memory Access Patterns in MTFlipH() 64
3.5.2 Multithreaded Memory Access of MTFlipH() 64
3.5.3 DRAM Access Rules of Thumb 66
3.6 IMFLIPPM: OBEYING DRAM RULES OF THUMB 67
3.6.1 Chaotic Memory Access Patterns of imflipP 67
3.6.2 Improving Memory Access Patterns of imflipP 68
3.6.3 MTFlipHM(): The Memory Friendly MTFlipH() 69
3.6.4 MTFlipVM(): The Memory Friendly MTFlipV() 71
3.7 PERFORMANCE OF IMFLIPPM.C 72
3.7.1 Comparing Performances of imflipP.c and imflipPM.c 72
3.7.2 Speed Improvement: MTFlipV() versus MTFlipVM() 73
3.7.3 Speed Improvement: MTFlipH() versus MTFlipHM() 73
3.7.4 Understanding the Speedup: MTFlipH() versus MTFlipHM() 73
3.8 PROCESS MEMORY MAP 74
3.9 INTEL MIC ARCHITECTURE: XEON PHI 76
3.10 WHAT ABOUT THE GPU? 77
3.11 CHAPTER SUMMARY 78

Chapter 4 Understanding the Cores and Memory 79

4.1 ONCE UPON A TIME ... INTEL ... 79

4.2 CPU AND MEMORY MANUFACTURERS 80
4.3 DYNAMIC (DRAM) VERSUS STATIC (SRAM) MEMORY 81
4.3.1 Static Random Access Memory (SRAM) 81
4.3.2 Dynamic Random Access Memory (DRAM) 81
4.3.3 DRAM Interface Standards 81
4.3.4 Influence of DRAM on our Program Performance 82
4.3.5 Influence of SRAM (Cache) on our Program
Performance 83
4.4 IMAGE ROTATION PROGRAM: IMROTATE.C 83
4.4.1 Description of the imrotate.c 84
4.4.2 imrotate.c: Parametric Restrictions and
Simplifications 84
4.4.3 imrotate.c: Theory of Operation 85
4.5 PERFORMANCE OF IMROTATE 89
4.5.1 Qualitative Analysis of Threading Efficiency 89
4.5.2 Quantitative Analysis: Defining Threading
Efficiency 89
xii Contents

4.6 THE ARCHITECTURE OF THE COMPUTER 91

4.6.1 The Cores, L1$ and L2$ 91
4.6.2 Internal Core Resources 92
4.6.3 The Shared L3 Cache Memory (L3$) 94
4.6.4 The Memory Controller 94
4.6.5 The Main Memory 95
4.6.6 Queue, Uncore, and I/O 96
4.7 IMROTATEMC: MAKING IMROTATE MORE EFFICIENT 97
4.7.1 Rotate2(): How Bad is Square Root and FP Division? 99
4.7.2 Rotate3() and Rotate4(): How Bad Is sin() and cos()? 100
4.7.3 Rotate5(): How Bad Is Integer Division/Multiplication? 102
4.7.4 Rotate6(): Consolidating Computations 102
4.7.5 Rotate7(): Consolidating More Computations 104
4.7.6 Overall Performance of imrotateMC 104
4.8 CHAPTER SUMMARY 106

Chapter 5 Thread Management and Synchronization 107

5.1 EDGE DETECTION PROGRAM: IMEDGE.C 107

5.1.1 Description of the imedge.c 108
5.1.2 imedge.c: Parametric Restrictions and
Simplifications 108
5.1.3 imedge.c: Theory of Operation 109
5.2 IMEDGE.C : IMPLEMENTATION 111
5.2.1 Initialization and Time-Stamping 112
5.2.2 Initialization Functions for Different Image
Representations 113
5.2.3 Launching and Terminating Threads 114
5.2.4 Gaussian Filter 115
5.2.5 Sobel 116
5.2.6 Threshold 117
5.3 PERFORMANCE OF IMEDGE 118
5.4 IMEDGEMC: MAKING IMEDGE MORE EFFICIENT 118
5.4.1 Using Precomputation to Reduce Bandwidth 119
5.4.2 Storing the Precomputed Pixel Values 120
5.4.3 Precomputing Pixel Values 121
5.4.4 Reading the Image and Precomputing Pixel
Values 122
5.4.5 PrGaussianFilter 123
5.4.6 PrSobel 124
5.4.7 PrThreshold 125
5.5 PERFORMANCE OF IMEDGEMC 126
Contents xiii

5.6 IMEDGEMCT: SYNCHRONIZING THREADS EFFICIENTLY 127

5.6.1 Barrier Synchronization 128
5.6.2 MUTEX Structure for Data Sharing 129
5.7 IMEDGEMCT: IMPLEMENTATION 130
5.7.1 Using a MUTEX: Read Image, Precompute 132
5.7.2 Precomputing One Row at a Time 133
5.8 PERFORMANCE OF IMEDGEMCT 134

Part II GPU Programming Using CUDA

Chapter 6 Introduction to GPU Parallelism and CUDA 137

6.1 ONCE UPON A TIME ... NVIDIA ... 137

6.1.1 The Birth of the GPU 137
6.1.2 Early GPU Architectures 138
6.1.3 The Birth of the GPGPU 140
6.1.4 Nvidia, ATI Technologies, and Intel 141
6.2 COMPUTE-UNIFIED DEVICE ARCHITECTURE (CUDA) 143
6.2.1 CUDA, OpenCL, and Other GPU Languages 143
6.2.2 Device Side versus Host Side Code 143
6.3 UNDERSTANDING GPU PARALLELISM 144
6.3.1 How Does the GPU Achieve High Performance? 145
6.3.2 CPU versus GPU Architectural Differences 146
6.4 CUDA VERSION OF THE IMAGE FLIPPER: IMFLIPG.CU 147
6.4.1 imflipG.cu: Read the Image into a CPU-Side Array 149
6.4.2 Initialize and Query the GPUs 151
6.4.3 GPU-Side Time-Stamping 153
6.4.4 GPU-Side Memory Allocation 155
6.4.5 GPU Drivers and Nvidia Runtime Engine 155
6.4.6 CPU→GPU Data Transfer 156
6.4.7 Error Reporting Using Wrapper Functions 157
6.4.8 GPU Kernel Execution 157
6.4.9 Finish Executing the GPU Kernel 160
6.4.10 Transfer GPU Results Back to the CPU 161
6.4.11 Complete Time-Stamping 161
6.4.12 Report the Results and Cleanup 162
6.4.13 Reading and Writing the BMP File 163
6.4.14 Vflip(): The GPU Kernel for Vertical Flipping 164
6.4.15 What Is My Thread ID, Block ID, and Block Dimension? 166
6.4.16 Hflip(): The GPU Kernel for Horizontal Flipping 169
xiv Contents

6.4.17 Hardware Parameters: threadIDx.x, blockIdx.x,

blockDim.x 169
6.4.18 PixCopy(): The GPU Kernel for Copying an Image 169
6.4.19 CUDA Keywords 170
6.5 CUDA PROGRAM DEVELOPMENT IN WINDOWS 170
6.5.1 Installing MS Visual Studio 2015 and CUDA Toolkit 8.0 171
6.5.2 Creating Project imflipG.cu in Visual Studio 2015 172
6.5.3 Compiling Project imflipG.cu in Visual Studio 2015 174
6.5.4 Running Our First CUDA Application: imflipG.exe 177
6.5.5 Ensuring Your Program’s Correctness 178
6.6 CUDA PROGRAM DEVELOPMENT ON A MAC PLATFORM 179
6.6.1 Installing XCode on Your Mac 179
6.6.2 Installing the CUDA Driver and CUDA Toolkit 180
6.6.3 Compiling and Running CUDA Applications on a Mac 180
6.7 CUDA PROGRAM DEVELOPMENT IN A UNIX PLATFORM 181
6.7.1 Installing Eclipse and CUDA Toolkit 181
6.7.2 ssh into a Cluster 182
6.7.3 Compiling and Executing Your CUDA Code 182

Chapter 7 CUDA Host/Device Programming Model 185

7.1 DESIGNING YOUR PROGRAM’S PARALLELISM 185

7.1.1 Conceptually Parallelizing a Task 186
7.1.2 What Is a Good Block Size for Vflip()? 187
7.1.3 imflipG.cu: Interpreting the Program Output 187
7.1.4 imflipG.cu: Performance Impact of Block and
Image Size 188
7.2 KERNEL LAUNCH COMPONENTS 189
7.2.1 Grids 189
7.2.2 Blocks 190
7.2.3 Threads 191
7.2.4 Warps and Lanes 192
7.3 IMFLIPG.CU: UNDERSTANDING THE KERNEL DETAILS 193
7.3.1 Launching Kernels in main() and Passing Arguments
to Them 193
7.3.2 Thread Execution Steps 194
7.3.3 Vflip() Kernel Details 195
7.3.4 Comparing Vflip() and MTFlipV() 196
7.3.5 Hflip() Kernel Details 197
7.3.6 PixCopy() Kernel Details 197
7.4 DEPENDENCE OF PCI EXPRESS SPEED ON THE CPU 199
Contents xv

7.5 PERFORMANCE IMPACT OF PCI EXPRESS BUS 200

7.5.1 Data Transfer Time, Speed, Latency, Throughput, and
Bandwidth 200
7.5.2 PCIe Throughput Achieved with imflipG.cu 201
7.6 PERFORMANCE IMPACT OF GLOBAL MEMORY BUS 204
7.7 PERFORMANCE IMPACT OF COMPUTE CAPABILITY 206
7.7.1 Fermi, Kepler, Maxwell, Pascal, and Volta Families 207
7.7.2 Relative Bandwidth Achieved in Different Families 207
7.7.3 imflipG2.cu: Compute Capability 2.0 Version of imflipG.cu 208
7.7.4 imflipG2.cu: Changes in main() 210
7.7.5 The PxCC20() Kernel 211
7.7.6 The VfCC20() Kernel 212
7.8 PERFORMANCE OF IMFLIPG2.CU 214
7.9 OLD-SCHOOL CUDA DEBUGGING 214
7.9.1 Common CUDA Bugs 216
7.9.2 return Debugging 218
7.9.3 Comment-Based Debugging 220
7.9.4 printf() Debugging 220
7.10 BIOLOGICAL REASONS FOR SOFTWARE BUGS 221
7.10.1 How Is Our Brain Involved in Writing/Debugging Code? 222
7.10.2 Do We Write Buggy Code When We Are Tired? 222
7.10.2.1 Attention 223
7.10.2.2 Physical Tiredness 223
7.10.2.3 Tiredness Due to Heavy Physical Activity 223
7.10.2.4 Tiredness Due to Needing Sleep 223
7.10.2.5 Mental Tiredness 224

Chapter 8 Understanding GPU Hardware Architecture 225

8.1 GPU HARDWARE ARCHITECTURE 226

8.2 GPU HARDWARE COMPONENTS 226
8.2.1 SM: Streaming Multiprocessor 226
8.2.2 GPU Cores 227
8.2.3 Giga-Thread Scheduler 227
8.2.4 Memory Controllers 229
8.2.5 Shared Cache Memory (L2$) 229
8.2.6 Host Interface 229
8.3 NVIDIA GPU ARCHITECTURES 230
8.3.1 Fermi Architecture 231
8.3.2 GT, GTX, and Compute Accelerators 231
8.3.3 Kepler Architecture 232
xvi Contents

8.3.4 Maxwell Architecture 232

8.3.5 Pascal Architecture and NVLink 233
8.4 CUDA EDGE DETECTION: IMEDGEG.CU 233
8.4.1 Variables to Store the Image in CPU, GPU Memory 233
8.4.1.1 TheImage and CopyImage 233
8.4.1.2 GPUImg 234
8.4.1.3 GPUBWImg 234
8.4.1.4 GPUGaussImg 234
8.4.1.5 GPUGradient and GPUTheta 234
8.4.1.6 GPUResultImg 235
8.4.2 Allocating Memory for the GPU Variables 235
8.4.3 Calling the Kernels and Time-Stamping Their Execution 238
8.4.4 Computing the Kernel Performance 239
8.4.5 Computing the Amount of Kernel Data Movement 239
8.4.6 Reporting the Kernel Performance 242
8.5 IMEDGEG: KERNELS 242
8.5.1 BWKernel() 242
8.5.2 GaussKernel() 244
8.5.3 SobelKernel() 246
8.5.4 ThresholdKernel() 249
8.6 PERFORMANCE OF IMEDGEG.CU 249
8.6.1 imedgeG.cu: PCIe Bus Utilization 250
8.6.2 imedgeG.cu: Runtime Results 250
8.6.3 imedgeG.cu: Kernel Performance Comparison 252
8.7 GPU CODE: COMPILE TIME 253
8.7.1 Designing CUDA Code 253
8.7.2 Compiling CUDA Code 255
8.7.3 GPU Assembly: PTX, CUBIN 255
8.8 GPU CODE: LAUNCH 255
8.8.1 OS Involvement and CUDA DLL File 255
8.8.2 GPU Graphics Driver 256
8.8.3 CPU←→GPU Memory Transfers 256
8.9 GPU CODE: EXECUTION (RUN TIME) 257
8.9.1 Getting the Data 257
8.9.2 Getting the Code and Parameters 257
8.9.3 Launching Grids of Blocks 258
8.9.4 Giga Thread Scheduler (GTS) 258
8.9.5 Scheduling Blocks 259
8.9.6 Executing Blocks 260
8.9.7 Transparent Scalability 261
Contents xvii

Chapter 9 Understanding GPU Cores 263

9.1 GPU ARCHITECTURE FAMILIES 263

9.1.1 Fermi Architecture 263
9.1.2 Fermi SM Structure 264
9.1.3 Kepler Architecture 266
9.1.4 Kepler SMX Structure 267
9.1.5 Maxwell Architecture 268
9.1.6 Maxwell SMM Structure 268
9.1.7 Pascal GP100 Architecture 270
9.1.8 Pascal GP100 SM Structure 271
9.1.9 Family Comparison: Peak GFLOPS and Peak DGFLOPS 272
9.1.10 GPU Boost 273
9.1.11 GPU Power Consumption 274
9.1.12 Computer Power Supply 274
9.2 STREAMING MULTIPROCESSOR (SM) BUILDING BLOCKS 275
9.2.1 GPU Cores 275
9.2.2 Double Precision Units (DPU) 276
9.2.3 Special Function Units (SFU) 276
9.2.4 Register File (RF) 276
9.2.5 Load/Store Queues (LDST) 277
9.2.6 L1$ and Texture Cache 277
9.2.7 Shared Memory 278
9.2.8 Constant Cache 278
9.2.9 Instruction Cache 278
9.2.10 Instruction Buffer 278
9.2.11 Warp Schedulers 278
9.2.12 Dispatch Units 279
9.3 PARALLEL THREAD EXECUTION (PTX) DATA TYPES 279
9.3.1 INT8 : 8-bit Integer 280
9.3.2 INT16 : 16-bit Integer 280
9.3.3 24-bit Integer 280
9.3.4 INT32 : 32-bit Integer 281
9.3.5 Predicate Registers (32-bit) 281
9.3.6 INT64 : 64-bit Integer 282
9.3.7 128-bit Integer 282
9.3.8 FP32: Single Precision Floating Point (float) 282
9.3.9 FP64: Double Precision Floating Point (double) 283
9.3.10 FP16: Half Precision Floating Point (half) 284
9.3.11 What is a FLOP? 284
xviii Contents

9.3.12 Fused Multiply-Accumulate (FMA) versus Multiply-Add

(MAD) 285
9.3.13 Quad and Octo Precision Floating Point 285
9.3.14 Pascal GP104 Engine SM Structure 285
9.4 IMFLIPGC.CU: CORE-FRIENDLY IMFLIPG 286
9.4.1 Hflip2(): Precomputing Kernel Parameters 288
9.4.2 Vflip2(): Precomputing Kernel Parameters 290
9.4.3 Computing Image Coordinates by a Thread 290
9.4.4 Block ID versus Image Row Mapping 291
9.4.5 Hflip3(): Using a 2D Launch Grid 292
9.4.6 Vflip3(): Using a 2D Launch Grid 293
9.4.7 Hflip4(): Computing Two Consecutive Pixels 294
9.4.8 Vflip4(): Computing Two Consecutive Pixels 295
9.4.9 Hflip5(): Computing Four Consecutive Pixels 296
9.4.10 Vflip5(): Computing Four Consecutive Pixels 297
9.4.11 PixCopy2(), PixCopy3(): Copying 2,4 Consecutive Pixels at
a Time 298
9.5 IMEDGEGC.CU: CORE-FRIENDLY IMEDGEG 299
9.5.1 BWKernel2(): Using Precomputed Values and 2D Blocks 299
9.5.2 GaussKernel2(): Using Precomputed Values and 2D Blocks 300

Chapter 10 Understanding GPU Memory 303

10.1 GLOBAL MEMORY 303

10.2 L2 CACHE 304
10.3 TEXTURE/L1 CACHE 304
10.4 SHARED MEMORY 305
10.4.1 Split versus Dedicated Shared Memory 305
10.4.2 Memory Resources Available Per Core 306
10.4.3 Using Shared Memory as Software Cache 306
10.4.4 Allocating Shared Memory in an SM 307
10.5 INSTRUCTION CACHE 307
10.6 CONSTANT MEMORY 307
10.7 IMFLIPGCM.CU: CORE AND MEMORY FRIENDLY IMFLIPG 308
10.7.1 Hflip6(),Vflip6(): Using Shared Memory as Buffer 308
10.7.2 Hflip7(): Consecutive Swap Operations in Shared Memory 310
10.7.3 Hflip8(): Using Registers to Swap Four Pixels 312
10.7.4 Vflip7(): Copying 4 Bytes (int) at a Time 314
10.7.5 Aligned versus Unaligned Data Access in Memory 314
10.7.6 Vflip8(): Copying 8 Bytes at a Time 315
10.7.7 Vflip9(): Using Only Global Memory, 8 Bytes at a Time 316
List of Figures

1.1 Harvesting each coconut requires two consecutive 30-second tasks (threads).
Thread 1: get a coconut. Thread 2: crack (process) that coconut using the
hammer. 4
1.2 Simultaneously executing Thread 1 (“1”) and Thread 2 (“2”). Accessing
shared resources will cause a thread to wait (“-”). 6
1.3 Serial (single-threaded) program imflip.c flips a 640×480 dog picture (left)
horizontally (middle) or vertically (right). 8
1.4 Running gdb to catch a segmentation fault. 20
1.5 Running valgrind to catch a memory access error. 23

2.1 Windows Task Manager, showing 1499 threads, however, there is 0% CPU
utilization. 33

3.1 The life cycle of a thread. From the creation to its termination, a thread is
cycled through many different statuses, assigned by the OS. 60
3.2 Memory access patterns of MTFlipH() in Code 2.8. A total of 3200 pixels’
RGB values (9600 Bytes) are flipped for each row. 65
3.3 The memory map of a process when only a single thread is running within
the process (left) or multiple threads are running in it (right). 75

4.1 Inside a computer containing an i7-5930K CPU [10] (CPU5 in Table 3.1),
and 64 GB of DDR4 memory. This PC has a GTX Titan Z GPU that will
be used to test a lot of the programs in Part II. 80
4.2 The imrotate.c program rotates a picture by a specified angle. Original dog
(top left), rotated +10◦ (top right), +45◦ (bottom left), and −75◦ (bottom
right) clockwise. Scaling is done to avoid cropping of the original image area. 84
4.3 The architecture of one core of the i7-5930K CPU (the PC in Figure 4.1).
This core is capable of executing two threads (hyper-threading, as defined
by Intel). These two threads share most of the core resources, but have their
own register files. 92
4.4 Architecture of the i7-5930K CPU (6C/12T). This CPU connects to the
GPUs through an external PCI express bus and memory through the mem-
ory bus. 94

5.1 The imedge.c program is used to detect edges in the original image
astronaut.bmp (top left). Intermediate processing steps are: GaussianFilter()
(top right), Sobel() (bottom left), and finally Threshold() (bottom right). 108

xxiii
Contents xix

10.7.8 PixCopy4(), PixCopy5(): Copying One versus 4 Bytes Using

Shared Memory 317
10.7.9 PixCopy6(), PixCopy7(): Copying One/Two Integers Using
Global Memory 318
10.8 IMEDGEGCM.CU: CORE- & MEMORY-FRIENDLY IMEDGEG 319
10.8.1 BWKernel3(): Using Byte Manipulation to Extract RGB 319
10.8.2 GaussKernel3(): Using Constant Memory 321
10.8.3 Ways to Handle Constant Values 321
10.8.4 GaussKernel4(): Buffering Neighbors of 1 Pixel in Shared
Memory 323
10.8.5 GaussKernel5(): Buffering Neighbors of 4 Pixels in Shared
Memory 325
10.8.6 GaussKernel6(): Reading 5 Vertical Pixels into Shared
Memory 327
10.8.7 GaussKernel7(): Eliminating the Need to Account for Edge
Pixels 329
10.8.8 GaussKernel8(): Computing 8 Vertical Pixels 331
10.9 CUDA OCCUPANCY CALCULATOR 333
10.9.1 Choosing the Optimum Threads/Block 334
10.9.2 SM-Level Resource Limitations 335
10.9.3 What is “Occupancy”? 336
10.9.4 CUDA Occupancy Calculator: Resource
Computation 336
10.9.5 Case Study: GaussKernel7() 340
10.9.6 Case Study: GaussKernel8() 343

Chapter 11 CUDA Streams 345

11.1 WHAT IS PIPELINING? 347

11.1.1 Execution Overlapping 347
11.1.2 Exposed versus Coalesced Runtime 348
11.2 MEMORY ALLOCATION 349
11.2.1 Physical versus Virtual Memory 349
11.2.2 Physical to Virtual Address Translation 350
11.2.3 Pinned Memory 350
11.2.4 Allocating Pinned Memory with cudaMallocHost() 351
11.3 FAST CPU←→GPU DATA TRANSFERS 351
11.3.1 Synchronous Data Transfers 351
11.3.2 Asynchronous Data Transfers 351
11.4 CUDA STREAMS 352
11.4.1 CPU→GPU Transfer, Kernel Exec, GPU→CPUTransfer 352
11.4.2 Implementing Streaming in CUDA 353
xx Contents

11.4.3 Copy Engine 353

11.4.4 Kernel Execution Engine 353
11.4.5 Concurrent Upstream and Downstream PCIe
Transfers 354
11.4.6 Creating CUDA Streams 355
11.4.7 Destroying CUDA Streams 355
11.4.8 Synchronizing CUDA Streams 355
11.5 IMGSTR.CU: STREAMING IMAGE PROCESSING 356
11.5.1 Reading the Image into Pinned Memory 356
11.5.2 Synchronous versus Single Stream 358
11.5.3 Multiple Streams 359
11.5.4 Data Dependence Across Multiple Streams 361
11.5.4.1 Horizontal Flip: No Data Dependence 362
11.5.4.2 Edge Detection: Data Dependence 363
11.5.4.3 Preprocessing Overlapping Rows Synchronously 363
11.5.4.4 Asynchronous Processing the Non-Overlapping
Rows 364
11.6 STREAMING HORIZONTAL FLIP KERNEL 366
11.7 IMGSTR.CU: STREAMING EDGE DETECTION 367
11.8 PERFORMANCE COMPARISON: IMGSTR.CU 371
11.8.1 Synchronous versus Asynchronous Results 371
11.8.2 Randomness in the Results 372
11.8.3 Optimum Queuing 372
11.8.4 Best Case Streaming Results 373
11.8.5 Worst Case Streaming Results 374
11.9 NVIDIA VISUAL PROFILER: NVVP 375
11.9.1 Installing nvvp and nvprof 375
11.9.2 Using nvvp 376
11.9.3 Using nvprof 377
11.9.4 imGStr Synchronous and Single-Stream Results 377
11.9.5 imGStr 2- and 4-Stream Results 378

Part III More To Know

Chapter 12 CUDA Libraries 383
Mohamadhadi Habibzadeh, Omid Rajabi Shishvan, and Tolga Soyata
12.1 cuBLAS 383
12.1.1 BLAS Levels 383
12.1.2 cuBLAS Datatypes 384
12.1.3 Installing cuBLAS 385
12.1.4 Variable Declaration and Initialization 385
Contents xxi

12.1.5 Device Memory Allocation 386

12.1.6 Creating Context 386
12.1.7 Transferring Data to the Device 386
12.1.8 Calling cuBLAS Functions 387
12.1.9 Transfer Data Back to the Host 388
12.1.10 Deallocating Memory 388
12.1.11 Example cuBLAS Program: Matrix Scalar 388
12.2 CUFFT 390
12.2.1 cuFFT Library Characteristics 390
12.2.2 A Sample Complex-to-Complex Transform 390
12.2.3 A Sample Real-to-Complex Transform 391
12.3 NVIDIA PERFORMANCE PRIMITIVES (NPP) 392
12.4 THRUST LIBRARY 393

Chapter 13 Introduction to OpenCL 397

Chase Conklin and Tolga Soyata
13.1 WHAT IS OpenCL? 397
13.1.1 Multiplatform 397
13.1.2 Queue-Based 397
13.2 IMAGE FLIP KERNEL IN OPENCL 398
13.3 RUNNING OUR KERNEL 399
13.3.1 Selecting a Device 400
13.3.2 Running the Kernel 401
13.3.2.1 Creating a Compute Context 401
13.3.2.2 Creating a Command Queue 401
13.3.2.3 Loading Kernel File 402
13.3.2.4 Setting Up Kernel Invocation 403
13.3.3 Runtimes of Our OpenCL Program 405
13.4 EDGE DETECTION IN OpenCL 406

Chapter 14 Other GPU Programming Languages 413

Sam Miller, Andrew Boggio-Dandry, and Tolga Soyata
14.1 GPU PROGRAMMING WITH PYTHON 413
14.1.1 PyOpenCL Version of imflip 414
14.1.2 PyOpenCL Element-Wise Kernel 418
14.2 OPENGL 420
14.3 OPENGL ES: OPENGL FOR EMBEDDED SYSTEMS 420
14.4 VULKAN 421
14.5 MICROSOFT’S HIGH-LEVEL SHADING LANGUAGE (HLSL) 421
14.5.1 Shading 421
14.5.2 Microsoft HLSL 422
xxii Contents

14.6 APPLE’S METAL API 422

14.7 APPLE’S SWIFT PROGRAMMING LANGUAGE 423
14.8 OPENCV 423
14.8.1 Installing OpenCV and Face Recognition 423
14.8.2 Mobile-Cloudlet-Cloud Real-Time Face Recognition 423
14.8.3 Acceleration as a Service (AXaas) 423

Chapter 15 Deep Learning Using CUDA 425

Omid Rajabi Shishvan and Tolga Soyata
15.1 ARTIFICIAL NEURAL NETWORKS (ANNS) 425
15.1.1 Neurons 425
15.1.2 Activation Functions 425
15.2 FULLY CONNECTED NEURAL NETWORKS 425
15.3 DEEP NETWORKS/CONVOLUTIONAL NEURAL NETWORKS 427
15.4 TRAINING A NETWORK 428
15.5 CUDNN LIBRARY FOR DEEP LEARNING 428
15.5.1 Creating a Layer 429
15.5.2 Creating a Network 430
15.5.3 Forward Propagation 431
15.5.4 Backpropagation 431
15.5.5 Using cuBLAS in the Network 431
15.6 KERAS 432

Bibliography 435

Index 439
xxiv List of Figures

5.2 Example barrier synchronization for 4 threads. Serial runtime is 7281 ms

and the 4-threaded runtime is 2246 ms. The speedup of 3.24× is close to
the best-expected 4×, but not equal due to the imbalance of each thread’s
runtime. 128
5.3 Using a MUTEXdata structure to access shared variables. 129

6.1 Turning the dog picture into a 3D wire frame. Triangles are used to represent
the object, rather than pixels. This representation allows us to map a texture
to each triangle. When the object moves, so does each triangle, along with
their associated textures. To increase the resolution of this kind of an object
representation, we can divide triangles into smaller triangles in a process
called tesselation. 139
6.2 Steps to move triangulated 3D objects. Triangles contain two attributes:
their location and their texture. Objects are moved by performing mathe-
matical operations only on their coordinates. A final texture mapping places
the texture back on the moved object coordinates, while a 3D-to-2D transfor-
mation allows the resulting image to be displayed on a regular 2D computer
monitor. 140
6.3 Three farmer teams compete in Analogy 6.1: (1) Arnold competes alone
with his 2× bigger tractor and “the strongest farmer” reputation, (2) Fred
and Jim compete together in a much smaller tractor than Arnold. (3) Tolga,
along with 32 boy and girl scouts, compete together using a bus. Who wins? 145
6.4 Nvidia Runtime Engine is built into your GPU drivers, shown in your Win-
dows 10 Pro SysTray. When you click the Nvidia symbol, you can open the
Nvidia control panel to see the driver version as well as the parameters of
your GPU(s). 156
6.5 Creating a Visual Studio 2015 CUDA project named imflipG.cu. Assume
that the code will be in a directory named Z:\code\imflipG in this example. 172
6.6 Visual Studio 2015 source files are in the Z:\code\imflipG\imflipG direc-
tory. In this specific example, we will remove the default file, kernel.cu, that
VS 2015 creates. After this, we will add an existing file, imflipG.cu, to the
project. 173
6.7 The default CPU platform is x86. We will change it to x64. We will also
remove the GPU debugging option. 174
6.8 The default Compute Capability is 2.0. This is too old. We will change it to
Compute Capability 3.0, which is done by editing Code Generation under
Device and changing it to compute 30, sm 30. 175
6.9 Compiling imflipG.cu to get the executable file imflipG.exe in the
Z:\code\imflipG\x64\Debug directory. 176
6.10 Running imflipG.exe from a CMD command line window. 177
6.11 The /usr/local directory in Unix contains your CUDA directories. 181
6.12 Creating a new CUDA project using the Eclipse IDE in Unix. 183

7.1 The PCIe bus connects for the host (CPU) and the device(s) (GPUs).
The host and each device have their own I/O controllers to allow transfers
through the PCIe bus, while both the host and the device have their own
memory, with a dedicated bus to it; in the GPU this memory is called global
memory. 205
List of Figures xxv

8.1 Analogy 8.1 for executing a massively parallel program using a significant
number of GPU cores, which receive their instructions and data from differ-
ent sources. Melissa (Memory controller ) is solely responsible for bringing
the coconuts from the jungle and dumping them into the big barrel (L2$).
Larry (L2$ controller ) is responsible for distributing these coconuts into the
smaller barrels (L1$) of Laura, Linda, Lilly, and Libby; eventually, these four
folks distribute the coconuts (data) to the scouts (GPU cores). On the right
side, Gina (Giga-Thread Scheduler ) has the big list of tasks (list of blocks to
be executed ); she assigns each block to a school bus (SM or streaming mul-
tiprocessor ). Inside the bus, one person — Tolga, Tony, Tom, and Tim —
is responsible to assign them to the scouts (instruction schedulers). 228
8.2 The internal architecture of the GTX550Ti GPU. A total of 192 GPU cores
are organized into six streaming multiprocessor (SM) groups of 32 GPU
cores. A single L2$ is shared among all 192 cores, while each SM has its
own L1$. A dedicated memory controller is responsible for bringing data in
and out of the GDDR5 global memory and dumping it into the shared L2$,
while a dedicated host interface is responsible for shuttling data (and code)
between the CPU and GPU over the PCIe bus. 230
8.3 A sample output of the imedgeG.cu program executed on the astronaut.bmp
image using a GTX Titan Z GPU. Kernel execution times and the amount
of data movement for each kernel is clearly shown. 242

9.1 GF110 Fermi architecture with 16 SMs, where each SM houses 32 cores, 16
LD/ST units, and 4 Special Function Units (SFUs). The highest end Fermi
GPU contains 512 cores (e.g., GTX 580). 264
9.2 GF110 Fermi SM structure. Each SM has a 128 KB register file that contains
32,768 (32 K) registers, where each register is 32-bits. This register file feeds
operands to the 32 cores and 4 Special Function Units (SFU). 16 Load/Store
(LD/ST) units are used to queue memory load/store requests. A 64 KB total
cache memory is used for L1$ and shared memory. 265
9.3 GK110 Kepler architecture with 15 SMXs, where each SMX houses 192
cores, 48 double precision units (DPU), 32 LD/ST units, and 32 Special
Function Units (SFU). The highest end Kepler GPU contains 2880 cores
(e.g., GTX Titan Black); its “double” version GTX Titan Z contains 5760
cores. 266
9.4 GK110 Kepler SMX structure. A 256 KB (64 K-register) register file feeds
192 cores, 64 Double-Precision Units (DPU), 32 Load/Store units, and 32
SFUs. Four warp schedulers can schedule four warps, which are dispatched
as 8 half-warps. Read-only cache is used to hold constants. 267
9.5 GM200 Maxwell architecture with 24 SMMs, housed inside 6 larger GPC
units; each SMM houses 128 cores, 32 LD/ST units, and 32 Special Function
Units (SFU), does not contain double-precision units (DPUs). The highest
end Maxwell GPU contains 3072 cores (e.g., GTX Titan X). 268
9.6 GM200 Maxwell SMM structure consists of 4 identical sub-structures with
32 cores, 8 LD/ST units, 8 SFUs, and 16 K registers. Two of these sub-
structures share an L1$, while four of them share a 96 KB shared memory. 269
9.7 GP100 Pascal architecture with 60 SMs, housed inside 6 larger GPC units,
each containing 10 SMs. The highest end Pascal GPU contains 3840 cores
(e.g., P100 compute accelerator). NVLink and High Bandwidth Memory
xxvi List of Figures

(HBM2) allow significantly faster memory bandwidths as compared to pre-

vious generations. 270
9.8 GP100 Pascal SM structure consists of two identical sub-structures that
contain 32 cores, 16 DPUs, 8 LD/ST units, 8 SFUs, and 32 K registers.
They share an instruction cache, however, they have their own instruction
buffer. 271
9.9 IEEE 754-2008 floating point standard and the supported floating point data
types by CUDA. half data type is supported in Compute Capability 5.3 and
above, while float has seen support from the first day of the introduction
of CUDA. Support for double types started in Compute Capability 1.3. 284

10.1 CUDA Occupancy Calculator: Choosing the Compute Capability, max.

shared memory size, registers/kernel, and kernel shared memory usage. In
this specific case, the occupancy is 24 warps per SM (out of a total of 64),
translating to an occupancy of 24 ÷ 64 = 38 %. 337
10.2 Analyzing the occupancy of a case with (1) registers/thread=16, (2) shared
memory/kernel=8192 (8 KB), and (3) threads/block=128 (4 warps). CUDA
Occupancy Calculator plots the occupancy when each kernel contains more
registers (top) and as we launch more blocks (bottom), each requiring an
additional 8 KB. With 8 KB/block, the limitation is 24 warps/SM; however,
it would go up to 32 warps/block, if each block only required 6 KB of shared
memory (6144 Bytes), as shown in the shared memory plot (below). 338
10.3 Analyzing the occupancy of a case with (1) registers/thread=16, (2) shared
memory/kernel=8192 (8 KB), and (3) threads/block=128 (4 warps). CUDA
Occupancy Calculator plots the occupancy when we launch our blocks with
more threads/block (top) and provides a summary of which one of the three
resources will be exposed to the limitation before the others (bottom). In this
specific case, the limited amount of shared memory (48 KB) limits the total
number of blocks we can launch to 6. Alternatively, the number of registers
or the maximum number of blocks per SM does not become a limitation. 339
10.4 Analyzing the GaussKernel7(), which uses (1) registers/thread ≈ 16, (2)
shared memory/kernel=40,960 (40 KB), and (3) threads/block=256. It is
clear that the shared memory limitation does not allow us to launch more
than a single block with 256 threads (8 warps). If you could reduce the
shared memory down to 24 KB by redesigning your kernel, you could launch
at least 2 blocks (16 warps, as shown in the plot) and double the occupancy. 341
10.5 Analyzing the GaussKernel7() with (1) registers/thread=16, (2) shared
memory/kernel=40,960, and (3) threads/block=256. 342
10.6 Analyzing the GaussKernel8() with (1) registers/thread=16, (2) shared
memory/kernel=24,576, and (3) threads/block=256. 343
10.7 Analyzing the GaussKernel8() with (1) registers/thread=16, (2) shared
memory/kernel=24,576, and (3) threads/block=256. 344

11.1 Nvidia visual profiler. 376

11.2 Nvidia profiler, command line version. 377
11.3 Nvidia NVVP results with no streaming and using a single stream, on the
K80 GPU. 378
11.4 Nvidia NVVP results with 2 and 4 streams, on the K80 GPU. 379
List of Figures xxvii

14.1 imflip.py kernel runtimes on different devices. 417

15.1 Generalized architecture of a fully connected artificial neural network with

n inputs, k hidden layers, and m outputs. 426
15.2 Inner structure of a neuron used in ANNs. ωij are the weights by which
inputs to the neuron (x1 , x2 , ..., xn ) are multiplied before they are summed.
“Bias” is a value by which this sum is augmented, and f () is the activation
function, which is used to introduce a non-linear component to the output. 426
List of Tables

1.1 A list of common gdb commands and functionality. 21

2.1 Serial and multithreaded execution time of imflipP.c, both for vertical flip
and horizontal flip, on an i7-960 (4C/8T) CPU. 51

3.1 Different CPUs used in testing the imflipP.c program. 55

3.2 imflipP.c execution times (ms) for the CPUs listed in Table 3.1. 58
3.3 Rules of thumb for achieving good DRAM performance. 67
3.4 imflipPM.c execution times (ms) for the CPUs listed in Table 3.1. 72
3.5 Comparing imflipP.c execution times (H, V type flips in Table 3.2) to
imflipPM.c execution times (I, W type flips in Table 3.4). 73
3.6 Comparing imflipP.c execution times (H, V type flips in Table 3.2) to
imflipPM.c execution times (I, W type flips in Table 3.4) for Xeon Phi 5110P. 77

4.1 imrotate.c execution times for the CPUs in Table 3.1 (+45◦ rotation). 89
4.2 imrotate.c threading efficiency (η) and parallelization overhead (1 − η) for
CPU3, CPU5. The last column reports the speedup achieved by using CPU5
that has more cores/threads, although there is no speedup up to 6 launched
SW threads. 90
4.3 imrotateMC.c execution times for the CPUs in Table 3.1. 105

5.1 Array variables and their types, used during edge detection. 111
5.2 imedge.c execution times for the W3690 CPU (6C/12T). 118
5.3 imedgeMC.c execution times for the W3690 CPU (6C/12T) in ms for a vary-
ing number of threads (above). For comparison, execution times of imedge.c
are repeated from Table 5.2 (below). 126
5.4 imedgeMCT.c execution times (in ms) for the W3690 CPU (6C/12T), using
the Astronaut.bmp image file (top) and Xeon Phi 5110P (60C/240T) using
the dogL.bmp file (bottom). 134

6.1 CUDA keyword and symbols that we learned in this chapter. 170

7.1 Vflip() kernel execution times (ms) for different size images on a GTX TITAN
Z GPU. 188
7.2 Variables available to a kernel upon launch. 190
7.3 Specifications of different computers used in testing the imflipG.cu program,
along with the execution results, compiled using Compute Capability 3.0. 202

xxix
xxx List of Tables

7.4 Introduction date and peak bandwidth of different bus types. 203
7.5 Introduction date and peak throughput of different CPU and GPU memory
types. 206
7.6 Results of the imflipG2.cu program, which uses the VfCC20() and PxCC20()
kernels and works in Compute Capability 2.0. 215

8.1 Nvidia microarchitecture families and their hardware features. 233

8.2 Kernels used in imedgeG.cu, along with their source array name and type. 235
8.3 PCIe bandwidth results of imedgeG.cu on six different computer
configurations 249
8.4 imedgeG.cu kernel runtime results; red numbers are the best option for the
number of threads and blue are fairly close to the best option (see ebook
for color version). 251
8.5 Summarized imedgeG.cu kernel runtime results; runtime is reported for 256
threads/block for every case. 252

9.1 Nvidia microarchitecture families and their peak computational power for
single precision (GFLOPS) and double-precision floating point (DGFLOPS). 273
9.2 Comparison of kernel performances between (Hflip() and Hflip2()) as well as
(Vflip() and HVflip2()). 289
9.3 Kernel performances: Hflip(),· · · ,Hflip3(), and Vflip(),· · · ,Vflip3(). 293
9.4 Kernel performances: Hflip(),· · · ,Hflip4(), and Vflip(),· · · ,Vflip4(). 295
9.5 Kernel performances: Hflip(),· · · ,Hflip5(), and Vflip(),· · · ,Vflip5(). 297
9.6 Kernel performances: PixCopy(), PixCopy2(), and PixCopy3(). 298
9.7 Kernel performances: BWKernel() and BWKernel2(). 300
9.8 Kernel performances: GaussKernel() and GaussKernel2(). 301

10.1 Nvidia microarchitecture families and the size of global memory, L1$, L2$
and shared memory in each one of them. 305
10.2 Kernel performances: Hflip() vs. Hflip6() and Vflip() vs. Vflip6(). 309
10.3 Kernel performances: Hflip(), Hflip6(), and Hflip7() using mars.bmp. 311
10.4 Kernel performances: Hflip6(), Hflip7(), Hflip8() using mars.bmp. 313
10.5 Kernel performances: Vflip(), Vflip6(), Vflip7(), and Vflip8(). 315
10.6 Kernel performances: Vflip(), Vflip6(), Vflip7(), Vflip8(), and Vflip9(). 316
10.7 Kernel performances: PixCopy(), PixCopy2(), . . . , PixCopy5(). 317
10.8 Kernel performances: PixCopy(), PixCopy4(), . . . , PixCopy7(). 318
10.9 Kernel performances: BWKernel(), BWKernel2(), and BWKernel3(). 320
10.10 Kernel performances: GaussKernel(), GaussKernel2(), GaussKernel3() 322
10.11 Kernel performances: GaussKernel(), . . . , GaussKernel4(). 324
10.12 Kernel performances: GaussKernel1(), . . . , GaussKernel5(). 325
10.13 Kernel performances: GaussKernel3(), . . . , GaussKernel6(). 327
10.14 Kernel performances: GaussKernel3(), . . . , GaussKernel7(). 330
10.15 Kernel performances: GaussKernel3(), . . . , GaussKernel8(). 331
List of Tables xxxi

10.16 Kernel performances for GaussKernel6(), GaussKernel7(), and GaussKernel8()

for Box IV, under varying threads/block choices. 334

11.1 Runtime for edge detection and horizontal flip for astronaut.bmp (in ms). 346
11.2 Execution timeline for the second team in Analogy 11.1. 347
11.3 Streaming performance results (in ms) for imGStr, on the astronaut.bmp
image. 371

13.1 Comparable terms for CUDA and OpenCL. 399

13.2 Runtimes for imflip, in ms. 405
13.3 Runtimes for imedge, in ms. 411

15.1 Common activation functions used in neurons to introduce a non-linear com-

ponent to the final output. 427
Preface

I am from the days when computer engineers and scientists had to write assembly language
on IBM mainframes to develop high-performance programs. Programs were written on
punch cards and compilation was a one-day process; you dropped off your punch-code
written program and picked up the results the next day. If there was an error, you did
it again. In those days, a good programmer had to understand the underlying machine
hardware to produce good code. I get a little nervous when I see computer science students
being taught only at a high abstraction level and languages like Ruby. Although abstraction
is a beautiful thing to develop things without getting bogged down with unnecessary details,
it is a bad thing when you are trying to develop super high performance code.
Since the introduction of the first CPU, computer architects added incredible features
into CPU hardware to “forgive” bad programming skills; while you had to order the sequence
of machine code instructions by hand two decades ago, CPUs do that in hardware for you
today (e.g., out of order processing). A similar trend is clearly visible in the GPU world.
Most of the techniques that were taught as performance improvement techniques in GPU
programming five years ago (e.g., thread divergence, shared memory bank conflicts, and
reduced usage of atomics) are becoming less relevant with the improved GPU architectures
because GPU architects are adding hardware features that are improving these previous
inefficiencies so much that it won’t even matter if a programmer is sloppy about it within
another 5–10 years. However, this is just a guess. What GPU architects can do depends on
their (i) transistor budget, as well as (ii) their customers’ demands. When I say transistor
budget, I am referring to how many transistors the GPU manufacturers can cram into an
Integrated Circuit (IC), aka a “chip.” When I say customer demands, I mean that even if
they can implement a feature, the applications that their customers are using might not
benefit from it, which will mean a wasted transistor budget.
From the standpoint of writing a book, I took all of these facts to heart and decided
that the best way to teach GPU programming is to show the differences among different
families of GPUs (e.g., Fermi, Kepler, Maxwell, and Pascal) and point out the trend, which
lets the reader be prepared about the upcoming advances in the next generation GPUs,
and the next, and the next . . . I put a lot of emphasis on concepts that will stay relevant
for a long period of time, rather than concepts that are platform-specific. That being said,
GPU programming is all about performance and you can get a lot higher performance if
you know the exact architecture you are running on, i.e., if you write platform-dependent
code. So, providing platform-dependent explanations are as valuable as generalized GPU
concepts. I engineered this book in such a way so that the later the chapter, the more
platform-specific it gets.
I believe that the most unique feature of this book is the fact that it starts explaining
parallelism by using CPU multi-threading in Part I. GPU massive parallelism (which differs
from CPU parallelism) is introduced in Part II. Due to the way the CPU parallelism is
explained in Part I, there is a smooth transition into understanding GPU parallelism in
Part II. I devised this methodology within the past six years of teaching GPU programming;
I realized that the concept of massive parallelism was not clear with students who have never

xxxiii
xxxiv Preface

taken a parallel programming class. Understanding the concept of “parallelizing a task” is

a lot easier to understand in a CPU architecture, as compared to a GPU.
The book is organized as follows. In Part I (Chapters 1 through 5), a few simple programs
are used to demonstrate the concept of dividing a large task into multiple parallel sub-tasks
and map them to CPU threads. Multiple ways of parallelizing the same task are analyzed
and their pros/cons are studied in terms of both core and memory operation. In Part II
(Chapters 6 through 11) of the book, the same programs are parallelized on multiple Nvidia
GPU platforms (Fermi, Kepler, Maxwell, and Pascal) and the same performance analysis
is repeated. Because the core and memory structures of CPUs and GPUs are different, the
results differ in interesting—and sometimes counterintuitive—ways; these differences are
pointed out and detailed discussions are provided as to what would make a GPU program
work faster. The end goal is to make the programmer aware of all of the good ideas—as
well as the bad ideas—so he or she can apply the good ideas and avoid the bad ideas to his
or her programs.
Although Part I and Part II totally cover what is needed to write successful CUDA
programs, there is always more to know. In Part III of the book, pointers are provided
for readers who want to expand their horizons. Part III is not meant to be a complete
reference to the topics that are covered in it; rather, it is meant to provide the initial
introduction, from which the reader can build a momentum toward understanding the entire
topic. Included topics in this part are an introduction to some of the popular CUDA libraries,
such as cuBLAS, cuFFT, Nvidia Performance Primitives, and Thrust (Chapter 12), an
introduction to the OpenCL programming language (Chapter 13), and an overview of GPU
programming using other programming languages and API libraries such as Python, Metal,
Swift, OpenGL, OpenGL ES, OpenCV, and Microsoft HLSL (Chapter 14), and finally the
deep learning library cuDNN (Chapter 15).
To download the code go to: https://www.crcpress.com/GPU-Parallel-Program-
Development-Using-CUDA/Soyata/p/book/9781498750752.

Tolga Soyata
About the Author

Tolga Soyata received his BS degree from Istanbul Technical

University, Department of Electronics and Communications
Engineering in 1988. He came to the United States to pursue
his graduate studies in 1990; he received his MS degree from
Johns Hopkins University, Department of Electrical and Com-
puter Engineering (ECE), Baltimore, MD in 1992 and PhD
degree from University of Rochester, Department of ECE in
2000. Between 2000 and 2015, he owned an IT outsourcing and
copier sales/service company. While operating his company,
he came back to academia, joining University of Rochester
(UR) ECE as a research scientist. Later he became an Assis-
tant Processor - Research and continued serving as a research
faculty member at UR ECE until 2016. During his tenure at
UR ECE, he supervised three PhD students. Two of them re-
ceived their PhD degrees under his supervision and one stayed at UR ECE when he joined
State University of New York - Albany (SUNY Albany) as an Associate Professor of ECE
in 2016. Soyata’s teaching portfolio includes VLSI, circuits, and parallel programming using
FPGA and GPUs. His research interests are in the field of cyber physical systems, digital
health, and high-performance medical mobile-cloud computing systems.
His entry into teaching GPU programming dates back to 2009, when he contacted Nvidia
to certify University of Rochester (UR) as a CUDA Teaching Center (CTC). Upon Nvidia’s
certification of UR as a CTC, he became the primary contact (PI). Later Nvidia also
certified UR as a CUDA Research Center (CRC), with him as the PI. He served as the
PI for these programs at UR until he left to join SUNY Albany in 2016. These programs
were later named GPU Education Center and GPU Research Center by Nvidia. While at
UR, he taught GPU programming and advanced GPU project development for five years,
which were cross-listed between the ECE and CS departments. He has been teaching similar
courses at SUNY Albany since he joined the department in 2016. This book is a product
of the experiences he gained in the past seven years, while teaching GPU courses at two
different institutions.

xxxv
PART I
Understanding CPU Parallelism

1
CHAPTER 1

Introduction to CPU Parallel

Programming

his book is a self-sufficient GPU and CUDA programming textbook. I can imagine the
T surprise of somebody who purchased a GPU programming book and the first chapter
is named “Introduction to CPU Parallel Programming.” The idea is that this book expects
the readers to be sufficiently good at a low-level programming language, like C, but not in
CPU parallel programming. To make this book a self-sufficient GPU programming resource
for somebody that meets this criteria, any prior CPU parallel programming experience
cannot be expected from the readers, yet it is not difficult to gain sufficient CPU parallel
programming skills within a few weeks with an introduction such as Part I of this book.
No worries, in these few weeks of learning CPU parallel programming, no time will be
wasted toward our eventual goal of learning GPU programming, since almost every concept
that I introduce here in the CPU world will be applicable to the GPU world. If you are
skeptical, here is one example for you: The thread ID, or, as we will call it tid, is the identifier
of an executing thread in a multi-threaded program, whether it is a CPU or GPU thread.
All of the CPU parallel programs we write will use the tid concept, which will make the
programs directly transportable to the GPU environment. Don’t worry if the term thread
is not familiar to you. Half the book is about threads, as it is the backbone of how CPUs
or GPUs execute multiple tasks simultaneously.

1.1 EVOLUTION OF PARALLEL PROGRAMMING

A natural question that comes to one’s mind is: why even bother with parallel programming?
In the 1970s, 1980s, even part of the 1990s, we were perfectly happy with single-threaded
programming, or, as one might call it, serial programming. You wrote a program to accom-
plish one task. When done, it gave you an answer. Task is done ... Everybody was happy
... Although the task was done, if you were, say, doing a particle simulation that required
millions, or billions of computations per second, or any other image processing computation
that works on thousands of pixels, you wanted your program to work much faster, which
meant that you needed a faster CPU.
Up until the year 2004, the CPU makers IBM, Intel, and AMD gave you a faster processor
by making it work at a higher speed, 16 MHz, 20 MHz, 66 MHz, 100 MHz, and eventually
200, 333, 466 MHz ... It looked like they could keep increasing the CPU speeds and provide
higher performance every year. But, in 2004, it was obvious that continuously increasing
the CPU speeds couldn’t go on forever due to technological limitations. Something else was
needed to continuously deliver higher performances. The answer of the CPU makers was
to put two CPUs inside one CPU, even if each CPU worked at a lower speed than a single
one would. For example, two CPUs (cores, as they called them) working at 200 MHz could

3
4 GPU Parallel Program Development Using CUDA

Coconut cracker Coconut tree

Jim’s tractor

Fred’s tractor Thread 1 =

Get a coconut

Thread 2 = Crack
Coconut (process) a coconut
harvesting
Farm House

FIGURE 1.1 Harvesting each coconut requires two consecutive 30-second tasks
(threads). Thread 1: get a coconut. Thread 2: crack (process) that coconut using
the hammer.

do more computations per second cumulatively, as compared to a single core working at

300 MHz (i.e., 2 × 200 > 300, intuitively).
Even if the story of “multiple cores within a single CPU” sounded like a dream come true,
it meant that the programmers would now have to learn the parallel programming methods
to take advantage of both of these cores. If a CPU could execute two programs at the
same time, this automatically implied that a programmer had to write those two programs.
But, could this translate to twice the program speed? If not, then our 2 × 200 > 300
thinking is flawed. What if there wasn’t enough work for one core? So, only truly a single
core was busy, while the other one was doing nothing? Then, we are better off with a
single core at 300 MHz. Numerous similar questions highlighted the biggest problem with
introducing multiple cores, which is the programming that can allow utilizing those cores
efficiently.

1.2 MORE CORES, MORE PARALLELISM

Programmers couldn’t simply ignore the additional cores that the CPU makers introduced
every year. By 2015, INTEL had an 8-core desktop processor, i7-5960X [11], and 10-core
workstation processors such as Xeon E7-8870 [14] in the market. Obviously, this multiple-
core frenzy continued and will continue in the foreseeable future. Parallel programming
turned from an exotic programming model in early 2000 to the only acceptable programming
model as of 2015. The story doesn’t stop at desktop computers either. On the mobile
processor side, iPhones and Android phones all have two or four cores. Expect to see an
ever-increasing number of cores in the mobile arena in the coming years.
So, what is a thread ? To answer this, let’s take a look at the 8-core INTEL CPU
i7-5960X [11] one more time. The INTEL archive says that this is indeed an 8C/16T CPU.
In other words, it has 8 cores, but can execute 16 threads. You also hear parallel program-
ming being incorrectly referred to as multi-core programming. The correct terminology
is multi-threaded programming. This is because when the CPU makers started designing
multi-core architectures, they quickly realized that it wasn’t difficult to add the capability
to execute two tasks within one core by sharing some of the core resources, such as cache
memory.
Introduction to CPU Parallel Programming 5

ANALOGY 1.1: Cores versus Threads.

Figure 1.1 shows two brothers, Fred and Jim, who are farmers that own two tractors.
They drive from their farmhouse to where the coconut trees are every day. They
harvest the coconuts and bring them back to their farmhouse. To harvest (process)
the coconuts, they use the hammer inside their tractor. The harvesting process requires
two separate consecutive tasks, each taking 30 seconds: Task 1 go from the tractor to
the tree, bringing one coconut at a time, and Task 2 crack (process) them by using
the hammer, and store them in the tractor. Fred alone can process one coconut per
minute, and Jim can also process one coconut per minute. Combined, they can process
two coconuts per minute.
One day, Fred’s tractor breaks down. He leaves the tractor with the repair shop,
forgetting that the coconut cracker is inside his tractor. It is too late by the time he
gets to the farmhouse. But, they still have work to do. With only Jim’s tractor, and
a single coconut cracker inside it, can they still process two coconuts per minute?

1.3 CORES VERSUS THREADS

Let’s look at our Analogy 1.1, which is depicted in Figure 1.1. If harvesting a coconut
requires the completion of two consecutive tasks (we will call them threads): Thread 1
picking a coconut from the tree and bringing it back to the tractor in 30 seconds, and
Thread 2 cracking (i.e., processing) that coconut using the hammer inside the tractor within
30 seconds, then each coconut can be harvested in 60 seconds (one coconut per minute). If
Jim and Fred each have their own tractors, they can simply harvest twice as many coconuts
(two coconuts per minute), since during the harvesting of each coconut, they can share the
road from the tractor to the tree, and they have their own hammer.
In this analogy, each tractor is a core, and harvesting one coconut is the program
execution using one data element. Coconuts are data elements, and each person (Jim,
Fred) is an executing thread, using the coconut cracker. Coconut cracker is the execution
unit, like ALU within the core. This program consists of two dependent threads: You
cannot execute Thread 2 before you execute Thread 1. The number of coconuts harvested
is equivalent to the program performance. The higher the performance, the more money
Jim and Fred make selling coconuts. The coconut tree is the memory, from which you get
data elements (coconuts), so the process of getting a coconut during Thread 1 is analogous
to reading data elements from the memory.

1.3.1 More Threads or More Cores to Parallelize?

Now, let’s see what happens if Fred’s tractor breaks down. They used to be able to harvest
two coconuts per minute, but, now, they only have one tractor and only one coconut cracker.
They drive to the trees and park the tractor. To harvest each coconut, they have to execute
Thread 1 (Th1) and Th2 consecutively. They both get out of the tractor and walk to the
trees in 30 seconds, thereby completing Th1. They bring back the coconut they picked, and
now, they have to crack their coconut. However, they cannot execute Th2 simultaneously,
since there is only one coconut cracker. Fred has to wait for Jim to crack the coconut,
and he cracks his after Fred is done using the cracker. This takes 30+30 more seconds,
and they finish harvesting two coconuts in 90 seconds total. Although not as good as two
6 GPU Parallel Program Development Using CUDA

Fred 1 1 1 2 2 2 1 1 1 111 - - - 222 222111222 111 211121 2 - - 1112 - -

Jim 111222111 111222 - - - 1 1 12 2 2 1 1 1 2 1 1 1 21 1 1 2 1112 - - 111

(a) (b) (c) (d) (e)

FIGURE 1.2 Simultaneously executing Thread 1 (“1”) and Thread 2 (“2”). Accessing
shared resources will cause a thread to wait (“-”).

coconuts per minute, they still have a performance improvement from 1 to 1.5 coconuts per
minute.
After harvesting a few coconuts, Jim asks himself the question: “Why do I have to wait
for Fred to crack the coconut? When he is cracking the coconut, I can immediately walk
to the tree, and get the next coconut. Since Th1 and Th2 take exactly the same amount
of time, we never have to be in a situation where they are waiting for the cracker to be
free. Exactly the time Fred is back from picking the next coconut, I will be done cracking
my coconut and we will both be 100% busy.” This genius idea brings them back to the
2 coconuts/minute speed without even needing an extra tractor. The big deal was that
Jim re-engineered the program, which is the sequence of the threads to execute, so the
threads are never caught in a situation where they are waiting for the shared resources inside
the core, like the cracker inside the tractor. As we will see very shortly, a shared resource
inside a core is an ALU, FPU, cache memory, and more ... For now, don’t worry about
these.
The two scenarios I described in this analogy are having two cores (2C), each executing a
single thread (1T) versus having a single core (1C) that is capable of executing two threads
(2T). In the CPU world, they are called 2C/2T versus 1C/2T. In other words, there are two
ways to give a program the capability to execute two simultaneous threads: 2C/2T (2 cores,
which are capable of executing a single thread each—just like two separate tractors for
Jim and Fred) or 1C/2T (a single core, capable of executing two threads—just like a single
tractor shared by Jim and Fred). Although, from the programmer’s standpoint, both of them
mean the ability to execute two threads, they are very different options from the hardware
standpoint, and they require the programmer to be highly aware of the implications of the
threads that share resources. Otherwise, the performance advantages of the extra threads
could vanish. Just to remind again: our almighty INTEL i7-5960X [11] CPU is an 8C/16T,
which has eight cores, each capable of executing two threads.
Three options are shown in Figure 1.2: (a) is the 2C/2T option with two separate cores.
(b) is the 1C/2T option with bad programming, yielding only 1.5 coconuts per minute, and
(c) is the sequence-corrected version, where the access to the cracker is never simultaneous,
yielding 2 coconuts per minute.
Introduction to CPU Parallel Programming 7

1.3.2 Influence of Core Resource Sharing

Being so proud of his discovery which brought their speed back to 2 coconuts per minute,
Jim wants to continue inventing ways to use a single tractor to do more work. One day,
he goes to Fred and says “I bought this new automatic coconut cracker which cracks a
coconut in 10 seconds.” Extremely happy with this discovery, they hit the road and park
the tractor next to the trees. This time they know that they have to do some planning,
before harvesting ...
Fred asks: “If our Th1 takes 30 seconds, and Th2 takes 10 seconds, and the only task for
which we are sharing resources is Th2 (cracker), how should we harvest the coconuts? ” The
answer is very clear to them: The only thing that matters is the sequence of execution of the
threads (i.e., the design of the program), so that they are never caught in a situation where
they are executing Th2 together and needing the only cracker they have (i.e., shared core
resources). To rephrase, their program consists of two dependent threads: Th1 is 30 seconds,
and does not require shared (memory) resources, since two people can walk to the trees
simultaneously, and Th2 is 10 seconds and cannot be executed simultaneously, since it
requires the shared (core) resource: the cracker. Since each coconut requires a 30+10=40
seconds of total execution time, the best they can hope for is 40 seconds to harvest two
coconuts, shown in Figure 1.2d. This would happen if everybody executed Th1 and Th2
sequentially, without waiting for any shared resource. So, their speed will be an average of
3 coconuts per minute (i.e., average 20 seconds per coconut).

1.3.3 Influence of Memory Resource Sharing

After harvesting 3 coconuts per minute using the new cracker, Jim and Fred come back
the next day and see something terrible. The road from the tractor to the tree could only
be used by one person today, since a heavy rain last night blocked half of the road. So,
they plan again ... Now, they have two threads that each require resources that cannot be
shared. Th1 (30 seconds — denoted as 30 s) can only be executed by one person, and Th2
(10 s) can only be executed by one person. Now what?
After contemplating multiple options, they realize that the limiting factor in their speed
is Th1; the best they can hope for is harvesting each coconut in 30 s. When the Th1 could be
executed together (shared memory access), each person could execute 10+30 s sequentially
and both of them could continue without ever needing to access shared resources. But now,
there is no way to sequence the threads to do so. The best they can hope for is to execute
10+30 s and wait for 20 s during which both need to access the memory. Their speed is back
to 2 coconuts per minute on average, as depicted in Figure 1.2e.
This heavy rain reduced their speed back to 2 coconuts per minute. Th2 no longer
matters, since somebody could easily crack a coconut while the other is on the road to
pick a coconut. Fred comes up with the idea that they should bring the second (slower)
hammer from the farmhouse to help. However, this would absolutely not help anything in
this case, since the limiting factor to harvesting is Th1. This concept of a limiting factor
by a resource is called resource contention. This example shows what happens when our
access to memory is the limiting factor for our program speed. It simply doesn’t matter
how fast we can process the data (i.e., core execution speed). We will be limited by the
speed at which we can get the data. Even if Fred had a cracker that could crack a coconut
in 1 second, they would still be limited to 2 coconuts per second if there is a resource
contention during memory access. In this book, we will start making a distinction between
two different programs: ones that are core intensive, which do not depend so much on the
8 GPU Parallel Program Development Using CUDA

FIGURE 1.3 Serial (single-threaded) program imflip.c flips a 640 × 480 dog picture
(left) horizontally (middle) or vertically (right).

memory access speed, and ones that are memory intensive, which are highly sensitive to
the memory access speed, as I have just shown.

1.4 OUR FIRST SERIAL PROGRAM

Now that we understand parallel programming in the coconut world, it is time to apply this
knowledge to real computer programming. I will start by introducing our first serial (i.e.,
single-threaded) program, and we will parallelize it next. Our first serial program imflip.c
takes the dog picture in Figure 1.3 (left) and flips it horizontally (middle) or vertically
(right). For simplicity in explaining the program, we will use Bitmap (BMP) images and
will write the result in BMP format too. This is an extremely easy to understand image
format and will allow us to focus on the program itself. Do not worry about the details in
this chapter. They will become clear soon. For now, just focus on high-level functionality.
The imflip.c source file can be compiled and executed from a Unix prompt as follows:
gcc imflip.c -o imflip
./imflip dogL.bmp dogh.bmp V
“H” is specified at the command line to flip the image horizontally (Figure 1.3 middle),
while “V” specifies a vertical flip (Figure 1.3 right). You will get an output that looks like
this (numbers might be different, based on the speed of your computer):
Input BMP File name : dogL.bmp (3200×2400)
Output BMP File name : dogh.bmp (3200×2400)
Total execution time : 81.0233 ms (10.550 ns per pixel)
The CPU that this program is running on is so fast that I had to artificially expand the
original 640×480 image dog.bmp to 3200×2400 dogL.bmp, so, it could run in an amount of
time that can be measured; dogL.bmp is 5× bigger in each dimension, thereby making it 25×
bigger than dog.bmp. To time the program, we have to record the clock at the beginning
of image flipping and at the end.

1.4.1 Understanding Data Transfer Speeds

It is very important to understand that the process of reading the image from the disk
(whether it is an SSD or hard drive) should be excluded from the execution time. In other
words, we read the image from the disk, and make sure that it is in memory (in our array),
Introduction to CPU Parallel Programming 9

and then measure only the amount of time that we spend flipping the image. Due to the
drastic differences in the data transfer speeds of different hardware components, we need to
analyze the amount of time spent in the disk, memory, and CPU separately.
In many of the parallel programs we will develop in this book, our focus is CPU time
and memory access time, because we can influence them; disk access time (which we will
call I/O time) typically saturates even with a single thread, thereby seeing almost no benefit
from multithreaded programming. Also, make a mental note that this slow I/O speed will
haunt us when we start GPU programming; since I/O is the slowest part of a computer
and the data from the CPU to the GPU is transferred through the PCI express bus, which
is a part of the I/O subsystem, we will have a challenge in feeding data to the GPU fast
enough. But, nobody said that GPU programming was easy! To give you an idea about the
magnitudes of transfer speeds of different hardware components, let me now itemize them:
• A typical network interface card (NIC) has a transfer speed of 1 Gbps (Giga-bits-per-
second or billion-bits-per-second). These cards are called “Gigabit network cards” or
“Gig NICs” colloquially. Note that 1 Gbps is only the amount of “raw data,” which
includes a significant amount of error correction coding and other synchronization
signals. The amount of actual data that is transferred is less than half of that. Since
my goal is to give the reader a rough idea for comparison, this detail is not that
important for us.
• A typical hard disk drive (HDD) can barely reach transfer speeds of 1–2 Gbps, even if
connected to a SATA3 cable that has a peak 6 Gbps transfer speed. The mechanical
read-write nature of the HDDs simply do not allow them to access the data that fast.
The transfer speed isn’t even the worst problem with an HDD, but the seek time is;
it takes the mechanical head of an HDD some time to locate the data on the spinning
metal cylinder, therefore forcing it to wait until the rotating head reaches the position
where the data resides. This could take milli-seconds (ms) if the data is distributed
in an irregular fashion (i.e., fragmented). Therefore, HDDs could have transfer speeds
that are far less than the peak speed of the SATA3 cable that they are connected to.
• A flash disk that is hooked up to a USB 2.0 port has a peak transfer speed of 480 Mbps
(Mega-bits-per-second or million-bits-per-second). However, the USB 3.0 standard has
a faster 5 Gbps transfer speed. The newer USB 3.1 can reach around 10 Gbps transfer
rates. Since flash disks are built using flash memory, there is no seek time, as they are
directly accessible by simply providing the data address.
• A typical solid state disk (SSD) can be read from a SATA3 cable at speeds close
to 4–5 Gbps. Therefore, an SSD is really the only device that can saturate a SATA3
cable, i.e., deliver data at its intended peak rate of 6 Gbps.
• Once the data is transferred from I/O (SDD, HDD, or flash disk) into the memory
of the CPU, transfer speeds are drastically higher. Core i7 family all the way up to
the sixth generation (i7-6xxx) and the higher-end Xeon CPUs use DDR2, DDR3,
and DDR4 memory technologies and have memory-to-CPU transfer speeds of 20–60
GBps (Giga-Bytes-per-second). Notice that this speed is Giga-Bytes; a Byte is 8 bits,
thereby translating to memory transfer speeds of 160–480 Gbps (Giga-bits-per-second)
just to compare readily to the other slower devices.
• As we will see in Part II and beyond, transfer speeds of GPU internal memory sub-
systems can reach 100–1000 GBps. The new Pascal series GPUs, for example, have
an internal memory transfer rate, which is close to the latter number. This translates
10 GPU Parallel Program Development Using CUDA

to 8000 Gbps, which is an order-of-magnitude faster than the CPU internal memory
and three orders-of-magnitude faster than a flash disk, and almost four orders-of-
magnitude faster than an HDD.

CODE 1.1: imflip.c main() {...}

The main() function of the imflip.c reads three command line parameters to determine
the input and output BMP images and also the flip direction (horizontal or vertical).
Operation is repeated multiple times (REPS) to improve the accuracy of the timing.

#define REPS 129

...
int main(int argc, char** argv)
{
double timer; unsigned int a; clock_t start,stop;
if(argc != 4){ printf("\n\nUsage: imflip [input][output][h/v]");
printf("\n\nExample: imflip square.bmp square_h.bmp h\n\n");
return 0; }
unsigned char** data = ReadBMP(argv[1]);
start = clock(); // Start timing the code without the I/O part
switch (argv[3][0]){
case ’v’ :
case ’V’ : for(a=0; a<REPS; a++) data = FlipImageV(data); break;
case ’h’ :
case ’H’ : for(a=0; a<REPS; a++) data = FlipImageH(data); break;
default : printf("\nINVALID OPTION\n"); return 0;
}
stop = clock();
timer = 1000*((double)(stop-start))/(double)CLOCKS_PER_SEC/(double)REPS;
//merge with header and write to file
WriteBMP(data, argv[2]);
// free() the allocated memory for the image
for(int i = 0; i < ip.Vpixels; i++) { free(data[i]); }
free(data);
printf("\n\nTotal execution time: %9.4f ms",timer);
printf(" (%7.3f ns/pixel)\n", 1000000*timer/(double)(ip.Hpixels*ip.Vpixels));
}

1.4.2 The main() Function in imflip.c

Our program, shown in Code 1.1, reads a few command line parameters and flips an input
image either vertically or horizontally, as specified by the command line. The command line
arguments are placed by C into the argv array.
The clock() function reports time in ms intervals; this crude reporting resolution is
improved by repeating the same operations an odd number of times (e.g., 129), specified in
the “#define REPS 129” line. This number can be changed based on your system.
The ReadBMP() function reads the source image from the disk and WriteBMP() writes
the processed (i.e., flipped) image back to the disk. The amount of time spent reading and
writing the image from/to the disk is defined as I/O time and we will exclude it from the
processing time. This is why I placed the “start = clock()” and “stop = clock()” lines
Introduction to CPU Parallel Programming 11

between the actual code that flips the image that is in memory and intentionally excluded
the I/O time.
Before reporting the elapsed time, the imflip.c program de-allocates all of the memory
that was allocated by ReadBMP() using a bunch of free() functions to avoid memory leaks.

CODE 1.2: imflip.c ... FlipImageV() {...}

To flip the rows vertically, each pixel is read and replaced with the corresponding
pixel of the mirroring row.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "ImageStuff.h"
#define REPS 129
struct ImgProp ip;

unsigned char FlipImageV(unsigned char img)

{
struct Pixel pix; //temp swap pixel
int row, col;
for(col=0; col<ip.Hbytes; col+=3){ //go through the columns
row = 0;
while(row<ip.Vpixels/2){ // go through the rows
pix.B = img[row][col]; pix.G = img[row][col+1];
pix.R = img[row][col+2];

img[row][col] = img[ip.Vpixels-(row+1)][col];
img[row][col+1] = img[ip.Vpixels-(row+1)][col+1];
img[row][col+2] = img[ip.Vpixels-(row+1)][col+2];

img[ip.Vpixels-(row+1)][col] = pix.B;
img[ip.Vpixels-(row+1)][col+1] = pix.G;
img[ip.Vpixels-(row+1)][col+2] = pix.R;

row++;
}
}
return img;
}

1.4.3 Flipping Rows Vertically: FlipImageV()

The FlipImageV() in Code 1.2 goes through every column and swaps each vertical pixel
with its mirroring vertical pixel for that column. The Bitmap (BMP) image functions are
placed in another file named ImageStuff.c and ImageStuff.h is the associated header file.
They will be explained in detail in the next chapter. Each pixel of the image is stored
as type “struct Pixel,” which contains the R, G, and B color components of that pixel
in unsigned char type; since unsigned char takes up one byte, each image pixel requires
3 bytes to store.
12 GPU Parallel Program Development Using CUDA

The ReadBMP() places the image width and height in two variables ip.Hpixels and
ip.Vpixels, respectively. The number of bytes that we need to store each row of the image
is placed in ip.Hbytes. FlipImageV() function has two loops: The outer loop goes through
all ip.Hbytes of the image and for every column, it swaps the corresponding vertical mirror
pixels one at a time in the inner loop.

CODE 1.3: imflip.c FlipImageH() {...}

To flip the columns horizontally, each pixel is read and replaced with the corresponding
pixel of the mirroring column.

unsigned char FlipImageH(unsigned char img)

{
struct Pixel pix; //temp swap pixel
int row, col;

//horizontal flip
for(row=0; row<ip.Vpixels; row++){ // go through the rows
col = 0;
while(col<(ip.Hpixels*3)/2){ // go through the columns
pix.B = img[row][col];
pix.G = img[row][col+1];
pix.R = img[row][col+2];

img[row][col] = img[row][ip.Hpixels*3-(col+3)];
img[row][col+1] = img[row][ip.Hpixels*3-(col+2)];
img[row][col+2] = img[row][ip.Hpixels*3-(col+1)];

img[row][ip.Hpixels*3-(col+3)] = pix.B;
img[row][ip.Hpixels*3-(col+2)] = pix.G;
img[row][ip.Hpixels*3-(col+1)] = pix.R;

col+=3;
}
}
return img;
}

1.4.4 Flipping Columns Horizontally: FlipImageH()

The FlipImageH() function of imflip.c flips the image horizontally, as shown in Code 1.3.
This function is identical to the vertical counterpart, except the inner loops are opposite.
Each swap uses the temporary pixel variable pix, which is “struct Pixel” type.
Since the pixels are stored as 3 bytes in a row using the RGB, RGB, RGB, ... format,
accessing consecutive pixels requires reading 3 bytes at a time. This will be detailed in the
next section. For now, all we need to know is that the following lines

pix.B = img[row][col];
pix.G = img[row][col+1];
pix.R = img[row][col+2];
Introduction to CPU Parallel Programming 13

are simply reading one pixel at the vertical row and horizontal column col; the blue color
components of the pixel are at address img[row][col], the green component is at address
img[row][col + 1], and the red components are at img[row][col + 2]. The pointer to the
beginning address of the image, img, was passed onto the FlipImageH() function by main()
after the ReadBMP() allocated space for it, as we will see in the next chapter.

1.5 WRITING, COMPILING, RUNNING OUR PROGRAMS

In this section, we will learn how to develop our program in one of these platforms: Windows,
Mac, or a Unix box running one of the Unix clones like Fedora, CentOS, or Ubuntu. The
readers are supposed to pick one of these platforms as their favorite and will be able to follow
the rest of Part I using that platform, both for serial and parallel program development.

1.5.1 Choosing an Editor and a Compiler

To develop a program, you need to be able to write, compile, and execute that program.
I am using the plain-and-simple C programming language in this book instead of C++,
since this is all we need to demonstrate CPU parallelism, or GPU programming. I do not
want unnecessary distracting complexity to take us away from the CPU and GPU and
parallelization concepts.
To write a C program, the easiest thing to do is to use an editor such as Notepad++ [17].
It is a free download and it works in every platform. It also colors the keywords in the C
programming language. There are, of course, more sophisticated integrated development
environments (IDEs) such as Microsoft Visual Studio. But, we will favor simplicity in Part I.
The results in Part I will be displayed in a Unix command line. This will work even if you
have Windows 7, as you will see in a second. To compile a C program, we will be using the
g++ compiler in Part I which also works in every platform. I will be providing a Makefile
which will allow us to compile our programs with the appropriate command line arguments.
To execute the compiled binary code, you simply have to run it within that same platform
that you compiled it in.

1.5.2 Developing in Windows 7, 8, and Windows 10 Platforms

The freely available download that allows you to emulate Unix in Windows is Cygwin64 [5].
Simply put, Cygwin64 is “Unix in Windows.” Be careful to get the 64-bit version of Cygwin
(called Cygwin64), since it has the newest packages. Your PC must be capable of 64-bit
x86. You can skip this section if you have a Mac or a Unix box. If you have a Windows
PC (preferably Windows x Professional), then your best bet is to install Cygwin64, which
has a built-in g++ compiler. To install Cygwin64 [5], go to http://www.cygwin.com and
choose the 64-bit installation. This process takes hours if your Internet connection is slow.
So, I strongly suggest that you download everything into a temporary directory, and start
installation from that local directory. If you do a direct Internet install, you are highly likely
to get interrupted and start over. DO NOT install the 32-bit version of Cygwin, since it
is heavily outdated. None of the code in this book will work properly without Cygwin64.
Also, the newest GPU programs we are running will require 64-bit execution.
In Cygwin64, you will have two different types of shells: The first type is a plain simple
command line (text) shell, which is called “Cygwin64 Terminal.” The second type is an
“xterm” which means “X Windows terminal,” and is capable of displaying graphics. For
maximum compatibility across every type of computer, I will use the first kind: a strictly
14 GPU Parallel Program Development Using CUDA

text terminal, “Cygwin64 Terminal,” which is a Unix bash shell. Using a text-only shell
has a few implications:

1. Since every single program we are developing operates on an image, we need a way
to display our images outside the terminal. In Cygwin64, since each directory you are
browsing on the Cygwin64 terminal corresponds to an actual Windows directory, all
you have to do is to find that Windows directory and display the input and output
images using a simple program like mspaint or an Internet Explorer browser. Both
programs will allow you to resize the monster 3200×2400 image to any size you want
and display it comfortably.
2. Cygwin commands ls, md, and cd are all indeed working on a Windows directory. The
cygwin64-Windows directory mapping is:
~/Cyg64dir ←→ C:\cygwin64\home\Tolga\Cyg64dir
where Tolga is my login, and, hence, my Cygwin64 root directory name. Every cyg-
win64 user’s home directory will be under the same C:\cygwin64\home directory. In
many cases, there will be only one user, which is your name.
3. We need to run Notepad++ outside the Cygwin64 terminal (i.e., from Windows) by
drag-and-dropping the C source files inside Notepad++ to edit them. Once edited,
we compile them in the Cygwin64 terminal, and display them outside the terminal.
4. There is another way to run Notepad++ and display the images in Cygwin64, without
going to Windows. Type the following command lines:

cygstart notepad++ imflip.c

gcc imflip.c -o imflip
. /imflip dogL.bmp dogh.bmp V
cygstart dogh.bmp

Command line cygstart notepad++ imflip.c is as if you double-clicked the Notepad++

icon and ran it to edit your file named imflip.c. The second line will compile the imflip.c
program and the third line will run it and display the execution time, etc. The last line
will run the default Windows program to display the image. This cygstart command
in Cygwin64 is basically equivalent to “double click in Windows.” The result of this
last command line is as if you double-clicked on an image dogh.bmp in Windows,
which would tell Windows to open a photo viewer. You can change this default viewer
by changing “file associations” in Windows Explorer.

One thing might look mysterious to you: Why did I precede our program’s name with . /
and didn’t do the same thing for cygstart? Type this:
echo $PATH
and you will not have . / in the current PATH environment variable after an initial Cygwin64
install. Therefore, Cygwin64 won’t know to search the current directory for any command
you type. If you already have the . / in your PATH, you do not have to worry about this.
If you don’t, you can add that to your PATH within your .bash profile file, and now it will
start recognizing it. This file is in your home directory and the line to add is:
export PATH=$PATH:. /
Introduction to CPU Parallel Programming 15

Since the cygstart command was in one of the paths that existed in your PATH environment
variable, you didn’t need to precede it with any directory name such as . / which implies
current directory.

1.5.3 Developing in a Mac Platform

As we discussed in Section 1.5.2, all you need to know to execute the programs in Part I
and display the results is how to display a BMP image within the directory that the image
ends up in. This is true for a Mac computer too. A Mac has a built-in Unix terminal, or
a downloadable “iterm,” so it doesn’t need something like Cygwin64. In other words, a
Mac is a Unix computer at heart. The only minor discrepancies you might see from what
I am explaining will surface if you use an IDE like Xcode in Mac. If you use Notepad++,
everything should work exactly the same way I described. However, if you see yourself
developing a lot of parallel programs, Xcode is great and it is a free download when you
create a developer account on Apple.com. It is worth the effort. To display images, Mac has
its own programs, so just double-click on the BMP images to display them. Mac will also
have a corresponding directory for each terminal directory. So, find the directory where you
are developing the application, and double click on the BMP images from the desktop.

1.5.4 Developing in a Unix Platform

If you have a GUI-based Unix box running Ubuntu, Fedora, or CentOS, they will all have
a command line terminal. I am using the term “box” to mean a generic or a brand-name
computer with an INTEL or AMD CPU. A Unix box will have either an xterm, or a
strictly text-based terminal, like bash. Either one will work to compile and run the programs
described here. You can, then, figure out where the directory is where you are running
the program, and double-click on the BMP images to display them. Instead of drag-and-
dropping into a program, if you simply double-click on them, you are asking the OS to
run the default program to display them. You can change this program through system
settings.

1.6 CRASH COURSE ON UNIX

Almost half of my students needed a starter course on Unix every year in the past five
years of my GPU teaching. So, I will provide it in this section. If you are comfortable
with Unix, you can skip this section. Only the key concepts and commands will be pro-
vided here which should be sufficient for you to follow everything in this book. A more
thorough Unix education will come with practice and possibly a book dedicated strictly
to Unix.

1.6.1 Unix Directory-Related Commands

Unix directory structure starts with your home directory, which has a special tilde character
(~) to represent it. Any directory that you create is created somewhere under your “~/”
home directory. For example, if you create a directory called cuda in your home directory,
the Unix way to represent this directory is ~/cuda using the tilde notation. You should
organize your files to be in a neat order, and create directories under directories to make
them hierarchical. For example, the examples we have in this book could be placed under
a cuda directory and each chapter’s examples could be placed under sub-directories such
as ch1, ch2, ch3, ... They would have directory names ~/cuda/ch1 etc.
16 GPU Parallel Program Development Using CUDA

Some commonly used directory creation/removal commands in Unix are:

ls # show me the listing of the root directory

mkdir cuda # make a directory named ’cuda’ right here
cd cuda # go into (change directory to) ’cuda’
ls # list the contents of this directory
mkdir ch1 # make a sub-directory named ’ch1’
mkdir ch2 # make another sub-directory named ’ch2’
mkdir ch33 # make a third. Oops, I mis-typed. I meant ’ch3’.
rmdir ch33 # Too late. Let’s remove the wrong directory ’ch33’
mkdir ch3 # Now, make the correct third sub-directory ’ch3’
mkdir ch4 # Oops, I meant to create a directory named ch5, not ch4
mv ch4 ch5 # move the directory to ch5 (effectively renames ch4)
ls -al # List detailed contents with the -al switch
ls .. # List contents of the above directory
pwd # print-working-directory. Where am I in the hierarchy ?
cd .. # Two special directories: . is ’here’ .. is ’one above’
pwd # Where am I again ? Did I go ’one above’ with cd .. ?
ls -al # detailed listing again. I should be at ’cuda’
cd # go to my home directory
rm -r dirname # removes a dir. and all of its subdirectories even if not empty
cat /proc/cpuinfo # get information about the CPU you have in your computer

The ls -al command enables you to see the sizes and permissions of a directory and the
files contained in it (i.e., detailed listing) no matter which directory you are in. You will
also see two directories Unix created for you automatically with special names. (meaning
this) and .. (meaning upper ) directory in relationship to where you are. So, for example,
the command . /imflip ... is telling Unix to run imflip from this directory.
Using the pwd command to find out where you are, you will get a directory that doesn’t
start with a tilde, but rather looks something like /home/Tolga/cuda Why? because pwd
reports where you are in relationship to the Unix root, which is / rather than your home
directory /home/Tolga/ or the ~/ short notation. While the cd command will take you to your
home directory, cd / command will take you to the Unix root, where you will see the directory
named home. You can drill down into home/Tolga with commands cd home/Tolga and end
up at your home directory, but clearly the short notation cd is much more convenient.
rmdir command removes a directory as long as it is empty. However, if it has files or
other directories in it (i.e., subdirectories), you will get an error indicating that the di-
rectory is not empty and cannot be removed. If you want to remove a directory that has
files in it, use the file deletion command rm with the switch “-r” that implies “recur-
sive.” What rm -r dirname means is: remove every file from the directory dirname along
with all of its subdirectories. There is possibly no need to emphasize how dangerous this
command is. Once you issue this command, the directory is gone, so are the entire con-
tents inside it, not to mention all of the subdirectories. So, use this command with extreme
caution.
mv command works for files and also directories. For example, mv dir1 dir2 “moves” a
directory dir1 into dir2. This effectively renames the directory dir1 as dir2 and the old
directory dir1 is gone. When you ls, you will only see the new directory dir2.

1.6.2 Unix File-Related Commands

Once the directories (aka folders) are created, you will need to create/delete files in them.
These files are your programs, the input files that your program needs, and the files that your
Introduction to CPU Parallel Programming 17

program generates. For example, to run the imflip.c serial program we saw in Section 1.4,
which flips a picture, you need the program itself, you need to compile it, and when this
program generates an output BMP picture, you need to be able to see that picture. You
also need to bring (copy) the picture into this directory. There is also a Makefile that I
created for you which helps the compilation. Here are some common Unix commands for
file manipulation:

clear # clear the screen

ls # let’s see the files. dogL.bmp is the dog picture
cat Makefile # see the contents of the text file named Makefile
more Makefile # great for displaying contents of a multi-page file
cat > mytest.c # the quickest way to create a file mytest.c. End with ˆD
make imflip # run the entry in the Makefile to compile imflip.c
ls -al # let’s see the files sizes, etc. to investigate
ls -al imflip # Show me the details of the executable file imflip
cp imflip if1 # make a copy of this executable file, named if1
man cp # display a manual for the unix command ’’cp’’
imflip # without command line arguments, we get a warning
imflip dogL.bmp dogH.bmp h # run imflip with correct parameters
cat Makefile | grep imflip # look for text ‘‘imflip’’ inside the Makefile
ls -al | grep imflip # pipe the listing to grep to search ‘‘imflip’’
ls imf* # list everyfile whose file names start with ‘‘imf’’
rm imf*.exe # remove every file starting with imf and ending with .exe
diff f1 f2 # compare files f1 and f2. Display differences
touch imflip # set the last-access date of imflip to ’’now’’
rm imflip # or imflip.exe in Windows. This removes a file.
mv f1 f2 # move a file from an old name f1 to a new one f2.
mv f1 ../f2 # move file f1 to the above directory and rename as f2.
mv f1 ../ # move file f1 to the above directory. Keep the same name.
mv ../f1 f2 # move file f1 from the above dir. to here. rename as f2
history # show the history of my commands

• # (hash) is the comment symbol and anything after it is ignored.

• clear command erases the clutter on your terminal screen.
• cat Makefile displays the contents of Makefile from the command line, without having
to use another external program like Notepad++.
• more Makefile displays the contents of Makefile and also allows you to scroll through
the pages one by one. This is great for multipage files.
• cat > filename is the fastest way to create a text file named filename. This makes
Unix go into text-entry mode. Text-entry mode takes everything you are typing and
sends it to the file that comes after the > you type (e.g., mytest.c). End the text-
input mode by typing CTRL-D (the CTRL and D keys together, which is the EOT
character, ASCII code 4, indicating end of transmission). This method for text entry
is great if you do not even want to wait to use an editor like Notepad++. It is perfect
for programs that are just a few lines, although nothing prevents you from typing an
entire program using this method!
• | is the “pipe” command that channels (i.e., pipes) the output of one Unix command
or a program into another. This allows the user to run two separate commands using
only a single command line. The second command accepts the output of the first one
as its input. Piping can be done more than once, but it is uncommon.
18 GPU Parallel Program Development Using CUDA

• cat Makefile | grep imflip pipes the output of the cat command into another command
grep that looks and lists the lines containing the keyword imflip. grep is excellent for
searching some text strings inside text files. The output of any Unix command could
be piped into grep.
• ls -al | grep imflip pipes the output of the ls command into the grep imflip. This is
effectively looking for the string imflip in the directory listing. This is very useful in
determining file names that contain a certain string.
• make imflip finds imflip : file1 file2 file3 ... inside Makefile and remakes imflip if
a file has been modified in the list.
• cp imflip if1 copies the executable file imflip that you just created under a different
filename if1, so you do not lose it.
• man cp displays a help file for the cp command. This is great to display detailed
information about any Unix command.
• ls -al can be used to show the permissions and file sizes of source files and input/output
files. For example, it is perfect to check whether the sizes of the input and output BMP
files dogL.bmp and dogH.bmp are identical. If they are not, this is an early indication
of a bug!
• ls imf* lists every file whose names start with “imf.” This is great for listing files that
you know contain this “imf” prefix, like the ones we are creating in this book named
imflip, imflipP, ... Start (*) is a wildcard that means “anything.” Of course, you can
get fancier with the * like : ls imf*12 that means files starting with “imf ” and ending
with “12.” Another example is ls imf*12* that means files starting with “imf ” and
having “12” in the middle of the file name.
• diff file1 file2 displays the differences between two text files. This is great to determine
if a file has changed. It can also be used for binary files.
• imflip or imflip dog... runs the program if . / is in your $PATH. Otherwise, you have
to use ./imflip dog...
• touch imflip updates the “last access time” of the file imflip.
• rm imflip deletes the imflip executable file.
• mv command, just like renaming directories, can also be used to rename files as well
as truly move them. mv file1 file2 renames file1 as file2 and keeps it in the same
directory. If you want to move files from one directory to another, precede the filename
with the directory name and it will move to that directory. You can also move a file
without renaming them. Most Unix commands allow such versatility. For example, the
cp command can be used exactly the way mv is used to copy files from one directory
to another.
• history lists the commands you used since you opened the terminal.

The Unix commands to compile our first serial program imflip.c and turn it into the
executable imflip (or, imflip.exe in Windows) will produce an output that looks something
Introduction to CPU Parallel Programming 19

like the listing below. Only the important commands that the user entered are shown on
the left and the Unix output is shown on a right-indentation:

ls
ImageStuff.c ImageStuff.h Makefile dogL.bmp imflip.c
cat Makefile
imflip : imflip.c ImageStuff.c ImageStuff.h
g++ imflip.c ImageStuff.c -o imflip
make imflip
ls
ImageStuff.c ImageStuff.h Makefile dogL.bmp imflip.c imflip
imflip
Usage : imflip [input][output][v/h]
imflip dogL.bmp dogH.bmp h
Input BMP File Name : dogL.bmp (3200x2400)
Output BMP File Name : dogH.bmp (3200x2400)

Total Execution time : 83.0775 ms (10.817 ns/pixel)

ls -al
...
-rwxr-x 1 Tolga 23020054 Jul 18 15:01 dogL.bmp
-rwxr-x 1 Tolga 23020054 Jul 18 15:08 dogH.bmp
...
rm imflip
history

In this listing, each file’s permissions are shown as -rwxr-x, etc. Your output might
be slightly different depending on the computer or organization you are running these
commands at. The Unix command chmod is used to change these permissions to make
them read-only, etc.
The Unix make tool allows us to automate routinely executed commands and makes it
easy to compile a file, among other tasks. In our case, “make imflip” asks Unix to look inside
the Makefile and execute the line “gcc imflip.c ImageStuff.c -o imflip” which will invoke the
gcc compiler and will compile imflip.c and ImageStuff.c source files and will produce an
executable file named imflip. In our Makefile, the first line is showing file dependencies: It
instructs make to make the executable file imflip only if one of the listed source files, imflip.c,
ImageStuff.c, or ImageStuff.h have changed. To force a compile, you can use the touch Unix
command.

1.7 DEBUGGING YOUR PROGRAMS

Debugging code is something that you will have to do eventually. At some point, you will
write code that is supposed to work but instead throws a segmentation fault or some other
random error that you have never seen before. This process can be extremely frustrat-
ing and oftentimes is a result of a simple typo or logic error that is nearly impossible to
find. Other bugs in your code may not even show up during runtime and you won’t see
the effects of it at first glance. These are the worst kinds of bugs because the compiler
does not find them and they are not obvious during runtime. Some errors, like memory
leaks, will not show up at runtime. A good practice during your code development is to
regularly run debugger tools like gdb and valgrind to potentially determine where your seg-
mentation fault happened. To run your codes in a debugger, you need to compile it with a
20 GPU Parallel Program Development Using CUDA

FIGURE 1.4 Running gdb to catch a segmentation fault.

debug flag, typically “-g”. This tells the compiler to include debug symbols, which includes
things like line numbers, to tell you where your code went wrong. An example is shown
below:
$ gcc imflip.c imageStuff.c -o imflip -g

1.7.1 gdb
For the sake of showing what happens when you mess up your code, I’ve inserted a memory
free() into imflip.c before the code is done using the data. This will knowingly cause a
segmentation error in the code as shown below:
$ gcc imflip.c imageStuff.c -o imflip -g
$ . /imflip dogL.bmp flipped.bmp V
Segmentation fault (core dumped)
Since imflip was compiled with debug symbols, gdb, the GNU debugger, can be run to try
to figure out where the segmentation fault is happening. The output from gdb is given in
Figure 1.4. gdb is first called by running
$ gdb . /imflip
Once gdb is running, the program arguments are set by the command:
set args dogL.bmp flipped.bmp V
Introduction to CPU Parallel Programming 21

TABLE 1.1 A list of common gdb commands and functionality.

Task Command Example Use
Start GDB gdb gdb ./imflip
Set the program arguments set args set args input.bmp output.bmp H
(once in gdb)
Run the debugger run run
List commands help help
Add a breakpoint at a line break break 13
Break at a function break break FlipImageV
Display where error happened where where

After this, the program is run using the simple run command. gdb then proceeds to spit out
a bunch of errors saying that your code is all messed up. The where command helps give a
little more information as to where the code went wrong. Initially, gdb thought the error was
in the WriteBMP() function within ImageStuff.c at line 73, but the where command narrows
it down to line 98 in imflip.c. Further inspection in imflip.c code reveals that a free(data)
command was called before writing data to a BMP image with the WriteBMP() function.
This is just a simple example, but gdb can be expanded to use break points, watching
specific variables, and a host of other options. A sample of common commands is listed in
Table 1.1.
Most integrated development environments (IDEs) have a built-in debugging module
that makes debugging quite easy to use. Typically the back-end is still gdb or some propri-
etary debugging engine. Regardless of whether you have an IDE to use, gdb is still available
from the command line and contains just as much, if not more functionality compared to
your IDE (depending on your IDE of choice).

1.7.2 Old School Debugging

This is possibly the correct term for the type of debugging that programmers used in
the old days —1940s, 1950s, 1960s, 1970s— and continue to use to date. I see no reason
why old school debugging should go away in the foreseeable future. After all, a “real”
debugger that we use to debug our code —such as gdb— is nothing more than an automated
implementation of old school debugging. I will provide a much more detailed description of
the old school debugging concepts within the context of GPU programming in Section 7.9.
All of what I will describe in Section 7.9 is applicable to the CPU world. So, you can either
hang tight (continue reading this chapter) or peek at Section 7.9 now.
In every debugger the idea is to insert breakpoints into the code, in an attempt to
print/display some sort of values related to the system status at that point. This status
could be variable values or the status of a peripheral, you name it. The execution can be
either halted at the breakpoint or continue, while printing multiple statuses along the way.
Lights: In the very early days when the Machine Code programmers were writing their
programs bit-by-bit by flicking the associated switches, the breakpoints were possibly a
single light to display a single bit’s value; today, FPGA programmers use 8 LEDs to display
the value of an 8-bit Verilog variable (Note: Verilog is a hardware description language).
However, this requires an incredibly experienced programmer to infer the system status
from just a few bits.
printf : In a regular C program, it is extremely common for the programmers to insert a
bunch of printf() commands to display the value of one or more variables at certain locations
22 GPU Parallel Program Development Using CUDA

of the code. These are nothing more than manual breakpoints. If you feel that the bug in
your code is fairly easy to discover, there is no reason to go through the lengthy gdb process,
as I described in Section 1.7.1. Stick a bunch of printf()’s inside the code and they will tell
you what is going on. A printf() can display a lot of information about multiple variables,
clearly being a much more powerful tool than the few lights.
assert: An assert statement does not do anything unless a condition — that you spec-
ified — is violated as opposed to printf(), which always prints something. For example, if
your code had the following lines:

ImgPtr=malloc(...);
assert(ImgPtr != NULL);

In this case, you are simply trying to make sure that the pointer that is given to us is not
NULL, which red flags a huge problem with memory allocation. While assert() would do
nothing under normal circumstances, it would issue an error like the one shown below if the
condition is violated:
Assertion violation: file mycode.c, line 36: ImgPtr != NULL
Comment lines: Surprisingly enough, there is something easier than sticking a bunch
of printf()’s in your code. Although C doesn’t care about the “lines,” it is fairly common
that the C programmers write their code pretty similarly to a line-by-line fashion, much
like Python. This is why Python received some criticism for making the line-by-line style
the actual syntax of the language, rather than an option as in C. In the commenting-driven
debugging, you simply comment out a line that you are suspicious of, recompile, re-execute
to see if the problem went away, although the result is definitely no longer correct. This
is perfect in situations where you are getting core dumps, etc. In the example below, your
program will give you a Divide By 0 error if the user enters 0 for speed. You insert the
printf() there to give you an idea about where it might be crashing, but an assert() is much
better, because assert() would do nothing under normal circumstances avoiding the clutter
on your screen during debugging.

scanf(&speed);
printf("DEBUG: user entered speed=%d\n",speed);
assert(speed != 0);
distance=100; time=distance/speed;

Comments are super practical, because you can insert them in the middle of the code in
case there are multiple C statements in your code as shown below:

scanf(&speed);
distance=100; // time=distance/speed;

1.7.3 valgrind
Another tool to debug that can be extremely useful is a framework called valgrind. Once
the code has been compiled with debug symbols on, valgrind is simple to run. It has a host
of options to run with, similar to GDB, but the basic usage is quite easy. The output from
valgrind on the same imflip code with a memory error is shown in Figure 1.5. It catches
quite a few more errors and even locates the proper line of the error on line 96 in imflip.c
where the improper free() command is located.
Introduction to CPU Parallel Programming 23

FIGURE 1.5 Running valgrind to catch a memory access error.

valgrind also excels at finding memory errors that won’t show up during run time. Typ-
ically memory leaks are harder to find with simple print statements or a debugger like gdb.
For example, if imflip did not free any of the memory at the end, a memory leak would be
present, and valgrind would pick up on this. valgrind also has a module called cachegrind that
helps simulate how your code interacts with the CPU’s cache memory system. cachegrind is
called with the –tool=cachegrind command line option. Further options and documentation
can be found at http://valgrind.org.

1.8 PERFORMANCE OF OUR FIRST SERIAL PROGRAM

Let’s get a good grasp of the performance of our first serial program, imflip.c. Since many
random events are always executing inside your operating system (OS), it is a good idea
to run the same program many times to make sure that we are getting consistent results.
So, let’s run the imflip.c a few times through the command prompt. When we do that, we
get results like 81.022 ms, 82.7132 ms, 81.9845 ms, ... We can call it roughly 82 ms. This is
good enough. This corresponds to 10.724 ns/pixel, since there are 3200×2400 pixels in this
dogL.bmp expanded dog image.
To be able to measure the performance of the program a little bit more accurately, I put
a repetition for() loop in the code that executes the same code many times (e.g., 129) and
divides the execution time by that same amount, 129. This will allow us to take 129 times
longer than usual, thereby allowing an inaccurate UNIX system timer to be used to achieve
much more accurate timing information. Most machines cannot provide hardware clock at
better accuracy than 10 ms, or even worse. If a program is taking only, say, 50 ms to execute,
you will get a highly inaccurate performance reading (with a 10-ms-accuracy clock) even
if you repeat the measurement many times as described above. However, if you repeat the
same program 129 times and measure the time at the very beginning and at the very end of
129 repetitions, and divide it by 129, your 10 ms becomes effectively a 10/129 ms accuracy,
which is sufficient for our purposes. Notice that this must be an odd number, otherwise, the
resulting dog picture won’t be flipped!
24 GPU Parallel Program Development Using CUDA

1.8.1 Can We Estimate the Execution Time?

With this modification, the results in the 81–82 ms, shown above, were obtained for the
imflip.c program that is horizontally flipping the dog picture. Since we are curious about
what happens when we run the same program on the smaller dog.bmp dog picture, which
is only a 901 KB bitmap image, let’s run exactly the same program, except on the original
640×480 dog.bmp file. We get a run time of 3.2636 ms which is 10.624 ns/pixel. In other
words, when the picture shrunk 25×, the run time almost perfectly reduced by that much.
One weird thing though: we get exactly the same execution time every time. Although, this
shows that we were able to calculate the execution time with an incredible accuracy, please
hold off on celebrating this discovery, since we will encounter such complexities that it will
rock our world!
Indeed, first weirdness showed up its face already. Can you answer this question: Al-
though we are getting almost identical (per pixel) execution times, why is the execution
time not changing for the smaller image? Identical up to the 4 decimal digits, whereas, the
execution time for the bigger dog image is changing within 1–2%. Although this might look
like a statistical anomaly, it is not! It has a very clear explanation. Just to prevent you from
losing sleep, I will give you the answer and will not make you wait until the next chapter:
When we processed the 22 MB image, dogL.bmp, as opposed to the original 901 KB version,
dog.bmp, what changed? The answer is that during the processing of dogL.bmp, the CPU
cannot keep the entire image in its last level L3 cache memory (L3$), which is 8 MB. This
means that, to access the image, it was continuously emptying and refilling its L3$ during
execution. Alternatively, when working on the 901 KB little dog image dogL.bmp, all it
takes is one turn of processing to completely soak the data into the L3$, and the CPU owns
that data during all 129 loops of execution. Note that I will be using the notation L3$,
which is pronounced “el-three-cash” to denote L3$.

1.8.2 What Does the OS Do When Our Code Is Executing?

The reason for the higher amount of variability in the execution time for the big dog image
is that there is a lot more uncertainty in accessing the memory than the inside of a core.
Since imflip.c is a serial program, we really need a “1T” to execute our program. On the
other hand, our CPU has luxurious resources such as 4C/8T. This means that once we
start running our program which consists of an extremely active thread, the OS almost
instantly realizes that giving us one completely dedicated CPU thread (or even a core) is
in everybody’s best interest, so our application can fully utilize this resource. All said and
done, this is the OS’s job: to allocate resources intelligently. Whether it is Windows or a
Unix clone, all of today’s OS code is very smart in understanding these patterns in program
execution. If a program is passionately asking for a single thread, but nothing more, unless
you are running many other programs, the best course of action for the OS is to give you
a nearly VIP access to a single thread (even, possibly a complete core).
However, the story is completely different for the main memory, which is accessed by
every active thread in the OS. Think of it as 1M! There is no 2M! So, the OS must share
it among every thread. Memory is where all OS data is, where every thread’s data is, and
main memory is where all of your coconuts (oops, sorry, I meant, your image data elements)
are. So, not only the OS has to figure out how your imflip.c can access the image data from
the main memory, but even itself. Another important job of the OS is to ensure fairness. If a
thread is starved of data, and another is feasting, the OS is not doing its job. It has to be fair
to everyone, including itself. When you have so much more moving parts in main memory
access, you can see why there is a lot more uncertainty in determining what the main
Introduction to CPU Parallel Programming 25

memory access time should be. Alternatively, if we have almost a completely dedicated core
when processing the small image, we are running the entire program inside the core itself,
without needing to go to the main memory. And, we are not sharing that core with anyone.
So, there is nearly zero uncertainty in determining the execution time. If these concepts are
slightly blurry, do not worry about them, there will be an entire chapter dedicated to the
CPU architecture. Here is the meaning of the C/T (cores/threads) notation:

â The C/T (cores/threads) notation denotes:

e.g., 4C/8T means 4 cores, 8 threads,
4C means that the processor has 4 cores,
8T means that each core houses 2 threads,
â So, the 4C/8T processor can execute 8 threads simultaneously,
However each thread pair has to share internal core resources,

1.8.3 How Do We Parallelize It?

There is so much detail to be aware of even when running the serial version of our code that
instead of shoving the parallel version of the code into this tiny section, I would rather make
it an entire chapter of its own. Indeed, the next few chapters will be completely dedicated
to the parallel version of the code and a deep analysis of its performance implications. For
now, let’s get warmed up to the parallel version by thinking in terms of coconuts! Answer
these questions:
In Analogy 1.1, what would happen if there were two tractors with two farmers in each
one of the tractors? In this case, you have four threads executing ... Because there are two
— physically separate — tractors, everything is comfortable inside the tractors (i.e., cores).
However, now, instead of two, four people will be sharing the road from the tractors to
the coconuts (i.e., multiple threads accessing the main memory). Keep going ... What if 8
farmers went to coconut harvesting? How about 16? In other words, even in the 8C/16T
scenario, where you have 8 cores and 16 threads, you have 8 tractors to satisfy every farmer
and two of them can share the coconut crackers, etc. But, what about the main memory
access? The more farmers you have harvesting, the more they will start waiting for that
road to be available to get the coconuts. In CPU terms, the memory bandwidth will sooner
or later saturate. Indeed, even in the next chapter, I will show you a program where this
happens. This means that before we move onto parallelizing our program, we have to do
some thinking in terms of which resources our threads will access during execution.

1.8.4 Thinking About the Resources

Even if you had the answers to the questions above there is yet another question: Would
the magic of parallelism work the same way on every possible resource scenario? In other
words, is the concept of parallelism separate from the resources it is applied to? To exem-
plify, would the 2C/4T core scenario always give us the same performance improvement
regardless of the memory bandwidth or, if the memory bandwidth is really bad, the added
performance gain from the extra core would disappear? For now, just think about them,
but the entire Part I of the book will be spent on answering these questions. So, don’t stress
over them at this point.

Ok, this is good enough brain warm up ... Let’s write our first parallel program.
CHAPTER 2

Developing Our First

Parallel CPU Program

his chapter is dedicated to understanding our very first CPU parallel program, imflipP.c.
T Notice the ’P’ at the end of the file name that indicates parallel. For the CPU parallel
programs, the development platform makes no difference. In this chapter, I will slowly start
introducing the concepts that are the backbone of a parallel program, and these concepts
will be readily applicable to GPU programming when we start developing GPU programs
in Part II. As you might have noticed, I never say GPU Parallel Programming, but, rather,
GPU Programming. This is much like there is no reason to say a car with wheels; it suffices
to say a car. In other words, there is no GPU serial programming, which would mean using
one GPU thread out of the available 100,000s! So, GPU programming by definition implies
GPU parallel programming.

2.1 OUR FIRST PARALLEL PROGRAM

It is time to write our first parallel program imflipP.c, which is the parallel version of our
serial imflip.c program, introduced in Chapter 1.4. To parallelize imflip.c, we will simply
have the main() function create multiple threads and let them do a portion of the work and
exit. If, for example, in the simplest case, we are trying to run the two-threaded version of
our program, the main() will create two threads, let them each do half the work, join the
threads, and exit. In this scenario, main() is nothing but the organizer of events. It is not
doing actual heavy-lifting.
To do what we just described, main() needs to be able to create, terminate, and organize
threads and assign tasks to threads. The functions that allow it to perform such tasks are a
part of the Pthreads library. Pthreads only work in a POSIX-compliant operating system.
Ironically, Windows is not POSIX compliant! However, the Cygwin64 allows Pthreads code
to run in Windows by performing some sort of API-by-API translation between POSIX and
Windows. This is why everything we describe here will work in Windows, and hence the
reason for me to use Cygwin64 in case you have a Windows PC. Here are a few functions
that we will use from the Pthreads library:
1. pthread create() allows you to create a thread.

2. pthread join() allows you to join any given thread into the thread that originally cre-
ated it. Think of the “join” process as “uncreating” threads, or like the top thread
“swallowing” the thread if just created.
3. pthread attr() allows you to initialize attributes for threads.

27
28 GPU Parallel Program Development Using CUDA

4. pthread attr setdetachstate() allows you to set attributes for the threads you just
initialized.

2.1.1 The main() Function in imflipP.c

Our serial program imflip.c, shown in Code 1.1, read a few command line parameters and
flipped an input image either vertically or horizontally, as specified by the user’s command
line entry. The same flip operation was repeated an odd number of times (e.g., 129) to
improve the accuracy of the system time read by clock().
Code 2.1 and Code 2.2 show the same main() function in imflipP.c with one exception:
Code 2.1 shows main() {..., which means the “first part” of main(), which is further empha-
sized by the ... at the end of this listing. This part is for command line parsing and other
routine work. In Code 2.2, an opposite main() ...} notation is used along with the ... in the
beginning of the listing, indicating the “second part” of the main() function, which is for
launching threads and assigning tasks to threads.
To improve readability, I might repeat some of the code in both parts, such as the time-
stamping with gettimeofday() and the image reading with our own function ReadBMP()
that will be detailed very soon. This will allow the readers to clearly follow the beginning
and connection points within the two separate parts. As you might have noticed already,
whenever a function is entirely listed, the “func() {...}” notation is used. When a function
and some surrounding code is listed, “... func() {...}” notation will be used, denoting “some
common code ... followed by a complete listing of func().”
Here is the part of main() which parses command arguments, given in the argv[] array (a
total of argc of them). It issues errors if the user enters an unaccepted number of arguments.
It saves the flip direction that the user requested in a variable called Flip for further use.
The global variable NumThreads is also determined based on user input and is used later in
the functions that actually perform the flip.

int main(int argc, char** argv)

{
...
switch (argc){
case 3: NumThreads=1; Flip=’V’; break;
case 4: NumThreads=1; Flip=toupper(argv[3][0]); break;
case 5: NumThreads=atoi(argv[4]); Flip=toupper(argv[3][0]); break;
default:printf("Usage: imflipP input output [v/h] [threads]");
printf("Example: imflipP infile.bmp out.bmp h 8\n\n");
return 0;
}
if((Flip != ’V’) && (Flip != ’H’)) {
printf("Invalid option ’%c’ ... Exiting...\n",Flip);
exit(EXIT_FAILURE);
}
if((NumThreads<1) || (NumThreads>MAXTHREADS)){
printf("Threads must be in [1..%u]... Exiting...\n",MAXTHREADS);
exit(EXIT_FAILURE);
}else{
...
Developing Our First Parallel CPU Program 29

2.1.2 Timing the Execution

When we have more than one thread executing, we want to be able to quantify the speed-up.
We used the clock() function in our serial code, which was included in the time.h header file
and was only millisecond-accurate.
The gettimeofday() function we will use in imflipP.c will get us down to µs-accuracy.
gettimeofday() requires us to #include the sys/time.h header file and provides the time
in two parts of a struct: one for seconds through the .tv_sec member and one for the
micro-seconds through the .tv_usec member. Both of these members are int types and are
combined to produce a double time value before being displayed.
An important note here is that the accuracy of the timing does not depend on the C
function itself, but, rather, the hardware. If your computer’s OS or hardware cannot provide
a µs-accurate timestamp, gettimeofday() will provide only as accurate of a result as it can
obtain from the OS (which itself gets its value from a hardware clock unit). For example,
Cygwin64 does not achieve µs-accuracy even with the gettimeofday() function due to its
reliance on the underlying Windows APIs.

#include <sys/time.h>
...
struct timeval t;
double StartTime, EndTime;
double TimeElapsed;
...
gettimeofday(&t, NULL);
StartTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
// work is done here : thread creation, task/data assignment, join
...
gettimeofday(&t, NULL);
EndTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
TimeElapsed=(EndTime-StartTime)/1000.00;
TimeElapsed/=(double)REPS;
...
printf("\n\nTotal execution time: %9.4f ms ...",TimeElapsed,...

2.1.3 Split Code Listing for main() in imflipP.c

I am intentionally staying away from providing one long listing for the main() function in
a single code fragment. This is because, as you can see from this first example, Code 2.1
and Code 2.2 provide listings for entirely different functionality: Code 2.1 provides a listing
for flat-out “boring” functionality for getting command line arguments, parsing them, and
warning the user. On the other hand, Code 2.2 is where the “cool action” of creating and
joining threads happens. Most of the time, I will arrange my code to allow such partitioning
and try to put a lot more emphasis on the important part of the code.
30 GPU Parallel Program Development Using CUDA

CODE 2.1: imflipP.c ... main() {...

The first part of the main() function of the imflipP.c reads and parses command line
options. Issues errors if needed. The BMP image is read into a memory array and the
timer is started. This part determines whether the multithreaded code will even run.

#define MAXTHREADS 128

...
int main(int argc, char** argv)
{
char Flip;
int a,i,ThErr;
struct timeval t;
double StartTime, EndTime;
double TimeElapsed;

switch (argc){
case 3: NumThreads=1; Flip=’V’; break;
case 4: NumThreads=1; Flip=toupper(argv[3][0]); break;
case 5: NumThreads=atoi(argv[4]); Flip=toupper(argv[3][0]); break;
default:printf("Usage: imflipP input output [v/h] [threads]");
printf("Example: imflipP infile.bmp out.bmp h 8\n\n");
return 0;
}
if((Flip != ’V’) && (Flip != ’H’)) {
printf("Invalid option ’%c’ ... Exiting...\n",Flip);
exit(EXIT_FAILURE);
}
if((NumThreads<1) || (NumThreads>MAXTHREADS)){
printf("Threads must be in [1..%u]... Exiting...\n",MAXTHREADS);
exit(EXIT_FAILURE);
}else{
if(NumThreads != 1){
printf("\nExecuting %u threads...\n",NumThreads);
MTFlipFunc = (Flip==’V’) ? MTFlipV:MTFlipH;
}else{
printf("\nExecuting the serial version ...\n");
FlipFunc = (Flip==’V’) ? FlipImageV:FlipImageH;
}
}
TheImage = ReadBMP(argv[1]);

gettimeofday(&t, NULL);
StartTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
...
}
Developing Our First Parallel CPU Program 31

CODE 2.2: imflipP.c ... main() ...}

The second part of the main() function in imflipP.c creates multiple threads and assigns
tasks to them. Each thread executes its assigned task and returns. When every thread
is done, main() joins (i.e., terminates) the threads and reports the elapsed time.

#define REPS 129

...
int main(int argc, char** argv)
{
...
gettimeofday(&t, NULL);
StartTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
if(NumThreads >1){
pthread_attr_init(&ThAttr);
pthread_attr_setdetachstate(&ThAttr, PTHREAD_CREATE_JOINABLE);
for(a=0; a<REPS; a++){
for(i=0; i<NumThreads; i++){
ThParam[i] = i;
ThErr = pthread_create(&ThHandle[i], &ThAttr,
MTFlipFunc, (void *)&ThParam[i]);
if(ThErr != 0){
printf("Create Error %d. Exiting abruptly...\n",ThErr);
exit(EXIT_FAILURE);
}
}
pthread_attr_destroy(&ThAttr);
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
}
}else{
for(a=0; a<REPS; a++){ (*FlipFunc)(TheImage); }
}
gettimeofday(&t, NULL);
EndTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
TimeElapsed=(EndTime-StartTime)/1000.00;
TimeElapsed/=(double)REPS;
// merge with header and write to file
WriteBMP(TheImage, argv[2]);
// free() the allocated memory for the image
for(i = 0; i < ip.Vpixels; i++) { free(TheImage[i]); }
free(TheImage);
printf("\n\nTotal execution time: %9.4f ms (%s flip)",TimeElapsed,
Flip==’V’?"Vertical":"Horizontal");
printf(" (%6.3f ns/pixel)\n",
1000000*TimeElapsed/(double)(ip.Hpixels*ip.Vpixels));
return (EXIT_SUCCESS);
}
32 GPU Parallel Program Development Using CUDA

2.1.4 Thread Initialization

Here is the part of the code that initializes the threads and runs the multithreaded code
multiple times. To initialize threads, we tell the OS, through the APIs pthread attr init() and
pthread attr setdetachstate() that we are getting ready to launch a bunch of threads that will
be joined later ... The loop that repeats the same code 129 times simply “slows down” the
time! Instead of measuring how long it takes to execute once, if you execute 129 times and
divide the total time elapsed by 129, nothing changes, except you are a lot less susceptible
to the inaccuracy of the Unix time measurement APIs.

#include <pthread.h>
...
#define REPS 129
#define MAXTHREADS 128
...
long NumThreads; // Total # of parallel threads
int ThParam[MAXTHREADS]; // Thread parameters ...
pthread_t ThHandle[MAXTHREADS]; // Thread handles
pthread_attr_t ThAttr; // Pthread attributes
...
pthread_attr_init(&ThAttr);
pthread_attr_setdetachstate(&ThAttr, PTHREAD_CREATE_JOINABLE);
for(a=0; a<REPS; a++){
...
}

2.1.5 Thread Creation

Here is where the good stuff happens : Look at the code below. Each thread is created
by using the API function pthread create() and starts executing the moment it is created.
What is this thread going to do? This is what the third argument tells the thread to execute:
MTFlipFunc. It is as if we called a function named MTFlipFunc() which starts executing on
its own, i.e., in parallel with us. Our main() just created a child whose name is MTFlipFunc()
and starts executing immediately in parallel. The question is that, if main() is creating 2, 4,
8 of these threads, how does each thread know who (s)he is? That’s the fourth argument,
which, after some pointer manipulation, boils down to ThParam[i].

for(i=0; i<NumThreads; i++){

ThParam[i] = i;
ThErr = pthread_create(&ThHandle[i], &ThAttr,
MTFlipFunc, (void *)&ThParam[i]);
if(ThErr != 0){
printf("Create Error %d. Exiting abruptly...\n",ThErr);
exit(EXIT_FAILURE);
}
}
Developing Our First Parallel CPU Program 33

FIGURE 2.1Windows Task Manager, showing 1499 threads, however, there is 0%

CPU utilization.

The OS needs about the first and second arguments: The second argument, &ThAttr, is
the same for all threads and contains the thread attributes. The first argument contains the
“handles” of each thread and is very important to the OS, to be able to keep track of the
threads. If the OS cannot create a thread for any reason, it will return a NULL (i.e., 0),
and this is our clue to know that we can no longer create a thread. This is a show-stopper,
so our program issues a runtime error and exits.
Here is the interesting question: If main() creates two threads, is our program a
dual-threaded program? As we will see shortly, when main() creates two threads using
pthread create(), the best we can expect is a 2x program speed-up. What about the main()
itself? It turns out main() itself is most definitely a thread too. So, there are 3 threads
involved in a program where main() created two child threads. The reason we only expect
a 2x speed-up is the fact that, while main() is only doing trivial work, the other two threads
are doing heavy work.
To quantify this: the main() function creates threads, assigns tasks to them, and joins
them, which constitutes, say, 1% of the activity, while the other 99% of the activity is
caused by the other two threads doing the actual heavy work (about 49.5% each). That
being the case, the amount of time that the third thread takes, running the main() function,
is negligible. Figure 2.1 shows my PC’s Windows Task Manager, which indicates 1499 active
threads. However, the CPU load is negligible (almost 0%). These 1499 are the threads that
the Windows OS created to listen to network packets, keyboard strokes, other interrupts,
etc. If, for example, the OS realizes that a network packet has arrived, it wakes up the
responsible thread, immediately processes that packet in a very short period of time and
the thread goes back to sleep, although still active. Remember: the CPU is drastically faster
than the network packet.
34 GPU Parallel Program Development Using CUDA

2.1.6 Thread Launch/Execution

Figure 1.2 shows that although the OS has 1499 threads that are very sleepy, the threads
that the main() function creates have a totally different personality: The moment the main()
creates the two child threads, the number of threads is now 1501. However, these two new
guys are adrenaline-driven crazy guys! They go crazy for 82 ms and during their execution
Windows Task Manager will peak at 100% for two of the virtual CPUs, and these threads
get swallowed by main() after pthread join() after 82 ms. At that point, you are back down
to 1499 threads. The main() function itself doesn’t die until it reaches the very end line,
where it reports the time and exits, at which point, we are down to 1498 threads. So, if you
looked at the Windows Task Manager during the 129 repetitions of the code, where two
threads are on an adrenaline rush — from thread launch to thread join, you would see the
2 of the 8 CPUs peak at 100%. The CPU in my computer has 4 cores, 8 threads (4C/8T).
The Windows OS sees this CPU as “8 virtual CPUs,” hence the reason to see 8 “CPUs” in
the Task Manager. When you have a Mac or a Unix machine, the situation will be similar.
To summarize what happens when our 2-threaded code runs, remember the command line
you typed to run your code:
imflipP dogL.bmp dogH.bmp H 2
This command line instructs the OS to load and execute the binary code imflipP. Execution
involves creating the very first thread, assigning function main() to it, and passing arguments
argc and argv[] to that thread. This sounds very similar to what we did to create our child
threads.
When the OS finishes loading the executable binary imflipP, it passes the control to
main() as if it called a function main() and passed the variables argc and argv[] array to it.
The main() starts running ... Somewhere during the execution, main() asks the OS to create
two more threads ...

...pthread_create(&ThHandle[0], ...,MTFlipFunc, (void *)&ThParam[0]);

...pthread_create(&ThHandle[1], ...,MTFlipFunc, (void *)&ThParam[1]);

and the OS sets up the memory and stack area and allocates these two super-active threads 2
of its available virtual CPUs. After successful creation of the threads, they must be launched.
pthread create() also implies launching a thread that has just been created. Launching a
thread effectively corresponds to calling the following functions:

(*MTFlipFunc)(ThParam[0]);
(*MTFlipFunc)(ThParam[1]);

which will turn into either one of the horizontal or vertical flip functions, determined at
runtime, based on user input. If the user chose ’H’ as the flip option, the launch will be
effectively equivalent to this:

...
MTFlipFunc=MTFlipH;
...
(*MTFlipH)(ThParam[0]);
(*MTFlipH)(ThParam[1]);
Developing Our First Parallel CPU Program 35

2.1.7 Thread Termination (Join)

After launch, a tornado will hit the CPU! They will use the two virtual CPUs super ac-
tively, and they eventually return() one by one, which allows the main() to execute the
pthread join() on each thread one by one:

pthread_join(ThHandle[0], NULL);
pthread_join(ThHandle[1], NULL);

After the first pthread join(), we are down to 1500 threads. The first child thread got swal-
lowed by main(). After the second pthread join(), we are down to 1499 threads. The second
child thread got gobbled up too. This stops the tornado! And, a few ms later, the main()
reports and time and exits. As we will see in Code 2.5, imageStuff.c contains code to dy-
namically allocate the memory area to store the image that is read from the disk. malloc()
function is used for dynamic (i.e., at run time) memory allocation. Before exiting main(),
all of this memory area is unallocated using free() as shown below.

...
// free() the allocated memory for the image
for(i = 0; i < ip.Vpixels; i++) { free(TheImage[i]); }
free(TheImage);
printf("\n\nTotal execution time: %9.4f ms (%s flip)",TimeElapsed,
Flip==’V’?"Vertical":"Horizontal");
printf(" (%6.3f ns/pixel)\n",
1000000*TimeElapsed/(double)(ip.Hpixels*ip.Vpixels));
return (EXIT_SUCCESS);
}

When main() exits, the parent OS thread swallows the child thread running main(). These
threads are an interesting life form, they are like some sort of bacteria that create and
swallow each other!

2.1.8 Thread Task and Data Splitting

Ok, the operation of these bacterial lifeforms, called threads, is pretty clear now. What about
data? The entire purpose of creating more than one thread was to execute things faster.
This, by definition, means that the more threads we create, the more task splitting we have
to do, and the more data splitting we have to do. To understand this, look at Analogy 2.1.
36 GPU Parallel Program Development Using CUDA

ANALOGY 2.1: Task and Data Splitting in Multithreading.

Cocotown is a major producer of coconuts, harvesting 1800 trees annually. Trees are
numbered from 0 to 1799. Every year, the entire town helps coconut harvesting. Due
to the increasing number of participants, the town devised the following strategy to
speed-up harvesting:
Farmers who are willing to help show up at the harvesting place and get a one-page
instruction manual. This page has a top and bottom part. The top part is the same
for every farmer which says: ”crack the shell, peel the skin, ..., and do this for the
following trees:”
When 2 farmers show up, the bottom part says: Process only trees numbered
[0...899] for the first farmer, and [900...1799] for the second farmer. But if 5 farmers
showed up, their bottom part of the instructions would have the following numbers:
[0..359] for the first farmer, [360..719] for the second farmer, ..., and [1440..1799] for
the last farmer.

In Analogy 2.1, harvesting a portion of the coconut trees is the task of each thread
and it is the same for every farmer, regardless of how many farmers show up. The farmers
are threads that are executing. Each farmer must be given a unique ID to know which
part of the trees he or she must harvest. This unique ID is analogous to a Thread ID, or,
tid. The number of trees is 1800, which is all of the data elements to process. The most
interesting thing is that the task (i.e., the top part of the instructions) can be separated
from the data (i.e., the bottom part of the instructions). While the task is the same for
every farmer, the data is completely different. So, in a sense, there is only a single task,
applied to different data elements, determined by tid.
It is clear that the task can be completely predetermined at compile-time, which means,
during the preparation of the instructions. However, it looks like the data portion must
be determined at runtime, i.e., when everybody shows up and we know exactly how many
farmers we have. The key question is whether the data portion can also be determined at
compile-time. In other words, the mayor of the town can write only one set of instructions
and make, say, 60 photocopies (i.e., the maximum number of farmers that is ever expected)
and never have to prepare anything else when the farmers show up. If 2 farmers show up, the
mayor hands out 2 instructions and assigns tid = 0 and tid = 1 to them. If 5 farmers show
up, he hands out 5 instructions and assigns them tid = 0, tid = 1, tid = 2, tid = 3, tid = 4.
More generally, the only thing that must be determined at runtime is the tid assignment,
i.e., tid = 0 ... tid = N − 1. Everything else is determined at compile time, including the
parameterized task. Is this possible? It turns out, it is most definitely possible. In the end,
for N farmers, we clearly know what the data splitting will look like: Each farmer will get
1800/N coconut trees to harvest, and farmer number tid will have to harvest trees in the
range
1800 1800
×tid ... ×(tid + 1) − 1 (2.1)
N N
To validate this, let us calculate the data split for tid = [0...4] (5 farmers). They end
up being [360 × tid ... 360 × tid + 359] for a given tid. Therefore, for 5 farmers, the data
split ends up being [0...359], [360...719], [720...1079], [1080...1439], and [1440...1799]. This
is precisely what we wanted. This means that for an N -threaded program, such as flipping
a picture horizontally, we really need to write a single function that will get assigned to a
launched thread at runtime. All we need to do is to let the launched thread know what its
Developing Our First Parallel CPU Program 37

tid is ... The thread, then, will be able to figure out exactly what portion of the data to
process at run time using an equation similar to Equation 2.1.
One important note here is that none of the data elements have dependencies, i.e., they
can be processed independently and in parallel. Therefore, we expect that when we launch
N threads, the entire task (i.e., 1800 coconut trees) can be processed N times faster. In
other words, if 1800 coconut trees took 1800 hours to harvest, when 5 farmers show up, we
expect it to take 360 hours. As we will see shortly, this perfectness proves to be difficult to
achieve. There is an inherent overhead in parallelizing a task. It is called the parallelization
overhead. Because of this overhead may 5 farmers would take 400 hours to complete the
job. The details will depend on the hardware and the way we wrote the function for each
thread. We will be paying a lot of attention to this issue in the coming chapters.

2.2 WORKING WITH BITMAP (BMP) FILES

Now that we understand how the task and data splitting must happen in a multithreaded
program, let us apply this knowledge to our first parallel program imflipP.c. Before we do
this, we need to understand the format of a bitmap (BMP) image file and how to read/write
these files.

2.2.1 BMP is a Non-Lossy/Uncompressed File Format

A Bitmap (BMP) file is an uncompressed image file. This means that by knowing the size
of the image, you can easily determine the size of the file that the image is stored in. For
example, a 24-bit-per-pixel BMP image occupies 3 bytes per pixel (i.e., one byte per R, G,
and B). This format also needs 54 additional bytes to store “header” information. I will
provide and explain the exact formula in Section 2.2.2, but, for now, let’s focus on the
concept of compression.

ANALOGY 2.2: Data Compression.

Cocotown record keeping division wanted to store the picture of the 1800 coconut
trees they harvested at the beginning and end of year 2015. The office clerk stored the
following information in a file named 1800trees.txt:
On January 1st of 2015, there were 1800 identical trees, arranged in a 40 wide
and 45 long rectangle. I took the picture of a single tree and saved it in a file named
OneTree.BMP. Take this picture, make a 40x45 tile out of it. I only noticed a single
tree that was different at location (30,35) and stored its picture in another picture file
DifferentTree.BMP. The other 1799 are identical.
On December 31, 2015, the trees looked different because they grew. I took the
picture of one tree and saved it in GrownTree.BMP. Although they grew, 1798 of them
were still identical on Dec. 31, 2015, while two of them were different. Make a 40x45
tile out of GrownTree.BMP and replace the two different trees, at locations (32,36) and
(32,38) by the files Grown3236.BMP and Grown3238.BMP.

If you look at Analogy 2.2, the clerk was able to get all of the information that is
necessary to draw an entire picture of the 40x45 tree farm by providing only a single
tree picture (OneTree.BMP) and the picture of the one that looked a little different than
the other 1798 (DifferentTree.BMP). Assume that each of these pictures occupies 1 KB of
38 GPU Parallel Program Development Using CUDA

storage. Including the text file that the clerk provided, this information fits in roughly 3 KB
on January 1, 2015. If we were to make a BMP file out of the entire 40x45 tree farm, we
would need 1 KB for all 1800 trees, occupying 1800 KB. Repetitious (i.e., redundant) data
allowed the clerk to substantially reduce the size of the file we need to deliver the same
information.
This concept is called data compression and can be applied to any kind of data that has
redundancies in it. This is why an uncompressed image format like BMP will be drastically
larger in size in comparison to a compressed image format like JPEG (or JPG) that com-
presses the information before it stores it; the techniques used in compression require the
knowledge of frequency domain analysis, however, the abstract idea is simple and is exactly
what is conceptualized in Analogy 2.2.
A BMP file stores “raw” image pixels without compressing them; because compression
is not performed, no additional processing is necessary before each pixel is stored in a BMP
file. This contrasts with a JPEG file, which first applies a frequency-domain transformation
like Cosine Transform. Another interesting artifact of a JPEG file is that only 90–99%
of the actual image information might be there; this concept of losing part of the image
information — though not noticeable to the eye — means that a JPEG file is a lossy image
storage format, whereas no information is lost in a BMP file because each pixel is stored
without any transformation. Considering that a 20 MB BMP file could be stored as a 1 MB
JPG file if we could tolerate a 1% loss in image data, this trade-off is perfectly acceptable
to almost any user. This is why almost every smartphone stores images in JPG format to
avoid quickly filling your storage space.

2.2.2 BMP Image File Format

Although BMP files support grayscale and various color depths (e.g., 8-bit or 24-bit), I will
only use the 24-bit RGB files in our programs. These files have a 54-byte header followed by
the RGB colors of each pixel. Unlike JPEG files, BMP files are not compressed, so each pixel
takes up 3 bytes and it is possible to determine the exact size of the BMP files according
to this formula:
Hbytes = (Hpixels × 3 + 3) ∧ (11...1100)2
(2.2)
24b RGB BMP File Size = 54 + V pixels × Hbytes
where V pixels and Hpixels are the height and width of the image (e.g., V pixels = 480
and Hpixels = 640 for the 640×480 dog.bmp image file). Per Equation 2.2, the dog.bmp
occupies 54 + 3 × 640 × 480 = 921, 654 bytes whereas the 3200×2400 dogL.bmp image has
a file size of 23, 040, 054 bytes (≈ 22MB).
The conversion from Hpixels to Hbytes should be as straightforward as
Hbytes = 3 × Hpixels,
however, it must be rounded up to the next integer that is divisible by 4 to ensure that
the BMP image size is a multiple of 4. This is achieved at the top line of Equation 2.2 by
adding 3 and wiping out the LSB two bits of the resulting number (i.e., by ANDing the last
2 bits with 00).
Here are some example BMP size computations:
• A 24-bit 1024×1024 BMP file would need 3,145,782 bytes of storage (54+1024×1024×3).
• A 24-bit 321×127 BMP would need 122,482 bytes (54 + (321×3+1)×127).
Developing Our First Parallel CPU Program 39

CODE 2.3: ImageStuff.h

The header file ImageStuff.h contains the descriptions for the two BMP image manip-
ulation files as well as struct definitions for our images.

struct ImgProp{
int Hpixels;
int Vpixels;
unsigned char HeaderInfo[54];
unsigned long int Hbytes;
};
struct Pixel{
unsigned char R;
unsigned char G;
unsigned char B;
};

unsigned char** ReadBMP(char* );

void WriteBMP(unsigned char** , char*);

extern struct ImgProp ip;

2.2.3 Header File ImageStuff.h

Since I will be using exactly the same image format through Part I, I put all of the
BMP image manipulation files and the associated header file outside our actual code. The
ImageStuff.h header file contains the headers of the functions and struct definitions related
to images and needs to be included in all of our programs and is shown in Code 2.3. In-
stead of the ImageStuff.h file, you could use more professional grade image helper packages
like ImageMagick. However, because ImageStuff.h is, in a sense, “open-source,” I strongly
suggest the readers to understand this file before starting to use another package like
ImageMagick or OpenCV. This will allow you to get a good grasp of the low-level con-
cepts related to images. We will switch over to other easier-to-use packages in Part II of
the book.
In ImageStuff.h, a struct is defined for an image that contains the previously mentioned
Hpixels and Vpixels of that image. The header of the image that is being processed is
saved into the HeaderInfo[54] to be restored when writing back the image after the flip
operation, etc. Hbytes contains the number of bytes each row occupies in memory, rounded
up to the nearest integer that is divisible by 4. For example, if a BMP image has 640
horizontal pixels, Hbytes= 3×640 = 1920. However, for an image that has, say, 201 horizontal
pixels, Hbytes= 3 ×201 = 603 → 604. So, each row will take up 604 bytes and there will be
one wasted byte on each row.
The ImageStuff.h file also contains the headers for the BMP image read and write func-
tions ReadBMP() and WriteBMP(), that are provided in the ImageStuff.c file. The actual C
variable ip contains the properties of our primary image, which is the dog picture in many
of our examples. Since this variable is defined inside the actual program, imflipP.c in this
chapter, it must be included within ImageStuff.h as an extern struct, so the ReadBMP()
and WriteBMP() functions can reference them without a problem.
40 GPU Parallel Program Development Using CUDA

CODE 2.4: ImageStuff.c WriteBMP() {...}

WriteBMP() writes a BMP image after the processing. Variable img contains a pointer
to a struct that contains the image to be written.

void WriteBMP(unsigned char** img, char* filename)

{
FILE* f = fopen(filename, "wb");
if(f == NULL){
printf("\n\nFILE CREATION ERROR: %s\n\n",filename);
exit(1);
}
unsigned long int x,y;
char temp;
//write header
for(x=0; x<54; x++) { fputc(ip.HeaderInfo[x],f); }
//write image data on byte at a time
for(x=0; x<ip.Vpixels; x++){
for(y=0; y<ip.Hbytes; y++){
temp=img[x][y];
fputc(temp,f);
}
}
printf("\n Output BMP File name: %20s (%u x %u)",filename,ip.Hpixels,ip.Vpixels);
fclose(f);
}

2.2.4 Image Manipulation Routines in ImageStuff.c

The file ImageStuff.c contains two functions, responsible for reading and writing BMP files.
Encapsulating these two functions and the surrounding variable definitions, etc. into the
ImageStuff.c and ImageStuff.h files allows us to worry about the details of these files only once
here for the entire Part I of this book. Whether we are developing CPU or GPU programs,
we will read the BMP images using these functions. Even when developing GPU programs,
the image will be first read into the CPU area and later transferred into the GPU memory
as will be detailed in Part II of the book.
Code 2.4 shows the WriteBMP() function that writes the processed BMP file back to disk.
This function takes a struct pointer named img that contains the header for the image being
written, as well as the output file name that is contained in a string named filename. The
header of the output BMP file is 54 bytes and was saved during the BMP read operation.
The ReadBMP() function, shown in Code 2.5, allocates memory for the image one row at
a time, using the new keyword. Each pixel is treated as a struct of three bytes, containing
the RGB values of that pixel, during processing. However, when reading an image from
disk, we can simply read each row Hbytes at a time and write it into a consecutive array of
Hbytes unsigned char values and not worry about individual pixels.
Developing Our First Parallel CPU Program 41

CODE 2.5: ImageStuff.c ... ReadBMP() {...}

ReadBMP() reads a BMP image and allocates memory for it. Required image param-
eters, such as Hpixels and Vpixels, are extracted from the BMP file and written into
the struct. Hbytes is computed using Equation 2.2.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "ImageStuff.h"

unsigned char** ReadBMP(char* filename)

{
int i;
FILE* f = fopen(filename, "rb");
if(f == NULL){ printf("\n\n%s NOT FOUND\n\n",filename); exit(1); }

unsigned char HeaderInfo[54];

fread(HeaderInfo, sizeof(unsigned char), 54, f); // read 54b header
// extract image height and width from header
int width = *(int*)&HeaderInfo[18];
int height = *(int*)&HeaderInfo[22];
//copy header for re-use
for(i=0; i<54; i++){ ip.HeaderInfo[i]=HeaderInfo[i]; }

ip.Vpixels = height;
ip.Hpixels = width;
int RowBytes = (width*3 + 3) & (˜3);
ip.Hbytes = RowBytes;
printf("\n Input BMP File name: %20s (%u x %u)",filename,ip.Hpixels,ip.Vpixels);
unsigned char tmp;
unsigned char **TheImage = (unsigned char **)malloc(height *
sizeof(unsigned char*));
for(i=0; i<height; i++) {
TheImage[i] = (unsigned char *)malloc(RowBytes * sizeof(unsigned char));
}
for(i = 0; i < height; i++) {
fread(TheImage[i], sizeof(unsigned char), RowBytes, f);
}
fclose(f);
return TheImage; // remember to free() it in caller!
}

ReadBMP() extracts Hpixels and Vpixels values from the BMP header, i.e., the first 54
bytes of the BMP file, and calculates Hbytes from Equation 2.2. It dynamically allocates
sufficient memory for the image using the malloc() function, which will be released using
the free() function at the end of main(). The image is read from a user-specified file name
that is passed onto ReadBMP() within the string filename. This BMP file header is saved in
HeaderInfo[] to use when we need to write the processed file back to the disk.
With the ReadBMP() and WriteBMP() functions use the C library function fopen() with
either the “rb” or “wb” options which mean read or write a binary file. If the file cannot
be opened by the OS, the return value of fopen() is NULL, an error is issued to the user.
42 GPU Parallel Program Development Using CUDA

This would happen due to a wrong file name or an existing lock on the file. fopen() allocates
a file handle and a read/write buffer area for the new file and returns it to the caller.
Depending on the fopen() parameters, it also places a lock on the file to prevent multiple
programs from corrupting the file due to a simultaneous access. Each byte is read/writ-
ten from/to the file one byte at a time (i.e., the C variable type unsigned char) by using
this buffer. Function fclose() de-allocates this buffer and removes the lock (if any) from
the file.

2.3 TASK EXECUTION BY THREADS

Now that we know how to time our CPU code and how to read/write BMP images in
detail, let us flip an image using multiple threads. The responsibilities of each party in a
multithreaded program are as follows:

• main() is responsible for creating the threads and assigning a unique tid to each one
at runtime (e.g., ThParam[i] shown below).
• main() invokes a function for each thread (function pointer MTFlipFunc).
• main() must also pass other necessary values to the thread, if any (also passed in
ThParam[i] below).

for(i=0; i<NumThreads; i++){

ThParam[i] = i;
ThErr = pthread_create(&ThHandle[i], &ThAttr, MTFlipFunc,
(void *)&ThParam[i]);

• main() is also responsible to let the OS know what type of thread it is creating (i.e.,
thread attributes, passed in &ThAttr). In the end, main() is nothing but another thread
itself and it is speaking on behalf of the other child threads it will create in a moment.
• The OS is responsible for deciding whether a thread can be created. Threads are
nothing but resources that must be managed by the OS. If a thread can be created,
the OS is responsible for assigning a handle to that thread (ThHandle[i]). If not, OS
returns NULL (ThErr).
• If the OS cannot create a thread, main() is also responsible for either exiting or some
other action.

if(ThErr != 0){
printf("\nThread Creation Error %d. Exiting abruptly...
\n",ThErr);
exit(EXIT_FAILURE);
}

• The responsibility of each thread is to receive its tid and perform its task MTFlipFunc()
only on the data portion that it is required to process. We will spend multiple pages
on this.
• The final responsibility of main() is to wait for the threads to be done, and join them.
This instructs the OS to de-allocate the thread resources.
Developing Our First Parallel CPU Program 43

pthread_attr_destroy(&ThAttr);
for(i=0; i<NumThreads; i++){
pthread_join(ThHandle[i], NULL);
}

2.3.1 Launching a Thread

Let us look at the way function pointers are used to launch threads. The pthread create()
function is expecting a function pointer as its third parameter, which is MTFlipFunc. Where
did this pointer come from? To be able to determine this, let us list all of the lines in imflipP.c
that participate in “computing” the variable MTFlipFunc. They are listed in Code 2.6. Our
aim is to give sufficient flexibility to main(), so, we can launch a thread with any function
we want. Code 2.6 lists four different functions:

void FlipImageV(unsigned char** img)

void FlipImageH(unsigned char** img)
void *MTFlipV(void* tid)
void *MTFlipH(void* tid)

The first two functions are exactly what we had in Chapter 1.1. These are the serial
functions that flip an image in the vertical or horizontal direction. We just introduced their
multithreaded versions (the last two above), that will do exactly what the serial version did,
except faster by using multiple threads (hopefully)! Note that the multithreaded versions
will need the tid as we described before, whereas the serial versions don’t ...
Now, our goal is to understand how we pass the function pointer and data to each
launched thread. The serial versions of the function are slightly modified to eliminate the
return value (i.e., void), so they are consistent with the multithreaded versions that also do
not return a value. All four of these functions simply modify the image that is pointed to
by the pointer TheImage. It turns out, we do not really have to pass the function pointer to
the thread. Instead, we have to call the function that is pointed to by the function pointer.
This process is called thread launch.
The way we pass the data and launch a thread differs based on whether we are launching
the serial or multithreaded version of the function. I designed imflipP.c to be able to run
the older serial versions of the code as well as the new multithreaded versions, based on the
user command-line parameters. Since the input variables of the two families of functions
are slightly different, it was easier to define two separate function pointers, FlipFunc and
MTFlipFunc, that were responsible for launching the serial and multithreaded version of the
functions. I maintained two function pointers shown below:

void (*FlipFunc)(unsigned char** img); // Serial flip function ptr

void* (*MTFlipFunc)(void *arg); // Multi-threaded flip func ptr

Let us clarify the difference between creating and launching a thread, both of which are
implied in pthread create(). Creating a thread involves a request/grant mechanism between
the parent thread main() and the OS. If the OS says No, nothing else can happen. So, it
is the OS that actually creates the thread and sets up a memory area, a handle, a virtual
CPU, and a stack area for it and gives a nonzero thread handle to the parent thread, thereby
granting the parent permission to launch (aka run) another thread in parallel.
44 GPU Parallel Program Development Using CUDA

CODE 2.6: imflipP.c Thread Function Pointers

Code that determines the function pointer that is passed onto the launched thread
as a parameter. This is how the thread will know what to execute.

...
void (*FlipFunc)(unsigned char** img); // Serial flip function ptr
void* (*MTFlipFunc)(void *arg); // Multi-threaded flip func ptr
...
void FlipImageV(unsigned char** img)
{
...

void FlipImageH(unsigned char** img)

{
...

void MTFlipV(void tid)

{
...

void MTFlipH(void tid)

{
...

int main(int argc, char** argv)

{
char Flip;
...
... if(NumThreads != 1){ // multi-threaded version
printf("\nExecuting the multi-threaded version..."...);
MTFlipFunc = (Flip==’V’) ? MTFlipV:MTFlipH;
}else{ // serial version
printf("\nExecuting the serial version ...\n");
FlipFunc = (Flip==’V’) ? FlipImageV:FlipImageH;
...
if(NumThreads >1){ // multi-threaded version
...
for(i=0; i<NumThreads; i++){
ThParam[i] = i;
ThErr=pthread_create( , , MTFlipFunc,(void *)&ThParam[i]);
...
}else{ // if running the serial version
...
(*FlipFunc)(TheImage);
...

Notice, although the parent now has the license to run, nothing is happening yet. Launch-
ing a thread is effectively a parallel function call. In other words, main() knows that another
child thread is running after the launch, and can communicate with it if it needs to.
Developing Our First Parallel CPU Program 45

The main() function may never communicate (e.g., pass data back and forth) with its
child thread(s), as exemplified in Code 2.2, since it doesn’t need to. Child threads modify
the required memory areas and return. The tasks assigned to child threads, in this spe-
cific case, leave only a single responsibility to main(): Wait until the child threads are
done and terminate (join) them. Therefore, the only thing that main() cares about: it
has the handle of the new thread, and it can determine when that thread has finished
execution (i.e., returned) by using pthread join(). So, effectively, pthread join(x) means,
wait until thread with the handle number x is done. When that thread is done, well, it
means that it executed a return and finished its job. There is no reason to keep him
around.
When the thread (with handle x) joins main(), the OS gets rid of all of the memory,
virtual CPU, and stack areas it allocated to that thread, and this thread disappears. How-
ever, main() is still alive and well ... until it reaches the last code line and returns (last line
in Code 2.2) ... When main() executes a return, the OS de-allocates all of the resources it
allocated for main() (i.e., the imflipP program). The program has just completed its execu-
tion. You then get a prompt back in Unix, waiting for your next Unix command, since the
execution of imflipP has just been completed.

2.3.2 Multithreaded Vertical Flip: MTFlipV()

Now that we know how a multithreaded program should work, it is time to look at the
task that each thread executes in the multithreaded version of the program. This is like our
Analogy 2.1, where one farmer had to harvest all 1800 trees (from 0 to 1799) if he or she
was alone, whereas if two farmers came to harvest, they could split it into two segments
[0...899] and [900...1799] and the split assigns a shrinking range of coconut tree ranges (data
ranges) as the number of farmers increases. The magic formula was our Equation 2.1, which
specified these split ranges based on only a single parameter called tid. Therefore, assigning
the same task to every single thread (i.e., writing a function that each thread will execute
at runtime) and assigning a unique tid to each thread at runtime proved to be perfectly
sufficient to write a multithreaded program.
If we remember the serial version of our vertical flip code, displayed in Code 1.2, it
went through each column one by one, and it swapped each column’s pixel with its vertical
mirror. As an example, in our 640×480 image named dog.bmp, row 0 (the very first row)
had horizontal pixels [0][0...639] and its vertical mirror row 479 (the last row) had pixels
[479][0...639]. So, to vertically flip the image, our serial function FlipImageV() had to swap
each row’s pixels one by one as follows. The ←→ symbol denotes a swap.

Row [0]: [0][0]←→[479][0] , [0][1]←→[479][1] ... [0][639]←→[479][639]

Row [1]: [1][0]←→[478][0] , [1][1]←→[478][1] ... [1][639]←→[478][639]
Row [2]: [2][0]←→[477][0] , [2][1]←→[477][1] ... [2][639]←→[477][639]
Row [3]: [3][0]←→[476][0] , [3][1]←→[476][1] ... [3][639]←→[476][639]
... ... ... ... ...
Row [239]: [239][0]←→[240][0] , [239][1]←→[240][1] ... [239][639]←→[240][639]
46 GPU Parallel Program Development Using CUDA

Refreshing our memory with Code 1.2, FlipImageV() function that swaps pixels looked some-
thing like this. Note: the return value type is modified to be void to be consistent with the
multithreaded versions of the same program. Otherwise, the rest of the code below looks
exactly like Code 1.2.

void FlipImageV(unsigned char** img)

{
struct Pixel pix; //temp swap pixel
int row, col;

//vertical flip
for(col=0; col<ip.Hbytes; col+=3){
row = 0;
while(row<ip.Vpixels/2){
pix.B = img[row][col];
...
row++;
}
}
return img;
}

The question now is: how to modify this FlipImageV() function to allow multithreading? The
multithreaded version of the function, MTFlipV(), will receive one parameter named tid as
we emphasized before. The image it will work on is a global variable TheImage, so it doesn’t
need to be passed as an additional input. Since our friend pthread create() expects us to
give it a function pointer, we will define MTFlipV() as follows:

void MTFlipV(void tid)

{
...
}

During the course of this book, we will be encountering other types of functions that are
not so amenable to being parallelized. A function that doesn’t easily parallelize is commonly
referred to as a function that doesn’t thread well. There should be no question in any of the
readers’ mind at this point that, if a function doesn’t thread well, it is expected to be not
GPU-friendly. Here, in this section, I am also making the point that such a function is also
possibly not CPU multithreading friendly.
So, what do we do when a task is “born to be serial”? You clearly do not run this task
on a GPU. You keep it on the CPU ... keep it serial ... run it fast. Most modern CPUs,
such as the i7-5960x [11] I mentioned in Section 1.1, have a feature called Turbo Boost that
allows the CPU to achieve very high performance on a single thread when running a serial
(single-threaded) code. They achieve this by clocking one of the cores at, say, 4 GHz, while
other cores are at, say, 3 GHz, thereby significantly boosting the performance of single-
threaded code. This allows the CPU to achieve a good performance for both modern and
old-fashioned serial code ...
Developing Our First Parallel CPU Program 47

CODE 2.7: imflipP.c ... MTFlipV() {...}

Multithreaded version of the FlipImageV() in Code 1.2 that expects a tid to be pro-
vided. The only difference between this and the serial version is the portion of the
data area it processes, rather than the entire data.

...
long NumThreads; // Total # threads working in parallel
unsigned char** TheImage; // This is the main image
struct ImgProp ip;
...
void *MTFlipV(void* tid)
{
struct Pixel pix; //temp swap pixel
int row, col;

long ts = ((int ) tid); // My thread ID is stored here

ts *= ip.Hbytes/NumThreads; // start index
long te = ts+ip.Hbytes/NumThreads-1; // end index

for(col=ts; col<=te; col+=3)

{
row=0;
while(row<ip.Vpixels/2)
{
pix.B = TheImage[row][col];
pix.G = TheImage[row][col+1];
pix.R = TheImage[row][col+2];

TheImage[row][col] = TheImage[ip.Vpixels-(row+1)][col];
TheImage[row][col+1] = TheImage[ip.Vpixels-(row+1)][col+1];
TheImage[row][col+2] = TheImage[ip.Vpixels-(row+1)][col+2];

TheImage[ip.Vpixels-(row+1)][col] = pix.B;
TheImage[ip.Vpixels-(row+1)][col+1] = pix.G;
TheImage[ip.Vpixels-(row+1)][col+2] = pix.R;

row++;
}
}
pthread_exit(NULL);
}

The entire code listing for MTFlipV() is shown in Code 2.7. Comparing this to the serial
version of the function, shown in Code 1.2, there aren’t really a lot of differences other than
the concept of tid, which acts as the data partitioning agent. Please note that this code is
an overly simple multithreaded code. Normally, what each thread does completely depends
on the logic of the programmer. For our purposes, though, this simple example is perfect
to demonstrate the basic ideas. Additionally, the FlipImageV() function is a well-mannered
function that is very amenable to multithreading.
48 GPU Parallel Program Development Using CUDA

2.3.3 Comparing FlipImageV() and MTFlipV()

Here are the major differences between the serial version of the vertical-flipper function
FlipImageV() and its parallel version MTFlipV():

• FlipImageV() is defined as a function, whereas MTFlipV() is defined as a function

pointer. This is to make it easy for us to use this pointer in launching the thread
using pthread create().

void (*FlipFunc)(unsigned char** img); // Serial flip func ptr

void* (*MTFlipFunc)(void *arg); // Multi-th flip func ptr
...
void FlipImageV(unsigned char** img)
{
...
}

void MTFlipV(void tid)

{
...
}

• FlipImageV() is designed to process the entire image, while its parallel counterpart
MTFlipV() is designed to process only a portion of the image defined by an equation
similar to Equation 2.1. Therefore, MTFlipV() needs a variable tid passed to it to know
who he is. This is done when launching the thread using pthread create().
• Besides the option of using the MTFlipFunc function pointer in launching threads using
pthread create(), nothing prevents us from simply calling the function ourselves by
using the MTFlipFunc function pointer (and its serial version FlipFunc). To call the
functions that these pointers are pointing to, the following notation has to be used:

FlipFunc = FlipImageV;
MTFlipFunc = MTFlipV;
...
(*FlipFunc)(TheImage); // call the serial version
(*MTFlipFunc)(void *(&ThParam[0])); // call the multi-threaded version

• Each image row occupies ip.Hbytes bytes. For example, for the 640 × 480 image
dog.bmp, ip.Hbytes= 1920 bytes according to Equation 2.2. The serial function
FlipImageV() clearly has to loop through every byte in the range [0...1919]. However,
the multithreaded version MTFlipV() partitions these horizontal 1920 bytes based on
tid. If 4 threads are launched, the byte (and pixel) range that has to be processed for
each thread is:
tid = 0 : Pixels [0...159] Hbytes [0...477]
tid = 1 : Pixels [160...319] Hbytes [480...959]
tid = 2 : Pixels [320...479] Hbytes [960...1439]
tid = 3 : Pixels [480...639] Hbytes [1440...1919]
Developing Our First Parallel CPU Program 49

• Multi-threaded function’s first task is to calculate which data range it has to process.
If every thread does this, all 4 of these pixel ranges shown above can be processed in
parallel. Here is how each thread calculates its own range:

void MTFlipV(void tid)

{
struct Pixel pix; //temp swap pixel
int row, col;

long ts = ((int ) tid); // My thread ID is stored here

ts *= ip.Hbytes/NumThreads; // start index
long te = ts+ip.Hbytes/NumThreads-1; // end index

for(col=ts; col<=te; col+=3)

{
row=0;
...

The thread, as its very first task, is calculating its ts and te values (thread start and
thread end ). These are the Hbytes ranges, similar to the ones shown above and the
split is based on Equation 2.1. Since each pixel occupies 3 bytes (one byte for each
of the RGB colors), the function is adding 3 to the col variable in the for loop. The
FlipImageV() function doesn’t have to do such a computation since it is expected to
process everything, i.e., Hbytes range 0...1919.
• The image to process is passed via a local variable img in the serial FlipImageV() to
be compatible with the version introduced in Chapter 1.1, whereas a global variable
(TheImage) is used in MTFlipV() for reasons that will be clear in the coming chapters.
• The multithreaded function executes pthread exit() to let main() know that it is done.
This is when the pthread join() function advances to the next line for the thread that
finished.

As an interesting note, if we launched only a single thread using pthread create(), we

are technically running a multithreaded program where the tid range is tid= [0...0]. This
thread would still calculate its data range just to find that it has to process the entire range
anyway. In the imflipP.c program, the FlipImageV() function is referred to as a serial version,
whereas launching the multithreaded version with 1 thread is allowed which is referred to
as the 1-threaded version.
By comparing serial Code 1.2 and its parallel version Code 2.7, it is easy to see that,
as long as the function is written carefully at the beginning, it is easy to parallelize it with
minor modifications. This will be a very useful concept to remember when we are writing
GPU versions of certain serial CPU code. Since GPU code by definition means parallel code,
this concept might allow us to port CPU code to the GPU world with minimal effort. This
will prove to be the case sometimes. Not always!
50 GPU Parallel Program Development Using CUDA

CODE 2.8: imflipP.c ... MTFlipH() {...}

Multithreaded version of the FlipImageH() function in Code 1.3.

...
long NumThreads; // Total # threads working in parallel
unsigned char** TheImage; // This is the main image
struct ImgProp ip;
...
void *MTFlipH(void* tid)
{
struct Pixel pix; //temp swap pixel
int row, col;

long ts = ((int ) tid); // My thread ID is stored here

ts *= ip.Vpixels/NumThreads; // start index
long te = ts+ip.Vpixels/NumThreads-1; // end index

for(row=ts; row<=te; row++)

{
col=0;
while(col<ip.Hpixels*3/2)
{
pix.B = TheImage[row][col];
pix.G = TheImage[row][col+1];
pix.R = TheImage[row][col+2];

TheImage[row][col] = TheImage[row][ip.Hpixels*3-(col+3)];
TheImage[row][col+1] = TheImage[row][ip.Hpixels*3-(col+2)];
TheImage[row][col+2] = TheImage[row][ip.Hpixels*3-(col+1)];

TheImage[row][ip.Hpixels*3-(col+3)] = pix.B;
TheImage[row][ip.Hpixels*3-(col+2)] = pix.G;
TheImage[row][ip.Hpixels*3-(col+1)] = pix.R;

col+=3;
}
}
pthread_exit(NULL);
}

2.3.4 Multithreaded Horizontal Flip: MTFlipH()

The serial function FlipImageH(), shown in Code 1.3, is parallelized and its multithreaded
version, MTFlipH() is shown in Code 2.8. Very much like its vertical version, the horizontal
version of the multithreaded code looks at tid to determine which data partition it has to
process. For an example 640×480 image (480 rows) using 4 threads, these pixels are flipped:

tid= 0 : Rows [0...119] tid= 1 : Rows [120...239]

tid= 2 : Rows [240...359] tid= 3 : Rows [360...479]
Developing Our First Parallel CPU Program 51

TABLE 2.1Serial and multithreaded execution time of imflipP.c, both for vertical flip
and horizontal flip, on an i7-960 (4C/8T) CPU.
#Threads Command line Run time (ms)
Serial imflipP dogL.bmp dogV.bmp v 131
2 imflipP dogL.bmp dogV2.bmp v 2 70
3 imflipP dogL.bmp dogV3.bmp v 3 46
4 imflipP dogL.bmp dogV4.bmp v 4 67
5 imflipP dogL.bmp dogV5.bmp v 5 55
6 imflipP dogL.bmp dogV6.bmp v 6 51
8 imflipP dogL.bmp dogV8.bmp v 8 52
9 imflipP dogL.bmp dogV9.bmp v 9 47
10 imflipP dogL.bmp dogV10.bmp v 10 51
12 imflipP dogL.bmp dogV10.bmp v 12 44
Serial imflipP dogL.bmp dogH.bmp h 81
2 imflipP dogL.bmp dogH2.bmp h 2 41
3 imflipP dogL.bmp dogH3.bmp h 3 28
4 imflipP dogL.bmp dogH4.bmp h 4 41
5 imflipP dogL.bmp dogH5.bmp h 5 33
6 imflipP dogL.bmp dogH6.bmp h 6 28
8 imflipP dogL.bmp dogH8.bmp h 8 32
9 imflipP dogL.bmp dogH9.bmp h 9 30
10 imflipP dogL.bmp dogH10.bmp h 10 33
12 imflipP dogL.bmp dogH7.bmp h 12 29

For each row that a thread is responsible for, each pixel’s 3-byte RGB values are swapped
with its horizontal mirror. This swap starts at col= [0...2] which holds the RGB values of
pixel 0, and continues until the last RGB (3 byte) value has been swapped. For a 640×480
image, since Hbytes= 1920, and there is no wasted byte, the last pixel (i.e., pixel 639) is at
col= [1917...1919].

2.4 TESTING/TIMING THE MULTITHREADED CODE

Now that we know how the imflipP program works in detail, it is time to test it. The program
command line syntax is determined via the parsing portion of main() as shown in Code 2.1.
To run imflipP, the general command line syntax is:
imflipP InputfileName OutputfileName [v/h/V/H] [1-128]
where InputFileName and OutputFileName are the BMP files to read and write, respectively.
The optional command line argument [v/h/V/H] is to specify the flip direction (’V’ is the
default). The next optional argument is the number of threads and can be specified between
1 and MAXTHREADS (128), with a default value of 1 (serial).
Table 2.1 shows the run times of the same program by using 1 through 10 threads
on an Intel i7-960 CPU that has 4C/8T (4 cores, 8 threads). Results all the way to 10
threads is reported, not that we expect an improvement beyond 8 threads, but as a sanity
check. These kinds of checks are useful to quickly discover a potentially hidden bug. The
functionality is also confirmed by looking at the picture, and checking the file size, and
running a comparison program that checks two binary files, the Unix diff command.
52 GPU Parallel Program Development Using CUDA

So, what do these results tell us ? First of all, in both the vertical and horizontal flip case, it
is clear that using more than a single thread helps. So, our efforts to parallelize the program
weren’t for nothing. However, the troubling news is that, beyond 3 threads, there seems to
be no performance improvement at all, in both the vertical and horizontal case. For ≥ 4
threads, you can simply regard the data as noise!
What Table 2.1 clearly shows is that multithreading helps up to 3 threads. Of course,
this is not a generalized statement. This statement strictly applies to my i7-960 test CPU
(4C/8T) and the code I have shown in Code 2.7 and Code 2.8, which are the heart of the
imflipP.c code. By this time, you should have a thousand questions in your mind. Here are
some of them:
• Would the results be different with a less powerful CPU like a 2C/2T?

• How about a more powerful CPU, such as a 6C/12T?

• In Table 2.1, considering the fact that we tested it on a 4C/8T CPU, shouldn’t we
have gotten better results up to 8 threads? Or, at least 6 or something? Why does
the performance collapse beyond 3 threads?
• What if we processed a smaller 640 × 480 image, such as dog.bmp, instead of the
giant 3200×2400 image dog.bmp? Would the inflection point for performance be at a
different thread count?
• Or, for a smaller image, would there even be an inflection point?
• Why is the horizontal flip faster, considering that vertical is also processing the same
number of pixels; 3200×2400?
• ...
The list goes on ... Don’t lose sleep over Table 2.1. For this chapter, we have achieved our
goal. We know how to write multithreaded programs now, and we are getting some speed-
up. This is more than enough so early in our parallel programming journey. I can guarantee
that you will ask another 1000 questions to yourself about why this program isn’t as fast
as we hoped. I can also guarantee that you will NOT ask some of the key questions that
actually contribute to this lackluster performance. The answers to these questions deserves
an entire chapter, and this is what I will do. In Chapter 3, I will answer all of the above
questions and more of the ones you are not asking. For now, I invite you to think about
what you might NOT be asking ...
CHAPTER 3

Improving Our First Parallel

CPU Program

e parallelized our first serial program imflip.c and developed its parallel version
W imflipP.c in Chapter 2. The parallel version achieved a reasonable speed-up using
pthreads, as shown in Table 2.1. Using multiple threads reduced the execution time from
131 ms (serial version) down to 70 ms, and 46 ms when we launched two and three threads,
respectively, on an i7-960 CPU with 4C/8T. Introducing more threads (i.e., ≥ 4) didn’t
help. In this chapter, we want to understand the factors that contributed to the perfor-
mance numbers that were reported in Table 2.1. We might not be able to improve them,
but we have to be able to explain why we cannot improve them. We do not want to achieve
good performance by luck !

3.1 EFFECT OF THE “PROGRAMMER” ON PERFORMANCE

Understanding the hardware and the compiler helps a programmer write good code. Over
many years, CPU architects and compiler designers kept improving their CPU architecture
and the optimization capabilities of compilers. A lot of these efforts helped ease the burden
on the software developer, so, he or she can write code without worrying too much about low-
level hardware details. However, as we will see in this chapter, understanding the underlying
hardware and utilizing the hardware efficiently might allow a programmer to develop 10x
higher performance code in some cases.
Not only is this statement true for CPUs, but also the potential GPU performance im-
provement is even more emphasized when the hardware is utilized efficiently, since a lot of
the impressive GPU performance comes from shifting the performance-improvement respon-
sibility from the hardware to the programmer. This chapter explains the interaction among
all parties that contribute to the performance of a program. They are the programmer,
compiler, OS, and hardware (and, to a certain degree, the user ).
• Programmer is the ultimate intelligence and should understand the capabilities of
the other pieces. No software or hardware can match what the programmer can do,
since the programmer brings in the most valuable asset to the game: the logic. Good
programming logic requires a complete understanding of all pieces of the puzzle.

• Compiler is a large piece of code that is packed with routine functionality for two
things: (1) compilation, and (2) optimization. (1) is the compiler’s job and (2) is
the additional work the compiler has to do at compile time to potentially optimize
the inefficient code that the programmer wrote. So, the compiler is the “organizer”
at compile time. At compile time, time is frozen, meaning that the compiler could

53
54 GPU Parallel Program Development Using CUDA

contemplate many alternative scenarios that can happen at run time and produce
the best code for run time. When we run the program, the clock starts ticking. One
thing the compiler cannot know is the data, which could completely change the flow
of the program. The data can only be known at runtime, when the OS and CPU are
in action.
• The Operating System (OS) is the software that is the “boss” or the “manager” of the
hardware at run time. Its job is to allocate and map the hardware resources efficiently
at run time. Hardware resources include the virtual CPUs (i.e., threads), memory,
hard disk, flash drives (via Universal Serial Bus [USB] ports), network cards, keyboard,
monitor, GPU (to a certain degree), and more. A good OS knows its resources and
how to map them very well. Why is this important? Because the resources themselves
(e.g., CPU) have no idea what to do. They simply follow orders. The OS is the general
and the threads are the soldiers.
• Hardware is the CPU+memory+peripherals. OS takes the binary code that the
compiler produced and assigns them to virtual cores at run time. Virtual cores execute
them as fast as possible at run time. OS also facilitates the data movement between
the CPU and the memory, disk, keyboard, network card, etc.
• User is the final piece of the puzzle: Understanding the user is also important in
writing good code. The user of a program isn’t a programmer, yet the programmer
has to appeal to the user and has to communicate with him or her. This is not an
easy task!
In this book, the major focus will be on hardware, particularly the CPU and memory (and,
later in Part II, GPU and memory). Understanding the hardware holds the key to developing
high-performance code, whether for CPU or GPU. In this chapter, we will discover the truth
about whether it is possible to speed-up our first parallel program, imflipP.c. If we can, how?
The only problem is: we don’t know which part of the hardware we can use more efficiently
for performance improvement. So, we will look at everything.

3.2 EFFECT OF THE “CPU” ON PERFORMANCE

In Section 2.3.3, I explained the sequence of events that takes place when we launch multi-
threaded code. In Section 2.4, I also listed numerous questions you might be asking your-
self to explain Table 2.1. Let us answer the very first and the most obvious family of
questions:
• How would the results differ on different CPUs?
• Is it the speed of the CPU, or the number of cores, number of threads?
• Anything else related to the CPU? Like cache memory?
Perhaps, the most fun way to answer this question is go ahead and run the same program
on many different CPUs. Once the results are in place, we can try to make sense out of
them. The more variety of CPUs we test this code on, the better idea we might get. I will
go ahead and run this program on 6 different CPUs shown in Table 3.1.
Table 3.1 lists important CPU parameters, such as the number of cores and threads
(C/T), per-core L1$ and L2$ cache memory sizes, denoted as L1$/C and L2$/C, and the
shared L3$ cache memory size (shared by all 4, 6, or 8 cores). Table 3.1 also lists the
Improving Our First Parallel CPU Program 55

TABLE 3.1 Different CPUs used in testing the imflipP.c program.

Feature CPU1 CPU2 CPU3 CPU4 CPU5 CPU6
Name i5-4200M i7-960 i7-4770K i7-3820 i7-5930K E5-2650
C/T 2C/4T 4C/8T 4C/8T 4C/8T 6C/12T 8C/16T
Speed:GHz 2.5-3.1 3.2-3.46 3.5-3.9 3.6-3.8 3.5-3.7 2.0-2.8
L1$/C 64KB 64KB 64KB 64KB 64KB 64KB
L2$/C 256KB 256KB 256KB 256KB 256KB 256KB
shared L3$ 3MB 8MB 8MB 10MB 15MB 20MB
Memory 8GB 12GB 32GB 32GB 64GB 16GB
BW:GB/s 25.6 25.6 25.6 51.2 68 51.2

amount of memory and memory bandwidth (BW) in Giga-Bytes-per-second (GB/s) for

each computer that a given column indicates. In this section, we focus on the CPU’s role on
performance, but the role of the memory in determining the performance will be explained
thoroughly in this book. We will also look at the operation of memory in this chapter. For
now, just in case the performance numbers had anything to do with the memory instead of
the CPU, they are also listed in Table 3.1.
In this section, we have no intention to make an intelligent assessment on how the
performance results could differ among these 6 CPUs. We simply want to watch the CPU
horse race and have fun! As we see the numbers, we will develop theories about what could
be the most responsible party in determining the overall performance of our program. In
other words, we are looking at things from long distance at this point. We will dive deep
into details later, but the experimental data we gather first will help us develop methods
to improve program performance later.

3.2.1 In-Order versus Out-Of-Order Cores

Aside from how many cores a CPU has, there is another consideration that relates to the
cores; almost every CPU manufacturer started manufacturing in order cores and upgraded
their design to out of order in their more advanced family of offerings. Abbreviations
inO and OoO will be used going forward. For example, MIPS R2000 was an inO CPU,
while the more advanced R10000 was OoO. Similarly, Intel 8086, 80286, 80386, and the
newer Atom CPUs are inO, whereas Core i3, i5, and i7, as well as every Xeon is OoO.
The difference between inO andOoO is the way the CPU is capable of executing a given
set of instructions; while an inO CPU can only execute the instructions in precisely the
order that is listed in the binary code, an OoO CPU executes them in the order of operand
availability. In other words, an OoO CPU can find a lot more work to do in the later list
of instructions, whereas an inO CPU simply sits idle if the next instruction in the order of
all given instructions does not have data available, perhaps because the memory controller
did not yet read the necessary data from the memory.
This allows an OoO CPU to execute instructions a lot faster due to the ability to avoid
getting stuck when the next instruction cannot be immediately executed until the operands
are available. However, this luxury comes at a price: an inO CPU takes up a lot less chip
area, therefore allowing the manufacturer to fit a lot more inO cores in the same integrated
circuit chip. Because of this very reason, each inO core might actually be clocked a little
faster, since they are simpler. Time for an analogy ...
56 GPU Parallel Program Development Using CUDA

ANALOGY 3.1: In order versus Out of Order Execution.

Cocotown had a competition in which two teams of farmer families had to make
coconut pudding. These were the instructions provided to them: (1) crack the coconut
with the automated cracker machine, (2) grind the coconuts that come out of the
cracker machine using the grinder machine, (3) boil milk, and (4) put cocoa in milk
and boil more, (5) put the ground coconuts in the cocoa-mixed milk and boil more.
Each step took 10 minutes. Team 1 finished their pudding in 50 minutes, while
Team 2 shocked everyone by finishing in 30 minutes. Their secret became obvious
after the competition: they started cracking the coconut (step 1) and boiling the milk
(step 3) at the same time. These two tasks did not depend on each other and could
be started at the same time. Within 10 minutes, both of them were done and the
coconuts could be placed in the grinder (step 2), while, in parallel, cocoa is mixed
with the milk and boiled (step 4). So, in 20 minutes, they were done with steps 1–4.
Unfortunately, step 5 had to wait for steps 1–4 to be done, making their total
execution time 30 minutes. So, the secret of Team 2 was to execute the tasks out of
order, rather than in the order they were specified. In other words, they could start
executing any step if it did not depend on the results of a previous step.

Analogy 3.1 emphasizes the performance advantage of OoO execution; an OoO core
can execute independent dependence-chains (i.e., chains of CPU instructions that do not
have dependent results) in parallel, without having to wait for the next instruction to finish,
achieving a healthy speed-up. But, there are other trade-offs when a CPU is designed using
one of the two paradigms. One wonders which one is a better design idea: (1) more cores
that are inO, or (2) fewer cores that are OoO? What if we took the idea to the extreme
and placed, say, 60 cores in a CPU that are all inO? Would this work faster than a CPU
that has 8 OoO cores? The answer is not as easy as just picking one of them.
Here are the facts related to inO versus OoO CPUs:
• Since both design ideas are valid, there is a real inO CPU design like this, called
Xeon Phi, manufactured by Intel. One model, Xeon Phi 5110P, has 60 inO cores and
4 threads in each core, making it capable of executing 240 threads. It is considered a
Many Integrated Core (MIC) rather than a CPU; each core works at a very low speed
like 1 GHz, but it gets its computational advantage from the sheer number of cores
and threads. Since inO cores consume much less power, a 60C/240T Xeon Phi power
consumption is only slightly higher than a comparable 6C/12T Core i7 CPU. I will
be providing execution times on Xeon 5110P shortly.
• An inO CPU would only benefit a restricted set of applications; not every application
can take advantage of so many cores or threads. In most applications, we get diminish-
ing returns beyond a certain number of cores or threads. Generally, image and signal
processing applications are perfect for inO CPUs or MICs. Scientific high performance
processing applications are also typically a good candidate for inO CPUs.
• Another advantage of inO cores is their low power consumption. Since each core is
much simpler, it does not consume as much power as a comparable OoO core. This
is why most of today’s netbooks incorporate Intel Atom CPUs, which have inO cores.
An Atom CPU consumes only 2–10 Watts. The Xeon Phi MIC is basically 60 Atom
cores, with 4 threads/core, stuffed into a chip.
Improving Our First Parallel CPU Program 57

• If having so many cores and threads can benefit even just a small set of applications,
why not take this idea even farther and put thousands of cores in a compute unit
that can execute even more than 4 threads per core? It turns out, even this idea is
valid. Such a processor, which could execute something like hundreds of thousands of
threads in thousands of cores, is called a GPU, which is what this book is all about!

3.2.2 Thin versus Thick Threads

When a multithreaded program is being executed, such as imflipP.c, a core can be assigned
more than one thread to execute at run time. For example, in a 4C/8T CPU, what is the
difference between having two threads in two separate cores versus both threads in the same
core? The answer is: when two threads are sharing a core, they are sharing all of the core
resources, such as the cache memory, integer, and floating point units.
In the event that two threads that need a large amount of cache memory are assigned to
the same core, they will both be wasting time evicting data in and out of the cache memory,
thereby not benefiting from multithreading. Assume that one thread needed a huge amount
of cache memory access and another one needed only the integer unit and no near-zero
cache memory access. These two threads would be good candidates to place in the same
core during execution because they do not need the same resources during execution.
On the other hand, if a thread was designed by the programmer to require minimal
core resources, it could benefit significantly from multithreading. These kinds of threads
are called thin threads, whereas the ones that require excessive amounts of core resources
are called thick threads. It is the responsibility of the programmer to design the threads
carefully to avoid significant core resource usage; if every thread is thick, the benefit from
the additional threads will be minimal. This is why OS designers, such as Microsoft, design
their threads to be think to avoid interfering with the performance of the multithreaded
applications. In the end, the OS is the ultimate multithreaded application.

3.3 PERFORMANCE OF IMFLIPP

Table 3.2 lists the execution times (in ms) of imflipP.c for the CPUs listed in Table 3.1. The
total number of threads is only reported for values that make a difference. CPU2 results are
a repeat of Table 2.1. The pattern for every CPU seems to be very similar: the performance
gets better up to a certain number of threads and hits a brick wall! Launching more threads
doesn’t help beyond that inflection point and this point depends on the CPU.
Table 3.2 raises quite a few questions, such as:
• What does it mean to launch 9 threads on a 4C/8T CPU that is known to be able to
execute a maximum of 8 threads (.../8T)?

• Maybe the right way to phrase this question is: is there a difference between launching
and executing a thread?
• When we design a program to be “8-threaded,” what are we assuming about runtime?
Are we assuming that all 8 threads are executing?
• Remember from Section 2.1.5: there were 1499 threads launched on the computer, yet
the CPU utilization was 0%. So, not every thread is executing in parallel. Otherwise,
CPU utilization would be hitting the roof. If a thread is not executing, what is it
doing? Who is managing these threads at runtime?
58 GPU Parallel Program Development Using CUDA

TABLE 3.2 imflipP.c execution times (ms) for the CPUs listed in Table 3.1.
#Threads CPU1 CPU2 CPU3 CPU4 CPU5 CPU6
Serial V 109 131 159 117 181 185
2 V 93 70 50 58 104 95
3 V 78 46 33 43 75 64
4 V 78 67 49 59 54 49
5 V 93 55 40 52 35 57
6 V 78 51 35 55 35 48
8 V 78 52 37 53 26 37
9 V 47 34 52 25 49
10 V 40 23 45
12 V 35 28 38
Serial H 62 81 50 60 66 73
2 H 31 41 25 36 57 38
3 H 46 28 16 29 39 25
4 H 46 41 25 41 23 19
5 H 33 20 34 13 28
6 H 28 18 31 17 24
8 H 32 20 23 13 18
9 H 30 19 21 12 24
10 H 20 11 22
12 H 18 14 19

• Probably, the answer for why ≥ 4 threads is not helping our performance in Table 3.2
is hidden in these questions.
• Yet another question is whether the thick versus thin threads could change this answer.
• Another obvious question is: all of the CPUs in Table 3.1 are OoO.

3.4 EFFECT OF THE “OS” ON PERFORMANCE

There are numerous other questions we can raise which all ask the same thing: what happens
to a thread at runtime? In other words, we know that the OS is responsible for managing
the creation/joining of the virtual CPUs (threads), but it is time to understand the details.
To go back to our list of who is responsible for what:
• Programmer determines what a thread does by writing a function for each thread.
This determination is made at compile time, when no runtime information is avail-
able. The function is written in a language that is much higher level than the machine
code: the only language the CPU understands. In the old days, programmers wrote
machine code, which made program development possibly 100x more difficult. Now
we have high level languages and compilers, so we can shift the burden to the compiler
in a big way. The final product of the programmer is a program, which is a set of tasks
to execute in a given order, packed with what-if scenarios. The purpose of the what-if
scenarios is to respond well to runtime events.

• Compiler compiles the thread creation routines to machine code (CPU language),
at compile time. The final product of the compiler is an executable instruction list,
Improving Our First Parallel CPU Program 59

or the binary executable. Note that the compiler has minimal information (or idea)
about what is going to happen at runtime when it is compiling from the programming
language to machine code.
• The OS is responsible for the runtime. Why do we need such an intermediary?
It is because a multitude of different things can happen when executing the binary
that the compiler produced. BAD THINGS could happen such as: (1) the disk could
be full, (2) memory could be full, (3) the user could enter a response that causes
the program to crash, (4) the program could request an extreme number of threads,
for which there are no available thread handles. Alternatively, even if nothing goes
wrong, somebody has to be responsible for RESOURCE EFFICIENCY, i.e., run-
ning programs efficiently by taking care things like (1) who gets which virtual CPU,
(2) when a program asks to create memory, should it get it, if so, what is the pointer,
(3) if a program wants to create a child thread, do we have enough resources to create
it? If so, what thread handle to give, (4) accessing disk resources, (5) network re-
sources, (6) any other resource you can imagine. Resources are managed at runtime
and there is no way to know them precisely at compile time.
• Hardware executes the machine code. The machine code for the CPU to execute
is assigned to the CPU at runtime by the OS. Similarly, the memory is read and
transferred mostly by the peripherals that are under the OS’s control (e.g., Direct
Memory Access – DMA controller).
• User enjoys the program which produces excellent results if the program was written
well and everything goes right at runtime.

3.4.1 Thread Creation

The OS knows what resources it has, since most of them are statically determined once the
computer turns on. The number of virtual CPUs is one of them, and the one we are most
interested in understanding. If the OS determines that it is running on a processor with 8
virtual CPUs (as we would find on a 4C/8T machine), it assigns these virtual CPUs names,
such as vCPU0, vCPU1, vCPU2, vCPU3, ..., vCPU7. So, in this case, the OS has 8 virtual
CPU resources and is responsible for managing them.
When the program launches a thread using pthread create(), the OS assigns a thread
handle to that thread; say, 1763. So, the program sees ThHandle[1]= 1763 at runtime. The
program is interpreting this as “tid= 1 was assigned the handle ThHandle[1]= 1763.” The
program only cares about tid= 1 and the OS only cares about 1763 inside its handle list.
Although, the program must save this handle (1763), since this handle is its only way to
tell the OS which thread it is talking about ... tid= 1 or ThHandle[] are nothing more than
program variables and have no significance to the internal workings of the OS.

3.4.2 Thread Launch and Execution

When the OS gives the parent thread ThHandle[1]= 1763 at runtime, the parent thread
understands that it has the authorization to execute some function using this child thread.
It launches the code with the function name that is built into pthread create(). What this
is telling the OS is that, in addition to creating the thread, now the parent wants to launch
this thread. While creating a thread required a thread handle, launching a thread requires a
virtual CPU to be assigned (i.e., find somebody to do the work). In other words, the parent
thread is saying: find a virtual CPU, and run this code on it.
60 GPU Parallel Program Development Using CUDA

pthread_create(…) pthread_join(…)

Quantum End
Context Switch
Thread Queued
Terminated
Thread Runnable Running
Created

OS Scheduler Dispatch
Event Wait

Event
Arrival
Stopped

FIGURE 3.1 The life cycle of a thread. From the creation to its termination, a thread
is cycled through many different statuses, assigned by the OS.

The OS, then, tries to find a virtual CPU resource that is available to execute this code.
The parent thread does not care which virtual CPU the OS chooses, since this is a resource
management issue that the OS is responsible for. The OS maps the thread handle it just
assigned to an available virtual CPU at runtime (e.g., handle 1763 → vCPU4), assuming
that virtual CPU 4 (vCPU4) is available right at the time that the pthread create() is
called.

3.4.3 Thread Status

Remember from Section 2.1.5 that there were 1499 threads launched on the computer, yet
the CPU utilization was 0%. So, not every thread is executing in parallel. If a thread is not
executing, then what is it doing? In summary, if a CPU has 8 virtual CPUs (as in a 4C/8T
processor), no more than 8 threads could have the running status. This is the status that
a thread possesses when it is executing; rather than being a thought by the OS when it was
runnable, this thread is now on the CPU, executing (i.e., running). Aside from running,
a thread could have the status of runnable, stopped, or in case it is done with its job,
terminated, as depicted in Figure 3.1.
When our application calls pthread create() to launch a thread, the OS immediately
determines one thing: do I have a sufficient amount of resources to assign a handle and
create this thread? If the answer is “Yes,” the thread handle is assigned to that thread,
and all of the necessary memory, stack areas are created. Right at that time, the status of
the thread is recorded within that handle as runnable. This means that the thread can
run, but it is not running yet. It basically goes into a queue of runnable threads and waits
for its turn to come. At some point in time, the OS decides to start executing this thread.
Improving Our First Parallel CPU Program 61

For that, two things must happen: (1) a virtual CPU (vCPU) is found for that thread to
execute on, (2) the status of the thread now changes to running.
The Runnable=⇒Running status change is handled by a part of the OS called the
dispatcher. The OS treats each one of the available CPU threads as a virtual CPU (vCPU);
so, for example, a 8C/16T CPU has 16 vCPUs. The thread that was waiting in the queue
starts running on a vCPU. A sophisticated OS will pay attention to where to place the
threads to optimize the performance. This placement, called the core affinity can actually
be manually modified by the user to override a potentially suboptimal placement by the OS.
The OS allows each thread to run for a certain period of time (called quantum) before it
switches to another thread that has been waiting in the Runnable status. This is necessary
to avoid starvation, i.e., a thread being stuck forever in the Runnable status. When a thread
is switched from Running=⇒Runnable, all of its register information – and more – has
to be saved in an area; this information is called the context of a thread. Similarly, the
Running=⇒Runnable status change is called a context switch. A context switch takes
a certain amount of time to complete and has performance implications, although it is an
unavoidable reality.
During execution (in the Running status), a thread might call a function, say scanf(), to
read a keyboard input. Reading a keyboard is much slower than any other CPU operation;
so, there is no reason why the OS should make our thread keep in the Running status
while waiting for the keyboard input, which would starve other threads of core time. In
this case, the OS cannot switch this thread to Runnable either, since the Runnable
queue is dedicated to the threads that can be immediately switched over to the Running
status when the time is right. A thread that is waiting for a keyboard input could wait
for an indefinite amount of time; it could happen immediately or it could happen within
10 minutes, in case the user has left to get coffee! So, there is another status to distinguish
this specific status; it is called Stopped.
A thread undergoes a Running=⇒Stopped status switch when it requests a resource
that is not going to be available for a period of time, or it has to wait for an event to hap-
pen for an indeterminate amount of time. When the requested resource (or data) becomes
available, the thread undergoes a Stopped=⇒Runnable status switch and is placed in
the queue of Runnable threads that are waiting for their time to start executing again.
It would make no sense for the OS to switch this thread to Running either, as this would
mean a chaotic unscheduling of another peacefully executing thread, i.e., kicking it out of
the core! So, to do things in a calm and orderly fashion, the OS places the Stopped thread
back in the Runnable queue and decides when to allow it to execute again later. It might,
however, assign a different priority to threads that should be dispatched ahead of others
for whatever reason.
Finally, when a thread completes execution, upon calling, say, the pthread join() function,
the OS makes the Running=⇒Terminated status switch and the thread is permanently
out of the Runnable queue. Once its memory areas etc. are cleaned up, the handle for that
thread is destroyed and it is available later for another pthread create().

3.4.4 Mapping Software Threads to Hardware Threads

Section 3.4.3 answers our question about the 1499 threads in Figure 2.1: we know that out of
the 1499 threads that we see in Figure 2.1, at least 1491 threads must be either Runnable
or Stopped on a 4C/8T CPU, since no more than 8 threads could be in the Running
status. Think of 1499 as the number of tasks to do, but there are only 8 people to do it!
The OS simply does not have physical resources to “do” (i.e., execute) more than 8 things
62 GPU Parallel Program Development Using CUDA

at any given point in time. It picks one of the 1499 tasks, and assigns one of the people to
do it. If another task becomes more urgent for that one person (for example, if a network
packet arrives requiring immediate attention), the OS switches that person to doing that
more urgent task and suspends what he or she was currently doing.
We are curious about how these status switches affect our application’s performance. In
the case of 1499 threads in Figure 2.1, it is highly likely that something like 1495 threads are
Stopped or Runnable, waiting for you to hit some key on the keyboard or a network packet
to arrive, and only four threads are Running, probably your multithreaded application
code. Here is an analogy:

ANALOGY 3.2: Thread Status.

You look through your window and see 1499 pieces of paper on the curb with written
tasks on them. You also see 8 people outside, all sitting on their chair until the manager
gives them a task to execute. At some point, the manager tells person #1 to grab paper
#1256. Then, person #1 starts doing whatever is written on paper #1256. All of a
sudden, the manager tells person #1 to bring that paper #1256 back, stop executing
task #1256 and grab paper #867 and start doing what is written on paper #867 ...
Since the execution of task #1256 is not yet complete, all of the notes that person
#1 took during the execution of task #1256 must be written somewhere, on the
manager’s notebook, for person #1 to remember it later. Indeed, this task might not
even be completed by the same person. If the task on paper #867 is finished, it could
be crumpled up and thrown into the waste basket. That task is complete.

In Analogy 3.2, sitting on the chair corresponds to the thread status Runnable and doing
what is written on the paper corresponds to the thread status Running and the people
are the virtual CPUs. The manager, who is allowed to switch the status of people, is the
OS, whereas his or her notebook is where the thread contexts are saved to be used later
during a context switch. Crumpling up a paper (task) is equivalent to switching it to the
Terminated status.
The number of the launched threads could fluctuate from 1499 to 1505 down to 1365,
etc., but the number of available virtual CPUs cannot change (e.g., 8 in this example),
since they are a “physical” entity. A good way to define the 1499 quantity is software
threads, i.e., the threads that the OS creates. The available number of physical threads
(virtual CPUs) is the hardware threads, i.e., the maximum number of threads that the
CPU manufacturer designed the CPU to be able to execute. It is a little bit confusing
that both of them are called “threads,” since the software threads are nothing but a data
structure containing information about the task that the thread will perform as well as the
thread handle, memory areas, etc., whereas the hardware threads are the physical hardware
component of the CPU that is executing machine code (i.e., the compiled version of the
task). The job of the OS is to find an available hardware thread for each one of the software
threads it is managing. The OS is responsible for managing the virtual CPUs, which are
hardware resources, much like the available memory.

3.4.5 Program Performance versus Launched Pthreads

The maximum number of software threads is only limited by the internal OS parame-
ters, whereas the number of hardware threads is set in stone at the time the CPU is
Improving Our First Parallel CPU Program 63

designed. When you launch a program that executes two heavily active threads, the OS
will do its best to bring them into the Running status as soon as possible. Possibly one
more thread, belonging to the OS’s thread scheduler is very active, making the highly active
threads 3.
So, how does this help in explaining the results in Table 3.2? Although the exact answer
depends on the model of the CPU, there are some highly distinguishable patterns that can
be explained with what we just learned. Let us pick one CPU2 as an example. While CPU2
should be able to execute 8 threads in parallel (it is a 4C/8T), the performance falls off a
cliff beyond 3 launched threads. Why? Let’s try to guess this by fact-checking:
• Remember our Analogy 1.1, where two farmers were sharing a tractor. By timing the
tasks perfectly, together, they could get 2x more work done. This is the hope behind
4C/8T getting an 8T performance, otherwise, you really have only 4 physical cores
(i.e., tractors).
• If this best case scenario happened here in our Code 2.1 and Code 2.2, we should
expect the performance improvement to continue to 8 threads, or at least, something
like 6 or 7. This is not what we see in Table 3.2!
• So, what if one of the tasks required one of the farmers to use the hammer and other
resources in the tractor in a chaotic way? The other wouldn’t be able to do anything
useful since they would be continuously bumping into each other and keep falling, etc;
the performance wouldn’t even be close to 2x (i.e., 1+1 = 2)! It would be more like
0.9x! As far as efficiency is concerned, 1+1 = 0.9 sounds pretty bad! In other words,
if both threads are “thick threads,” they are not meant to work simultaneously with
another thread inside the same core ... I mean, efficiently ... This must be somehow
the case in Code 2.1 and Code 2.2, since we are not getting anything out of the dual
threads inside each core ...
• What about memory? We will see an entire architectural organization of the cores
and memory in Chapter 4. But, for now, it suffices to say that, no matter how many
cores/threads you have in a CPU, you only have a single main memory for all of the
threads to share. So, if one thread was a memory-unfriendly thread, it would mess
up everybody’s memory accesses. This is another possibility in explaining why the
performance hits a brick wall at ≥ 4 threads.
• Let’s say that we explained the problems with why we are not able to use the double-
threads in each core (called hyper-threading by Intel), but why does the performance
stop improving at 3 threads, not 4? The performance from 3 to 4 threads is lower,
which is counterintuitive. Are these threads not even able to use all of the cores?
A similar pattern is visible in almost every CPU, although the exact thread count
depends on the maximum available threads and varies from CPU to CPU.

3.5 IMPROVING IMFLIPP

Instead of answering all of these questions, let’s see if we can improve the program without
knowing all of these answers, but simply guessing what the problem could be. After all, we
have enough intuition to be able to make educated guesses. In this section, we will analyze
the code and will try to identify the parts of the code that could be causing inefficiencies
and will suggest a fix. After implementing this fix, we will see how it performed and will
explain why it worked (or didn’t).
64 GPU Parallel Program Development Using CUDA

Where is the best place to start? If you want to improve a computer program’s perfor-
mance, the best place to start is the innermost loops. Let’s start with the MTFlipH() function
shown in Code 2.8. This function is taking a pixel value and moving it to another memory
area one byte at a time. The MTFlipV() function shown in Code 2.7 is very similar. For each
pixel, both functions move R, G, and B values one byte at a time. What is wrong with this
picture? A lot! When we go through the details of the CPU and memory architecture in
Chapter 4, you will be amazed with how horribly inefficient Code 2.7 and Code 2.8 are. But,
for now, we just want to find obvious fixes and apply them and observe the improvements
quantitatively. We will not comment on them until we learn more about the memory/core
architecture in Chapter 4.

3.5.1 Analyzing Memory Access Patterns in MTFlipH()

The MTFlipH() function is clearly a “memory-intensive” function. For each pixel, there is
really no “computation” done. Instead, one byte is moved from one memory location to
another. When I say “computation,” I mean something like making each pixel value darker
by reducing the RGB values, or turning the image into a B&W image by recalculating a
new value for each pixel, etc. None of these computations are being performed here. The
innermost loop of MTFlipH() looks something like this:

...
for(row=ts; row<=te; row++) {
col=0;
while(col<ip.Hpixels*3/2){
// example: Swap pixel[42][0] , pixel[42][3199]
pix.B = TheImage[row][col];
pix.G = TheImage[row][col+1];
pix.R = TheImage[row][col+2];
TheImage[row][col] = TheImage[row][ip.Hpixels*3-(col+3)];
...

So, to improve this program, we should strictly analyze the memory access patterns.
Figure 3.2 shows the memory access patterns of the MTFlipH() function during the processing
of the 22 MB image dogL.bmp. This dog picture consists of 2400 rows and 3200 columns. For
example, when flipping Row 42 horizontally (no specific reason for choosing this number),
here is the swap pattern for pixels (also shown in Figure 3.2):
[42][0]←→[42][3199], [42][1]←→[42][3198] ... [42][1598]←→[42][1601], [42][1599]←→[42][1600]

3.5.2 Multithreaded Memory Access of MTFlipH()

Not only the byte-by-byte memory access of MTFlipH() sounds bad, but also remember that
this function is running in a multithreaded environment. First, if we launched only a single
thread, let’s look at what a single thread’s memory access patterns look like: This single
thread would want to flip all 2400 rows by itself, starting at row 0, and continuing with
row 1, 2, ... 2399. During this loop, when it is flipping row 42 as an example again, which
“bytes” does MTHFlip() really swap? Let’s take the very first pixel swap as an example. It
involves the following operations:

Swap pixel[42][0] with pixel[42][3199], which corresponds to

Swap bytes[0..2] of row 42 with bytes [9597..9599] of row 42, consecutively.
Improving Our First Parallel CPU Program 65

each row = 3200 pixels = 9600B

dogL.bmp
3200x2400 Row 42 : pixels (42,0) … (42,3199)
Row 0
Row 1
Row 2

RGB 42,3197
RGB 42,3198
RGB 42,3199
22MB
…
Row 42 … …

RGB 42,0
RGB 42,1
RGB 42,2
…
Row 2397
Row 2398
2400 rows Row 2399

3200 columns …
(~22MB)
…
Swap (42,2) RGB (42,3197)
Swap (42,1) RGB (42,3198)
Main Memory Swap (42,0) RGB (42,3199)

FIGURE 3.2 Memory access patterns of MTFlipH() in Code 2.8. A total of 3200 pixels’
RGB values (9600 Bytes) are flipped for each row.

In Figure 3.2, notice that each pixel corresponds to 3 consecutive bytes holding that pixel’s
RGB values. During just this one pixel swap, the function MTFlipH() requests 6 memory
accesses, 3 to read the bytes [0..2] and 3 to write them into the flipped pixel location
held at bytes [9597..9599]. This means that, to merely flip one row, our MTFlipH() function
requests 3200×6 = 19, 200 memory accesses, with mixed read and writes. Now, let’s see what
happens when, say, 4 threads are launched. Each thread is trying to finish the following
tasks, consisting of flipping 600 rows.

tid= 0 : Flip Row[0] , Flip Row[1] ... Flip Row [598] , Flip Row [599]
tid= 1 : Flip Row[600] , Flip Row[601] ... Flip Row [1198] , Flip Row [1199]
tid= 2 : Flip Row[1200] , Flip Row[1201] ... Flip Row [1798] , Flip Row [1799]
tid= 3 : Flip Row[1800] , Flip Row[1801] ... Flip Row [2398] , Flip Row [2399]

Notice that each one of the 4 threads is requesting as frequent memory accesses as the
single thread does. If each thread was designed improperly, causing chaotic memory access
requests, 4 of them together will have 4x the mess! Let’s look at the very early part of the
execution when main() launches all 4 threads and assigns them the MTHFlip() function to
execute. If we assume that all 4 threads started executing at exactly the same time, this is
what all 4 threads are trying to do simultaneously for the first few bytes:

tid= 0 : Flip Row[0] : mem(00000000..00000002)←→mem(00009597..00009599) , ...

tid= 1 : Flip Row[600] : mem(05760000..05760002)←→mem(05769597..05769599) , ...
tid= 2 : Flip Row[1200] : mem(11520000..11520002)←→mem(11529597..11529599) , ...
tid= 3 : Flip Row[1800] : mem(17280000..17280002)←→mem(17289597..17289599) , ...
66 GPU Parallel Program Development Using CUDA

Although there will be slight variations in the progression of each thread, it doesn’t
change the story. What do you see when you look at these memory access patterns? The
very first thread, tid= 0 is trying to read the pixel [0][0], whose value is at memory addresses
mem(00000000..00000002). This is the very beginning of tid= 0’s task that requires it to
swap the entire row 0 first before it moves on to row 1.
While tid= 0 is waiting for its 3 bytes to come in from the memory, at precisely the same
time, tid= 1 is trying to read pixel[600][0] that is the very first pixel of row 600, located at
memory addresses mem(05760000..05760002), i.e., 5.5 MB (Megabytes) away from the very
first request. Hold on, tid= 2 is not standing still. It is trying to do its own job that starts
by swapping the entire row 1200. The first pixel to be read is pixel[1200][0], located at the 3
consecutive bytes with memory addresses mem(11520000..11520002), i.e., 11 MB away from
the 3 bytes that tid= 0 is trying to read. Similarly, tid= 3 is trying to read the 3 bytes that
are 16.5 MB away from the very first 3 bytes ... Remember that the total image was 22 MB
and the processing of it was divided into 4 threads, each responsible for a 5.5 MB chunk
(i.e., 600 rows).
When we learn the detailed inner-workings of a DRAM (Dynamic Random Access Mem-
ory) in Chapter 4, we will understand why this kind of a memory access pattern is nothing
but a disaster, but, for now, we will find a very simple fix for this problem. For the folks
who are craving to get into the GPU world, let me make a comment here that the DRAM
in the CPU and the GPU are almost identical operationally. So, anything we learn here will
be readily applicable to the GPU memory with some exceptions resulting from the massive
parallelism of the GPUs. An identical “disaster memory access” example will be provided
for the GPUs and you will be able to immediately guess what the problem is by relying on
what you learned in the CPU world.

3.5.3 DRAM Access Rules of Thumb

While it will take a good portion of Chapter 4 to understand why these kinds of choppy
memory accesses are bad for DRAM performance, fixing the problem is surprisingly simple
and intuitive. All you need to know is the rules of thumb in Table 3.3 that are a good guide
to achieving good DRAM performance. Let’s look at this table and make sense out of it.
These rules are based on the architecture of the DRAM that is designed to allow the data
to be shared by every CPU core, and they all say the same thing one way or another:
â When accessing DRAM, access big consecutive chunks like
1 KB, 4 KB, etc., rather than small onesy-twosy bytes ...

While this is an excellent guide to improving the performance of our first parallel program
imflipP.c, let’s first check to see if we were obeying these rules in the first place. Here is the
summary of the MTFlipH() function’s memory access patterns (Code 2.8):
• Granularity rule is clearly violated, since we are trying to access one byte at a time.

• Locality rule wouldn’t be violated if there was only a single thread. However, multiple
simultaneous (and distant) accesses by different threads cause violations.
• L1, L2, L3 caching do not help us at all since there isn’t a good “data reuse” scenario.
This is because we never need any data element more than once.
With almost every rule violated, it is no wonder that the performance of imflipP.c is
miserable. Unless we obey the access rules of DRAM, we are just creating massive inefficient
memory access patterns that cripple the overall performance.
Improving Our First Parallel CPU Program 67

TABLE 3.3 Rules of thumb for achieving good DRAM performance.

Rule Ideal Values Description
Granularity 8 B...64 B The size of each read/write. “Too small” values are
extremely inefficient (e.g., one byte at a time)
1 KB If consecutive accesses are too far from each other,
Locality ... they force the row buffer to be flushed
4 KB (i.e., they trigger a new DRAM row-read)
If the total number of bytes read/written repeatedly
L1, L2 64 KB by a single thread are confined into a small region
Caching ... like this, the data can be L1- or L2-cached, thereby
256 KB dramatically improving re-access speed by
that thread
8 MB If the total number of bytes read/written repeatedly
L3 ... by all threads are confined into a small region
Caching 20 MB like this, the data can be L3-cached, thereby
dramatically improving re-access speed by every core

3.6 IMFLIPPM: OBEYING DRAM RULES OF THUMB

It is time to improve imflipP.c by making it obey the rules of thumb in Table 3.3. The
improved program is imflipPM.c (“M” for “memory-friendly”).

3.6.1 Chaotic Memory Access Patterns of imflipP

Again, let’s start by analyzing MTFlipH(), one of the un-memory-friendly functions of
imflipP.c. When we are reading bytes and replacing with other bytes, we are accessing
DRAM for every single byte of the pixel individually as follows:

for(row=ts; row<=te; row++) {

col=0;
while(col<ip.Hpixels*3/2){
pix.B = TheImage[row][col];
pix.G = TheImage[row][col+1];
pix.R = TheImage[row][col+2];
TheImage[row][col] = TheImage[row][ip.Hpixels*3-(col+3)];
TheImage[row][col+1] = TheImage[row][ip.Hpixels*3-(col+2)];
TheImage[row][col+2] = TheImage[row][ip.Hpixels*3-(col+1)];
TheImage[row][ip.Hpixels*3-(col+3)] = pix.B;
TheImage[row][ip.Hpixels*3-(col+2)] = pix.G;
TheImage[row][ip.Hpixels*3-(col+1)] = pix.R;
col+=3;
}
...

The key observation is that, since the dog picture is in the main memory (i.e., DRAM),
every single pixel-read triggers an access to DRAM. According to Table 3.3, we know that
DRAM doesn’t like to be bothered frequently.
68 GPU Parallel Program Development Using CUDA

3.6.2 Improving Memory Access Patterns of imflipP

What if we read the entire row of the image (all 3200 pixels, totaling 9600 bytes) into a
temporary area (somewhere other than DRAM) and then processed it in that confined area
without ever bothering the DRAM during the processing of that row? We will call this area
a Buffer. Since the buffer is small enough, it will be cached inside L1$ and will allow us
to take advantage of L1 caching. Well, at least we are now using the cache memory and
obeying the cache-friendliness rules in Table 3.3.

unsigned char Buffer[16384]; // This is the buffer to use to get the entire row
...
for(row=ts; row<=te; row++) {
// bulk copy from DRAM to cache
memcpy((void *) Buffer, (void *) TheImage[row], (size_t) ip.Hbytes);
col=0;
while(col<ip.Hpixels*3/2){
pix.B = Buffer[col];
pix.G = Buffer[col+1];
pix.R = Buffer[col+2];
Buffer[col] = Buffer[ip.Hpixels*3-(col+3)];
Buffer[col+1] = Buffer[ip.Hpixels*3-(col+2)];
Buffer[col+2] = Buffer[ip.Hpixels*3-(col+1)];
Buffer[ip.Hpixels*3-(col+3)] = pix.B;
Buffer[ip.Hpixels*3-(col+2)] = pix.G;
Buffer[ip.Hpixels*3-(col+1)] = pix.R;
col+=3;
}
// bulk copy back from cache to DRAM
memcpy((void *) TheImage[row], (void *) Buffer, (size_t) ip.Hbytes);
...

When we transfer the 9600 B from the main memory into the Buffer, we are relying on
the efficiency of the memcpy() function, which is provided as part of the standard C library.
During the execution of memcpy(), 9600 bytes are transferred from the main memory into
the memory area that we name Buffer. This access is super efficient, since it only involves
a single continuous memory transfer that obeys every rule in Table 3.3.
Let’s not kid ourselves: Buffer is also in the main memory; however, there is a huge
difference in the way we will use these 9600 bytes. Since we will access them continuously,
they will be cached and will no longer bother DRAM. This is what will allow the accesses
to the Buffer memory area to be significantly more efficient, thereby obeying most of the
rules in Table 3.3. Let us now re-engineer the code to use the Buffer.
Improving Our First Parallel CPU Program 69

CODE 3.1: imflipPM.c MTFlipHM() {...}

Memory-friendly version of MTFlipH() (Code 2.8) that obeys the rules in Table 3.3.

void MTFlipHM(void tid)

{
struct Pixel pix; //temp swap pixel
int row, col;
unsigned char Buffer[16384]; // This is the buffer to use to get the entire row

long ts = ((int ) tid); // My thread ID is stored here

ts *= ip.Vpixels/NumThreads; // start index
long te = ts+ip.Vpixels/NumThreads-1; // end index

for(row=ts; row<=te; row++){

// bulk copy from DRAM to cache
memcpy((void *) Buffer, (void *) TheImage[row], (size_t) ip.Hbytes);
col=0;
while(col<ip.Hpixels*3/2){
pix.B = Buffer[col];
pix.G = Buffer[col+1];
pix.R = Buffer[col+2];
Buffer[col] = Buffer[ip.Hpixels*3-(col+3)];
Buffer[col+1] = Buffer[ip.Hpixels*3-(col+2)];
Buffer[col+2] = Buffer[ip.Hpixels*3-(col+1)];
Buffer[ip.Hpixels*3-(col+3)] = pix.B;
Buffer[ip.Hpixels*3-(col+2)] = pix.G;
Buffer[ip.Hpixels*3-(col+1)] = pix.R;
col+=3;
}
// bulk copy back from cache to DRAM
memcpy((void *) TheImage[row], (void *) Buffer, (size_t) ip.Hbytes);
}
pthread_exit(NULL);
}

3.6.3 MTFlipHM(): The Memory Friendly MTFlipH()

The memory-friendly version of the MTFlipH() function in Code 2.8 is the MTFlipHM()
function shown in Code 3.1. They are virtually identical with one distinctive difference:
To flip each row’s pixels, MTFlipHM() accesses the DRAM only once to read a large chunk
of data using the memcpy() function (entire row of the image, e.g., 9600 B for dogL.bmp).
A 16 KB buffer memory array is defined as a local array variable and the entire contents of
the row are copied into the buffer before even getting into the innermost loop that starts
swapping the pixels. We could have very well defined a 9600 B buffer since this is all we
need for our image, but a larger buffer allows scalability to larger images.
Although the innermost loops are identical inside both functions, notice that only the
Buffer[] array is accessed inside the while loop of MTFlipHM(). We know that the OS assigns
a stack area for all local variables and we will get into great details of this in Chapter 5.
But, for now, the most important observation is the definition of a localized storage area
70 GPU Parallel Program Development Using CUDA

in the 16 KB range that will allow MTFlipHM() to be compliant with the L1 caching rule of
thumb, shown in Table 3.3.
Here are some lines from Code 3.1, highlighting the buffering in MTFlipHM(). Note that
the global array TheImage[] is in DRAM, since it was read into DRAM by the ReadBMP()
function (see Code 2.5). This is the variable that should obey strict DRAM rules in Table 3.3.
I guess we cannot do better than accessing it once to read 9600 B of data and copying this
data into our local memory area. This makes it 100% DRAM-friendly.

unsigned char Buffer[16384]; // This is the buffer to use to get the entire row
...
for(...){
// bulk copy from DRAM to cache
memcpy((void *) Buffer, (void *) TheImage[row], (size_t) ip.Hbytes);
...
while(...){
... =Buffer[...]
... =Buffer[...]
... =Buffer[...]
Buffer[...]=Buffer[...]
...
Buffer[...]=...
Buffer[...]=...
...
}
// bulk copy back from cache to DRAM
memcpy((void *) TheImage[row], (void *) Buffer, (size_t) ip.Hbytes);
...

The big question is: Why is the local variable Buffer[] fair game? We modified the
innermost loop and made it access the Buffer[] array as terribly as we were accessing
TheImage[] before. What is so different with the Buffer[] array? Also, another nagging
question is the claim that the contents of the Buffer[] array will be “cached.” Where did
that come from? There is no indication in the code that says “put these 9600 bytes into the
cache.” How are we so sure that it does go into cache? The answer is actually surprisingly
simple and has everything to do with the design of the CPU architecture.
A CPU caching algorithm predicts which values inside DRAM (the “bad area”) should
be temporarily brought into the cache (the “good area”). These guesses do not have to
be 100% accurate, since if the guess is bad, it could always be corrected later. The result
is an efficiency penalty rather than a crash or something. Bringing “recently used DRAM
contents” into the cache memory is called caching. The CPU could get lazy and bring
everything into the cache, but this is not possible since there are only small amounts of
cache memory available. In the i7 family processors, L1 cache is 32 KB for data elements
and L2 cache is 256 KB. L1 is faster to access than L2. Caching helps for three major reasons:
• Access Patterns: Cache memory is SRAM (static random access memory), not
DRAM like the main memory. The rules governing SRAM access patterns are a lot
less strict as compared to the DRAM efficiency rules listed in Table 3.3.
• Speed: Since SRAM is much faster than DRAM, accessing cache is substantially
faster once something is cached.
• Isolation: Each core has its own cache memory (L1$ and L2$). So, if each thread
was accessing up to a 256 KB of data frequently, this data would be very efficiently
cached in that core’s cache and would not bother the DRAM.
Improving Our First Parallel CPU Program 71

CODE 3.2: imflipPM.c MTFlipVM() {...}

Memory-friendly version of MTFlipV() (Code 2.7) that obeys the rules in Table 3.3.

void MTFlipVM(void tid)

{
struct Pixel pix; //temp swap pixel
int row, row2, col; // need another index pointer ...
unsigned char Buffer[16384]; // This is the buffer to get the first row
unsigned char Buffer2[16384]; // This is the buffer to get the second row

long ts = ((int ) tid); // My thread ID is stored here

ts *= ip.Vpixels/NumThreads/2; // start index
long te = ts+(ip.Vpixels/NumThreads/2)-1; // end index

for(row=ts; row<=te; row++){

memcpy((void *) Buffer, (void *) TheImage[row], (size_t) ip.Hbytes);
row2=ip.Vpixels-(row+1);
memcpy((void *) Buffer2, (void *) TheImage[row2], (size_t) ip.Hbytes);
// swap row with row2
memcpy((void *) TheImage[row], (void *) Buffer2, (size_t) ip.Hbytes);
memcpy((void *) TheImage[row2], (void *) Buffer, (size_t) ip.Hbytes);
}
pthread_exit(NULL);
}

We will get into the details of how the CPU cores and CPU main memory work together
in Chapter 4. However, we have learned enough so far about the concept of buffering to
improve our code. Note that caching is extremely important for both CPUs and even more so
in GPUs. So, understanding the buffering concept that causes data to be cached is extremely
important. There is no way to tell the CPU to cache something explicitly, although some
theoretical research has investigated this topic. It is done completely automatically by the
CPU. However, the programmer can influence the caching dramatically by the memory
access patterns of the code. We experienced first-hand what happens when the memory
access patterns are chaotic like the ones shown in Code 2.7 and Code 2.8. The CPU caching
algorithms simply cannot correct for these chaotic patterns, since their simplistic caching/
eviction algorithms throw in the towel. The compiler cannot correct for these either, since
it literally requires the compiler to read the programmer’s mind in many cases! The only
thing that can help the performance is the logic of the programmer.

3.6.4 MTFlipVM(): The Memory Friendly MTFlipV()

Now, let’s take a look at the redesigned MTFlipVM() function in Code 3.2. We see some
major differences between this code and its inefficient version, the MTFlipV() function, in
Code 2.7. Here are the differences between MTFlipVM() and MTFlipV():
• There are two buffers used in the improved version: 16 KB each.
• In the outermost loop, the first buffer is used to read an entire start image row, where
the second buffer is used to read an entire end image row. They are swapped.
• The innermost loop is eliminated and replaced with bulk memory transfers using the
buffers, although the outermost loop is identical.
72 GPU Parallel Program Development Using CUDA

TABLE 3.4 imflipPM.c execution times (ms) for the CPUs listed in Table 3.1.
#Threads CPU1 CPU2 CPU3 CPU4 CPU5 CPU6
Serial W 4.116 5.49 3.35 4.11 5.24 3.87
2 W 3.3861 3.32 2.76 2.43 3.51 2.41
3 W 3.0233 2.90 2.66 1.96 2.78 2.52
4 W 3.1442 3.48 2.81 2.21 1.57 1.95
5 W 3.1442 3.27 2.71 2.17 1.47 2.07
6 W 3.05 2.73 2.04 1.69 2.00
8 W 3.02 2.75 2.03 1.45 2.09
9 W 2.74 1.45 2.26
10 W 2.74 1.98 1.45 1.93
12 W 2.75 1.33 1.91
Serial I 35.8 49.4 29.0 34.6 45.3 42.6
2 I 23.7 25.2 14.7 17.6 34.5 21.4
3 I 21.2 17.4 9.8 12.3 19.5 14.3
4 I 22.7 20.1 14.6 17.6 12.5 10.9
5 I 22.3 17.1 11.8 14.3 8.8 15.8
6 I 21.8 15.8 10.5 11.8 10.5 13.2
8 I 18.4 10.4 12.1 8.3 10.0
9 I 9.8 7.5 13.5
10 I 16.6 9.5 11.6 6.9 12.3
12 I 9.2 8.6 11.2

3.7 PERFORMANCE OF IMFLIPPM.C

To run the improved program imflipPM.c, the following command line is used:
imflipPM InputfileName OutputfileName [v/V/h/H/w/W/i/I] [1-128]
where the newly added command line options W and I are to allow using the memory-
friendly MTFlipVM() and MTFlipHM() functions, respectively. Upper or lower case does not
matter, hence the option listing W/w and I/i. The older V and H options still work and
allow access to the memory-unfriendly functions MTFlipV() and MTFlipH(), respectively. This
helps us to compare the two families of functions by running only a single program.
Execution times of the improved program imflipPM.c are shown in Table 3.4. When
we compare these results to their “memory unfriendly” counterpart imflipP.c (listed in
Table 3.2), we see major improvement in performance all across the board. This shouldn’t
be surprising to the readers, since I wouldn’t drag you through an entire chapter to show
marginal improvements! This wouldn’t make happy readers!
In addition to being substantial, the improvements also differ substantially based on
whether it is for the vertical or horizontal flip. Therefore, instead of making generic com-
ments, let’s dig deep into the results by picking an example CPU and listing the memory-
friendly and memory-unfriendly results side by side. Since almost every CPU is showing
identical improvement patterns, there is no harm in commenting on a representative CPU.
The best pick is CPU5 due to the richness of the results and the possibility of extending
the analysis beyond just a few cores.

3.7.1 Comparing Performances of imflipP.c and imflipPM.c

Table 3.5 lists the imflipP.c and imflipPM.c results only for CPU5. A new column is added to
compare the improvement in speed (i.e., “Speedup”) between the memory-friendly functions
Improving Our First Parallel CPU Program 73

TABLE 3.5Comparing imflipP.c execution times (H, V type flips in Table 3.2) to
imflipPM.c execution times (I, W type flips in Table 3.4).
#Threads CPU5 CPU5 Speedup CPU5 CPU5 Speedup
V W V→W H I H→I
Serial 181 5.24 34× 66 45.3 1.5×
2 104 3.51 30× 57 34.5 1.7×
3 75 2.78 27× 39 19.5 2×
4 54 1.57 34× 23 12.5 1.8×
5 35 1.47 24× 13 8.8 1.5×
6 35 1.69 20× 17 10.5 1.6×
8 26 1.45 18× 13 8.3 1.6×
9 25 1.45 17× 12 7.5 1.6×
10 23 1.45 16× 11 6.9 1.6×
12 28 1.33 21× 14 8.6 1.6×

MTFlipVM() and MTFlipHM() and the unfriendly ones MTFlipV() and MTFlipH(). It is hard
to make a generic comment such as ”major improvement all across the board,” since this is
not really what we are seeing here. The improvements in the horizontal-family and vertical-
family functions are so different that we need to comment on them separately.

3.7.2 Speed Improvement: MTFlipV() versus MTFlipVM()

First, let’s look at the vertical flip function MTFlipVM(). There are some important observa-
tions when going from the MTFlipV() function (“V” column) in Table 3.5 to the MTFlipVM()
function (“W” column):
• Speed improvement changes when we launch a different number of threads
• Launching more threads reduces the speedup gap (34× down to 16×)
• Speedup does continue even for a number of threads that the CPU cannot physically
support (e.g., 9, 10).

3.7.3 Speed Improvement: MTFlipH() versus MTFlipHM()

Next, let’s look at the horizontal flip function MTFlipHM(). Here are the observations when
going from the MTFlipH() function (“H” column) in Table 3.5 to the MTFlipHM() function
(“I” column):
• Speed improvement changes much less as compared to the vertical family.
• Launching more threads changes the speedup a little bit, but the exact trend is hard
to quantify.
• It is almost possible to quantify the speedup as a “fixed 1.6,” with some minor
fluctuations.

3.7.4 Understanding the Speedup: MTFlipH() versus MTFlipHM()

Table 3.5 will take a little bit of time to digest. We will need to go through an entire
Chapter 4 to appreciate what is going on. However, we can make some guesses in this
74 GPU Parallel Program Development Using CUDA

chapter. To be able to make educated guesses, let’s look at the facts. First, let’s explain
why there would be a difference in the vertical versus horizontal family of flips, although,
in the end, both of the functions are flipping exactly the same number of pixels:
Comparing the MTFlipH() in Code 2.8 to its memory-friendly version MTFlipHM() in
Code 3.1, we see that the only difference is the local buffering, and the rest of the code is
identical. In other words, if there is any speedup between these two functions, it is strictly
due to buffering. So, it is fair to say that
â Local buffering allowed us to utilize cache memory,
which resulted in a 1.6× speedup.
â This number fluctuates minimally with more threads.

On the other hand, comparing the MTFlipV() in Code 2.7 to its memory-friendly ver-
sion MTFlipVM() in Code 3.2, we see that we turned the function from a core-intensive
function to a memory-intensive function. While MTFlipV() is picking at the data one byte
at a time, and keeping the core’s internal resources completely busy, MTFlipVM() uses
the memcpy() bulk memory copy function and does everything through the bulk mem-
ory transfer, possibly completely bypassing the core involvement. The magical memcpy()
function is extremely efficient to copy something from DRAM when you are grabbing a big
chunk of data, like we are here. This is also consistent with our DRAM efficiency rules in
Table 3.3.
If this is all true, why is the speedup somehow saturating? In other words, why are we
getting a lower speedup when we launch more threads? It looks like the program execution
time is not going below a ≈ 1.5× speedup, no matter what the number of threads is. This
can actually be explained intuitively as follows:
â When a program is highly memory-intensive, its performance
will be strictly determined by the memory bandwidth.
â We seemed to have saturated the memory bandwidth at ≈ 4 threads.

3.8 PROCESS MEMORY MAP

What happens when you launch the following program at the command prompt?
imflipPM dogL.bmp Output.bmp V 4

First of all, we are asking the executable program imflipPM (or, imflipPM.exe in Windows)
to be launched. To launch this program (i.e., start executing), the OS creates a process with
a Process ID assigned to it. When this program is executing, it will need three different
memory areas:

• A stack area to store the function call return addresses and arguments that are
passed/returned onto/from the function calls. This area grows from top to bottom
(from high address to the low addresses), since this is how each microprocessor uses
a stack.
• A heap area to store the dynamically allocated memory contents using the malloc()
function. This memory area grows in the opposite direction of the stack to al-
low the OS to use every possible byte of memory without bumping into the stack
contents.
Improving Our First Parallel CPU Program 75

• A code area to store the program code and the constants that were declared within
the program. This is the code area and is not modified. The constants within the
program are stored here, since they also are not modified.
The memory map of the process that the OS created will look like Figure 3.3: First, since
the program is launched with only a single thread that is running main(), the memory map
looks like Figure 3.3 (left). As the four pthreads are launched using the pthread create(),
the memory map will look like Figure 3.3 (right). The stack of each thread is saved even
if the OS decides to switch out of that thread to allow another one to run (i.e., context
switching). The context of the thread is saved in the same memory area. Furthermore, the
code is on the bottom memory area and the shared heap among all threads is just above
the code. This is all the threads need to resume their operation when they are scheduled to
run again, following a context switch.
The size of the stack and the heap are not known to the OS while launching imflipPM.c
the first time. There are default settings that can be modified. Unix and Mac OS allow
specifying them at the commands prompt using switches, while Windows allows changing
them through the right click, modify application properties. Since the programmer is the
one who has the best idea about how much stack, heap a program needs, generous stack,
heap areas should be allocated to applications to avoid a core dump in case of a scenario
when an invalid memory address access occurs due to these memory areas clashing with
each other.
Let’s look at our favorite Figure 2.1 again; it shows 1499 launched threads and 98
processes. What this means is that many of the processes that OS launched internally are
multithreaded, very possibly all of them, resembling a memory map shown in Figure 3.3;
Memory Allocated for the Process by the OS

Stack Stack for Thread 1 Thread 1

Region
Other info for Thread 1
Stack for Thread 2 Thread 2
Other info for Thread 2 Region

Stack for Thread 3 Thread 3

Other info for Thread 3 Region
...

Shared
Heap
Heap

Single-Threaded Program Code + Data

Program Code + Data (All Threads)

FIGURE 3.3The memory map of a process when only a single thread is running within
the process (left) or multiple threads are running in it (right).
76 GPU Parallel Program Development Using CUDA

each process seems to have launched an average of 15 threads, all of which must have a very
low activity ratio. We saw what happens when even 5,6 threads are super active for a short
period of time; in Figure 2.1, if all 1499 threads had a high activity ratio like the threads
that we wrote so far, your CPU would possibly choke and you would not even be able to
move your mouse on your computer.
There is another thing to keep in mind when it comes to the 1499 threads: the OS
writers must design their threads to be as thin as possible to avoid the OS from interfering
with application performance. In other words, if any of the OS threads are creating a lot
of disturbance when changing their status from Runnable to Running, they will over-
tax some of the core resources and will not allow their hyper-thread to work efficiently
when one of your threads is scheduled alongside an OS thread. Of course, not every task
can be performed so thinly, and what I just described about the OS has its limits. The
other side of the coin is that the application designed should pay attention to making the
application threads thin. We will only go into the details of this a little bit, since this is not
a CPU parallel programming book, but rather, a GPU one. When I get a chance to make
a comment about how a CPU thread could have been made a little thinner, I will make a
point throughout the book though.

3.9 INTEL MIC ARCHITECTURE: XEON PHI

One interesting parallel computing platform that is close to GPUs is Many Integrated Core
(MIC), an architecture introduced by Intel to compete with Nvidia and AMD GPU archi-
tectures. Xeon Phi model-named MIC architectures incorporate quite a few x86-compatible
inO cores that are capable of running more than the two-threads-per-core that is standard
in Intel’s Core i7 OoO architectures. For example, the Xeon Phi 5110P unit that I will
provide test results for incorporates 60 cores and 4 threads/core. Therefore, it is capable of
executing 240 simultaneous threads.
As opposed to the cores inside a Core i7 CPU that work close to 4 GHz, each Xeon Phi
core only works at 1.053 GHz, nearly 4× slower. To overcome this disadvantage, Xeon Phi is
architected with a 30 MB cache memory, it has 16 memory channels, rather than the 4 that
is in the modern core i7 processors and introduces a 320 GBps memory bandwidth, which
is around 5−10× higher than the main memory bandwidth of a Core i7. Additionally, it
has a 512-bit vector engine that is capable of performing 8 double-precision floating point
operations each cycle. Therefore, it is capable of very high TFLOP (tera-floating point
operating) processing capability. Rather than classifying the Xeon Phi as a CPU, it is more
appropriate to classify it as a throughput engine, which is designed to process a lot of data
(particularly scientific data) at a very high rate.
The Xeon Phi unit can be used in one of two ways:
• It could be used “almost” as a GPU with the OpenCL language. In this mode of
operation, the Xeon Phi will be treated as a device that is outside the CPU; a de-
vice is connected to the CPU through an I/O bus, which will be PCI Express in
our case.
• It could be used as “almost” a CPU with its own compiler, icc. After you compile it,
you remote-connect to mic0 (which is a connection to the lightweight OS within Xeon
Phi) and run the code in mic0. In this mode of operation, the Xeon Phi is a device
again with its own OS, so data must be transferred from the CPU to the Xeon’s work
area. This transfer is done by using Unix commands that we use; scp (secure copy) is
to transfer the data from the host into the Xeon Phi.
Improving Our First Parallel CPU Program 77

TABLE 3.6Comparing imflipP.c execution times (H, V type flips in Table 3.2) to
imflipPM.c execution times (I, W type flips in Table 3.4) for Xeon Phi 5110P.
Xeon Phi Speedup Speedup
#Threads V W V→W H I H→I
Serial 673 60.9 11× 358 150 2.4×
2 330 30.8 10.7× 179 75 2.4×
4 183 16.4 11.1× 90 38 2.35×
8 110 11.1 9.9× 52 22 2.35×
16 54 11.9 4.6× 27 15 1.8×
32 38 16.1 2.4× 22 18 1.18×
64 39 29.0 1.3× 28.6 29.4 0.98×
128 68 56.7 1.2× 48 53 0.91×
256 133 114 1.15× 90 130 0.69×
512 224 234 0.95× 205 234 0.87×

Here is the command line to compile imflipPM.c to execute on Xeon Phi to get the
performance numbers in Table 3.6:

$ icc -mmic -pthread imflipPM.c ImageStuff.c -o imflipPM

$ scp imflipPM dogL.bmp mic0:~
$ ssh mic0
$ ./imflipPM dogL.bmp flipped.bmp H 60
Executing the multi-threaded version with 60 threads ...
Output BMP File name: flipped.bmp (3200 x 2400)
Total execution time: 27.4374 ms. (0.4573 ms per thread).
Flip Type = ’Horizontal’ (H) (3.573 ns/pixel)

Performance results for Xeon Phi 5110P, executing the imflipPM.c program are shown
in Table 3.6. While there is a healthy improvement from multiple threads up to 16 or 32
threads, the performance improvement limit is reached at 32 threads. Sixty-four threads
provides no additional performance improvement. The primary reason for this is that the
threads in our imflipPM.c program are so thick that they cannot take advantage of the
multiple threads in each core.

3.10 WHAT ABOUT THE GPU?

Now that we understand how the CPU parallel programming story goes, what about the
GPU? I promised that everything we learned in the CPU world would be applicable to
the GPU world. Now, imagine that you had a CPU that allowed you to run 1000 or more
cores/threads. That’s the over-simplified story of the GPU. However, as you see, it is not
easy to keep increasing the number of threads, since your performance eventually stops
improving beyond a certain point. So, a GPU is not simply like a CPU with thousands of
cores. Major architectural improvements had to be made inside the GPU to eliminate all of
the core and memory bottlenecks we just discussed in this chapter. And, even after that, a
lot of responsibility falls on the GPU programmer to make sure that his or her program is
not encountering these bottlenecks.
78 GPU Parallel Program Development Using CUDA

The entire Part I of this book is dedicated to understanding how to “think parallel,”
in fact, this is not even enough. You have to start thinking “massively parallel.” When we
had 2, 4, 8 threads to execute in the examples shown before, it was somehow easy to adjust
the sequence to make every thread do useful work. However, in the GPU world, you will
be dealing with thousands of threads. Teaching how to think in such an absurdly parallel
world should start by learning how to sequence two threads first! This is the reason why
the CPU environment was perfect to warm up to parallelism, and is the philosophy of this
book. By the time you finish Part I of this book, you will not only learn CPU parallelism,
but will be totally ready to take on the massive parallelism that the GPUs bring you in
Part II.
If you are still not convinced, let me mention this to you: GPUs actually support hun-
dreds of thousands of threads, not just thousands! Convinced yet? A corporation like IBM
with hundreds of thousands of employees can run as well as a corporation with 1 or 2 em-
ployees, and, yet, IBM is able to harvest the manpower of all of its employees. But, it takes
extreme discipline and a systematic approach. This is what the GPU programming is all
about. If you cannot wait until we reach Part II to learn GPU programming, you can peak
at it now; but, unless you understand the concepts introduced in Part I, something will
always be missing.
GPUs are here to give us shear computational power. A GPU program that works 10×
faster than a comparable CPU program is better than one that works only 5× faster. If
somebody could come in and rewrite the same GPU program to be 20× faster, that person
is the king (or queen). There is no point in writing a GPU program unless you are targeting
speed. There are three things that matter in a GPU program: speed, speed, and speed ! So,
the goal of this book is to get you to be a GPU programmer that writes super-fast GPU
code. This doesn’t happen unless we systematically learn every important concept, so, in
the middle of a program, when we encounter some weird bottleneck, we can explain it and
remove the bottleneck. Otherwise, if you are going to write slow GPU code, you might as
well spend your time learning much better CPU multithreading techniques, since there is
no point in using a GPU unless your code gets every little bit of extra speed out of it. This
is the reason why we will time our code throughout the entire book and will find ways to
make our GPU code faster.

3.11 CHAPTER SUMMARY

We now have an idea about how to write efficient multithreaded code. Where do we go from
here? First, we need to be able to quantify everything we discussed in this chapter: What
is memory bandwidth? How do the cores really operate? How do the cores get data from
the memory? How do the threads share the data? Without answering these, we will only
be guessing why we got any speedup.
It is time to quantify everything and get a 100% precise understanding of the archi-
tecture. This is what we will do in Chapter 4. Again, everything we learn will be readily
applicable to the GPU world, although there will be distinct differences, which I will point
out as they come up.
CHAPTER 4

Understanding the Cores

and Memory

hen we say hardware architecture, what are we talking about?

W The answer is: everything physical. What does “everything physical” include? CPU,
memory, I/O controller, PCI express bus, DMA controller, hard disk controller, hard disk(s),
SSDs, CPU chipsets, USB ports, network cards, CD-ROM controller, DVDRW, I can go on
for a while ... The big question is: which one of these do we care about as a programmer?
The answer is: CPU and memory, and more specifically, cores inside the CPU and
memory. Especially if you are writing high performance programs (like the ones in this
book), > 99% of your program performance will be determined by these two things. Since
the purpose of this book is high-performance programming, let’s dedicate this chapter to
understanding the cores and memory.

4.1 ONCE UPON A TIME ... INTEL ...

Once upon a time, in the early 1970s, a tiny Silicon Valley company named INTEL, em-
ploying about 150 people at the time, designed a programmable chip in the beginning days
of the concept of a “microprocessor,” or “CPU,” a digital device that can execute a program
stored in memory. Every CPU had the capability to address a certain amount of memory,
primarily determined at design time by the CPU designers, based on technological and
business-related constraints.
The program (aka, CPU instructions), and the data (that was fed into the program)
had to be stored somewhere. INTEL designed the 4004 processor to have 12 address bits,
capable of addressing 4096 Bytes (i.e., 212 ). Each piece of information was stored in 8-bit
units (a Byte). So, all together, 4004 could execute a program that was, say, 1 KB and could
work on data that was in the remaining 3 KB. Or, maybe, some customers only needed to
store 1 KB program and 1 KB data: so, they would have to buy the memory chips to attach
to the 4004 somewhere else.
Although INTEL designed support chips, such as an I/O controller and memory con-
troller to allow their customers to interface the 4004 CPU to the memory chips they bought
somewhere else, they did not particularly focus on making memory chips themselves. This
might look a little counterintuitive at first, but, from a business standpoint, it makes a lot
of sense. At the time of the release of the 4004, one needed at least 6, 7 other types of
interface chips to properly interface other important devices to the 4004 CPU. Out of those
6, 7 different chips, one was very special: the memory chips.

79
80 GPU Parallel Program Development Using CUDA

FIGURE 4.1 Inside a computer containing an i7-5930K CPU [10] (CPU5 in Table 3.1),
and 64 GB of DDR4 memory. This PC has a GTX Titan Z GPU that will be used
to test a lot of the programs in Part II.

4.2 CPU AND MEMORY MANUFACTURERS

Much like 30–40 years ago, INTEL still doesn’t make the memory chips as of the year
2017, at the time of the projected publication of this book. The players in the memory
manufacturing world are Kingston Technology, Samsung, and Crucial Technology, to name
a few. So, the trend in the 4004 days four decades ago never changed. Memory manufacturers
are different from CPU manufacturers, although a lot of the CPU manufacturers make their
own support chips (chipsets). Figure 4.1 shows the inside of the PC containing the CPU5 in
our Table 3.1. This CPU is an i7-5930K [10], made by INTEL. However, the memory chips
in this computer (top left of Figure 4.1) are made by a completely different manufacturer:
Kingston Technology Corp. No big surprise for the manufacturer of the GPU: Nvidia Corp!
The SSD (solid state disk) is manufactured by Crucial Technology. Power supply is Corsair.
Ironically, the CPU liquid cooler in Figure 4.1 is not made by INTEL. It is made by the
same company that the chassis and power supply are made by (Corsair).
Support chips (which were later called chipsets) are made out of logic gates, AND,
OR, XOR gates, etc. Their primary building blocks are metal oxide semiconductor (MOS)
transistors. For example, the X99 chipset in Figure 4.1 was made by INTEL (not marked
in the figure). This chip also controls the PCI Express bus that is connecting to the GPU
to the CPU (marked in Figure 4.1), as well as the SATA3 bus that is connecting to
the SSD.
Understanding the Cores and Memory 81

4.3 DYNAMIC (DRAM) VERSUS STATIC (SRAM) MEMORY

While CPU manufacturers have all the interest in the world to design and manufacture
their chipsets, why are the memory chips so special? Why do the CPU manufacturers have
no interest in making memory chips? To answer this, first let’s look at different types of
memory.

4.3.1 Static Random Access Memory (SRAM)

This type of memory is still made with MOS transistors, something that is already the
building block of the CPUs and their chipsets. It is easy for the CPU manufacturers to
incorporate this type of memory into their CPU design, since it is made out of the same
material. About 10 years after the introduction of the first CPUs, CPU designers introduced
a type of SRAM that could be built right into the CPU. They called it cache memory that
was able to buffer a very small portion of the main memory. Since most computer programs
require access to very small portions of data in a repeated fashion, the effective speedup by
the introduction of the cache memory was significant, although only very small amounts of
this type of memory were possible to incorporate into the CPU.

4.3.2 Dynamic Random Access Memory (DRAM)

This type of memory was the only option to manufacture huge amounts of memory. For
example, as of today, only 8–30 MB of cache memory can be built into the CPU, while a
computer with 32 GB of main memory (DRAM) is mainstream (a stunning 1000× differ-
ence). To be able to build DRAMs, a completely different technology has to be employed:
While the building block of chipsets and CPUs is primarily MOS transistors, the building
block of DRAM is extremely small capacitors that store charge. This charge, with proper
interface circuitry, is interpreted as data. There are a few things that are very different with
DRAM:
• Since the charge is stored in extremely small capacitors, it drains (i.e., leaks) after a
certain amount of time (something like 50 ms).
• Due to this leakage, data has to be read and put back on the DRAM (i.e., refreshing).
• It makes no sense to allow byte-at-a-time access to data, considering the disadvantages
of refreshing. So, the data is accessed in big rows at a time (e.g., 4 KB at a time).
• There is a long delay in getting a row of data, although, once read, access to that row
is extremely fast.
• In addition to being row accessible, DRAMs have all sorts of other delays, such as row-
to-row delay, etc. Each one of these parameters is specified by the memory interface
standards that are defined by a consortium of companies.

4.3.3 DRAM Interface Standards

How can CPU manufacturers control the compatibility of the memory chips that some other
companies are making? The answer is: memory interface standards. From the first days of
the introduction of the memory chips decades ago, there was a need to define standards for
memory chips. SDRAM, DDR (double data rate), DDR2, DDR3, and finally, in 2015 the
82 GPU Parallel Program Development Using CUDA

DDR4 standard (contained inside the PC in Figure 4.1). These standards are determined by
a large consortium of chip manufacturers that define such standards and precise timing of
the memory chips. If the memory manufacturers design their memory to be 100% compliant
with these standards, and INTEL designs their CPU to be 100% compliant, there is no need
for both of these chips to be manufactured by the same company.
In the past four decades, the main memory was always made out of DRAM. A new
DRAM standard was released every 2–3 years to take advantage of the exciting develop-
ments in the DRAM manufacturing technology. As the CPUs improved, so did the DRAM.
However, not only the improvements in CPU and DRAM technology followed different
patterns, but also improvement meant something different for CPU and DRAM:
• For CPU designs, improvement meant more work done per second.
• For DRAM designs, improvement meant more data read per second (bandwidth) as
well as more storage (capacity).
CPU manufacturers improved their CPUs using better architectural designs and by tak-
ing advantage of more MOS transistors that became available at each shrinking technology
node (130 nm, 90 nm, ... and 14 nm as of 2016). On the other hand, DRAM manufacturers
improved their memories by the ability to continuously pack more capacitors into the same
area, thereby resulting in more storage. Additionally, they were able to continuously improve
their bandwidths by the newer standards that became available (e.g., DDR3, DDR4).

4.3.4 Influence of DRAM on our Program Performance

The most important question for us as programmers is this: how does this separation of the
CPU versus DRAM manufacturing technology influence the performance of our programs?
The characteristics of the SRAM versus DRAM, listed in Section 4.3, will stay relatively
the same in the foreseeable future. While the CPU manufacturers will always try to increase
the amount of cache memory inside their CPU by using more SRAM-based cache memory,
DRAM manufacturers will always try to increase the bandwidth and storage area of the
DRAMs. However, they will be less concerned about access speeds to small amounts of
memory, since this should be remedied by the increasing amounts of cache memory inside
the CPUs. The access speeds of DRAM are as follows:
• DRAM is accessed one row at a time, each row being the smallest accessible amount
of DRAM memory. A row is approximately 2–8 KB in modern DRAMs.
• To access a row, a certain amount of time (latency) is required. However, once the
row is accessed (brought into the row buffer internally by the DRAM), this row is
practically free to access going forward.
• The latency in accessing DRAM is about 200–400 cycles for a CPU, while accessing
subsequent elements in the same row is just a few cycles.
Considering all of these DRAM facts, the clear message to a programmer is this:
â Write your programs in such a way that:
i) Accessing data from distant memory locations should be in large
chunks, since we know that this data will be in DRAM.
ii) Accessing small amounts should be highly localized and repetitive,
since we know that this data will be in SRAM (cache).
iii) Pay extreme attention to multiple threads, since there is only one
memory, but multiple threads: simultaneous threads might cause
bad access DRAM patterns although individually they look okay.
Understanding the Cores and Memory 83

4.3.5 Influence of SRAM (Cache) on our Program Performance

Since there is only one main memory (DRAM), as long as we obey the rules set forth in
Table 3.3 and pay attention to the comments in Section 4.3.4, we should have fairly good
DRAM performance. But, cache memory, made out of SRAM, is a little different. First
of all, there are multiple types of cache memory, making their design hierarchical. For a
modern CPU like the one shown in the PC in Figure 4.1, there are three different types of
cache memory, built into the CPU:
• L1$ is 32 KB for data (L1 data cache, or L1D$), and 32 KB for instructions (L1
instruction cache, or L1I$). Total L1$ is 64 KB. Access to L1$ is extremely fast (4
cycle load-to-use latency). Each core has its own L1$.
• L2$ is 256 KB. There is no separation between data or instructions. Access to L2$ is
fairly fast (11–12 cycles). Each core has its own L2$.
• L3$ is 15 MB. Access to L3$ is faster than DRAM, but much slower than L2$ (≈ 22
cycles). L3$ is shared by all of the cores (6 in this CPU).
• The data that is brought into and evicted out of L1$, L2$, and L3$ is purely controlled
by the CPU and is completely out of the programmer’s control. However, by keeping
the data operations confined into small loops, the programmer can have a significant
impact on the effectiveness of the cache memory operation.
• Alternatively, by disobeying the cache efficiency rules, the programmer could nearly
render the cache memory useless.
• The best way for the programmer to take advantage of caching is to know the exact
sizes of each cache hierarchy and design the programs to stay within these ranges.
Considering all of these SRAM facts, the clear message to a programmer is this:
â To take advantage of caching, write your programs, so that:
i) Each thread accesses 32 KB data regions repetitively
ii) Try to confine broader accesses to 256 KB if possible,
iii) When considering all of the launched threads:
try to confine their cumulative data access within L3$ (e.g., 15 MB)
iv) If you must exceed this, make sure that there is
heavy usage of L3$ before exceeding this region

4.4 IMAGE ROTATION PROGRAM: IMROTATE.C

We covered a lot of information regarding the cache memory (SRAM) and main memory
(DRAM). It is time to put this information to good use. In this section, I will introduce a
program, imrotate.c, that rotates an image by a specified amount of degrees. This program
will put a lot of pressure on the core and memory components of the CPU, thereby allowing
us to understand their influence on the overall program performance. We will comment
on the performance bottlenecks caused by over-stressing either the cores or the memory
and will improve our program. The improved program, imrotateMC.c, the memory-friendly
(“M”) and core-friendly (“C”) version of imrotate.c, will implement the improvements in an
incremental fashion via multiple different steps, and each step will be explained in detail.
84 GPU Parallel Program Development Using CUDA

FIGURE 4.2 The imrotate.c program rotates a picture by a specified angle. Origi-
nal dog (top left), rotated +10◦ (top right), +45◦ (bottom left), and −75◦ (bottom
right) clockwise. Scaling is done to avoid cropping of the original image area.

4.4.1 Description of the imrotate.c

The purpose of imrotate.c is to create a program that is both memory-heavy and core-heavy.
imrotate.c will take an image as shown in Figure 4.2 (top left) and will rotate it clockwise by a
specified amount (in degrees). Example outputs of the program are shown in Figure 4.2 for a
+10◦ rotation (top right), a +45◦ rotation (bottom left), and a −75◦ rotation (bottom right).
These angels are all specified clockwise, thereby making the last −75◦ rotation effectively a
counterclockwise +75◦ rotation. To run the program, the following command line is used:
imrotate InputfileName OutputfileName [degrees] [1-128]
where degrees specifies the clockwise rotation amount and the next parameter [1–128] is
the number of threads to launch, similar to the previous programs.

4.4.2 imrotate.c: Parametric Restrictions and Simplifications

Some simplifications had to be made in the program to avoid unnecessary complications
and diversion of our attention to unrelated issues. These are
• For rectangular images where the width and height are not equal, part of the rotated
image will end up outside the original image area.
Understanding the Cores and Memory 85

• To avoid this cropping, the resulting image is scaled, so that the resulting image always
fits in its original size.
• This scaling naturally implies empty areas in the resulting image (black pixels with
RGB=000 values are used to fill the blank pixels).
• The scaling is not automatic, i.e., the same exact amount of scaling is applied at the
beginning, thereby leaving more blank area for certain rotation amounts, as is clearly
evidenced in Figure 4.2.

4.4.3 imrotate.c: Theory of Operation

Rotation of each pixel with the original coordinates (x, y) is achieved by multiplying
these coordinates by a rotation matrix that yields the rotated coordinates (x0, y0) as
follows:
x0 cos θ sin θ x
= × (4.1)
y0 − sin θ cos θ y
where the θ is the rotation angle (specified in radians, θrad ), with a simple conversion to
user-specified degrees as:
2π
θ = θrad = × θdeg (4.2)
360
When a pixel’s destination coordinates (x0, y0) are determined, all three color components
RGB of that pixels are moved to that same (x0, y0) location. Scaling (more accurately,
prescaling) of the image is done by the following formula:
(
p ScaleFactor = wd , h > w
d = w2 + h2 =⇒ (4.3)
ScaleFactor = hd , w ≤ h
where the width and height values are the previously introduced Hpixels and Vpixels
attributes, respectively. The scale factor is determined to avoid cropping in case either
one of these values is greater based on Equation 4.3. An important note about this pro-
gram is that this rotation cannot be easily implemented without using additional im-
age memory. For that purpose, an additional image function called CreateBlankBMP()
(shown in Code 4.1) is introduced inside ImageStuff.c and its header is placed in
ImageStuff.h.

CODE 4.1: ImageStuff.c CreateBlankBMP() {...}

The CreateBlankBMP() function creates an image filled with zeroes (blank pixels).

unsigned char** CreateBlankBMP()

{
int i,j;

unsigned char img=(unsigned char )malloc(ip.Vpixelssizeof(unsigned char));

for(i=0; i<ip.Vpixels; i++){
img[i] = (unsigned char *)malloc(ip.Hbytes * sizeof(unsigned char));
memset((void *)img[i],0,(size_t)ip.Hbytes); // zero out every pixel
}
return img;
}
86 GPU Parallel Program Development Using CUDA

CODE 4.2: imrotate.c ... main() {...

First part of the main() function in imrotate.c converts user supplied rotation in
degrees to radians and calls the Rotate() function to rotate the image.

#include <pthread.h>
#include <stdint.h>
#include <ctype.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <sys/time.h>
#include "ImageStuff.h"
#define REPS 1
#define MAXTHREADS 128
long NumThreads; // Total number of threads
int ThParam[MAXTHREADS]; // Thread parameters ...
double RotAngle; // rotation angle
pthread_t ThHandle[MAXTHREADS]; // Thread handles
pthread_attr_t ThAttr; // Pthread attributes
void* (*RotateFunc)(void *arg); // Func. ptr to rotate img
unsigned char** TheImage; // This is the main image
unsigned char** CopyImage; // This is the copy image
struct ImgProp ip;
...
int main(int argc, char** argv)
{
int RotDegrees, a, i, ThErr;
struct timeval t;
double StartTime, EndTime, TimeElapsed;

switch (argc){
case 3 : NumThreads=1; RotDegrees=45; break;
case 4 : NumThreads=1; RotDegrees=atoi(argv[3]); break;
case 5 : NumThreads=atoi(argv[4]); RotDegrees = atoi(argv[3]); break;
default: printf("\n\nUsage: imrotate inputBMP outputBMP [RotAngle] [1-128]");
printf("\n\nExample: imrotate infilename.bmp outname.bmp 45 8\n\n");
printf("\n\nNothing executed ... Exiting ...\n\n");
exit(EXIT_FAILURE);
}
if((NumThreads<1) || (NumThreads>MAXTHREADS)){
printf("\nNumber of threads must be between 1 and %u... \n",MAXTHREADS);
printf("\n’1’ means Pthreads version with a single thread\n");
printf("\n\nNothing executed ... Exiting ...\n\n"); exit(EXIT_FAILURE);
}
if((RotDegrees<-360) || (RotDegrees>360)){
printf("\nRotation angle of %d degrees is invalid ...\n",RotDegrees);
printf("\nPlease enter an angle between -360 and +360 degrees ...\n");
printf("\n\nNothing executed ... Exiting ...\n\n"); exit(EXIT_FAILURE);
}
...
Understanding the Cores and Memory 87

CODE 4.3: imrotate.c main() ...}

Second part of the main() function in imrotate.c creates a blank image and launches
multiple threads to rotate TheImage[] and place it in CopyImage[].

...
if((RotDegrees<-360) || (RotDegrees>360)){
...
}
printf("\nExecuting the Pthreads version with %u threads ...\n",NumThreads);
RotAngle=2*3.141592/360.000*(double) RotDegrees; // Convert the angle to radians
printf("\nRotating %d deg (%5.4f rad) ...\n",RotDegrees,RotAngle);
RotateFunc=Rotate;

TheImage = ReadBMP(argv[1]);
CopyImage = CreateBlankBMP();
gettimeofday(&t, NULL);
StartTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
pthread_attr_init(&ThAttr);
pthread_attr_setdetachstate(&ThAttr, PTHREAD_CREATE_JOINABLE);
for(a=0; a<REPS; a++){
for(i=0; i<NumThreads; i++){
ThParam[i] = i;
ThErr = pthread_create(&ThHandle[i], &ThAttr, RotateFunc,
(void *)&ThParam[i]);
if(ThErr != 0){
printf("\nThread Creation Error %d. Exiting abruptly... \n",ThErr);
exit(EXIT_FAILURE);
}
}
pthread_attr_destroy(&ThAttr);
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
}
gettimeofday(&t, NULL);
EndTime = (double)t.tv_sec*1000000.0 + ((double)t.tv_usec);
TimeElapsed=(EndTime-StartTime)/1000.00;
TimeElapsed/=(double)REPS;
//merge with header and write to file
WriteBMP(CopyImage, argv[2]);

// free() the allocated area for the images

for(i = 0; i < ip.Vpixels; i++) { free(TheImage[i]); free(CopyImage[i]); }
free(TheImage); free(CopyImage);
printf("\n\nTotal execution time: %9.4f ms. ",TimeElapsed);
if(NumThreads>1) printf("(%9.4f ms per thread). ",
TimeElapsed/(double)NumThreads);
printf("\n (%6.3f ns/pixel)\n",
1000000*TimeElapsed/(double)(ip.Hpixels*ip.Vpixels));
return (EXIT_SUCCESS);
}
88 GPU Parallel Program Development Using CUDA

CODE 4.4: imrotate.c Rotate() {...}

The Rotate() function takes each pixel from the TheImage[] array, scales it, and applies
the rotation matrix in Equation 4.1 and writes the new pixel into the CopyImage[]
array.

void Rotate(void tid)

{
long tn;
int row,col,h,v,c, NewRow,NewCol;
double X, Y, newX, newY, ScaleFactor, Diagonal, H, V;
struct Pixel pix;

tn = *((int *) tid);
tn *= ip.Vpixels/NumThreads;

for(row=tn; row<tn+ip.Vpixels/NumThreads; row++)

{
col=0;
while(col<ip.Hpixels*3){
// transpose image coordinates to Cartesian coordinates
c=col/3; h=ip.Hpixels/2; v=ip.Vpixels/2; // integer div
X=(double)c-(double)h;
Y=(double)v-(double)row;

// image rotation matrix

newX=cos(RotAngle)*X-sin(RotAngle)*Y;
newY=sin(RotAngle)*X+cos(RotAngle)*Y;

// Scale to fit everything in the image box

H=(double)ip.Hpixels;
V=(double)ip.Vpixels;
Diagonal=sqrt(H*H+V*V);
ScaleFactor=(ip.Hpixels>ip.Vpixels) ? V/Diagonal : H/Diagonal;
newX=newX*ScaleFactor;
newY=newY*ScaleFactor;

// convert back from Cartesian to image coordinates

NewCol=((int) newX+h);
NewRow=v-(int)newY;
if((NewCol>=0) && (NewRow>=0) && (NewCol<ip.Hpixels)
&& (NewRow<ip.Vpixels)){
NewCol*=3;
CopyImage[NewRow][NewCol] = TheImage[row][col];
CopyImage[NewRow][NewCol+1] = TheImage[row][col+1];
CopyImage[NewRow][NewCol+2] = TheImage[row][col+2];
}
col+=3;
}
}
pthread_exit(NULL);
}
Understanding the Cores and Memory 89

TABLE 4.1 imrotate.c execution times for the CPUs in Table 3.1 (+45◦ rotation).
# HW Threads 2C/4T 4C/8T 4C/8T 4C/8T 6C/12T 8C/16T
i5-4200M i7-960 i7-4770K i7-3820 i7-5930K E5-2650
CPU1 CPU2 CPU3 CPU4 CPU5 CPU6
# SW Threads
1 951 1365 782 1090 1027 845
2 530 696 389 546 548 423
3 514 462 261 368 365 282
4 499 399 253 322 272 227
5 499 422 216 295 231 248
6 387 283 338 214 213
8 374 237 297 188 163
9 237 177 199
10 341 228 285 163 201
12 217 158 171

4.5 PERFORMANCE OF IMROTATE

Execution times of imrotate.c are shown in Table 4.1 for the same CPUs tabulated in
Table 3.1.

4.5.1 Qualitative Analysis of Threading Efficiency

These results were obtained by rotating the image +45◦ clockwise. Aside from being much
slower than the corresponding horizontal/vertical flip programs, here are some observations
from Table 4.1:
• If we look at the CPU5 performance (the CPU inside the personal computer shown in
Figure 4.1), the performance improvement continues steadily, although the threading
efficiency goes down significantly with the increasing number of hardware threads
(noted as HW threads in Table 4.1).
• This is the first program that we are seeing where a 6C/12T CPU (CPU5) seems to be
taking advantage of all of the 12 hardware threads, although the threading efficiency
plummets for high software thread counts (noted as SW threads, i.e., software threads,
in Table 4.1).
• Although certain CPUs have a relative advantage initially (e.g., CPU3 @788 ms) they
lose this advantage with the increased number of SW threads, when compared against
CPUs with higher HW threads (e.g., CPU5 is faster with 8 software threads @188 ms
vs. 237 ms for CPU3).
• For now, ignore CPU6, since the Amazon environment is a multiuser environment and
the explanation is the performance is not as straightforward.

4.5.2 Quantitative Analysis: Defining Threading Efficiency

It would be useful to define a metric called multithreading efficiency (or, in short, threading
efficiency) that quantifies how additional software threads are improving program perfor-
mance relatively. If we take CPU5 in Table 4.1 as an example and take the single-thread
90 GPU Parallel Program Development Using CUDA

TABLE 4.2imrotate.c threading efficiency (η) and parallelization overhead (1−η) for
CPU3, CPU5. The last column reports the speedup achieved by using CPU5 that
has more cores/threads, although there is no speedup up to 6 launched SW threads.
# SW CPU3: i7-4770K 4C/8T CPU5:i7-5930K 6C/12T Speedup
Thr Time η 1−η Time η 1−η CPU5→CPU3
1 782 100% 0% 1027 100% 0% 0.76×
2 389 100% 0% 548 94% 6% 0.71×
3 261 100% 0% 365 94% 6% 0.72×
4 253 77% 23% 272 95% 5% 0.93×
5 216 72% 28% 231 89% 11% 0.94×
6 283 46% 54% 214 80% 20% 1.32×
8 237 42% 58% 188 68% 32% 1.26×
9 237 41% 69% 177 65% 35% 1.34×
10 228 34% 66% 163 63% 37% 1.4×
12 217 30% 70% 158 54% 46% 1.37×

execution time of 1027 ms as our baseline (i.e., 100% efficiency), then, when we launch two
threads, we ideally expect half of that execution time ( 1027
2 = 513.5 ms). However, we see
548 ms. Not bad, but only ≈ 94% of the performance we hoped for.
In other words, launching the additional thread improved the execution time, but hurt
the efficiency of the CPU. Quantifying this efficiency metric (η) is fairly straightforward as
shown in Equation 4.4. One corollary of Equation 4.4 is that parallelization has an overhead
that can be defined as shown in Equation 4.5.

(Single-Thread Execution Time)

η = Threading Efficiency = (4.4)
(N-Thread Execution Time) × N

Parallelization Overhead = 1 − Threading Efficiency = 1 − η (4.5)

In our case, launching 2 threads in CPU5 cost a 6% overhead (i.e., 1−0.94 = 0.06) as the
cost of parallelization. The threading efficiency of imrotate.c for CPU3 and CPU5 is shown
in Table 4.2. CPU3 has 4 cores, 8 hardware threads, and its peak main DRAM memory
bandwidth is 25.6 GB/s (Gigabytes per second) as per Table 3.1. On the other hand,
CPU5 has 6 cores, 12 hardware threads, and its peak DRAM main memory bandwidth is
68 GB/s. The increase in the memory bandwidth should allow more hardware threads to
bring data into the cores faster, thereby avoiding a memory bandwidth saturation when a
higher number of launched software threads demand data from the DRAM main memory
simultaneously. An observation of Table 4.2 shows exactly that.
In Table 4.2, while the decrease in threading efficiency (η) is evident for both CPUs as
we launch more software threads, the rate of fall-off is much less for CPU5. If we consider
89% “highly efficient,” then we observe from Table 4.2 that CPU3 is highly efficient for
3 software threads, while CPU5 is highly efficient for 5 software threads. Alternatively,
if we define 67% as “highly inefficient” (i.e., wasting one out of three threads), then
CPU3 becomes highly inefficient for > 5 threads, while CPU5 is highly inefficient for
> 8 threads.
To provide a direct performance comparison between CPU3 and CPU5, the last column
is added to Table 4.2. This column shows the “speedup” when running the same program
on CPU5 (what is supposed to be a higher performance CPU) as compared to CPU3. Up
to 5 launched software threads (i.e., before CPU3 starts becoming highly inefficient), we
Understanding the Cores and Memory 91

see that CPU3 beats CPU5. However, beyond 5 launched threads, CPU5 beats CPU3 for
any number of threads. The reason has something to do with both the cores and memory
as we will see shortly in the next section. Our imrotate.c program is by nature designed
inefficiently, so, it is not taking advantage of the advanced architectural improvements that
are built into CPU5.
We will get into the architectural details shortly, but, for now, it suffices to comment
that just because a CPU is a later generation doesn’t mean that it will work faster for every
program. Newer generations CPUs’ architectural improvements are typically geared toward
programs that are written well. A program that causes choppy memory and core access
patterns, like imrotate.c, will not benefit from the beautiful architectural improvements of
newer generation CPUs. INTEL’s message to the programmers is clear:
â Newer generation CPUs and memories are always designed to work more
efficiently when rules, like the ones in Table 3.3, are obeyed.
â The influence of these rules will keep increasing in future generations.
â CPU says: If you’re gonna be a bad program designer, I’ll be a bad CPU!

4.6 THE ARCHITECTURE OF THE COMPUTER

In this section, we will understand what is going on inside the CPU and memory in detail.
Based on this understanding, we will improve our results of the imrotate.c program. As
mentioned previously, the function named Rotate() in Code 4.4 is what matters to this
program’s performance. So, we will only improve this function. To be able to do this, let’s
understand what happens, at runtime, when this function is being executed.

4.6.1 The Cores, L1$ and L2$

First, let’s look at the structure of a CPU core. I will pick i7-5930K as an example (CPU5
in Table 4.2). The internal structure of an i7-5930K core is shown in Figure 4.3. Each core
has a 64 KB L1 and 256 KB L2 cache memory. Let’s see what these cache memories do:
• L1$ is broken into a 32 KB instruction cache (L1I$) and a 32 KB data cache (L1D$).
Both of these cache memories are the fastest cache memories, as compared to L2$ and
L3$. It takes the processor only a few cycles (e.g., 4 cycles) to access them.
• L1I$ stores the most recently used CPU instructions to avoid continuous fetching of
something that has already been fetched; storing a replica of an instruction is called
caching it.
• L1D$ caches a copy of the data elements, for example the pixels that are read from
the memory.
• The L2$ is used to cache either instructions or data. It is the second fastest cache
memory. The processor accessed L2$ in something like 11 cycles. This is why whatever
is cached in L1$ always goes through L2$ first.
• Data or instructions that are brought from L3$ first land in the L2$ and then go
into L1$. If the cache memory controller decides that it no longer needs some data,
it evicts it (i.e., it un-caches it). However, since L2$ is bigger than L1$, a copy of the
evicted data/instruction is probably still in the L2$ and can be brought back.
92 GPU Parallel Program Development Using CUDA

Execution Units L1 I$ 32KB

ALU
Pre-fetch
ALU Decode L2$
FP ADD
L3$
ALU Thread 1 15MB
256KB
FP MUL Hardware
MUL/DIV
Thread 2
LAGU Hardware
SAGU
MOB L1 D$ 32KB
FIGURE 4.3 The architecture of one core of the i7-5930K CPU (the PC in Figure 4.1).
This core is capable of executing two threads (hyper-threading, as defined by Intel).
These two threads share most of the core resources, but have their own register files.

4.6.2 Internal Core Resources

An important observation in Figure 4.3 is that, although this core is capable of executing two
threads, these threads share 90% of the core resources, and have a very minimal amount of
dedicated hardware to themselves. Their dedicated hardware primarily consists of separate
register files. Each thread (denoted as Thread 1 and Thread 2 in Figure 4.3) needs dedicated
registers to save its results, since each thread executes different instructions (e.g., a different
function) and produces results that might be completely unrelated. For example, while one
thread is executing the Rotate() function of our code, the other thread could be executing
a function that is a part of the OS, and has nothing to do with our Rotate() function. An
example could be when the OS is processing a network packet that has just arrived while
we are running our Rotate() function. Figure 4.3 shows how efficient the core architecture
is. The two threads share the following components inside the core:
• The L2$ is shared by both threads to receive the instructions from the main memory
via L3$ (I will explain L3$ shortly).
• The L1 I$ is shared by both threads to receive the instructions from L2$ that belongs
to either thread.

• The L1 D$ is shared by both threads to receive/send data from/to either thread.

• Execution units fall into two categories: ALUs (arithmetic logic units) are responsi-
ble for integer operations, and logic operations such as OR, AND, XOR, etc. FPUs
(floating point units) are responsible for floating point (FP) operations such as FP
ADD and FP MUL (multiplication). Division (whether integer or FP) is significantly
Understanding the Cores and Memory 93

more sophisticated than addition and multiplication, so there is a separate unit for
division. All of these execution units are shared by both threads.
• In each generation, more sophisticated computational units are available as shared ex-
ecution units. However, multiple units are incorporated for common operations that
might be executed by both threads, such as ALUs. Also, Figure 4.3 is overly simpli-
fied and the exact details of each generation might change. However, the ALU-FPU
functionality separation has never changed in the past 3–4 decades of CPU designs.
• The addresses that both threads generate must be calculated to write the data from
both threads back into the memory. For address computations, load and store address
generation units (LAGU and SAGU) are shared by both threads, as well as a unit
that properly orders the destination memory addresses (MOB).

• Instructions are prefetched and decoded only once and routed to the owner thread.
Therefore, prefetcher and decoder are shared by both threads.
• Acronyms are:
ALU=Arithmetic Logic Unit
FPU=Floating Point Unit
FPMUL=Floating Point Dedicated Multiplier
FPADD = FP Adder
MUL/DIV=Dedicated Multiplier/Divider
LAGU=Load Address Generation Unit
SAGU=Store Address Generation Unit
MOB=Memory Order Buffer
The most important message from Figure 4.3 is as follows:
â Our program performance will suffer if both threads inside a core are
requesting exactly the same shared core resources.
â For example, if both threads require heavy floating point operations,
they will pressure the FP resources.
â On the data side, if both threads are extremely memory intensive,
they will pressure L1 D$ or L2$, and eventually L3$ and main memory.
94 GPU Parallel Program Development Using CUDA

X99 GPU GPU

chipset PCI EXPRESS BUS

CPU

DDR4 MAIN MEMORY 64GB

QUEUE, UNCORE, I/O

DDR4 MEMORY
CORE 1 CORE 2 CORE 3

CONTROLLER
MEMORY
L3$ 15 MB
BUS

CORE 4 CORE 5 CORE 6

FIGURE 4.4 Architecture of the i7-5930K CPU (6C/12T). This CPU connects to the
GPUs through an external PCI express bus and memory through the memory bus.

4.6.3 The Shared L3 Cache Memory (L3$)

The internal architecture of the i7-5930K CPU is shown in Figure 4.4. This CPU contains 6
cores, each capable of executing 2 threads as described in Section 4.6.2. While these “twin”
threads in each core share their internal resources peacefully (yeah right!), all 12 threads
share the 15 MB L3$. The impact of the L3$ is so high that INTEL dedicated 20% of the
CPU die area to L3$. Although this gives around 2.5 MB to each core on average, cores that
demand more L3$ could tap into a much larger portion of the L3$. The L3$ cache design
inside i7-5930K is termed Smart Cache by INTEL and is completely managed by the CPU.
Although the programmer (i.e., the program) has no input in the way the cache memory
manages its data, it has almost complete control of the efficiency of this cache by the data
demand patterns of his or her programs. These data demand patterns by threads cause
DRAM access patterns as I emphasized in the development of Code 3.1 and Code 3.2.

4.6.4 The Memory Controller

There is no direct access path from the cores to the main memory. Anything that is trans-
ferred from/to DRAM to the cores must first go through the L3$. The operational timing
of the DRAM memory is so sophisticated that (regardless of whether it is DDR2, DDR3, or
DDR4), the i7-5930K CPU architecture dedicates 18% of its chip area to a unit called the
memory controller. The memory controller buffers and aggregates data coming from L3$
to somehow eliminate the inefficiencies resulting from the block-transfer nature of DRAM.
Additionally, the memory controller is responsible for converting the data streams between
the L3$ and DRAM into the proper format that the destination needs (whether L3$ or
DRAM). For example, while DRAM works with rows of data, L3$ is an associative cache
memory that takes data one line at a time.
Understanding the Cores and Memory 95

4.6.5 The Main Memory

The 64GB DDR4 DRAM memory that the PC in Figure 4.1 has is capable of providing
a peak 68 GB/s of data throughput. Such a high throughput will only be achieved if the
program is reading massive consecutive blocks of data. In all of the programs we saw so far,
how close did we get to this? Let’s search for it. The fastest program we wrote so far was
imflipPM.c that had the MTFlipVM() function as its innermost loop, shown in Code 3.2. The
execution results are shown in Table 3.4. CPU5 was able to flip the image in 1.33 ms. If
we analyze the innermost loop, in summary, the MTFlipVM() function is reading the entire
image (i.e., reading 22 MB) and writing the entire vertically flipped image back (another
22 MB). So, all said and done, this is a 44MB data transfer. Using 12 threads, imflipPM.c
was able to complete the entire operation in 1.33 ms.
If we moved 44 MB in 1.33 ms, what kind of a data rate does this translate to?
Data Moved 44 MB
Bandwidth Utilization = =⇒ ≈ 33 GB/s (CPU5) (4.6)
Time Required 1.33 ms
Although this is only 49% of the actual CPU5’s peak memory bandwidth of 68 GBps, it is
far closer to the peak than any other program we have seen. This computation also clarifies
the fact that increasing the number of threads is not helping nearly as much as it did with
other programs, since we are getting fairly close to saturating the Memory Bus, as depicted
in Figure 4.4. We know that we cannot ever exceed the 68 GB/s, since the memory controller
(and the main memory) simply cannot transfer data that fast.
At this point, the next natural question to ask ourselves is the same comparison for
another CPU. Why not CPU3 again? Since it is the same exact function MTFlipVM() that is
running on both CPU3 and CPU5, we can expect CPU3 to be exposed to a similar memory
bandwidth saturation when running MTFlipVM(). We see from Table 3.4 that CPU3 is not
achieving an execution speed better than 2.66 ms, corresponding to a memory bandwidth of
16.5 GB/s. We know from Table 3.1 that CPU3 has a peak memory bandwidth of 25.6 GB/s.
So CPU3 actually hit 65% of its peak bandwidth with the same program. The difference
between CPU3 and CPU5 is that CPU3 is in a PC that has DDR3 memory: DDR3 is
friendlier to smaller data chunk transfers, whereas DDR4 memory in CPU5’s PC is designed
to transfer much larger chunks efficiently and provides much higher bandwidth if one does
so. This begs the question as to whether the CPU5 would do better if we increased the
size of the buffer and made each thread responsible for processing much larger chunks. The
answer is YES. I will provide an example in Chapter 5. The idea here is that you do not
compare a Yugo and Porsche by making both of them go 30 mph! You are simply insulting
the Porsche! The difference will not be seen until you go to 80 mph!
Code 3.2 barely reaches half of the peak bandwidth of CPU5. Reaching the full band-
width will take a lot more engineering. First of all, there is a perfect “row size” for each
DRAM (I called it the “chunk size” earlier, that corresponds to some physical feature of
the DRAM). Although Code 3.2 is doing a good job by transferring large chunks at a time,
we are not sure if the size of these chunks perfectly matches the optimum chunk size of
the DRAM. Additionally, since multiple threads are requesting data simultaneously, they
could be disrupting a good pattern that fewer threads would otherwise generate. It is very
easy to observe this from Table 3.4, where a higher number of threads sometimes results
in a lower performance. Figuring out the optimum size is left to the user as an exercise,
however, when we get to the GPU coding, we will be analyzing GPU DRAM details a
lot more closely. CPU DRAM and GPU DRAM are almost identical in operation with
some differences related to supplying data in parallel to a lot more threads that are in
the GPU.
96 GPU Parallel Program Development Using CUDA

4.6.6 Queue, Uncore, and I/O

Using Intel’s fancy term, cores are connected to the external world through the part of
the CPU called uncore. Uncore includes parts of the CPU that are not inside the cores,
but must be included inside the CPU to achieve high performance. Memory controller and
L3$ cache controller are two of the most important components of the uncore functionality.
PCI Express controller is inside the X99 chipset that is designed for this CPU, as shown in
Figure 4.4. Although the PCI Express bus (abbreviated PCIe) is managed by the chipset,
there is a part of the CPU that interfaces to the chipset.
This part of the CPU takes up about 22% of the die area inside the CPU and is re-
sponsible for queuing and efficiently transferring the data between the PCIe bus and the
L3$. When, for example, we are transferring data between the CPU and GPU, this part
of the CPU gets heavily used: The X99 chipset is responsible for communicating the PCIe
data between the GPUs and itself. Also, it is responsible for communicating the same data
between itself and the CPU through the “I/O” portion of the CPU. We can have a hard
disk controller, a network card, or a few GPUs hooked up to the I/O.
So, when I am describing the data transfers later in this book, I will pay attention to the
memory, cores, and I/O. The programs we are developing are going to be core-intensive,
memory-intensive, or I/O-intensive. By looking at Figure 4.4, you can see what they mean.
A core-intensive program will heavily use the core resources, shown in Figure 4.3, while
a memory-intensive program will use the memory controller heavily, on the right side of
Figure 4.4. Alternatively, an I/O-intensive program will use the I/O controller of the CPU,
shown on the left side of Figure 4.4.
These portions of the CPU can do their work in parallel, so there is nothing wrong
with a program being core, memory, and I/O intensive all at the same time, which is using
every part of the CPU. The only exception is that using all of these resources heavily might
slow them down, as compared to using only one of them heavily; for example, the data you
are transferring to the GPU from the main memory is going through the L3$, creating a
bottleneck determined by L3$. Part II will describe this in detail.
Understanding the Cores and Memory 97

4.7 IMROTATEMC: MAKING IMROTATE MORE EFFICIENT

Let’s look at the execution times of the imrotate.c program in Table 4.1. The part of this
program that is solely responsible for this performance is the innermost function named
Rotate(). In this section, we will look at this function and will improve its performance. We
will modify main() to allow us to run a different version of the Rotate() function named
Rotate2(). As we keep improving this function that has direct influence on the performance
of our program, we will name them Rotate3(), Rotate4(), Rotate5(), Rotate6(), and Rotate7().
To achieve this, a variable named RotateFunc was defined in our imflip.c program as follows:

...
void* (*RotateFunc)(void *arg); // Func. ptr to rotate the image (multi-threaded)
...
void *Rotate(void* tid)
{
...
}
...
int main(int argc, char** argv)
{
...
RotateFunc=Rotate;
...
}

To keep improving this function, we will design different versions of this function and
will allow the user to select the desired function through the command line. To run the
imrotateMC.c program, the following command line is used:
imrotateMC InputfileName OutputfileName [degrees] [threads] [func]
where degrees specifies the clockwise rotation, [threads] specifies the number of threads to
launch as before, and the newly added [func] parameter (1–7) specifies which function to run
(i.e., 1 is to run Rotate(), 2 to run Rotate2(), etc.) The improved functions will be consistently
named Rotate2(), Rotate3(), etc. and the appropriate function pointer will be assigned to the
RotateFunc based on the command line argument selection as shown in Code 4.5. The name
of this new program constraints “MC” for “memory and core friendly.”
98 GPU Parallel Program Development Using CUDA

CODE 4.5: imrotateMC.c main() {...

The main() function in imrotateMC.c allows the user to specify which function to run,
from Rotate() to Rotate7(). Parts of the code that are identical to imrotate.c, listed in
Code 4.2 and 4.3, are not repeated.

int main(int argc, char** argv)

{
int RotDegrees, Function;
int a,i,ThErr;
struct timeval t;
double StartTime, EndTime;
double TimeElapsed;
char FuncName[50];

switch (argc){
case 3 : NumThreads=1; RotDegrees=45; Function=1; break;
case 4 : NumThreads=1; RotDegrees=at... Function=1; break;
case 5 : NumThreads=at... RotDegrees=at... Function=1; break;
case 6 : NumThreads=at... RotDegrees=at... Function=atoi(argv[5]); break;
default: printf("\nUsage: %s inputBMP outBMP [RotAngle] [1-128] [1-7]...");
printf("Example: %s infilename.bmp outname.bmp 125 4 3\n\n",argv[0]);
printf("Nothing executed ... Exiting ...\n\n");
exit(EXIT_FAILURE);
}
if((NumThreads<1) || (NumThreads>MAXTHREADS)){
...
}
if((RotDegrees<-360) || (RotDegrees>360)){
...
}
switch(Function){
case 1: strcpy(FuncName,"Rotate()"); RotateFunc=Rotate; break;
case 2: strcpy(FuncName,"Rotate2()"); RotateFunc=Rotate2; break;
case 3: strcpy(FuncName,"Rotate3()"); RotateFunc=Rotate3; break;
case 4: strcpy(FuncName,"Rotate4()"); RotateFunc=Rotate4; break;
case 5: strcpy(FuncName,"Rotate5()"); RotateFunc=Rotate5; break;
case 6: strcpy(FuncName,"Rotate6()"); RotateFunc=Rotate6; break;
case 7: strcpy(FuncName,"Rotate7()"); RotateFunc=Rotate7; break;
// case 8: strcpy(FuncName,"Rotate8()"); RotateFunc=Rotate8; break;
// case 9: strcpy(FuncName,"Rotate9()"); RotateFunc=Rotate9; break;
default: printf("Wrong function %d ... \n",Function);
printf("\n\nNothing executed ... Exiting ...\n\n");
exit(EXIT_FAILURE);
}
printf("\nLaunching %d Pthreads using function: %s\n",NumThreads,FuncName);
RotAngle=2*3.141592/360.000*(double) RotDegrees; // Convert the angle to radians
printf("\nRotating image by %d degrees ...\n",RotDegrees);
TheImage = ReadBMP(argv[1]);
...
}
Understanding the Cores and Memory 99

CODE 4.6: imrotateMC.c Rotate2() {...}

In the Rotate2() function, the four lines that calculate H, V, Diagonal, and ScaleFactor
are moved outside the two for loops, since they only need to be calculated once.

void Rotate2(void tid)

{
int row,col,h,v,c, NewRow,NewCol;
double X, Y, newX, newY, ScaleFactor, Diagonal, H, V;
...
H=(double)ip.Hpixels; //MOVE UP HERE
V=(double)ip.Vpixels; //MOVE UP HERE
Diagonal=sqrt(H*H+V*V); //MOVE UP HERE
ScaleFactor=(ip.Hpixels>ip.Vpixels) ? V/Diagonal : H/Diagonal; //MOVE UP HERE
for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){
col=0;
while(col<ip.Hpixels*3){
...
newY=sin(RotAngle)*X+cos(RotAngle)*Y;
// MOVE THESE 4 INSTRUCTIONS OUTSIDE BOTH LOOPS
// H=(double)ip.Hpixels; V=(double)ip.Vpixels;
// Diagonal=sqrt(H*H+V*V);
// ScaleFactor=(ip.Hpixels>ip.Vpixels) ? V/Diagonal : H/Diagonal;
newX=newX*ScaleFactor;
...
}

4.7.1 Rotate2(): How Bad is Square Root and FP Division?

Now, let’s look at the Rotate() function in Code 4.4. It calculates the new scaled X, Y
coordinates and saves them in variables newX and newY. For this, it first has to calculate the
ScaleFactor from Equation 4.3, and d (Diagonal). This calculation involves the code lines:

H=(double)ip.Hpixels;
V=(double)ip.Vpixels;
Diagonal=sqrt(H*H+V*V);
ScaleFactor=(ip.Hpixels>ip.Vpixels) ? V/Diagonal : H/Diagonal;

Moving these lines outside both loops will not change the functionality at all, since they
really only need to be calculated once. We are particularly interested in understanding what
parts of the CPU core in Figure 4.3 these computations use and how much speedup we will
get from this move. Revised Rotate2() function is shown in Code 4.6. Identical lines compared
to the original function Rotate() are not repeated and denoted as “...” to improve readability.
An entire list of execution times for every version of this function will be provided later
in this chapter in Table 4.3. For now, let’s quickly compare the single-threaded performance
of Rotate2() versus Rotate() on CPU5. To run the single-threaded version of Rotate2(), type:
imrotateMC dogL.bmp d.bmp 45 1 2

where 45, 1, and 2 are rotation, number of threads (single), and function ID (Rotate2()). We
reduced the single-threaded run time from 1027 ms down to 498 ms, a 2.06× improvement.
100 GPU Parallel Program Development Using CUDA

Now, let’s analyze the instructions on the 4 lines that we moved to see why we had a 2.06×
improvement. First two computations are integer-to-double precision floating point (FP)
cast operations when we are calculating H and V. They are simple enough, but would use
an FPU resource (shown in Figure 4.3). The next line in calculating Diagonal is absolutely
a resource hog, since Square Root is very compute-intensive. This harmless looking line
requires two FP multiplications (FP-MUL) to compute H×H and V ×V and one FP-ADD
to compute their sum. After this, the super-expensive Square Root operation is performed.
As if sqrt is not enough torture for the core, we see a floating point division next, that is as
bad as the square root, followed by an integer comparison! So, when the CPU core hits the
instructions that computes these 4 lines, core resources are chewed up and spit out! When
we move them outside both loops, it is no wonder that we get a 2.06× speedup.

4.7.2 Rotate3() and Rotate4(): How Bad Is sin() and cos()?

Why stop here? What else can we precompute? Look at these lines in the code:

newX=cos(RotAngle)*X-sin(RotAngle)*Y;
newY=sin(RotAngle)*X+cos(RotAngle)*Y;

We are simply forcing the CPU to compute the sin() and cos() for every pixel ! There is
no need for that, Rotate3() function defines a precomputed variable called CRA (meaning
precomputed cosine of RotAngle) and uses it in the innermost loop whenever it needs to use
cos(RotAngle). The revised Rotate3() function is shown in Code 4.7 and it is a single-threaded
run time on CPU5 reduced from 498 ms to 376 ms, a 1.32× improvement.
Function Rotate4() (shown in Code 4.8) does the same thing by precomputing
sin(RotAngle) as SRA. Rotate4() single-threaded run time is reduced from 376 ms to 235 ms,
another 1.6×improvement. The simplified code lines in the Rotate4() function in Code 4.8 are

newX=CRA*X-SRA*Y;
newY=SRA*X+CRA*Y;

When we compare these two lines to the two lines above, the summary is as follows:
• Rotate3() needs the calculation of sin(), cos().
• Rotate3() performs 4 double-precision FP multiplications.
• Rotate3() performs 2 double-precision FP addition/subtractions.

• Rotate4() needs all of the above, except the sin(), cos().

Understanding the Cores and Memory 101

CODE 4.7: imrotateMC.c Rotate3() {...}

Rotate3() function precomputes cos(RotAngle) as CRA outside both loops.

void Rotate3(void tid)

{
int NewRow,NewCol;
double X, Y, newX, newY, ScaleFactor;
double Diagonal, H, V;
double CRA;
...
H=(double)ip.Hpixels;
V=(double)ip.Vpixels;
Diagonal=sqrt(H*H+V*V);
ScaleFactor=(ip.Hpixels>ip.Vpixels) ? V/Diagonal : H/Diagonal;
CRA=cos(RotAngle); /// MOVE UP HERE
for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){
col=0;
while(col<ip.Hpixels*3){
...
newX=CRA*X-sin(RotAngle)*Y; // USE PRE-COMPUTED CRA
newY=sin(RotAngle)*X+CRA*Y; // USE PRE-COMPUTED CRA
// newX=cos(RotAngle)*X-sin(RotAngle)*Y; // CHANGE
// newY=sin(RotAngle)*X+cos(RotAngle)*Y; // CHANGE
newX=newX*ScaleFactor;
...

CODE 4.8: imrotateMC.c Rotate4() {...}

Rotate4() function precomputes sin(RotAngle) as SRA outside both loops.

void Rotate4(void tid)

{
...
double CRA, SRA;
...
ScaleFactor=(ip.Hpixels>ip.Vpixels) ? V/Diagonal : H/Diagonal;
CRA=cos(RotAngle);
SRA=sin(RotAngle); /// MOVE UP HERE
for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){
col=0;
while(col<ip.Hpixels*3){
...
newX=CRA*X-SRA*Y; // USE PRE-COMPUTED SRA, CRA
newY=SRA*X+CRA*Y; // USE PRE-COMPUTED SRA, CRA
// newX=cos(RotAngle)*X-sin(RotAngle)*Y; // CHANGE
// newY=sin(RotAngle)*X+cos(RotAngle)*Y; // CHANGE
...
102 GPU Parallel Program Development Using CUDA

4.7.3 Rotate5(): How Bad Is Integer Division/Multiplication?

We have almost completely exhausted the FP operations that we can precompute. It is
time to look at integer operations now. We know that integer divisions are substantially
slower than integer multiplications. Since every CPU that is used in testing is a 64-bit CPU,
there isn’t too much of a performance difference between 32-bit and 64-bit multiplications.
However, divisions might be different depending on the CPU. It is time to put all of these
ideas to test. The revised Rotate5() function is shown in Code 4.9.
For the Rotate5() function, our target is the following lines of code, containing integer
computations:

for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){

col=0;
while(col<ip.Hpixels*3){ // USE THE PRE-COMPUTED hp3 VALUE
// transpose image coordinates to Cartesian coordinates
c=col/3; h=ip.Hpixels/2; v=ip.Vpixels/2; // integer div

We notice that we are calculating a value called ip.Hpixels*3 just to turn back around
and divide it by 3 a few lines below. Knowing that integer divisions are expensive, why
not do this in a way where we can eliminate the integer division altogether? To do this, we
observe that the variable c is doing nothing but mirror the same value as col/3. Since we
are starting the variable col at 0 before we get into the while() loop, why not start the
c variable at 0 also? Since we are incrementing the value of the col variable by 3 at the
end of the while loop, we can simply increment the c variable by one. This will create two
variables that completely track each other, with the relationship col = 3 × c without ever
having to use an integer division.
The implementation of the Rotate5() function that implements this idea is shown in
Code 4.9 and simply trades an integer division c = col/3 for an integer increment operation
c + +. Additionally, to find the half-way point in the picture, ip.Hpixels and ip.Vpixels
must be divided by 2 as shown above. Since this can also be precomputed, it is moved
outside both loops in the implementation of the Rotate5(). All said and done, Rotate5() run
time is reduced from 235 ms to 210 ms, an improvement of 1.12×. Not bad ...

4.7.4 Rotate6(): Consolidating Computations

To design the Rotate6(), we will target these lines

X=(double)c-(double)h; Y=(double)v-(double)row;
// pixel rotation matrix
newX=CRA*X-SRA*Y; newY=SRA*X+CRA*Y;
newX=newX*ScaleFactor; newY=newY*ScaleFactor;

After all of our modifications, these lines ended up in this order. So, the natural question
to ask ourselves is: can we consolidate these computations of newX and newY and can we
precompute variables X or Y? An observation of the loop shows that, although the variable
X is stuck inside the innermost loop, we can move the computation of variable Y outside the
innermost loop, although we cannot move it outside both loops. But, considering that this
will save us the repetitive computation of Y many times (3200 times for a 3200 row image!),
the savings could be worth the effort. The revised Rotate6() is shown in Code 4.10 and
implements these ideas by using two additional variables named SRAYS and CRAYS. Rotate6()
runtime improved from 210 ms to 185 ms (1.14× better). Not bad ...
Understanding the Cores and Memory 103

CODE 4.9: imrotateMC.c Rotate5() {...}

In Rotate5(), integer division, multiplications are taken outside both loops.

void Rotate5(void tid)

{
int hp3;
double CRA,SRA;
...
CRA=cos(RotAngle); SRA=sin(RotAngle);
h=ip.Hpixels/2; v=ip.Vpixels/2; // MOVE IT OUTSIDE BOTH LOOPS
hp3=ip.Hpixels*3; // PRE-COMPUTE ip.Hpixels*3
for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){
col=0;
c=0; // HAD TO DEFINE THIS
while(col<hp3){ //INSTEAD OF col<ip.Hpixels*3
// c=col/3; h=ip.Hpixels/2; v=ip.Vpixels/2; // MOVE OUT OF THE LOOP
X=(double)c-(double)h;
Y=(double)v-(double)row;
// pixel rotation matrix
newX=CRA*X-SRA*Y; newY=SRA*X+CRA*Y;
newX=newX*ScaleFactor; newY=newY*ScaleFactor;
...
col+=3; c++;
...

CODE 4.10: imrotateMC.c Rotate6() {...}

In Rotate6(), FP operations are consolidated to reduce the FP operation count.

void Rotate6(void tid)

{
double CRA,SRA, CRAS, SRAS, SRAYS, CRAYS;
...
CRA=cos(RotAngle); SRA=sin(RotAngle);
CRAS=ScaleFactor*CRA; SRAS=ScaleFactor*SRA; // PRECOMPUTE ScaleFactor*SRA, CRA
...
for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){
col=0; c=0;
Y=(double)v-(double)row; // MOVE UP HERE
SRAYS=SRAS*Y; CRAYS=CRAS*Y; // NEW PRE-COMPUTATIONS
while(col<hp3){
X=(double)c-(double)h;
// Y=(double)v-(double)row; // MOVE THIS OUT
// pixel rotation matrix
newX=CRAS*X-SRAYS; // CALC NewX with pre-computed values
newY=SRAS*X+CRAYS; // CALC NewY with pre-computed values
...
104 GPU Parallel Program Development Using CUDA

CODE 4.11: imrotateMC.c Rotate7() {...}

In Rotate7(), every computation is expanded to avoid any redundant computation.

void Rotate7(void tid)

{
long tn;
int row, col, h, v, c, hp3, NewRow, NewCol;
double cc, ss, k1, k2, X, Y, newX, newY, ScaleFactor;
double Diagonal, H, V, CRA,SRA, CRAS, SRAS, SRAYS, CRAYS;
struct Pixel pix;

tn = ((int ) tid); tn *= ip.Vpixels/NumThreads;

H=(double)ip.Hpixels; V=(double)ip.Vpixels; Diagonal=sqrt(H*H+V*V);
ScaleFactor=(ip.Hpixels>ip.Vpixels) ? V/Diagonal : H/Diagonal;
CRA=cos(RotAngle); CRAS=ScaleFactor*CRA;
SRA=sin(RotAngle); SRAS=ScaleFactor*SRA;
h=ip.Hpixels/2; v=ip.Vpixels/2; hp3=ip.Hpixels*3;
for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){
col=0; cc=0.00; ss=0.00;
Y=(double)v-(double)row; SRAYS=SRAS*Y; CRAYS=CRAS*Y;
k1=CRAS*(double)h + SRAYS; k2=SRAS*(double)h - CRAYS;
while(col<hp3){
newX=cc-k1; newY=ss-k2;
NewCol=((int) newX+h); NewRow=v-(int)newY;
if((NewCol>=0) && (NewRow>=0) && (NewCol<ip.Hpixels) &&
(NewRow<ip.Vpixels)){
NewCol*=3;
CopyImage[NewRow][NewCol] = TheImage[row][col];
CopyImage[NewRow][NewCol+1] = TheImage[row][col+1];
CopyImage[NewRow][NewCol+2] = TheImage[row][col+2];
}
col+=3; cc += CRAS; ss += SRAS;
}
}
pthread_exit(NULL);
}

4.7.5 Rotate7(): Consolidating More Computations

Finally, our Rotate7() function looks at every possible computation to see if it can be done
with precomputed values. The Rotate7() function in Code 4.11 achieves a run time of 161 ms,
an improvement of 1.15× compared to Rotate6() that achieved 185 ms.

4.7.6 Overall Performance of imrotateMC

The execution results of all 7 Rotate functions are shown in Table 4.3. Taking CPU5 as
an example, all of the improvements resulted in a 6.4× speedup on a single thread. For 8
threads, which we know is a comfortable operating point for CPU5, the speedup is 7.8×. This
additional boost in speedup shows that, for core-intensive functions like Rotate(), improving
the core efficiency of the program helped when more threads are launched.
Understanding the Cores and Memory 105

TABLE 4.3 imrotateMC.c execution times for the CPUs in Table 3.1.
#Th Func CPU1 CPU2 CPU3 CPU4 CPU5 CPU6
1 951 1365 782 1090 1027 845
2 530 696 389 546 548 423
3 514 462 261 368 365 282
4 Rotate() 499 399 253 322 272 227
6 387 283 338 214 213
8 374 237 297 188 163
10 341 228 285 163 201
1 468 580 364 441 498 659
2 280 301 182 222 267 330
3 249 197 123 148 194 220
4 Rotate2() 280 174 126 165 137 165
6 207 127 138 101 176
8 195 138 134 84 138
10 125 141 67
1 327 363 264 301 376 446
2 218 189 131 151 202 223
3 187 123 88 101 142 149
4 Rotate3() 202 93 106 108 101 112
6 123 97 106 75 116
8 117 101 110 59 89
10 106 92 47
1 202 227 161 182 235 240
2 140 124 80 91 135 120
3 109 80 54 61 92 80
4 Rotate4() 109 65 73 54 69 60
8 88 62 69 37 47
10 58 55 29
1 171 209 145 158 210 207
2 140 108 73 78 117 104
3 93 73 49 53 80 69
4 Rotate5() 93 61 69 72 61 52
6 72 51 62 44 53
8 81 56 60 36 40
10 59 48 29
1 156 180 125 128 185 176
2 124 92 63 64 109 88
3 93 78 43 45 78 59
4 Rotate6() 93 57 63 65 55 44
6 60 43 43 37 44
8 65 51 49 30 33
1 140 155 107 110 161 156
2 109 75 53 55 97 78
3 93 52 36 37 64 52
4 Rotate7() 62 70 53 56 46 39
6 61 36 38 36 39
8 56 40 42 24 29
10 43 45 21
106 GPU Parallel Program Development Using CUDA

4.8 CHAPTER SUMMARY

In this chapter, we looked at what happens inside the core and during the memory transfers
from the CPU to the main memory. We used this information to make one example program
faster. The rules we outlined are simple:
â Stay away from sophisticated core instructions, such as sin(), sqrt().
If you must use them, make sure to have a limited number of them.
â The ALU executes integer instructions, FPU executes floating point.
Try to have a mix of these instructions in an inner loop to use both units.
If you can use only one type, use integer.
â Avoid choppy memory accesses. Use big bulk transfers if possible.
Try to take heavy computations outside the inner loops.

Aside from these simple rules, Table 4.3 shows that if the threads we design are thick,
we will not be able to take advantage of the multiple threads that a core can execute. In our
code so far, even the improved Rotate7() function has thick threads. So, the performance
falls off sharply when either the launched threads gets close to the number of the physical
cores, or we saturate the memory. This brings up the concept of “whichever constraint comes
first.” In other words, when we change our program to improve it, we might eliminate one
bottleneck (say, FPU inside the core), but create a totally different bottleneck (say, main
memory bandwidth saturation).
The “whichever constraint comes first” concept will be the key ingredient in designing
GPU code, since inside a GPU, there are multiple constraints that can saturate and the
programmer has to be fully aware of every single one of them. For example, the programmer
could saturate the total number of launched threads before the memory is saturated. The
constraint ecosystem inside a GPU will be very much like the one we will learn in the CPU.
More on that coming up very shortly ...
Before we jump right into the GPU world, we have one more thing to learn: thread
synchronization in Chapter 5. Then, we are ready to take on the GPU challenge starting
in Chapter 6.
CHAPTER 5

Thread Management and

Synchronization

hen we say hardware architecture, what are we talking about?

W The answer is everything physical : CPU, memory, I/O controller, PCI express bus,
DMA controller, hard disk controller, hard disk(s), SSDs, CPU chipsets, USB ports, network
cards, CD-ROM controller, DVDRW, I can go on for a while ...
How about, when we say software architecture, what are we talking about?
The answer is all of the code that runs on the hardware. Code is everywhere: your hard
disk, SSDs, your USB controller, your CD-ROM, even your cheap keyboard, your operating
system, and, finally, the programs you write.
The big question is: which one of these do we care about as a high-performance program-
mer? The answer is: CPU cores and threads, as well as memory in the hardware architecture
and the operating system (OS) and your application code in the software architecture. Es-
pecially if you are writing high performance programs (like the ones in this book), a vast
percentage of your program performance will be determined by these things. The disk per-
formance generally does not play an important role in the overall performance because
most modern computers have plenty of RAM to cache large portions of the disk. The OS
tries to efficiently allocate the CPU cores and memory to maximize performance while your
application code requests these two resources from the OS and uses them – hopefully effi-
ciently. Because the purpose of this book is high-performance programming, let’s dedicate
this chapter to understanding the interplay between your our own code and the OS when
it comes to allocating/using the cores and memory.

5.1 EDGE DETECTION PROGRAM: IMEDGE.C

In this section, I will introduce an image edge detection program, imedge.c, that detects the
edges in an image as shown in Figure 5.1. In addition to providing a good template example
that we will continuously improve, edge detection is actually one of the most common image
processing tasks and is an essential operation that the human retina performs. Just like the
programs I introduced in the preceding chapters, this program is also resource-intensive;
therefore, I will introduce its variant imedgeMC.c that is memory-friendly (“M”) and core-
friendly (“C”). One interesting variant I will add is imedgeMCT.c, which adds the thread-
friendliness (“T”) concept by utilizing a MUTEX structure to facilitate communication
among multiple threads.

107
108 GPU Parallel Program Development Using CUDA

FIGURE 5.1 The imedge.c program is used to detect edges in the original im-
age astronaut.bmp (top left). Intermediate processing steps are: GaussianFilter()
(top right), Sobel() (bottom left), and finally Threshold() (bottom right).

5.1.1 Description of the imedge.c

The purpose of imedge.c is to create a program that is core-heavy, memory-heavy, and is
composed of multiple independent operations (functions):
• GaussianFilter(): The initial smoothing filter to reduce noise in the original image.
• Sobel(): The edge detection kernel we apply to amplify the edges.
• Threshold(): The operation that turns a grayscale image to a binary (black & white)
image (denoted B&W going forward), thereby “detecting” the edges.
imedge.c takes an image such as the one shown in Figure 5.1 (top left) and applies these
three distinct operations one after the other to finally yield an image that is composed of
only the edges (bottom right). To run the program, the following command line is used:
imedge InputfileName OutputfileName [1-128] [ThreshLo] [ThreshHi]
where [1–128] is the number of threads to launch as usual and the [ThreshLo], [ThreshHi]
pair determines the thresholds used during the Threshold() function.

5.1.2 imedge.c: Parametric Restrictions and Simplifications

Some simplifications had to be made in the program to avoid unnecessary complications
and diversion of our attention to unrelated issues. Numerous improvements are possible
Thread Management and Synchronization 109

to this program and the readers should definitely attempt to discover them. However, the
choices made in this program are geared toward improving the instructional value of this
chapter, rather than the ultimate performance of the imedge.c program. Specifically,
• A much more robust edge detection program can be designed by adding other post-
processing steps. They have not been incorporated into our program.
• The granularity of the processing is limited to a single row when running the “MCT”
version (imedgeMCT.c) at the end of this chapter. Although a finer granularity pro-
cessing is possible, it does not add any instructional value.
• To calculate some of the computed pixel values, the choice of the double variable type
is a little overkill; however, double type has been employed in the program because it
improves the instructional value of the code.
• Multiple arrays are used to store different processed versions of the image before
its final edge-detected version is computed: TheImage array stores the original image.
BWImage array stores the B&W version of this original. GaussImage array stores the
Gaussian filtered version. Gradient and Theta arrays store the Sobel-computed pixels.
The final edge-detected image is stored in CopyImage. Although these arrays could be
combined, separate arrays have been used to improve the clarity of the code.

5.1.3 imedge.c: Theory of Operation

imedge.c edge detection program includes three distinct operations as described in Sec-
tion 5.1. It is definitely possible to combine these operations to end up with a more efficient
program, but from an instructional standpoint, it helps to see how we will speed up each
individual operation; because these three operations have different resource characteristics,
separating them will allow us to analyze each one individually. Later, when we design im-
proved versions of it – as we did in the previous chapters – we will improve each individual
operation in different ways. Now, let me describe each individual operation separately.
GaussianFilter(): You can definitely skip this operation at the expense of confusing some
noise with the edges. As shown in Figure 5.1, Gaussian filter blurs the original image, which
can be perceived as a useless operation that degrades image quality. However, the intended
positive consequence of blurring is the fact that we get rid of noise that might be present
in the image, which could be erroneously detected as edges. Also, remember that our final
image will be turned into a binary image consisting of only two values, which will correspond
to edge/no edge. A binary image only has a 1-bit depth for each pixel, which is substantially
lower quality than the original image that has a 24-bit depth, i.e., 8-bits for R, G, and B each.
Considering that this edge detection operation reduces the image bit-depth from 24-bits per
pixel down to 1-bit per pixel anyway, the depth reduction due to the blurring operation only
has a positive effect for us. By the same token, though, too much blurring might worsen the
ability of our entire process by eventually melting the actual edges into the picture, thereby
making them unrecognizable. Quantitatively speaking, GaussianFilter() turns the original
24-bit image into a blurred 8-bit-per-pixel B&W image according to the equation below:
TheImageR + TheImageG + TheImageB
BWImage = (5.1)
3
where the TheImage is the original 24-bit/pixel image, composed of R, G, and B parts; av-
eraging of the R, G, and B values turns this color image into an 8-bit/pixel B&W image,
saved in the variable named BWImage.
110 GPU Parallel Program Development Using CUDA

The Gaussian filtering step is composed of a convolution operation as follows:

 
2 4 5 4 2
4 9 12 9 4
1  
5 12 15 12 5 =⇒ GaussImage = BWImage ∗ Gauss
Gauss = (5.2)
159 4 9 12 9 4


2 4 5 4 2

where Gauss is the filter kernel and ∗ is the convolution operation, which is one of the
most common operations in digital signal processing (DSP). Together, the operation in
Equation 5.2 convolves the B&W image we created in Equation 5.2 – that is contained in
variable BWImage – with the Gauss filter mask, contained in variable Gauss to result in the
blurred image, contained in variable GaussImage. Note that other filter kernels are available
that result in different levels of blurring, however, for our demonstration purposes, this filter
kernel is totally fine. At the end of this step, our image looks like Figure 5.1 (top right).
Sobel(): The goal of this step is to use a Sobel gradient operator on each pixel to deter-
mine the direction – and existence – of edges. From Equation 5.3, Sobel kernels are
   
−1 0 1 −1 −2 −1
Gx = −2 0 2 , Gy =  0 0 0 (5.3)
−1 0 1 −1 −2 −1

where Gx and Gx are the Sobel kernels used to determine the edge gradient for each pixel in
the x and y directions. These kernels are convolved with the previously computed GaussImage
to compute a gradient image as follows:

p GX
GX = Im ∗ Gx , GY = Im ∗ Gy , G = GX 2 + GY 2 , θ = tan−1 (5.4)
GY

where Im is the previously computed blurred image in GaussImage, ∗ is the convolution

operation, GX and GY are the edge gradients in the x and y directions. These temporary
gradients are not of interest to us, whereas the magnitude of the gradient is; the G variable
in Equation 5.4 is the magnitude of the gradient for each pixel, stored in variable Gradient.
Additionally, we would like to have some idea about what the direction of the edge was (θ)
at a specific pixel location; the variable Theta stores this information – also calculated from
Equation 5.4 – and will be used in the thresholding process. At the end of the Sobel step,
our image looks like Figure 5.1 (bottom left). This image is the value of the Gradient array,
which stores the magnitudes of the edges; it is a grayscale image.
Threshold(): The thresholding step takes the Gradient array and turns it into a 1-bit
edge/no edge image based on two threshold values: ThreshLo value is the threshold value
below which a pixel is definitely considered a “NO EDGE.” Alternatively, ThreshHi value is
the threshold value above which a pixel is definitely considered an “EDGE.” In these two
cases, using the Gradient variable suffices, as formulated in the equation below:
(
Gradient[x, y] < ThreshLo, CopyImage[x, y] = NOEDGE
pixel at[x, y] =⇒ (5.5)
Gradient[x, y] > ThreshHi, CopyImage[x, y] = EDGE

where the CopyImage array stores the final binary image. In case the gradient value of a
pixel is in between these two thresholds, we use the second array, Theta, to determine the
direction of the edge. We classify the direction into one of four possible values: horizontal
(EW), vertical (NS), left diagonal (SW-NE), and right diagonal (SE-NW). The idea is that
Thread Management and Synchronization 111

TABLE 5.1 Array variables and their types, used during edge detection.
Function Starting Destination Destination
to perform array variable array variable type
Convert image to B&W TheImage BWImage unsigned char
Gaussian filter BWImage GaussImage double
Sobel filter GaussImage Gradient double
Theta double
Threshold Gradient CopyImage unsigned char
(Theta if needed)

if the Theta of the edge is pointing to vertical (that is, up/down), we determine whether this
pixel is an edge or not by looking at the pixel above or below it. Similarly, for a horizontal
pixel, we look at its horizontal neighbors. This well-studied method is formulated below:

Θ < − 38 π or Θ > 83 π =⇒ EW neighbor




 Θ ≥ − 18 π 1

 and Θ≤ 8π =⇒ NS neighbor
L < ∆[x, y] < H =⇒ 1 3 (5.6)

 Θ> 8π and Θ≤ 8π =⇒ SW-NE neighbor

 Θ≥ − 38 π and Θ< − 81 π =⇒ SE-NW neighbor


where Θ = Θ[x, y] is the angle of edge [x, y] and L and H are the low, high thresholds. The
gradient is shown as ∆. The final result of the imedge.c program is shown in Figure 5.1
(bottom right). The #define values define how EDGE/NOEDGE will be colored; in this
program I assigned 0 (black) to EDGE and 255 (white) to NO EDGE. This makes the
edges print-friendly.

5.2 IMEDGE.C : IMPLEMENTATION

Table 5.1 provides the names and types of different arrays that are used during the ex-
ecution of imedge.c. The image is initially read into the TheImage array, which contains
the RGB values of each pixel in unsigned char type (Figure 5.1 top left). This image is
converted to its B&W version according to Equation 5.1 and saved in the BWImage array
(Figure 5.1 top right). Each pixel value in the BWImage array is of type double, although
a much lower resolution type would suffice. Gaussian filtering takes the BWImage array and
places its filtered (i.e., blurred ) version in the Gradient and Theta arrays, which contain the
gradient magnitudes and edge angles, respectively. Gradient array is shown in Figure 5.1
(bottom left), which shows how edgy this pixel is. The final thresholding step takes these
two arrays to produce the final result of the program, CopyImage array, which contains the
binary edge/no edge image (Figure 5.1 bottom right).
Although a single-bit depth would suffice for the CopyImage, each pixel is saved in
CopyImage using RGB pixel values (EDGE, EDGE, EDGE) to denote an existing edge,
or (NOEDGE, NOEDGE, NOEDGE) to denote a non-existent edge. This makes it feasible
to use the same function to save a BMP image to save the final binary image. The default
#define value of EDGE is 0, so each edge looks black, i.e., the RGB value (0,0,0). Simi-
larly, NOEDGE is 255, which looks white, i.e., RGB value (255,255,255). The edge image
in Figure 5.1 (bottom right) contains only these two RGB values. A 1-bit depth BMP file
format could also be used to conserve space when saving the final edge image, which is left
up to the reader as an exercise.
112 GPU Parallel Program Development Using CUDA

CODE 5.1: imedge.c ... main() {...

Time stamping code in main to identify partial execution times.

unsigned char** TheImage; // This is the main image

unsigned char** CopyImage; // This is the copy image (to store edges)
double **BWImage; // B&W of TheImage (each pixel=double)
double **GaussImage; // Gauss filtered version of the B&W image
double **Gradient, **Theta; // gradient and theta for each pixel
struct ImgProp ip;
...
double GetDoubleTime() // returns the time stamps in ms
{
struct timeval tnow;
gettimeofday(&tnow, NULL);
return ((double)tnow.tv_sec*1000000.0 + ((double)tnow.tv_usec))/1000.00;
}

double ReportTimeDelta(double PreviousTime, char *Message)

{
double Tnow,TimeDelta;
Tnow=GetDoubleTime();
TimeDelta=Tnow-PreviousTime;
printf("\n.....%-30s ... %7.0f ms\n",Message,TimeDelta);
return Tnow;
}

int main(int argc, char** argv)

{
int a,i,ThErr; double t1,t2,t3,t4,t5,t6,t7,t8;
...
printf("\nExecuting the Pthreads version with %li threads ...\n",NumThreads);
t1 = GetDoubleTime();
TheImage=ReadBMP(argv[1]); printf("\n");
t2 = ReportTimeDelta(t1,"ReadBMP complete"); // Start time without IO
CopyImage = CreateBlankBMP(NOEDGE); // Store edges in RGB
BWImage = CreateBWCopy(TheImage);
GaussImage = CreateBlankDouble();
Gradient = CreateBlankDouble();
Theta = CreateBlankDouble();
t3=ReportTimeDelta(t2, "Auxiliary images created");
...

5.2.1 Initialization and Time-Stamping

imedge.c time stamps each one of the three separate steps, as well as the initialization times
to create the aforementioned arrays to assess program performance better. Initialization
and time-stamping code is shown in Code 5.1. Each time stamp is obtained using the
GetDoubleTime() helper function; the time difference between two time stamps is computed
and reported using the ReportTimeDelta() function, which also uses a string to explain the
achieved function at that time stamp.
Thread Management and Synchronization 113

CODE 5.2: imageStuff.c Image Initialization Functions

Initialization of the different image values.

double** CreateBlankDouble()
{
int i;
double** img = (double **)malloc(ip.Vpixels * sizeof(double*));
for(i=0; i<ip.Vpixels; i++){
img[i] = (double *)malloc(ip.Hpixels*sizeof(double));
memset((void *)img[i],0,(size_t)ip.Hpixels*sizeof(double));
}
return img;
}

double CreateBWCopy(unsigned char img)

{
int i,j,k;
double** imgBW = (double **)malloc(ip.Vpixels * sizeof(double*));
for(i=0; i<ip.Vpixels; i++){
imgBW[i] = (double *)malloc(ip.Hpixels*sizeof(double));
for(j=0; j<ip.Hpixels; j++){ // convert each pixel to B&W = (R+G+B)/3
k=3*j;
imgBW[i][j]=((double)img[i][k]+(double)img[i][k+1]+(double)img[i][k+2])/3.0;
}
}
return imgBW;
}

unsigned char** CreateBlankBMP(unsigned char FILL)

{
int i,j;
unsigned char** img=(unsigned char **)malloc(ip.Vpixels*sizeof(unsigned char*));
for(i=0; i<ip.Vpixels; i++){
img[i] = (unsigned char *)malloc(ip.Hbytes * sizeof(unsigned char));
memset((void *)img[i],FILL,(size_t)ip.Hbytes); // zero out every pixel
}
return img;
}

5.2.2 Initialization Functions for Different Image Representations

ReadBMP() function (Code 2.5) reads the source image into TheImage using an unsigned
char for each of the RGB values. CreateBlankBMP() function (Code 5.2) creates an initialized
BMP image, with R=G=B=0 unsigned char initial pixel values; it is used to initialize the
CopyImage array. CreateBWCopy() is used to initialize the BWImage array; it turns a 24-bit
image into its B&W version (using Equation 5.1), where each pixel has a double value.
CreateBlankDouble() function (Code 5.2) creates an image array, populated with 0.0 double
values; it is used to initialize the GaussImage, Gradient, and Theta arrays.
114 GPU Parallel Program Development Using CUDA

CODE 5.3: imedge.c main() ...}

Launching multiple threads for three separate functions.

int main(int argc, char** argv)

{
...
pthread_attr_init(&ThAttr);
pthread_attr_setdetachstate(&ThAttr, PTHREAD_CREATE_JOINABLE);
for(i=0; i<NumThreads; i++){
ThParam[i] = i;
ThErr = pthread_create(&ThHandle[i], &ThAttr, GaussianFilter, (void *)...
if(ThErr != 0){
printf("\nThread Creation Error %d. Exiting abruptly... \n",ThErr);
exit(EXIT_FAILURE);
}
}
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
t4=ReportTimeDelta(t3, "Gauss Image created");
for(i=0; i<NumThreads; i++){
ThParam[i] = i;
ThErr = pthread_create(&ThHandle[i], &ThAttr, Sobel, (void *)&ThParam[i]);
if(ThErr != 0){ ... }
}
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
t5=ReportTimeDelta(t4, "Gradient, Theta calculated");
for(i=0; i<NumThreads; i++){
ThParam[i] = i;
ThErr = pthread_create(&ThHandle[i], &ThAttr, Threshold, (void *)&ThParam[i]);
if(ThErr != 0){ ... }
}
pthread_attr_destroy(&ThAttr);
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
t6=ReportTimeDelta(t5, "Thresholding completed");
WriteBMP(CopyImage, argv[2]); printf("\n"); //merge with header. write to file
t7=ReportTimeDelta(t6, "WriteBMP completed");
for(i = 0; i < ip.Vpixels; i++) { // free() image memory and pointers
free(TheImage[i]); free(CopyImage[i]); free(BWImage[i]);
free(GaussImage[i]); free(Gradient[i]); free(Theta[i]);
}
free(TheImage); ... free(Theta);
t8=ReportTimeDelta(t2, "Program Runtime without IO"); return (EXIT_SUCCESS);
}

5.2.3 Launching and Terminating Threads

Code 5.3 shows the part of the main() that launches and terminates multiple threads for
three separate functions: GaussianFilter(), Sobel(), and Threshold(). Using the timestamping
variables t4, t5, and t6, the execution times for these three separate functions are individ-
ually determined. This will come in handy for different analyses we will conduct, because
these functions have varying core/memory resource demands.
Thread Management and Synchronization 115

CODE 5.4: imedge.c ... GaussianFilter() {...

This function converts the BWImage into its Gaussian-filtered version, GaussImage.

double Gauss[5][5] = { { 2, 4, 5, 4, 2 },
{ 4, 9, 12, 9, 4 },
{ 5, 12, 15, 12, 5 },
{ 4, 9, 12, 9, 4 },
{ 2, 4, 5, 4, 2 } };
// Calculate Gaussian filtered GaussFilter[][] array from BW image
void *GaussianFilter(void* tid)
{
long tn; // My thread number (ID) is stored here
int row,col,i,j;
double G; // temp to calculate the Gaussian filtered version
tn = *((int *) tid); // Calculate my Thread ID
tn *= ip.Vpixels/NumThreads;

for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){

if((row<2) || (row>(ip.Vpixels-3))) continue;
col=2;
while(col<=(ip.Hpixels-3)){
G=0.0;
for(i=-2; i<=2; i++){
for(j=-2; j<=2; j++){
G+=BWImage[row+i][col+j]*Gauss[i+2][j+2];
}
}
GaussImage[row][col]=G/159.00D;
col++;
}
}
pthread_exit(NULL);
}

5.2.4 Gaussian Filter

Code 5.4 shows the implementation of Gaussian filtering, which applies Equation 5.2 to
the BWImage array to generate the GaussImage array. The Gaussian filter kernel – shown in
Equation 5.2 – is defined as a 2D global double array Gauss[][] inside imedge.c. For effi-
ciency, each pixel’s filtered value is saved in the G variable inside the two inner for loops and
the final value of G is divided by 159 only once before being written into the GaussImage array.
Let us analyze the resource requirements of the GaussianFilter() function:
• Gauss array is only 25 double elements (200 Bytes), which can easily fit in the cores’ L1$
or L2$. Therefore, this array allows the cores/threads to use the cache very efficiently
to store the Gauss array.
• BWImage array is accessed 25 times for each pixel’s computation, and should take
advantage of the cache architecture efficiently due to this high reuse ratio.
• GaussImage is written once for each pixel and takes no advantage of the cache memory.
116 GPU Parallel Program Development Using CUDA

CODE 5.5: imedge.c ... Sobel() {...

This function converts the GaussImage into its Gradient and Theta.

double Gx[3][3] = { { -1, 0, 1 },

{ -2, 0, 2 },
{ -1, 0, 1 } };

double Gy[3][3] = { { -1, -2, -1 },

{ 0, 0, 0 },
{ 1, 2, 1 }};
...
// Function that calculates the Gradient and Theta for each pixel
// Takes the Gauss[][] array and creates the Gradient[][] and Theta[][] arrays
void *Sobel(void* tid)
{
int row,col,i,j; double GX,GY;
long tn = *((int *) tid); tn *= ip.Vpixels/NumThreads;

for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){

if((row<1) || (row>(ip.Vpixels-2))) continue;
col=1;
while(col<=(ip.Hpixels-2)){
// calculate Gx and Gy
GX=0.0; GY=0.0;
for(i=-1; i<=1; i++){
for(j=-1; j<=1; j++){
GX+=GaussImage[row+i][col+j]*Gx[i+1][j+1];
GY+=GaussImage[row+i][col+j]*Gy[i+1][j+1];
}
}
Gradient[row][col]=sqrt(GX*GX+GY*GY);
Theta[row][col]=atan(GX/GY)*180.0/PI;
col++;
}
}
pthread_exit(NULL);
}

5.2.5 Sobel
Code 5.5 implements the gradient computation. It achieves this by applying Equation 5.3
to the GaussImage array to generate the two resulting arrays: Gradient array contains the
magnitude of the edge gradients, whereas the Theta array contains the angle of the edges.
Resource usage characteristics of the Sobel() function are
• Gx and Gy arrays are small and should be cached nicely.
• GaussImage array is accessed 18 times for each pixel and should also be cache-friendly.
• Gradient and Theta arrays are written once for each pixel and take no advantage of
the cache memory inside the cores.
Thread Management and Synchronization 117

CODE 5.6: imedge.c ... Threshold() {...

This function finds the edges and saves the resulting binary image in CopyImage.

void Threshold(void tid)

{
int row,col; unsigned char PIXVAL; double L,H,G,T;
long tn = *((int *) tid); tn *= ip.Vpixels/NumThreads;
for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){
if((row<1) || (row>(ip.Vpixels-2))) continue;
col=1;
while(col<=(ip.Hpixels-2)){
L=(double)ThreshLo; H=(double)ThreshHi;
G=Gradient[row][col]; PIXVAL=NOEDGE;
if(G<=L) PIXVAL=NOEDGE; else if(G>=H){PIXVAL=EDGE;} else {
T=Theta[row][col];
if((T<-67.5) || (T>67.5)){ // Look at left and right
PIXVAL=((Gradient[row][col-1]>H) ||
(Gradient[row][col+1]>H)) ? EDGE:NOEDGE;
}else if((T>=-22.5) && (T<=22.5)){ // Look at top and bottom
PIXVAL=((Gradient[row-1][col]>H) ||
(Gradient[row+1][col]>H)) ? EDGE:NOEDGE;
}else if((T>22.5) && (T<=67.5)){ // Look at upper right, lower left
PIXVAL=((Gradient[row-1][col+1]>H) ||
(Gradient[row+1][col-1]>H)) ? EDGE:NOEDGE;
}else if((T>=-67.5) && (T<-22.5)){ // Look at upper left, lower right
PIXVAL=((Gradient[row-1][col-1]>H) ||
(Gradient[row+1][col+1]>H)) ? EDGE:NOEDGE;
}
}
CopyImage[row][col*3]=PIXVAL; CopyImage[row][col*3+1]=PIXVAL;
CopyImage[row][col*3+2]=PIXVAL; col++;
}
}
pthread_exit(NULL);
}

5.2.6 Threshold
Code 5.6 shows the implementation of the thresholding function that determines whether a
given pixel at location [x, y] should be classified as an EDGE or NOEDGE. If the gradient
value is lower then ThreshLo or higher than ThreshHi, the EDGE/NOEDGE determination
requires only Equation 5.5 and the Gradient array. Any gradient value between these two
values requires a more elaborate computation, based on Equation 5.6.
The if condition makes the resource determination of Threshold() more complicated:
• Gradient has a good reuse ratio; it should therefore be cached nicely.
• Theta array is accessed based on pixel – and edge – values. So, it is hard to determine
its cache-friendliness.
• CopyImage array is accessed only once per pixel, making it cache-unfriendly.
118 GPU Parallel Program Development Using CUDA

TABLE 5.2 imedge.c execution times for the W3690 CPU (6C/12T).
#Th/Func 1 2 4 8 10 12
ReadBMP() 73 70 71 72 73 72
Create arrays 749 722 741 724 740 734
GaussianFilter() 5329 2643 1399 1002 954 880
Sobel() 18197 9127 4671 2874 2459 2184
Threshold() 499 260 147 132 95 92
WriteBMP() 70 70 66 60 61 62
Total without IO 24850 12829 7030 4798 4313 3957

5.3 PERFORMANCE OF IMEDGE

Table 5.2 shows the run time results for the imedge.c program. Here are the summarized
observations for each line:
• ReadBMP(): Clearly, the performance of reading the BMP image from disk does not
depend on the number of threads used because the bottleneck is the disk access speed.
• Create arrays: Because the arrays are initialized using the efficient memset() function
in Code 2.5, no notable improvement is observed when multiple threads are running.
The calls to the memset() function completely saturate the memory subsystem even
if a single thread is running.
• GaussianFilter(): This function seems to be taking advantage of however many threads
you throw at it! This is a perfect function to take advantage of multithreading, because
it has a balanced usage pattern between memory and core resources. In other words, it
is both core- and memory-intensive. Therefore, added threads have sufficient work to
utilize the memory subsystem and core resources at an increasing rate. However, the
diminishing returns phenomenon is evident when the number of threads is increased.
• Sobel(): The characteristics of this step are almost identical to the Gaussian filter
because the operations are very similar.
• Threshold(): This operation is clearly more core-intensive, so it is less balanced than
the previous two operations. Because of this, the diminishing returns start at much
lower thread counts; from 4 to 8 threads, there is nearly no improvement.
• WriteBMP(): Much like reading the file, writing gets no benefit from multiple threads,
since the operation is I/O-intensive.
To summarize Table 5.2, the I/O intensive image-reading and image-writing functions, as
well as memory-saturating image-array-initialization functions cannot benefit from multiple
threads, although they consume only less than 5% of the execution time. On the other hand,
the memory- and core-intensive filtering functions consume more than 95% of the execution
time and can greatly benefit from multiple threads. This makes it clear that to improve
the performance of imedge.c, we should be focusing on improving the performance of the
filtering functions, GaussianFilter(), Sobel(), and Threshold().

5.4 IMEDGEMC: MAKING IMEDGE MORE EFFICIENT

To improve the overall computation speed of imedge.c, let us look at Equation 5.1 closely;
to compute the B&W image, Equation 5.1 simply requires one to add the R, G, and B
Thread Management and Synchronization 119

components of each pixel and divide the resulting value by 3. Later, when the Gaussian filter
is being applied, as shown in Equation 5.2, this B&W value for each pixel gets multiplied
by 2, 4, 5, 9, 12, and 15 and the Gaussian kernel is formed. This kernel – that is essentially
a 5×5 matrix – possesses favorable symmetries, containing only six different values (2, 4,
5, 9, 12, and 15).
A close look at Equation 5.2 shows that the constant 31 multiplier for the BWImage array
1
variable, as well as the constant 159 multiplier for the Gauss array matrix can be taken
outside the computation and be dealt with at the very end of the computation. That way
the computational burden is only experienced once for the entire formula, rather than for
each pixel. Therefore, to compute Equation 5.1 and Equation 5.2, one can get away with
multiplying each pixel value with simple integer numbers.
It gets better ... look at the Gauss kernel; the corner value is 2, which means that some
pixel at some point gets multiplied by 2. Since the other corner value is also 2, some pixel
four columns ahead (horizontally) also gets multiplied by 2. Same for four rows below and
four rows and four columns below. An extensive set of such symmetries brings about the
following idea to speed up the entire convolution operation:
For each given pixel with a B&W value of, say, X, why not precompute different multiples
of X and save them somewhere only once, right after we know what the pixel’s B&W value
is. These multiples are clearly 2X, 4X, 5X, 9X, 12X, and 15X. A careful observer will
come up with yet another optimization: instead of multiplying X by 2, why not simply add
X to itself and save the result into another place. Once we have this 2X value, why not
add that to itself to get 4X and continue to add 4X to X to get 5X, thereby completely
avoiding multiplications.
Since we saved each pixels’s B&W value as a double, each multiplication and addition
is a double type operation; so, saving the multiplications will clearly help reduce the core
computation intensity. Additionally, each pixel is accessed only once, rather than 25 times
during the convolution operation that Equation 5.2 prescribes.

5.4.1 Using Precomputation to Reduce Bandwidth

Based on the ideas we came up with, we need to store multiple values for each pixel.
Code 5.7 lists the updated imageStuff.h file that stores the R, G, and B values of each pixel
as an unsigned char, as well as the precomputed B&W pixel value as a float in BW. Instead
of double, a float value has more than sufficient precision, however, curious readers are
welcome to try it with double.
The precomputed values are stored in variables BW2, BW4, ..., BW15. The Gauss, Gauss2,
Theta, and Gradient values are also precomputed and will be explained shortly. We are not
necessarily talking about reducing memory bandwidth (“M”), but also core computational
bandwidth (“C”) with precomputation. First, the computation of the B&W image can be
melted into the actual precomputation operation. This reduces memory accesses and core
utilization substantially by avoiding the computation of the same multiplications over and
over again.
120 GPU Parallel Program Development Using CUDA

CODE 5.7: imageStuff.h ...

Header file that includes the struct to store the precomputed pixels.

#define EDGE 0
#define NOEDGE 255
#define MAXTHREADS 128

struct ImgProp{
int Hpixels;
int Vpixels;
unsigned char HeaderInfo[54];
unsigned long int Hbytes;
};
struct Pixel{
unsigned char R;
unsigned char G;
unsigned char B;
};
struct PrPixel{
unsigned char R;
unsigned char G;
unsigned char B;
unsigned char x; // unused. to make it an even 4B
float BW;
float BW2,BW4,BW5,BW9,BW12,BW15;
float Gauss, Gauss2;
float Theta,Gradient;
};

double CreateBWCopy(unsigned char img);

double** CreateBlankDouble();
unsigned char** CreateBlankBMP(unsigned char FILL);
struct PrPixel** PrAMTReadBMP(char*);
struct PrPixel** PrReadBMP(char*);
unsigned char** ReadBMP(char*);
void WriteBMP(unsigned char** , char*);

extern struct ImgProp ip;

extern long NumThreads, PrThreads;
extern int ThreadCtr[];

5.4.2 Storing the Precomputed Pixel Values

Code 5.7 shows the new struct that contains the precomputed pixel values. RGB values
are stored in the usual unsigned char, occupying 3 bytes total, followed by a “space filler”
variable named x to round the storage up to 4 bytes, for efficient access by a 32-bit load
operation. The precomputed B&W value of the pixel is stored in variable BW, while multiples
of this value are stored in BW2, BW4, ... BW15. The Gaussian filter value and Sobel kernel
precomputations are also stored in the same struct as explained shortly.
Thread Management and Synchronization 121

CODE 5.8: imedgeMC.c ...main() ...}

Precomputing pixel values.

struct PrPixel **PrImage; // the pre-calculated image

...
int main(int argc, char** argv)
{
...
printf("\nExecuting the Pthreads version with %li threads ...\n",NumThreads);
t1 = GetDoubleTime();
PrImage=PrReadBMP(argv[1]); printf("\n");
t2 = ReportTimeDelta(t1,"PrReadBMP complete"); // Start time without IO
CopyImage = CreateBlankBMP(NOEDGE); // This will store the edges in RGB
t3=ReportTimeDelta(t2, "Auxiliary images created");
pthread_attr_init(&ThAttr);
pthread_attr_setdetachstate(&ThAttr, PTHREAD_CREATE_JOINABLE);
for(i=0; i<NumThreads; i++){
ThParam[i] = i;
ThErr = pthread_create(&ThHandle[i], &ThAttr, PrGaussianFilter, ...);
if(ThErr != 0){ ... }
}
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
t4=ReportTimeDelta(t3, "Gauss Image created");
for(i=0; i<NumThreads; i++){
ThParam[i] = i; ThErr = pth...(..., PrSobel, ...); if(ThErr != 0){ ... }
}
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
t5=ReportTimeDelta(t4, "Gradient, Theta calculated");
for(i=0; i<NumThreads; i++){
ThParam[i] = i; ThErr = pth...(..., PrThreshold, ...); if(ThErr != 0){ ... }
}
pthread_attr_destroy(&ThAttr);
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
t6=ReportTimeDelta(t5, "Thresholding completed");
WriteBMP(CopyImage, argv[2]); printf("\n"); //merge with header and write to file
t7=ReportTimeDelta(t6, "WriteBMP completed");
// free() the allocated area for image and pointers
for(i = 0; i < ip.Vpixels; i++) { free(CopyImage[i]); free(PrImage[i]); }
free(CopyImage); free(PrImage);
t8=ReportTimeDelta(t2, "Program Runtime without IO");
return (EXIT_SUCCESS);
}

5.4.3 Precomputing Pixel Values

We read the image and its precomputed pixel values in Code 5.8 using the PrReadBMP()
function. CreateBlankBMP() function was already explained in Code 5.2. Aside from that, the
only difference is the replacement of the GaussianFilter(), Sobel(), and Threshold() functions
with PrGaussianFilter(), PrSobel(), and PrThreshold() functions. The additional struct array
PrPixel has the form that was described in Code 5.7.
122 GPU Parallel Program Development Using CUDA

CODE 5.9: imageStuff.c PrReadBMP() {...}

Reading the image from the disk and precomputing pixel values.

struct PrPixel** PrReadBMP(char* filename)

{
int i,j,k; unsigned char r, g, b; unsigned char Buffer[24576];
float R, G, B, BW, BW2, BW3, BW4, BW5, BW9, BW12, Z=0.0;
FILE* f = fopen(filename, "rb"); if(f == NULL){ printf(...); exit(1); }
unsigned char HeaderInfo[54];
fread(HeaderInfo, sizeof(unsigned char), 54, f); // read the 54-byte header
// extract image height and width from header
int width = *(int*)&HeaderInfo[18]; ip.Hpixels = width;
int height = *(int*)&HeaderInfo[22]; ip.Vpixels = height;
int RowBytes = (width*3 + 3) & (˜3); ip.Hbytes = RowBytes;
//copy header for re-use
for(i=0; i<54; i++) { ip.HeaderInfo[i] = HeaderInfo[i]; }
printf("\n Input BMP File name: %20s (%u x %u)",filename,ip.Hpixels,ip.Vpixels);
// allocate memory to store the main image
struct PrPixel **PrIm=(struct PrPixel **)malloc(height*sizeof(struct PrPixel *));
for(i=0; i<height; i++) PrIm[i]=(struct...)malloc(width*sizeof(struct PrPixel));
for(i = 0; i < height; i++) { // read image, pre-calculate the PrIm array
fread(Buffer, sizeof(unsigned char), RowBytes, f);
for(j=0,k=0; j<width; j++, k+=3){
b=PrIm[i][j].B=Buffer[k]; B=(float)b;
g=PrIm[i][j].G=Buffer[k+1]; G=(float)g;
r=PrIm[i][j].R=Buffer[k+2]; R=(float)r;
BW3=R+G+B; PrIm[i][j].BW = BW = BW3*0.33333;
PrIm[i][j].BW2 = BW2 = BW+BW; PrIm[i][j].BW4 = BW4 = BW2+BW2;
PrIm[i][j].BW5 = BW5 = BW4+BW; PrIm[i][j].BW9 = BW9 = BW5+BW4;
PrIm[i][j].BW12 = BW12 = BW9+BW3; PrIm[i][j].BW15 = BW12+BW3;
PrIm[i][j].Gauss = PrIm[i][j].Gauss2 = Z;
PrIm[i][j].Theta = PrIm[i][j].Gradient = Z;
}
}
fclose(f); return PrIm; // return the pointer to the main image
}

5.4.4 Reading the Image and Precomputing Pixel Values

The PrReadBMP() function reads the image from the disk, initializes each pixel with its RGB
value as well as its precomputed values, as shown in Code 5.9. One interesting aspect of this
function is that it overlaps the disk read with precomputation. Therefore, PrReadBMP() no
longer contains an exhaustive set of I/O operations. It performs core-intensive and memory
intensive operations while it is reading the data from the disk. The fact that PrReadBMP()
function is core, memory, and I/O intensive hints at the possibility of speeding this function
up by multithreading it. This function does not actually calculate Gauss, Gauss2, Gradient,
or Theta. It strictly initializes them.
Thread Management and Synchronization 123

CODE 5.10: imedgeMC.c ...PrGaussianFilter() {...}

Performing the Gaussian filter using the precomputed pixel values.

#define ONEOVER159 0.00628931

...
// Function that takes the pre-calculated .BW. .BW2, .BW4, ...
// pixel values and compute the .Gauss value from it
void *PrGaussianFilter(void* tid)
{
long tn; int row,col,i,j;
tn = *((int *) tid); tn *= ip.Vpixels/NumThreads;
float G; // temp to calculate the Gaussian filtered version

for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){

if((row<2) || (row>(ip.Vpixels-3))) continue;
col=2;
while(col<=(ip.Hpixels-3)){
G=PrImage[row][col].BW15;
G+=(PrImage[row-1][col].BW12 + PrImage[row+1][col].BW12);
G+=(PrImage[row][col-1].BW12 + PrImage[row][col+1].BW12);
G+=(PrImage[row-1][col-1].BW9 + PrImage[row-1][col+1].BW9);
G+=(PrImage[row+1][col-1].BW9 + PrImage[row+1][col+1].BW9);
G+=(PrImage[row][col-2].BW5 + PrImage[row][col+2].BW5);
G+=(PrImage[row-2][col].BW5 + PrImage[row+2][col].BW5);
G+=(PrImage[row-1][col-2].BW4 + PrImage[row+1][col-2].BW4);
G+=(PrImage[row-1][col+2].BW4 + PrImage[row+1][col+2].BW4);
G+=(PrImage[row-2][col-2].BW2 + PrImage[row+2][col-2].BW2);
G+=(PrImage[row-2][col+2].BW2 + PrImage[row+2][col+2].BW2);
G*=ONEOVER159;
PrImage[row][col].Gauss=G;
PrImage[row][col].Gauss2=G+G;
col++;
}
}
pthread_exit(NULL);
}

5.4.5 PrGaussianFilter
The PrGaussianFilter() function, shown in Code 5.10, has the same exact functionality as
Code 5.4, which is the version that does not use the precomputed values. The difference
between GaussianFilter() and PrGaussianFilter() is that the former achieves the same compu-
tation result by adding the appropriate precomputed values for the corresponding pixels,
rather than performing the actual computation.
The inner-loop simply computes the Gaussian-filtered pixel value from Equation 5.2 by
using the precomputed values that were stored in the struct in Code 5.7.
124 GPU Parallel Program Development Using CUDA

CODE 5.11: imedgeMC.c PrSobel() {...}

Performing the Sobel using the precomputed pixel values.

// Function that calculates the .Gradient and .Theta for each pixel.
// Uses the pre-computed .Gauss and .Gauss2x values
void *PrSobel(void* tid)
{
int row,col,i,j; float GX,GY; float RPI=180.0/PI;
long tn = *((int *) tid); tn *= ip.Vpixels/NumThreads;

for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){

if((row<1) || (row>(ip.Vpixels-2))) continue;
col=1;
while(col<=(ip.Hpixels-2)){
// calculate Gx and Gy
GX = PrImage[row-1][col+1].Gauss + PrImage[row+1][col+1].Gauss;
GX += PrImage[row][col+1].Gauss2;
GX -= (PrImage[row-1][col-1].Gauss + PrImage[row+1][col-1].Gauss);
GX -= PrImage[row][col-1].Gauss2;
GY = PrImage[row+1][col-1].Gauss + PrImage[row+1][col+1].Gauss;
GY += PrImage[row+1][col].Gauss2;
GY -= (PrImage[row-1][col-1].Gauss + PrImage[row-1][col+1].Gauss);
GY -= PrImage[row-1][col].Gauss2;
PrImage[row][col].Gradient=sqrtf(GX*GX+GY*GY);
PrImage[row][col].Theta=atanf(GX/GY)*RPI;
col++;
}
}
pthread_exit(NULL);
}

5.4.6 PrSobel
The PrSobel() function, shown in Code 5.11, has the same exact functionality as Code 5.5,
which is the version that does not use the precomputed values. The difference between
Sobel() and PrSobel() is that the former achieves the same computation result by adding the
appropriate precomputed values for the corresponding pixels, rather than performing the
actual computation.
The inner-loop simply computes the Sobel-filtered pixel value from Equation 5.3 by
using the precomputed values that were stored in the struct in Code 5.7.
Thread Management and Synchronization 125

CODE 5.12: imedgeMC.c ...PrThreshold() {...}

Performing the Threshold function using the precomputed pixel values.

// Function that takes the .Gradient and .Thetapre-computed values for

// each pixel and calculates the final value (EDGE/NOEDGE)
void *PrThreshold(void* tid)
{
int row,col,col3; unsigned char PIXVAL; float L,H,G,T;
long tn = *((int *) tid); tn *= ip.Vpixels/NumThreads;
for(row=tn; row<tn+ip.Vpixels/NumThreads; row++){
if((row<1) || (row>(ip.Vpixels-2))) continue;
col=1; col3=3;
while(col<=(ip.Hpixels-2)){
L=(float)ThreshLo; H=(float)ThreshHi;
G=PrImage[row][col].Gradient; PIXVAL=NOEDGE;
if(G<=L){ PIXVAL=NOEDGE; }else if(G>=H){ PIXVAL=EDGE; }else{ // noedge,edge
T=PrImage[row][col].Theta;
if((T<-67.5) || (T>67.5)){ // Look at left and right
PIXVAL=((PrImage[row][col-1].Gradient>H) ||
(PrImage[row][col+1].Gradient>H)) ? EDGE:NOEDGE;
}else if((T>=-22.5) && (T<=22.5)){ // Look at top and bottom
PIXVAL=((PrImage[row-1][col].Gradient>H) ||
(PrImage[row+1][col].Gradient>H)) ? EDGE:NOEDGE;
}else if((T>22.5) && (T<=67.5)){ // Look at upper right, lower left
PIXVAL=((PrImage[row-1][col+1].Gradient>H) ||
(PrImage[row+1][col-1].Gradient>H)) ? EDGE:NOEDGE;
}else if((T>=-67.5) && (T<-22.5)){ // Look at upper left, lower right
PIXVAL=((PrImage[row-1][col-1].Gradient>H) ||
(PrImage[row+1][col+1].Gradient>H)) ? EDGE:NOEDGE;
}
}
if(PIXVAL==EDGE){ // Each pixel was initialized to NOEDGE
CopyImage[row][col3]=PIXVAL; CopyImage[row][col3+1]=PIXVAL;
CopyImage[row][col3+2]=PIXVAL;
}
col++; col3+=3;
}
}
pthread_exit(NULL);
}

5.4.7 PrThreshold
The PrThreshold() function, shown in Code 5.12, has the same exact functionality as
Code 5.6, which is the version that does not use the precomputed values. The difference
between Threshold() and PrThreshold() is that the former achieves the same computation
result by adding the appropriate precomputed values for the corresponding pixels, rather
than performing the actual computation.
The inner-loop simply computes the resulting binary (thresholded) pixel value from
Equation 5.5 by using the precomputed values that were stored in the struct in Code 5.7.
126 GPU Parallel Program Development Using CUDA

TABLE 5.3 imedgeMC.c execution times for the W3690 CPU (6C/12T) in ms for a
varying number of threads (above). For comparison, execution times of imedge.c
are repeated from Table 5.2 (below).
Function #threads =⇒ 1 2 4 8 10 12
PrReadBMP() 2836 2846 2833 2881 2823 2898
Create arrays 31 32 31 36 31 31
PrGaussianFilter() 2179 1143 570 526 539 606
PrSobel() 7475 3833 1879 1141 945 864
PrThreshold() 358 193 121 107 113 107
WriteBMP() 61 60 61 61 60 61
imedgeMC.c runtime no I/O 12940 8107 5495 4752 4511 4567
ReadBMP() 73 70 71 72 73 72
Create arrays 749 722 741 724 740 734
GaussianFilter() 5329 2643 1399 1002 954 880
Sobel() 18197 9127 4671 2874 2459 2184
Threshold() 499 260 147 132 95 92
WriteBMP() 70 70 66 60 61 62
imedge.c runtime no I/O 24850 12829 7030 4798 4313 3957
Speedup 1.92× 1.58× 1.28× 1.01× 0.96× 0.87×

5.5 PERFORMANCE OF IMEDGEMC

Table 5.3 shows the run time results for the imedgeMC.c program. To be able to compare this
performance to the previous version, imedge.c, Table 5.3 actually provides the performance
results of both of the programs including the detailed break-down.
So, what did we improve? Let us analyze the performance results:
• The ReadBMP() function needed only ≈ 70 ms to read the BMP image file, whereas the
PrReadBMP() took ≈ 2850 ms on average, due to the precomputation while reading the
image. In other words, we added a large overhead to perform the precomputation.
• Creating the arrays required ≈ 700 ms less in imedgeMC.c, since the B&W image
computation ended up being shifted to the precomputation phase in imedgeMC.c. In
other words, our effective precomputation overhead, i.e., the non-BW-computation
part, was only ≈ 2100 ms.
• Since each compute- and memory-intensive function uses the precomputed values in
imedgeMC.c, they achieved healthy speed-ups: PrGaussianFilter() seems to be ≈ 2×
faster than GaussianFilter() on average, while PrSobel() is ≈ 2.5× faster consistently.
PrThreshold(), on the other hand, got a much lower speed-up; this is not a big deal
since thresholding is a fairly small portion of the overall execution time anyway.
• The WriteBMP() function was not affected since it is strictly I/O-intensive.
Putting it all together, we achieved our most important goal of speeding up the compute-
intensive functions by using precomputed values. One interesting observation, though, seems
to be that the increased number of threads help a lot less with the imedgeMC.c. This is
due to the fact that the threads are a lot less balanced in imedgeMC.c; due to the pre-
computation, there are a lot fewer memory accesses and the weight is shifted toward the
core, providing a lot less work for the added threads and reducing their utility.
Thread Management and Synchronization 127

CODE 5.13: imedgeMCT.c ...main() {...}

Structure of main() that uses MUTEX and barrier synchronization.

int main(int argc, char** argv)

{
...
t1 = GetDoubleTime();
PrImage=PrAMTReadBMP(argv[1]); printf("\n");
t2 = ReportTimeDelta(t1,"PrAMTReadBMP complete"); // Start time without IO
CopyImage = CreateBlankBMP(NOEDGE); // This will store the edges in RGB
t3=ReportTimeDelta(t2, "Auxiliary images created");
pthread_attr_init(&ThAttr); pthread_attr_setdetachstate(&ThAttr, PTHREAD_CR...);
for(i=0; i<NumThreads; i++){ ... pthread_create(...PrGaussianFilter...); ... }
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
t4=ReportTimeDelta(t3, "Gauss Image created");
for(i=0; i<NumThreads; i++){ ... pthread_create(...PrSobel...); }
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
t5=ReportTimeDelta(t4, "Gradient, Theta calculated");
for(i=0; i<NumThreads; i++){ ... pthread_create(...PrThreshold...); }
pthread_attr_destroy(&ThAttr);
for(i=0; i<NumThreads; i++){ pthread_join(ThHandle[i], NULL); }
t6=ReportTimeDelta(t5, "Thresholding completed");
//merge with header and write to file
WriteBMP(CopyImage, argv[2]); printf("\n");
t7=ReportTimeDelta(t6, "WriteBMP completed");
// free() the allocated area for image and pointers
for(i = 0; i < ip.Vpixels; i++) { free(CopyImage[i]); free(PrImage[i]); }
free(CopyImage); free(PrImage);
t8=ReportTimeDelta(t2, "Prog ... IO"); printf("\n\n--- ... -----\n");
for(i=0; i<PrThreads; i++) {
printf("\ntid=%2li processed %4d rows\n",i,ThreadCtr[i]);
} printf("\n\n--- ... -----\n"); return (EXIT_SUCCESS);
}

5.6 IMEDGEMCT: SYNCHRONIZING THREADS EFFICIENTLY

Code 5.13 shows the final version of main() in imedgeMCT.c. This is the thread-friendly
version (“T”) of the imedge.c. The only change we notice is the implementation of the
precomputation using the PrAMTReadBMP(), rather than the previous imedgeMC.c imple-
mentation using the PrReadBMP() function in Code 5.8.
The PrAMTReadBMP() function will allow us to introduce two new concepts: (1) the
concept of barrier synchronization, and (2) usage of MUTEXes. Detailed descriptions of
these two concepts will follow shortly in Section 5.6.1 and Section 5.6.2, respectively. Note
that these techniques that we introduce in PrAMTReadBMP() (AMT denoting “asymmetric
multithreading) can be readily applied to the other functions that use multithreading, such
as PrGaussianFilter(), PrSobel(), and PrThreshold(); this is, however, left to the reader because
it does not have any additional instructional value. Our focus will be on understanding the
effect of these two concepts on the performance of the PrAMTReadBMP() function, which is
sufficient to apply them to any function you want later.
128 GPU Parallel Program Development Using CUDA

tid=0 1835 ms

tid=1 1981 ms

join
create

tid=2 2246 ms

tid=3 2016 ms

Serial RunTime
MT RunTime = 2246 ms = 7281 ms

FIGURE 5.2 Example barrier synchronization for 4 threads. Serial runtime is 7281 ms
and the 4-threaded runtime is 2246 ms. The speedup of 3.24× is close to the best-
expected 4×, but not equal due to the imbalance of each thread’s runtime.

5.6.1 Barrier Synchronization

th
When N threads are executing the same function, which is N1 of the entire task, at what
point do we consider the entire task finished ? Figure 5.2 shows an example where the entire
task takes 7281 ms when executed in a serial fashion. The same task, when executed using
4 threads, takes 2246 ms, which means a 3.24× speedup (81% threading efficiency); from
everything we have seen previously, this 81% threading efficiency is not that bad. But, the
important question is: could we have done better?
To answer this, let’s dig deep into the details of this 81%. First of all, are we getting
so much below 100% because of the core-intensive or the memory-intensive nature of the
task? It turns out, there is another factor: the synchronization of threads; when you split
the entire task into 4 pieces, even if the amount of work that has to be done is equal among
all four tasks, there is no guarantee that they will be executed at the same time by different
hardware threads. There are just too many factors that come into play at runtime, one of
them being the involvement of the OS, as we saw in Section 3.4.4.
Ideally, each thread will complete its own sub-task exactly in 25% of the time that it
takes to complete the entire task, i.e., 7281
4 = 1820 ms and the task will be completed in
1820 ms. However, in a realistic scenario that we see in Figure 5.2, although one of the
threads executes its portion in an amount of time that is fairly close to the ideal 1820 ms
interval (1835 ms), three of them are far from it (1981, 2016, and 2246 ms). In the end,
we cannot deem the task finished until the very last thread completes in 2246 ms, thereby
making the multithreaded execution time 2246 ms. Sadly, three of the threads sit idle and
wait for the others to be done, reducing the efficiency.
Thread Management and Synchronization 129

Thread 1: Lock Mutex M

b = b + (int) c; Shared variables
b = b + (int) c; Unlock Mutex M using mutex M

float a;
Thread 2: Lock Mutex M int b;
f = a+(float)(b+c); char c;
f = a+(float)(b+c); Unlock Mutex M

FIGURE 5.3 Using a MUTEXdata structure to access shared variables.

5.6.2 MUTEX Structure for Data Sharing

We will dedicate the imageMCT.c program to answering whether we could have done better
by using better synchronization; could we have made it thread-friendly to deserve the “T”
in our imageMCT.c? The answer is clearly yes, but we need to explain a few more things
before we can explain the implementation of imageMCT.c.
First of all, when multiple threads are updating a variable (read or write), there is the
danger of one thread overwriting that variable’s value before another thread gets a chance
to read the correct value. This can be prevented with a MUTEX structure that allows the
threads to update a variable’s value in a “thread-safe” manner, completely eliminating the
possibility for one thread to read an incorrect value. A MUTEX structure is showing in
Figure 5.3. To understand it, let us look at an analogy.

ANALOGY 5.1: Thread Synchronization.

CocoTown needed to harvest 1800 coconuts, for which they employed 4 farmers. The
town supervisor would ask each farmer to take 450 coconuts and come back when
they are harvested. Because each farmer harvested the coconuts at different speeds,
the task was completed by different farmers in 450, 475, 500, and 515 minutes. The
supervisor declared the entire harvesting task complete in 515 minutes.
Since the faster farmers had to wait for the slower ones to complete, they asked the
supervisor to try something different next year: what if the farmers took one coconut
at a time and when harvested, they came back and asked the supervisor for another
one? To track how many coconuts have been harvested so far, the supervisor put up
a chalk board and asked each farmer to update the coconut count on it. That year,
the overall task took 482 minutes during which farmers harvested 418, 440, 464, 478
coconuts, respectively; a healthy improvement from 515 minutes.
Farmers noticed something strange when they were updating the coconut count;
although rarely, sometimes they would read the coconut count exactly at the same
time (say, count=784) and politely wait for the other farmer to finish and write the
updated count (count=785) on the board. Clearly, this count was one less than what
it was supposed to be, since the count ended up being updated only once instead of
twice. They found the solution by introducing a red flag, which each farmer used to
let the other know that he was updating the counter. The other farmer waited for the
flag to be down and brought it up and read the count after he brought the flag up.
130 GPU Parallel Program Development Using CUDA

The count updating problem mentioned in Analogy 5.1 is precisely what happens when
multiple threads attempt to update the same variable without using an additional structure
like a MUTEX, shown in Figure 5.3. It is a very common bug in multithreaded code. The
red flag solution is precisely what is used in multithreaded programming. The underlying
idea to prevent incorrect updating is very simple: instead of a thread recklessly updating a
variable (i.e., unsafely), a shared MUTEX variable is used.
If a variable is a MUTEX variable, it has to be updated according to the rules of updating
such a variable; each thread knows that it is updating a MUTEX variable and does not touch
the variable itself before it lets the other threads know that it is updating it. It does this by
locking the MUTEX, which is equivalent to bringing up the red flag in Analogy 5.1. Once it
locks the MUTEX, it is free to make any update it desires to the variable that is controlled
by this MUTEX. In other words, it either excludes itself or the other threads from updating
a MUTEX variable, hence the name mutually exclusive, or in short, MUTEX.
Before the terms get confused, let me make something clear: there is a difference between
a MUTEX itself and MUTEX variables. For example, in Figure 5.3, the name of the MUTEX is
M , while the variables that MUTEX M controls are f , a, b, and c. In such a setup, the
multithreaded functions are supposed to lock and unlock the MUTEX M itself. After a lock
has been obtained, they are free to update the four MUTEX-controlled variables, f , a, b, and
c. When done, they are supposed to unlock MUTEX M.
Let me make one point very clear: although a MUTEX eliminates the incorrect-updating
problem, implementing a MUTEX requires hardware-level atomic operations. For example,
the CAS (compare and swap) instructions in the x86 Intel ISA achieve this. Luckily, a
programmer enjoys the readily available MUTEX implementation functions that are a part
of POSIX and this is what we will use in our implementation of imageMCT.c.
There is absolutely no mechanism that checks to see whether a thread has updated a vari-
able safely by locking/unlocking a controlling MUTEX, therefore, it is also a common bug for
a programmer to forget that he or she was supposed to lock/unlock a MUTEX for a variable
that is being shared. In this case, exactly the same problems that were mentioned in Anal-
ogy 5.1 will creep up, presenting an impossible-to-debug problem. It is also common for a
programmer to realize that a variable should have really been a MUTEX variable and declare
a MUTEX for it halfway into the program development. However, declaring a MUTEX does
not magically solve the incorrect-updating problem; correct locking/unlocking does. All it
takes is forgetting one place where the variable is being accessed and forget to lock/unlock
its corresponding MUTEX; worse yet, most incorrect update problems are infrequent and
they manifest themselves as weird intermittent problems, keeping most programmers up at
night! So, a good upfront planning is the best way to prevent these problems.

5.7 IMEDGEMCT: IMPLEMENTATION

Like I mentioned before, only the PrAMTReadBMP() function, inside imageStuff.c, will be
implemented using MUTEX structures, as shown in Code 5.14. We will borrow important
ideas from Analogy 5.1: instead of assigning a N1 chunk of the entire task to N different
threads, we will assign a much smaller task to each thread, namely reading and precomputing
a single row of the image, and keep a counter to determine what the next row is.
PrAMTReadBMP() reads the image one row at a time and updates a MUTEX variable
named LastRowRead, controlled by the MUTEX named CtrMutex. It creates N different threads
that perform the precomputation using a function named AMTPreCalcRow(), shown in
Code 5.15.
Thread Management and Synchronization 131

CODE 5.14: imageStuff.c ...PrAMTReadBMP() {...}

Structure of the asymmetric multithreading and barrier synchronization.

pthread_mutex_t CtrMutex; // MUTEX

int NextRowToProcess, LastRowRead; // MUTEX variables
// This function reads one row at a time, assigns them to threads to precompute
struct PrPixel** PrAMTReadBMP(char* filename)
{
int i,j,k,ThErr; unsigned char Buffer[24576];
pthread_t ThHan[MAXTHREADS]; pthread_attr_t attr;
FILE* f = fopen(filename, "rb"); if(f == NULL){ ... }
unsigned char HeaderInfo[54];
fread(HeaderInfo, sizeof(unsigned char), 54, f); // read the 54-byte header
// extract image height and width from header, and copy header for re-use
int width = *(int*)&HeaderInfo[18]; ip.Hpixels = width;
int height = *(int*)&HeaderInfo[22]; ip.Vpixels = height;
int RowBytes = (width*3 + 3) & (˜3); ip.Hbytes = RowBytes;
for(i=0; i<54; i++) { ip.HeaderInfo[i] = HeaderInfo[i]; }
printf("\n Input BMP File name: %20s (%u x %u)",filename,ip.Hpixels,ip.Vpixels);
// allocate memory to store the main image
PrIm = (struct PrPixel **)malloc(height * sizeof(struct PrPixel *));
for(i=0; i<height; i++) {
PrIm[i] = (struct PrPixel *)malloc(width * sizeof(struct PrPixel));
}
pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHRE...JOINABLE);
pthread_mutex_init(&CtrMutex, NULL); // create a MUTEX named CtrMutex
pthread_mutex_lock(&CtrMutex); // MUTEX variable updates require lock,unlock
NextRowToProcess=0; // Set the asynchronous row counter to 0
LastRowRead=-1; // no rows read yet
for(i=0; i<PrThreads; i++) ThreadCtr[i]=0; // zero every thread counter
pthread_mutex_unlock(&CtrMutex);
// read the image from disk and pre-calculate the PRImage pixels
for(i = 0; i<height; i++) {
if(i==20){ // when sufficient # of rows are read, launch threads
// PrThreads is the number of pre-processing threads
for(j=0; j<PrThreads; j++){
ThErr=p..create(&ThHan[j], &attr, AMTPreCalcRow, (void *)&ThreadCtr[j]);
if(ThErr != 0){ ... }
}
}
fread(Buffer, sizeof(unsigned char), RowBytes, f); // Read one row
for(j=0,k=0; j<width; j++, k+=3){
PrIm[i][j].B=Buffer[k]; PrIm[i][j].G=Buffer[k+1]; PrIm[i][j].R=Buffer[k+2];
}
// Advance LastRowRead. While doing this, lock the CtrMutex, then unlock.
pthread_mutex_lock(&CtrMutex); LastRowRead=i; pthread_mutex_unlock(&CtrMutex);
}
for(i=0; i<PrThreads; i++){ pthread_join(ThHan[i], NULL); } // join threads
pthread_attr_destroy(&attr); pthread_mutex_destroy(&CtrMutex); fclose(f);
return PrIm; // return the pointer to the main image
}
132 GPU Parallel Program Development Using CUDA

5.7.1 Using a MUTEX: Read Image, Precompute

Looking at Code 5.14, PrAMTReadBMP() looks very similar to the multi-threaded functions
we have seen before, with the exception of the MUTEX variables:
• MUTEX variables are sacred! Any time you need to touch them, you need to lock the
responsible MUTEX and unlock it after completing the update. Here is an example:

pthread_mutex_lock(&CtrMutex); // MUTEX vars require lock,unlock

NextRowToProcess=0; // Set the asynchronous row counter to 0
LastRowRead=-1; // no rows read yet
for(i=0; i<PrThreads; i++) ThreadCtr[i]=0; // zero every thread counter
pthread_mutex_unlock(&CtrMutex);

• The indentation of the variables inside the lock/unlock make it easy to see the variables
that are being updated while the MUTEX is locked.
• We have to create and destroy each MUTEX using the following functions:

pthread_mutex_init(&CtrMutex, NULL); // create a MUTEX named CtrMutex

...
pthread_mutex_destroy(&CtrMutex); // destroy the MUTEX named CtrMutex

• For N threads that are being launched, there are N + 2 MUTEX variables, all con-
trolled by the same MUTEX named CtrMutex. These variables are: NextRowToProcess,
LastRowRead, and the array ThreadCtr[0]..ThreadCtr[N-1].

• PrAMTReadBMP() updates LastRowRead to indicate the last image row it read.

• NextRowToProcess tells AMTPreCalcRow() which row to preprocess next; while prepro-
cessing each row, each thread knows its own tid; for example ThreadCtr[5] is only
incremented by the thread that has tid = 5. Since we are expecting each thread to
preprocess a different number of rows, the counts in the ThreadCtr[] array will be
different, although we expect them to be relatively close to each other. In our Anal-
ogy 5.1, four farmers harvested 418, 440, 464, 478 coconuts each; that analogy is
equivalent to four different threads preprocessing 418, 440, 464, and 478 rows of the
image, thereby leaving the array values T hreadCtr[0] = 418, T hreadCtr[1] = 440,
T hreadCtr[2] = 464, and T hreadCtr[30] = 478 when PrAMTReadBMP() joins them.
• PrAMTReadBMP() does not launch the multiple threads before it reads a few rows (20
in Code 5.14). This avoids threads being idle and continuously spinning (checking to
see if there are rows available to preprocess). This number is not critical, however, I
am showing it since it has some value in pointing out the concept of idle threads.
• While the reading is going on from the disk, precomputation of the last read row also
continues by the launched threads. The function AMTPreCalcRow() is responsible for
precomputing the last row, as shown in Code 5.15.
• Notice that PrAMTReadBMP() is itself an active I/O intensive thread, in addition to the
N threads it launches. So, the execution of PrAMTReadBMP() launches N + 1 threads
on the computer. Clearly, there is no reason why PrAMTReadBMP() and the rest of
the functions of the imedgeMCT program cannot be launched by a different number
of threads; this is something that is left to the reader to implement.
Thread Management and Synchronization 133

CODE 5.15: imageStuff.c ...AMTPreCalcRow() {...}

Function that precomputes each row. The same CtrMutex MUTEX is used to share
multiple variables among the threads and PrAMTReadBMP().

pthread_mutex_t CtrMutex;
struct PrPixel **PrIm;
int NextRowToProcess, LastRowRead;
int ThreadCtr[MAXTHREADS]; // Counts # rows processed by each thread
void *AMTPreCalcRow(void* ThCtr)
{
unsigned char r, g, b; int i,j,Last;
float R, G, B, BW, BW2, BW3, BW4, BW5, BW9, BW12, Z=0.0;
do{ // get the next row number safely
pthread_mutex_lock(&CtrMutex);
Last=LastRowRead; i=NextRowToProcess;
if(Last>=i){
NextRowToProcess++; j = *((int *)ThCtr);
*((int *)ThCtr) = j+1; // One more row processed by this thread
}
pthread_mutex_unlock(&CtrMutex);
if(Last<i) continue;
if(i>=ip.Vpixels) break;
for(j=0; j<ip.Hpixels; j++){
r=PrIm[i][j].R; g=PrIm[i][j].G; b=PrIm[i][j].B;
R=(float)r; G=(float)g; B=(float)b; BW3=R+G+B;
PrIm[i][j].BW = BW = BW3*0.33333; PrIm[i][j].BW2 = BW2 = BW+BW;
PrIm[i][j].BW4 = BW4 = BW2+BW2; PrIm[i][j].BW5 = BW5 = BW4+BW;
PrIm[i][j].BW9 = BW9 = BW5+BW4; PrIm[i][j].BW12 = BW12 = BW9+BW3;
PrIm[i][j].BW15 = BW12+BW3; PrIm[i][j].Gauss=PrIm[i][j].Gauss2=Z;
PrIm[i][j].Theta= PrIm[i][j].Gradient = Z;
}
}while(i<ip.Vpixels);
pthread_exit(NULL);
}

5.7.2 Precomputing One Row at a Time

PrAMTReadBMP() launches N threads (requested by the user) and assigns the same
precomputation function AMTPreCalcRow() to all of them, shown in Code 5.15. While
AMTPreCalcRow() is similar to PrReadBMP() in Code 5.9, there are major differences:
• The granularity of AMTPreCalcRow() is much finer; it is expected to process only a
single row of the image before it updates some MUTEX variables. This contrasts with
an entire N1 of the image that each thread has to process in PrReadBMP(). This is one
idea we are borrowing from the farmers in Analogy 5.1. Although the processing times
of each row may be different, their influence on overall execution time is negligible.
• AMTPreCalcRow() updates a MUTEX variable named NextRowToProcess, by properly
locking/unlocking the MUTEX that controls this variable, named CtrMutex. Clearly
NextRowToProcess is being updated by all N threads, necessitating the usage of a
MUTEX to avoid updating problems.
134 GPU Parallel Program Development Using CUDA

TABLE 5.4 imedgeMCT.c execution times (in ms) for the W3690 CPU (6C/12T),
using the Astronaut.bmp image file (top) and Xeon Phi 5110P (60C/240T) using
the dogL.bmp file (bottom).
Function #threads =⇒ 1 2 4 8 10 12
PrAMTReadBMP() 2267 1264 920 1014 1020 1078
Create arrays 33 31 31 33 32 33
PrGaussianFilter() 2223 1157 567 556 582 611
PrSobel() 7415 3727 1910 1124 948 842
PrThreshold() 341 195 119 107 99 104
WriteBMP() 61 62 60 63 61 63
imedgeMCT.c w/o IO 12640 6436 3607 2897 2742 2731
PrReadBMP() 2836 2846 2833 2881 2823 2898
Create arrays 31 32 31 36 31 31
PrGaussianFilter() 2179 1143 570 526 539 606
PrSobel() 7475 3833 1879 1141 945 864
PrThreshold() 358 193 121 107 113 107
WriteBMP() 61 60 61 61 60 61
imedgeMC.c w/o IO 12940 8107 5495 4752 4511 4567
Speedup (W3690) 1.02× 1.26× 1.52× 1.64× 1.64× 1.67×

Xeon #threads =⇒ 1 2 4 8 16 32 64 128

Xeon Phi 5110P no IO 3994 2178 1274 822 604 507 486 532

• Each instance of the AMTPreCalcRow() function makes no assumption on how many

rows it is supposed to process, since it could differ among different instances
of this function; therefore, the only terminating condition for each instance of
AMTPreCalcRow() is when the NextRowToProcess reaches the end of the image and
there is nothing more to process.
• AMTPreCalcRow() idles in the case that the LastRowRead variable indicates a slow hard
disk read by PrAMTReadBMP(). Although, remember that the initially buffered 20 rows
should avoid this in Code 5.14.

5.8 PERFORMANCE OF IMEDGEMCT

Table 5.4 shows the run time results for the imedgeMCT.c program. The top part of the code
compares the runtimes of imedgeMCT.c and imedgeMC.c. Because we only redesigned the
PrReadBMP() function, its redesigned asymmetric multithreaded version PrAMTReadBMP()
benefits from multithreading and gets progressively faster as the number of threads in-
creases. The impact of this on the overall performance is clearly visible.
I also ran imedgeMCT.c on a Xeon 5110P, which has 60 cores and 240 threads; because
each thread is a thick thread in almost every function we are using, the additional threads
did not benefit the Xeon, saturating the performance when we got closer to 60 threads.
Remember from Section 3.9 that benefitting from the vast amount of threads that exist
in a Xeon required meticulous engineering; unless thick threads are intermixed with thin
threads as described in Section 3.2.2, no additional benefit from Xeon will be gained when
the number of threads increases beyond the core count.
PART II
GPU Programming Using CUDA

135
CHAPTER 6

Introduction to GPU
Parallelism and CUDA

e have spent a considerable amount of time in understanding the CPU parallelism

W and how to write CPU parallel programs. During the process, we have learned a
great deal about how simply bringing a bunch of cores together will not result in a magical
performance improvement of a program that was designed as a serial program to begin with.
This is the first chapter where we will start understanding the inner-workings of a GPU;
the good news is that we have such a deep understanding of the CPU that we can make
comparisons between a CPU and GPU along the way. While so many of the concepts will
be dead similar, some of the concepts will only have a place in the GPU world. It all starts
with the monsters ...

6.1 ONCE UPON A TIME ... NVIDIA ...

Yes, it all starts with the monsters. As many game players know, many games have monsters
or planes or tanks moving from here to there and interacting heavily during this movement,
whether it is crashing into each other or being shot by the game player to kill the monsters.
What do these actions have in common? (1) A plane moving in the sky, (2) a tank shooting,
or (3) a monster trying to grab you by moving his arms and his body. The answer —
from a mathematical standpoint — is that all of these objects are high resolution graphic
elements, composed of many pixels, and moving them (such as rotating them) requires heavy
floating point computations, as exemplified in Equation 4.1 during the implementation of
the imrotate.c program.

6.1.1 The Birth of the GPU

Computer games have existed as long as computers have I was playing computer games
in the late 1990s and I had an Intel CPU in my computer; something like a 486. Intel
offered two flavors of the 486 CPU: 486SX and 486DX. 486SX CPUs did not have a built-in
floating point unit (FPU), whereas 486DX CPUs did. So, 486SX was really designed for
more general purpose computations, whereas 486DX made games work much faster. So, if
you were a gamer like me in the late 1990s and trying to play a game that had a lot of
these tanks, planes, etc. in it, hopefully you had a 486DX, because otherwise your games
would play so slow that you would not be able to enjoy them. Why? Because games require
heavy floating point operations and your 486SX CPU does not incorporate an FPU that is
capable of performing floating point operations fast enough. You would resort to using your
ALU to emulate an FPU, which is a very slow process.

137
138 GPU Parallel Program Development Using CUDA

This story gets worse. Even if you had a 486DX CPU, the FPU inside your 486DX was
still not fast enough for most of the games. Any exciting game demanded a 20× (or even
50×) higher-than-achievable floating point computational power from its host CPU. Surely,
in every generation the CPU manufacturers kept improving their FPU performance, just to
witness a demand for FPU power that grew much faster than the improvements they could
provide. Eventually, starting with the Pentium generation, the FPU was an integral part
of a CPU, rather than an option, but this didn’t change the fact that significantly higher
FPU performance was needed for games. In an attempt to provide much higher scale FPU
performance, Intel went on a frenzy to introduce vector processing units inside their CPUs:
the first ones were called MMX, then SSE, then SSE2, and the ones in 2016 are SSE4.2.
These vector processing units were capable of processing many FPU operations in parallel
and their improvement has never stopped.
Although these vector processing units helped certain applications a lot — and they still
do – the demand for an ever-increasing amount of FPU power was insane! When Intel could
deliver a 2× performance improvement, game players demanded 10× more. When they
could eventually manage to deliver 10× more, they demanded 100× more. Game players
were just monsters that ate lots of FLOPS! And, they were always hungry! Now what? This
was the time when a paradigm shift had to happen. Late 1990s is when the manufacturers
of many plug-in boards for PCs — such as sound cards or ethernet controller — came
up with the idea of a card that could be used to accelerate the floating point operations.
Furthermore, routine image coordinate conversions during the course of a game, such as
3D-to-2D conversions and handling of triangles, could be performed significantly faster by
dedicated hardware rather than wasting precious CPU time. Note that the actual unit
element of a monster in a game is a triangle, not a pixel. Using triangles allows the games
to associate a texture for the surface of any object, like the s