Module 2
Module 2
VII Semester
2022 Scheme
Module 2
GPU programming, Programming hybrid systems, MIMD systems, GPUs,
Performance
2.1 GPU Programming
GPUs are usually not “standalone” processors. They don’t ordinarily run an operating system and
system services, such as direct access to secondary storage. So programming a GPU also involves
writing code for the CPU “host” system, which runs on an ordinary CPU. The memory for the CPU host
and the GPU memory are usually separate. So the code that runs on the host typically allocates and
initializes storage on both the CPU and the GPU. It will start the program on the GPU, and it is responsible
for the output of the results of the GPU program. Thus GPU programming is really heterogeneous
programming, since it involves programming two different types of processors. The GPU itself will have
one or more processors. Each of these processors is capable of running hundreds or thousands of
threads. In the systems we’ll be using, the processors share a large block of memory, but each individual
processor has a small block of much faster memory that can only be accessed by threads running on
that processor. These blocks of faster memory can be thought of as a programmer managed cache. The
threads running on a processor are typically divided into groups: the threads within a group use the SIMD
model, and two threads in different groups can run independently. The threads in a SIMD group may not
run in lockstep. That is, they may not all execute the same instruction at the same time. However, no
thread in the group will execute the next instruction until all the threads in the group have completed
executing the current instruction.
If the threads in a group are executing a branch, it may be necessary to idle some of the threads.
For example, suppose there are 32 threads in a SIMD group, and each thread has a private variable
rank_in_gp that ranges from 0 to 31. Suppose also that the threads are executing the following code:
/ / Thread p r i v a t e v a r i a b l e s
int rank_in_gp , my_x ;
...
if ( rank_in_gp < 16)
my_x += 1 ;
else
my_x −= 1;
Then the threads with rank < 16 will execute the first assignment, while the threads with rank ≥ 16 are
idle. After the threads with rank< 16 are done, the roles will be reversed: the threads with rank < 16 will
be idle, while the threads with rank ≥RVITM
16 will execute the second assignment. Another issue in GPU
[COMPANY NAME]
1
PARALLEL COMPUTING (BCS702)
programming that’s different from CPU programming is how the threads are scheduled to execute. GPUs
use a hardware scheduler (unlike CPUs, which use software to schedule threads and processes), and
this hardware scheduler uses very little overhead. However, the scheduler will choose to execute an
instruction when all the threads in the SIMD group are ready. In the preceding example, before executing
the test, we would want the variable rank_in_gp stored in a register by each thread. So, to maximize use
of the hardware, we usually create a large number of SIMD groups. When this is the case, groups that
aren’t ready to execute (e.g., they’re waiting for data from memory, or waiting for the completion of a
previous instruction) can be idled, and the scheduler can choose a SIMD group that is ready.
2.1 MIMD Systems
When multiple processes/threads can access stdout, stderr, or stdin, as you might guess, the
distribution of the input and the sequence of the output are usually nondeterministic. For output, the data
will probably appear in a different order each time the program is run, or, even worse, the output of one
process/thread may be broken up by the output of another process/thread. For input, the data read by
each process/thread may be different on each run, even if the same input is used.
To partially address these issues, we’ll be making these assumptions and following these rules
when our parallel programs need to do I/O:
1. In distributed-memory programs, only process 0 will access stdin. In sharedmemory programs,
only the master thread or thread 0 will access stdin.
2. In both distributed-memory and shared-memory programs, all the processes/threads can access
stdout and stderr.
3. However, because of the nondeterministic order of output to stdout, in most cases only a single
process/thread will be used for all output to stdout. The principal exception will be output for
debugging a program. In this situation, we’ll often have multiple processes/threads writing to
stdout.
4. Only a single process/thread will attempt to access any single file other than stdin, stdout, or
stderr. So, for example, each process/thread can open its own, private file for reading or writing,
but no two processes/threads will open the same file.
5. Debug output should always include the rank or ID of the process/thread that’s generating the
output.
2.1 Performance, Speedup and efficiency in MIMD Systems
Usually the best our parallel program can do is to divide the work equally among the cores while at the
same time introducing no additional work for the cores. If we succeed in doing this, and we run our
program with p cores, one thread or process on each core, then our parallel program will run p times
2
PARALLEL COMPUTING (BCS702)
faster than the serial program runs on a single core of the same design. If we call the serial run-time Tserial
and our parallel run-time Tparallel, then it’s usually the case that the best possible run-time of our parallel
program is Tparallel = Tserial/p. When this happens, we say that our parallel program has linear speedup.
In practice, we usually don’t get perfect linear speedup, because the use of multiple processes/threads
almost invariably introduces some overhead. For example, shared-memory programs will almost always
have critical sections, which will require that we use some mutual exclusion mechanism, such as a
mutex. The calls to the mutex functions are the overhead that’s not present in the serial program, and the
use of the mutex forces the parallel program to serialize execution of the critical section. Distributed-
memory programs will almost always need to transmit data across the network, which is usually much
slower than local memory access. Serial programs, on the other hand, won’t have these overheads. Thus,
it will be unusual for us to find that our parallel programs get linear speedup. Furthermore, it’s likely that
the overheads will increase as we increase the number of processes or threads.
So if we define the speedup of a parallel program to be then linear speedup has S = p.
Furthermore, since as p increases we expect the parallel overhead to increase, we also expect S to
become a smaller and smaller fraction of the ideal, linear speedup p. Another way of saying this is that
S/p will probably get smaller and smaller as p increases. This value, S/p, is sometimes called the
efficiency of the parallel program. If we substitute the formula for S, we see that the efficiency is
That is, the efficiency can be thought of as the fraction of the parallel run-time that’s spent, on
average, by each core working on solving the original problem. The remainder of the parallel run-time is
the parallel overhead. This can be seen by simply multiplying the efficiency and the parallel run-time: For
example, suppose we have Tserial = 24 ms, p = 8, and Tparallel = 4 ms. Then E=24/(8*4)=0.75.
We’ve already seen that Tparallel, S, and E depend on p, the number of processes or threads. We
also need to keep in mind that Tparallel, S, E, and Tserial all depend on the problem size. When we increase
the problem size, the speedups and the efficiencies increase, while they decrease when we decrease the
problem size. This behavior is quite common, because in many parallel programs, as the problem size is
3
PARALLEL COMPUTING (BCS702)
increased but the number of processes/threads is fixed, the parallel overhead grows much more slowly
than the time spent in solving the original problem. That is, if we think of Tserial and Toverhead as functions
of the problem size, Tserial grows much faster as the problem size is increased.
2.2 Amdahl’s law
Back in the 1960s, Gene Amdahl made an observation that’s become known as Amdahl’s law. It
says, roughly, that unless virtually all of a serial program is parallelized, the possible speedup is going to
be very limited—regardless of the number of cores available. Suppose, for example, that we’re able to
parallelize 90% of a serial program. Furthermore, suppose that the parallelization is “perfect,” that is,
regardless of the number of cores p we use, the speedup of this part of the program will be p. If the serial
run-time is Tserial = 20 seconds, then the run-time of the parallelized part will be 0.9 × Tserial/p = 18/p and
the run-time of the “unparallelized” part will be 0.1×Tserial = 2. The overall parallel run-time will be
Now as p gets larger and larger, 0.9 × Tserial/p = 18/p gets closer and closer to 0, so the total
parallel run-time can’t be smaller than 0.1 × Tserial = 2. That is, the denominator in S can’t be smaller than
0.1 × Tserial = 2. The fraction S must therefore satisfy the inequality
That is, S ≤ 10. This is saying that even though we’ve done a perfect job in parallelizing 90% of the
program, and even if we have, say, 1000 cores, we’ll never get a speedup better than 10.
More generally, if a fraction r of our serial program remains unparallelized, then Amdahl’s law
says we can’t get a speedup better than 1/r.
There are several reasons not to be too worried by Amdahl’s law.
• First, it doesn’t take into consideration the problem size. For many problems, as we increase
the problem size, the “inherently serial” fraction of the program decreases in size; a more
mathematical version of this statement is known as Gustafson’s law.
• Second, there are thousands of programs used by scientists and engineers that routinely
obtain huge speedups on large distributed-memory systems. Finally, is a small speedup so
awful? In many cases, obtaining a speedup of 5 or 10 is more than adequate, especially if the
effort involved in developing the parallel program wasn’t very large.
4
PARALLEL COMPUTING (BCS702)
2.3 Scalability in MIMD systems
The word “scalable” has a wide variety of informal uses. Indeed, we’ve used it several times already.
Roughly speaking, a program is scalable if, by increasing the power of the system it’s run on (e.g.,
increasing the number of cores), we can obtain speedups over the program when it’s run on a less
powerful system (e.g., a system with fewer cores). However, in discussions of MIMD parallel program
performance, scalability has a somewhat more formal definition. Suppose we run a parallel program with
a fixed number of processes/threads and a fixed input size, and we obtain an efficiency E. Suppose we
now increase the number of processes/threads that are used by the program. If we can find a
corresponding rate of increase in the problem size so that the program always has efficiency E, then the
program is scalable.
As an example, suppose that Tserial = n, where the units of Tserial are in microseconds, and n is
also the problem size. Also suppose that Tparallel = n/p +1. Then
To see if the program is scalable, we increase the number of processes/threads by a factor of k, and we
want to find the factor x that we need to increase the problem size by, so that E is unchanged. The number
of processes/threads will be kp; the problem size will be xn, and we want to solve the following equation
for x:
Well, if x = k, there will be a common factor of k in the denominator xn + kp = kn+kp = k(n +p), and we can
reduce the fraction to get
In other words, if we increase the problem size at the same rate that we increase the number of
processes/threads, then the efficiency will be unchanged, and our program is scalable.
There are a couple of cases that have special names. If, when we increase the number of
processes/threads, we can keep the efficiency fixed without increasing the problem size, the program is
said to be strongly scalable. If we can keep the efficiency fixed by increasing the problem size at the
same rate as we increase the number of processes/threads, then the program is said to be weakly
scalable. The program in our example would be weakly scalable.
5
PARALLEL COMPUTING (BCS702)
2.4 Taking timings of MIMD programs
During program development, we may take timings to determine if the program is behaving as we
intend. For example, in a distributed-memory program, we might be interested in finding out how much
time the processes are spending waiting for messages, because if this value is large, there is almost
certainly something wrong either with our design or our implementation. On the other hand, once we’ve
completed development of the program, we’re often interested in determining how good its performance
is. Perhaps surprisingly, the way we take these two timings is usually different. For the first timing, we
usually need very detailed information: How much time did the program spend in this part of the program?
How much time did it spend in that part?. Second, we’re usually not interested in the time that elapses
between the program’s start and the program’s finish. We’re usually interested only in some part of the
program. For example, if we write a program that implements bubble sort, we’re probably only interested
in the time it takes to sort the keys, not the time it takes to read them in and print them out. So, we
probably can’t use something like the Unix shell command time, which reports the time taken to run a
program from start to finish.
Third, we’re usually not interested in “CPU time.” This is the time reported by the standard C function
clock. It’s the total time the program spends in code executed as part of the program. It would include
the time for code we’ve written; it would include the time we spend in library functions, such as pow or
sin; and it would include the time the operating system spends in functions we call, such as printf and
scanf. It would not include time the program was idle, and this could be a problem. For example, in a
distributed-memory program, a process that calls a receive function may have to wait for the sending
process to execute the matching send, and the operating system might put the receiving process to sleep
while it waits. This idle time wouldn’t be counted as CPU time, since no function that’s been called by the
process is active. However, it should count in our evaluation of the overall run-time, since it may be a real
cost in our program. If each time the program is run, the process has to wait, ignoring the time it spends
waiting would give a misleading picture of the actual run-time of the program.
Thus, when you see an article reporting the run-time of a parallel program, the reported time is usually
“wall clock” time. That is, the authors of the article report the time that has elapsed between the start and
finish of execution of the code that the user is interested in. If the user could see the execution of the
program, she would hit the start button on her stopwatch when it begins execution and hit the stop button
6
PARALLEL COMPUTING (BCS702)
when it stops execution. Of course, she can’t see her code executing, but she can modify the source code
so that it looks something like this:
double start , finish ;
...
start = Get_current_time ( ) ;
/ ∗ Code t h a t we want t o t ime ∗ /
...
finish = Get_current_time ( ) ;
printf ( " The elapsed time = %e seconds \n" , finish−start ) ;
The function Get_current_time() is a hypothetical function that’s supposed to return the number
of seconds that have elapsed since some fixed time in the past. It’s just a placeholder. The actual
function that is used will depend on the API. For example, MPI has a function MPI_Wtime that could be
used here, and the OpenMP API for shared-memory programming has a function omp_get_wtime. Both
functions return wall clock time instead of CPU time.
The resolution is the unit of measurement on the timer. It’s the duration of the shortest event that
can have a nonzero time. Some timer functions have resolutions in milliseconds (10−3 seconds), and
when instructions can take times that are less than a nanosecond (10−9 seconds), a program may have
to execute millions of instructions before the timer reports a nonzero time. Many APIs provide a function
that reports the resolution of the timer. Other APIs specify that a timer must have a given resolution. In
either case we, as the programmers, need to check these values. When we’re timing parallel programs,
we need to be a little more careful about how the timings are taken. In our example, the code that we
want to time is probably being executed by multiple processes or threads, and our original timing will
result in the output of p elapsed times:
private double start , finish ;
...
start = Get_current_time ( ) ;
/ ∗ Code t h a t we want t o t ime ∗ /
...
finish = Get_current_time ( ) ;
printf ( " The elapsed time = %e seconds \n" , finish−start ) ;
However, what we’re usually interested in is a single time: the time that has elapsed from when
the first process/thread began execution of the code to the time the last process/thread finished
execution of the code.We often can’t obtain this exactly, since there may not be any correspondence
between the clock on one node and the clock on another node. We usually settle for a compromise that
looks something like this:
7
PARALLEL COMPUTING (BCS702)
my_start = Get_current_time ( ) ;
/ ∗ Code t h a t we want t o t ime ∗ /
...
my_finish = Get_current_time ( ) ;
my_elapsed = my_finish − my_start ;
/ ∗ Find the max acros s a l l pr o c e s s e s / t h r e a d s ∗ /
global_elapsed = Global_max ( my_elapsed ) ;
i f ( my_rank == 0)
printf ( " The elapsed time = %e seconds \n" ,
global_elapsed ) ;
Here, we first execute a barrier function that approximately synchronizes all of the
processes/threads. We would like for all the processes/threads to return from the call simultaneously,
but such a function usually can only guarantee that all the processes/ threads have started the call when
the first process/thread returns. We then execute the code as before, and each process/thread finds the
time it took. Then all the processes/threads call a global maximum function, which returns the largest of
the elapsed times, and process/thread 0 prints it out.
We also need to be aware of the variability in timings. When we run a program several
times, it’s extremely likely that the elapsed time will be different for each run. This will be true, even if
each time we run the program we use the same input and the same systems. It might seem that the best
way to deal with this would be to report either a mean or a median run-time. However, it’s unlikely that
some outside event could actually make our program run faster than its best possible run-time. So
instead of reporting the mean or median time, we usually report the minimum time.
Running more than one thread per core can cause dramatic increases in the variability of timings.
More importantly, if we run more than one thread per core, the system will have to take extra time to
schedule and deschedule cores, and this will add to the overall run-time. Therefore, we rarely run more
than one thread per core.
Finally, as a practical matter, since our programs won’t be designed for high performance I/O,
we’ll usually not include I/O in our reported run-times.
2.4 GPU performance
If we run the inherently serial part of a GPU program on a conventional, serial processor, then
Amdahl’s law can be applied to GPU programs, and the resulting upper bound on the possible speedup
will be the same as the upper bound on the possible speedup for a MIMD program. That is, if a fraction r
of the original serial program isn’t parallelized, and this fraction is run on a conventional serial processor,
then the best possible speedup of the program running on the GPU and the serial processor will be less
than 1/r.
8
PARALLEL COMPUTING (BCS702)
It should be noted that the same caveats that apply to Amdahl’s law on MIMD systems also apply
to Amdahl’s law on GPUs: It’s likely that the “inherently serial” fraction will depend on the problem size,
and if it gets smaller as the problem size increases, the bound on the best possible speedup will increase.
Also, many GPU programs obtain huge speedups, and, finally, a relatively small speedup may be perfectly
adequate.
The same basic ideas about timing that we discussed for MIMD programs also apply to GPU
programs. However, since a GPU program is ordinarily started and finished on a conventional CPU, as
long as we’re interested in the performance of the entirety of the program running on the GPU, we can
usually just use the timer on the CPU, starting it before the GPU part(s) of the program are started, and
stopping it after the GPU part(s) are done. There are more complicated scenarios—e.g., running a
program on multiple CPU-GPU pairs—that require more care, but we won’t be dealing with these types of
programs. If we only want to time a subset of the code running on the GPU, we’ll need to use a timer
defined by the API for the GPU.
2.4 Parallel program design
So we’ve got a serial program. How do we parallelize it? We know that in general we need to divide
the work among the processes/threads so that each process/thread gets roughly the same amount of
work and any parallel overhead is minimized. Ian Foster provides an outline of steps in his online book
Designing and Building Parallel Programs: -
• Partitioning. Divide the computation to be performed and the data operated on by the
computation into small tasks. The focus here should be on identifying tasks that can be
executed in parallel.
• Communication. Determine what communication needs to be carried out among the tasks
identified in the previous step.
• Agglomeration or aggregation. Combine tasks and communications identified in the first step
into larger tasks. For example, if task A must be executed before task B can be executed, it
may make sense to aggregate them into a single composite task.
• Mapping. Assign the composite tasks identified in the previous step to processes/ threads.
This should be done so that communication is minimized, and each process/thread gets
roughly the same amount of work.
This is sometimes called Foster’s methodology.
2.5 How we need to write parallel programs?
9
PARALLEL COMPUTING (BCS702)
Most programs that have been written for conventional, single-core systems cannot exploit the
presence of multiple cores. We can run multiple instances of a program on a multicore system, but this
is often of little help. For example, being able to run multiple instances of our favorite game isn’t really
what we want—we want the program to run faster with more realistic graphics. To do this, we need to
either rewrite our serial programs so that they’re parallel, so that they can make use of multiple cores, or
write translation programs, that is, programs that will automatically convert serial programs into parallel
programs. The bad news is that researchers have had very limited success writing programs that convert
serial programs in languages such as C, C++, and Java into parallel programs. This isn’t terribly
surprising. While we can write programs that recognize common constructs in serial programs, and
automatically translate these constructs into efficient parallel constructs, the sequence of parallel
constructs may be terribly inefficient.
For example, we can view the multiplication of two n × n matrices as a sequence of dot products,
but parallelizing a matrix multiplication as a sequence of parallel dot products is likely to be fairly slow
on many systems. An efficient parallel implementation of a serial program may not be obtained by finding
efficient parallelization’s of each of its steps. Rather, the best parallelization may be obtained by devising
an entirely new algorithm. As an example, suppose that we need to compute n values and add them
together. We know that this can be done with the following serial code:
sum = 0;
for ( i = 0; i < n ; i ++)
{
x = Compute_next_value ( . . . ) ;
sum += x;
}
Now suppose we also have p cores and p ≤ n. Then each core can form a partial sum of approximately
n/p values:
my_sum = 0;
my_first_i = . . .;
my_last_i = . . .;
for ( my_i = my_first_i ; my_i < my_last_i ; my_i ++)
{
my_x = Compute_next_value ( . . . ) ;
my_sum += my_x ;
}
Here the prefix my_ indicates that each core is using its own, private variables, and each core can
execute this block of code independently of the other cores. After each core completes execution of this
code, its variable my_sum will store the sum of the values computed by its calls to Compute_next_value.
For example, if there are eight cores, n = 24, and the 24 calls to Compute_next_value return the values 1,
4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8, 6, 5, 1, 2, 3, 9, then the values stored in my_sum might be Core 0
1 2 3 4 5 6 7 my_sum 8 19 7 15 7 13 12 14
10
PARALLEL COMPUTING (BCS702)
Here we’re assuming the cores are identified by nonnegative integers in the range 0, 1, . . ., p −1, where p
is the number of cores. When the cores are done computing their values of my_sum, they can form a
global sum by sending their results to a designated “master” core, which can add their results:
if (I’m the master core)
{
sum = my sum;
for each core other than myself
{
receive value from core;
sum += value;
}
}
else
{
send my_sum to the master ;
}
In our example, if the master core is core 0, it would add the values 8 + 19 + 7 + 15+7+ 13+12+ 14
= 95.
But you can probably see a better way to do this—especially if the number of cores is large.
Instead of making the master core do all the work of computing the final sum, we can pair the cores so
that while core 0 adds in the result of core 1, core 2 can add in the result of core 3, core 4 can add in the
result of core 5, and so on. Then we can repeat the process with only the even-ranked cores: 0 adds in
the result of 2, 4 adds in the result of 6, and so on. Now cores divisible by 4 repeat the process, and so
on. See Fig. 1.1. The circles contain the current value of each core’s sum, and the lines with arrows
indicate that one core is sending its sum to another core. The plus signs indicate that a core is receiving
a sum from another core and adding the received sum into its own sum.
12
PARALLEL COMPUTING (BCS702)
In both approaches the “cores” are the professor and her TAs. The first approach might be
considered an example of task-parallelism. There are five tasks to be carried out: grading the first
question, grading the second question, and so on. Presumably, the graders will be looking for different
information in question 1, which is about Shakespeare, from the information in question 2, which is about
Milton, and so on. So, the professor and her TAs will be “executing different instructions.” On the other
hand, the second approach might be considered an example of data parallelism. The “data” are the
students’ papers, which are divided among the cores, and each core applies more or less the same
grading instructions to each paper.
The data are the values computed by Compute_next_value, and each core carries out roughly the
same operations on its assigned elements: it computes the required values by calling
Compute_next_value and adds them together. The second part of the first global sum example might be
considered an example of task-parallelism. There are two tasks: receiving and adding the cores’ partial
sums, which is carried out by the master core; and giving the partial sum to the master core, which is
carried out by the other cores. When the cores can work independently, writing a parallel program is much
the same as writing a serial program.
Things get a great deal more complex when the cores need to coordinate their work. In the second
global sum example, although the tree structure in the diagram is very easy to understand, writing the
actual code is relatively complex.
Unfortunately, it’s much more common for the cores to need coordination. In both global sum
examples, the coordination involves communication: one or more cores send their current partial sums
to another core. The global sum examples should also involve coordination through load balancing. In
the first part of the global sum, it’s clear that we want the amount of time taken by each core to be roughly
the same as the time taken by the other cores. If the cores are identical, and each call to
Compute_next_value requires the same amount of work, then we want each core to be assigned roughly
the same number of values as the other cores. If, for example, one core has to compute most of the
values, then the other cores will finish much sooner than the heavily loaded core, and their computational
power will be wasted.
A third type of coordination is synchronization. As an example, suppose that instead of
computing the values to be added, the values are read from stdin. Say, x is an array that is read in by the
master core:
if ( I’m the master core )
for (my_i = 0; my_i < n; my_i ++)
scanf ("%lf", &x [ my_i ] ) ;
13
PARALLEL COMPUTING (BCS702)
In most systems the cores are not automatically synchronized. Rather, each core works at its own
pace. In this case, the problem is that we don’t want the other cores to race ahead and start computing
their partial sums before the master is done initializing x and making it available to the other cores. That
is, the cores need to wait before starting execution of the code:
14
PARALLEL COMPUTING (BCS702)