0% found this document useful (0 votes)

5 views15 pages

Module 2

Uploaded by

itssujantheju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views15 pages

Module 2

Uploaded by

itssujantheju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

RV Educational Institutions®

RV Institute of Technology and Management

(Affiliated to VTU, Belagavi)

JP Nagar 8th Phase, Bengaluru - 560076

CSE Cluster
DEPARTMENT OF ISE

Course Name: Parallel Computing

Course Code: BCS702

VII Semester
2022 Scheme
Module 2
GPU programming, Programming hybrid systems, MIMD systems, GPUs,
Performance
2.1 GPU Programming
GPUs are usually not “standalone” processors. They don’t ordinarily run an operating system and
system services, such as direct access to secondary storage. So programming a GPU also involves
writing code for the CPU “host” system, which runs on an ordinary CPU. The memory for the CPU host
and the GPU memory are usually separate. So the code that runs on the host typically allocates and
initializes storage on both the CPU and the GPU. It will start the program on the GPU, and it is responsible
for the output of the results of the GPU program. Thus GPU programming is really heterogeneous
programming, since it involves programming two different types of processors. The GPU itself will have
one or more processors. Each of these processors is capable of running hundreds or thousands of
threads. In the systems we’ll be using, the processors share a large block of memory, but each individual
processor has a small block of much faster memory that can only be accessed by threads running on
that processor. These blocks of faster memory can be thought of as a programmer managed cache. The
threads running on a processor are typically divided into groups: the threads within a group use the SIMD
model, and two threads in different groups can run independently. The threads in a SIMD group may not
run in lockstep. That is, they may not all execute the same instruction at the same time. However, no
thread in the group will execute the next instruction until all the threads in the group have completed
executing the current instruction.
If the threads in a group are executing a branch, it may be necessary to idle some of the threads.
For example, suppose there are 32 threads in a SIMD group, and each thread has a private variable
rank_in_gp that ranges from 0 to 31. Suppose also that the threads are executing the following code:
/ / Thread p r i v a t e v a r i a b l e s
int rank_in_gp , my_x ;
...
if ( rank_in_gp < 16)
my_x += 1 ;
else
my_x −= 1;
Then the threads with rank < 16 will execute the first assignment, while the threads with rank ≥ 16 are
idle. After the threads with rank< 16 are done, the roles will be reversed: the threads with rank < 16 will
be idle, while the threads with rank ≥RVITM
16 will execute the second assignment. Another issue in GPU
[COMPANY NAME]
1
PARALLEL COMPUTING (BCS702)
programming that’s different from CPU programming is how the threads are scheduled to execute. GPUs
use a hardware scheduler (unlike CPUs, which use software to schedule threads and processes), and
this hardware scheduler uses very little overhead. However, the scheduler will choose to execute an
instruction when all the threads in the SIMD group are ready. In the preceding example, before executing
the test, we would want the variable rank_in_gp stored in a register by each thread. So, to maximize use
of the hardware, we usually create a large number of SIMD groups. When this is the case, groups that
aren’t ready to execute (e.g., they’re waiting for data from memory, or waiting for the completion of a
previous instruction) can be idled, and the scheduler can choose a SIMD group that is ready.
2.1 MIMD Systems
When multiple processes/threads can access stdout, stderr, or stdin, as you might guess, the
distribution of the input and the sequence of the output are usually nondeterministic. For output, the data
will probably appear in a different order each time the program is run, or, even worse, the output of one
process/thread may be broken up by the output of another process/thread. For input, the data read by
each process/thread may be different on each run, even if the same input is used.
To partially address these issues, we’ll be making these assumptions and following these rules
when our parallel programs need to do I/O:
1. In distributed-memory programs, only process 0 will access stdin. In sharedmemory programs,
only the master thread or thread 0 will access stdin.
2. In both distributed-memory and shared-memory programs, all the processes/threads can access
stdout and stderr.
3. However, because of the nondeterministic order of output to stdout, in most cases only a single
process/thread will be used for all output to stdout. The principal exception will be output for
debugging a program. In this situation, we’ll often have multiple processes/threads writing to
stdout.
4. Only a single process/thread will attempt to access any single file other than stdin, stdout, or
stderr. So, for example, each process/thread can open its own, private file for reading or writing,
but no two processes/threads will open the same file.
5. Debug output should always include the rank or ID of the process/thread that’s generating the
output.
2.1 Performance, Speedup and efficiency in MIMD Systems
Usually the best our parallel program can do is to divide the work equally among the cores while at the
same time introducing no additional work for the cores. If we succeed in doing this, and we run our
program with p cores, one thread or process on each core, then our parallel program will run p times
2
PARALLEL COMPUTING (BCS702)
faster than the serial program runs on a single core of the same design. If we call the serial run-time Tserial
and our parallel run-time Tparallel, then it’s usually the case that the best possible run-time of our parallel
program is Tparallel = Tserial/p. When this happens, we say that our parallel program has linear speedup.

In practice, we usually don’t get perfect linear speedup, because the use of multiple processes/threads
almost invariably introduces some overhead. For example, shared-memory programs will almost always
have critical sections, which will require that we use some mutual exclusion mechanism, such as a
mutex. The calls to the mutex functions are the overhead that’s not present in the serial program, and the
use of the mutex forces the parallel program to serialize execution of the critical section. Distributed-
memory programs will almost always need to transmit data across the network, which is usually much
slower than local memory access. Serial programs, on the other hand, won’t have these overheads. Thus,
it will be unusual for us to find that our parallel programs get linear speedup. Furthermore, it’s likely that
the overheads will increase as we increase the number of processes or threads.
So if we define the speedup of a parallel program to be then linear speedup has S = p.

Furthermore, since as p increases we expect the parallel overhead to increase, we also expect S to
become a smaller and smaller fraction of the ideal, linear speedup p. Another way of saying this is that
S/p will probably get smaller and smaller as p increases. This value, S/p, is sometimes called the
efficiency of the parallel program. If we substitute the formula for S, we see that the efficiency is

That is, the efficiency can be thought of as the fraction of the parallel run-time that’s spent, on
average, by each core working on solving the original problem. The remainder of the parallel run-time is
the parallel overhead. This can be seen by simply multiplying the efficiency and the parallel run-time: For
example, suppose we have Tserial = 24 ms, p = 8, and Tparallel = 4 ms. Then E=24/(8*4)=0.75.
We’ve already seen that Tparallel, S, and E depend on p, the number of processes or threads. We
also need to keep in mind that Tparallel, S, E, and Tserial all depend on the problem size. When we increase
the problem size, the speedups and the efficiencies increase, while they decrease when we decrease the
problem size. This behavior is quite common, because in many parallel programs, as the problem size is
3
PARALLEL COMPUTING (BCS702)
increased but the number of processes/threads is fixed, the parallel overhead grows much more slowly
than the time spent in solving the original problem. That is, if we think of Tserial and Toverhead as functions
of the problem size, Tserial grows much faster as the problem size is increased.
2.2 Amdahl’s law
Back in the 1960s, Gene Amdahl made an observation that’s become known as Amdahl’s law. It
says, roughly, that unless virtually all of a serial program is parallelized, the possible speedup is going to
be very limited—regardless of the number of cores available. Suppose, for example, that we’re able to
parallelize 90% of a serial program. Furthermore, suppose that the parallelization is “perfect,” that is,
regardless of the number of cores p we use, the speedup of this part of the program will be p. If the serial
run-time is Tserial = 20 seconds, then the run-time of the parallelized part will be 0.9 × Tserial/p = 18/p and
the run-time of the “unparallelized” part will be 0.1×Tserial = 2. The overall parallel run-time will be

Now as p gets larger and larger, 0.9 × Tserial/p = 18/p gets closer and closer to 0, so the total
parallel run-time can’t be smaller than 0.1 × Tserial = 2. That is, the denominator in S can’t be smaller than
0.1 × Tserial = 2. The fraction S must therefore satisfy the inequality

That is, S ≤ 10. This is saying that even though we’ve done a perfect job in parallelizing 90% of the
program, and even if we have, say, 1000 cores, we’ll never get a speedup better than 10.
More generally, if a fraction r of our serial program remains unparallelized, then Amdahl’s law
says we can’t get a speedup better than 1/r.
There are several reasons not to be too worried by Amdahl’s law.
• First, it doesn’t take into consideration the problem size. For many problems, as we increase
the problem size, the “inherently serial” fraction of the program decreases in size; a more
mathematical version of this statement is known as Gustafson’s law.
• Second, there are thousands of programs used by scientists and engineers that routinely
obtain huge speedups on large distributed-memory systems. Finally, is a small speedup so
awful? In many cases, obtaining a speedup of 5 or 10 is more than adequate, especially if the
effort involved in developing the parallel program wasn’t very large.

4
PARALLEL COMPUTING (BCS702)
2.3 Scalability in MIMD systems
The word “scalable” has a wide variety of informal uses. Indeed, we’ve used it several times already.
Roughly speaking, a program is scalable if, by increasing the power of the system it’s run on (e.g.,
increasing the number of cores), we can obtain speedups over the program when it’s run on a less
powerful system (e.g., a system with fewer cores). However, in discussions of MIMD parallel program
performance, scalability has a somewhat more formal definition. Suppose we run a parallel program with
a fixed number of processes/threads and a fixed input size, and we obtain an efficiency E. Suppose we
now increase the number of processes/threads that are used by the program. If we can find a
corresponding rate of increase in the problem size so that the program always has efficiency E, then the
program is scalable.
As an example, suppose that Tserial = n, where the units of Tserial are in microseconds, and n is
also the problem size. Also suppose that Tparallel = n/p +1. Then

To see if the program is scalable, we increase the number of processes/threads by a factor of k, and we
want to find the factor x that we need to increase the problem size by, so that E is unchanged. The number
of processes/threads will be kp; the problem size will be xn, and we want to solve the following equation
for x:

Well, if x = k, there will be a common factor of k in the denominator xn + kp = kn+kp = k(n +p), and we can
reduce the fraction to get

In other words, if we increase the problem size at the same rate that we increase the number of
processes/threads, then the efficiency will be unchanged, and our program is scalable.
There are a couple of cases that have special names. If, when we increase the number of
processes/threads, we can keep the efficiency fixed without increasing the problem size, the program is
said to be strongly scalable. If we can keep the efficiency fixed by increasing the problem size at the
same rate as we increase the number of processes/threads, then the program is said to be weakly
scalable. The program in our example would be weakly scalable.

5
PARALLEL COMPUTING (BCS702)
2.4 Taking timings of MIMD programs
During program development, we may take timings to determine if the program is behaving as we
intend. For example, in a distributed-memory program, we might be interested in finding out how much
time the processes are spending waiting for messages, because if this value is large, there is almost
certainly something wrong either with our design or our implementation. On the other hand, once we’ve
completed development of the program, we’re often interested in determining how good its performance
is. Perhaps surprisingly, the way we take these two timings is usually different. For the first timing, we
usually need very detailed information: How much time did the program spend in this part of the program?
How much time did it spend in that part?. Second, we’re usually not interested in the time that elapses
between the program’s start and the program’s finish. We’re usually interested only in some part of the
program. For example, if we write a program that implements bubble sort, we’re probably only interested
in the time it takes to sort the keys, not the time it takes to read them in and print them out. So, we
probably can’t use something like the Unix shell command time, which reports the time taken to run a
program from start to finish.
Third, we’re usually not interested in “CPU time.” This is the time reported by the standard C function
clock. It’s the total time the program spends in code executed as part of the program. It would include
the time for code we’ve written; it would include the time we spend in library functions, such as pow or
sin; and it would include the time the operating system spends in functions we call, such as printf and
scanf. It would not include time the program was idle, and this could be a problem. For example, in a
distributed-memory program, a process that calls a receive function may have to wait for the sending
process to execute the matching send, and the operating system might put the receiving process to sleep
while it waits. This idle time wouldn’t be counted as CPU time, since no function that’s been called by the
process is active. However, it should count in our evaluation of the overall run-time, since it may be a real
cost in our program. If each time the program is run, the process has to wait, ignoring the time it spends
waiting would give a misleading picture of the actual run-time of the program.

Thus, when you see an article reporting the run-time of a parallel program, the reported time is usually
“wall clock” time. That is, the authors of the article report the time that has elapsed between the start and
finish of execution of the code that the user is interested in. If the user could see the execution of the
program, she would hit the start button on her stopwatch when it begins execution and hit the stop button

6
PARALLEL COMPUTING (BCS702)
when it stops execution. Of course, she can’t see her code executing, but she can modify the source code
so that it looks something like this:
double start , finish ;
...
start = Get_current_time ( ) ;
/ ∗ Code t h a t we want t o t ime ∗ /
...
finish = Get_current_time ( ) ;
printf ( " The elapsed time = %e seconds \n" , finish−start ) ;
The function Get_current_time() is a hypothetical function that’s supposed to return the number
of seconds that have elapsed since some fixed time in the past. It’s just a placeholder. The actual
function that is used will depend on the API. For example, MPI has a function MPI_Wtime that could be
used here, and the OpenMP API for shared-memory programming has a function omp_get_wtime. Both
functions return wall clock time instead of CPU time.
The resolution is the unit of measurement on the timer. It’s the duration of the shortest event that
can have a nonzero time. Some timer functions have resolutions in milliseconds (10−3 seconds), and
when instructions can take times that are less than a nanosecond (10−9 seconds), a program may have
to execute millions of instructions before the timer reports a nonzero time. Many APIs provide a function
that reports the resolution of the timer. Other APIs specify that a timer must have a given resolution. In
either case we, as the programmers, need to check these values. When we’re timing parallel programs,
we need to be a little more careful about how the timings are taken. In our example, the code that we
want to time is probably being executed by multiple processes or threads, and our original timing will
result in the output of p elapsed times:
private double start , finish ;
...
start = Get_current_time ( ) ;
/ ∗ Code t h a t we want t o t ime ∗ /
...
finish = Get_current_time ( ) ;
printf ( " The elapsed time = %e seconds \n" , finish−start ) ;

However, what we’re usually interested in is a single time: the time that has elapsed from when
the first process/thread began execution of the code to the time the last process/thread finished
execution of the [Link] often can’t obtain this exactly, since there may not be any correspondence
between the clock on one node and the clock on another node. We usually settle for a compromise that
looks something like this:

shared double global_elapsed ;

private double my_start , my_finish , my_elapsed ;
...
/ ∗ Synchronize a l l proc e s s e s / thr eads ∗ /
Barrier ( ) ;

7
PARALLEL COMPUTING (BCS702)
my_start = Get_current_time ( ) ;
/ ∗ Code t h a t we want t o t ime ∗ /
...
my_finish = Get_current_time ( ) ;
my_elapsed = my_finish − my_start ;
/ ∗ Find the max acros s a l l pr o c e s s e s / t h r e a d s ∗ /
global_elapsed = Global_max ( my_elapsed ) ;
i f ( my_rank == 0)
printf ( " The elapsed time = %e seconds \n" ,
global_elapsed ) ;

Here, we first execute a barrier function that approximately synchronizes all of the
processes/threads. We would like for all the processes/threads to return from the call simultaneously,
but such a function usually can only guarantee that all the processes/ threads have started the call when
the first process/thread returns. We then execute the code as before, and each process/thread finds the
time it took. Then all the processes/threads call a global maximum function, which returns the largest of
the elapsed times, and process/thread 0 prints it out.
We also need to be aware of the variability in timings. When we run a program several
times, it’s extremely likely that the elapsed time will be different for each run. This will be true, even if
each time we run the program we use the same input and the same systems. It might seem that the best
way to deal with this would be to report either a mean or a median run-time. However, it’s unlikely that
some outside event could actually make our program run faster than its best possible run-time. So
instead of reporting the mean or median time, we usually report the minimum time.
Running more than one thread per core can cause dramatic increases in the variability of timings.
More importantly, if we run more than one thread per core, the system will have to take extra time to
schedule and deschedule cores, and this will add to the overall run-time. Therefore, we rarely run more
than one thread per core.
Finally, as a practical matter, since our programs won’t be designed for high performance I/O,
we’ll usually not include I/O in our reported run-times.
2.4 GPU performance
If we run the inherently serial part of a GPU program on a conventional, serial processor, then
Amdahl’s law can be applied to GPU programs, and the resulting upper bound on the possible speedup
will be the same as the upper bound on the possible speedup for a MIMD program. That is, if a fraction r
of the original serial program isn’t parallelized, and this fraction is run on a conventional serial processor,
then the best possible speedup of the program running on the GPU and the serial processor will be less
than 1/r.

8
PARALLEL COMPUTING (BCS702)
It should be noted that the same caveats that apply to Amdahl’s law on MIMD systems also apply
to Amdahl’s law on GPUs: It’s likely that the “inherently serial” fraction will depend on the problem size,
and if it gets smaller as the problem size increases, the bound on the best possible speedup will increase.
Also, many GPU programs obtain huge speedups, and, finally, a relatively small speedup may be perfectly
adequate.
The same basic ideas about timing that we discussed for MIMD programs also apply to GPU
programs. However, since a GPU program is ordinarily started and finished on a conventional CPU, as
long as we’re interested in the performance of the entirety of the program running on the GPU, we can
usually just use the timer on the CPU, starting it before the GPU part(s) of the program are started, and
stopping it after the GPU part(s) are done. There are more complicated scenarios—e.g., running a
program on multiple CPU-GPU pairs—that require more care, but we won’t be dealing with these types of
programs. If we only want to time a subset of the code running on the GPU, we’ll need to use a timer
defined by the API for the GPU.
2.4 Parallel program design
So we’ve got a serial program. How do we parallelize it? We know that in general we need to divide
the work among the processes/threads so that each process/thread gets roughly the same amount of
work and any parallel overhead is minimized. Ian Foster provides an outline of steps in his online book
Designing and Building Parallel Programs: -
• Partitioning. Divide the computation to be performed and the data operated on by the
computation into small tasks. The focus here should be on identifying tasks that can be
executed in parallel.
• Communication. Determine what communication needs to be carried out among the tasks
identified in the previous step.
• Agglomeration or aggregation. Combine tasks and communications identified in the first step
into larger tasks. For example, if task A must be executed before task B can be executed, it
may make sense to aggregate them into a single composite task.
• Mapping. Assign the composite tasks identified in the previous step to processes/ threads.
This should be done so that communication is minimized, and each process/thread gets
roughly the same amount of work.
This is sometimes called Foster’s methodology.
2.5 How we need to write parallel programs?

9
PARALLEL COMPUTING (BCS702)
Most programs that have been written for conventional, single-core systems cannot exploit the
presence of multiple cores. We can run multiple instances of a program on a multicore system, but this
is often of little help. For example, being able to run multiple instances of our favorite game isn’t really
what we want—we want the program to run faster with more realistic graphics. To do this, we need to
either rewrite our serial programs so that they’re parallel, so that they can make use of multiple cores, or
write translation programs, that is, programs that will automatically convert serial programs into parallel
programs. The bad news is that researchers have had very limited success writing programs that convert
serial programs in languages such as C, C++, and Java into parallel programs. This isn’t terribly
surprising. While we can write programs that recognize common constructs in serial programs, and
automatically translate these constructs into efficient parallel constructs, the sequence of parallel
constructs may be terribly inefficient.
For example, we can view the multiplication of two n × n matrices as a sequence of dot products,
but parallelizing a matrix multiplication as a sequence of parallel dot products is likely to be fairly slow
on many systems. An efficient parallel implementation of a serial program may not be obtained by finding
efficient parallelization’s of each of its steps. Rather, the best parallelization may be obtained by devising
an entirely new algorithm. As an example, suppose that we need to compute n values and add them
together. We know that this can be done with the following serial code:
sum = 0;
for ( i = 0; i < n ; i ++)
{
x = Compute_next_value ( . . . ) ;
sum += x;
}
Now suppose we also have p cores and p ≤ n. Then each core can form a partial sum of approximately
n/p values:
my_sum = 0;
my_first_i = . . .;
my_last_i = . . .;
for ( my_i = my_first_i ; my_i < my_last_i ; my_i ++)
{
my_x = Compute_next_value ( . . . ) ;
my_sum += my_x ;
}
Here the prefix my_ indicates that each core is using its own, private variables, and each core can
execute this block of code independently of the other cores. After each core completes execution of this
code, its variable my_sum will store the sum of the values computed by its calls to Compute_next_value.
For example, if there are eight cores, n = 24, and the 24 calls to Compute_next_value return the values 1,
4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8, 6, 5, 1, 2, 3, 9, then the values stored in my_sum might be Core 0
1 2 3 4 5 6 7 my_sum 8 19 7 15 7 13 12 14

10
PARALLEL COMPUTING (BCS702)
Here we’re assuming the cores are identified by nonnegative integers in the range 0, 1, . . ., p −1, where p
is the number of cores. When the cores are done computing their values of my_sum, they can form a
global sum by sending their results to a designated “master” core, which can add their results:
if (I’m the master core)
{
sum = my sum;
for each core other than myself
{
receive value from core;
sum += value;
}
}
else
{
send my_sum to the master ;
}

In our example, if the master core is core 0, it would add the values 8 + 19 + 7 + 15+7+ 13+12+ 14
= 95.
But you can probably see a better way to do this—especially if the number of cores is large.
Instead of making the master core do all the work of computing the final sum, we can pair the cores so
that while core 0 adds in the result of core 1, core 2 can add in the result of core 3, core 4 can add in the
result of core 5, and so on. Then we can repeat the process with only the even-ranked cores: 0 adds in
the result of 2, 4 adds in the result of 6, and so on. Now cores divisible by 4 repeat the process, and so
on. See Fig. 1.1. The circles contain the current value of each core’s sum, and the lines with arrows
indicate that one core is sending its sum to another core. The plus signs indicate that a core is receiving
a sum from another core and adding the received sum into its own sum.

Fig. 1.14 Multiple cores forming a global sum.

For both “global” sums, the master core (core 0) does more work than any other core, and the
length of time it takes the program to complete the final sum should be the length of time it takes for the
11
PARALLEL COMPUTING (BCS702)
master to complete. However, with eight cores, the master will carry out seven receives and adds using
the first method, while with the second method, it will only carry out three. So, the second method results
in an improvement of more than a factor of two. The difference becomes much more dramatic with large
numbers of cores. With 1000 cores, the first method will require 999 receives and adds, while the second
will only require 10—an improvement of almost a factor of 100!
The first global sum is a fairly obvious generalization of the serial global sum: divide the work of
adding among the cores, and after each core has computed its part of the sum, the master core simply
repeats the basic serial addition - if there are p cores, then it needs to add p values. The second global
sum, on the other hand, bears little relation to the original serial addition. The point here is that it’s unlikely
that a translation program would “discover” the second global sum. Rather, there would more likely be a
predefined efficient global sum that the translation program would have access to. It could “recognize”
the original serial loop and replace it with a precoded, efficient, parallel global sum. We might expect that
software could be written so that a large number of common serial constructs could be recognized and
efficiently parallelized, that is, modified so that they can use multiple cores. However, as we apply this
principle to ever more complex serial programs, it becomes more and more difficult to recognize the
construct, and it becomes less and less likely that we’ll have a precoded, efficient parallelization. Thus,
we cannot simply continue to write serial programs; we must write parallel programs, programs that
exploit the power of multiple processors.
2.6 Writing Parallel Programs
There are two widely used approaches: task-parallelism and data - parallelism. In task-
parallelism, we partition the various tasks carried out in solving the problem among the cores. In data-
parallelism, we partition the data used in solving the problem among the cores, and each core carries out
more or less similar operations on its part of the data.
As an example, suppose that Prof P has to teach a section of “Survey of English Literature.” Also
suppose that Prof P has one hundred students in her section, so she’s been assigned four teaching
assistants (TAs): Mr. A, Ms. B, Mr. C, and Ms. D. At last the semester is over, and Prof P makes up a final
exam that consists of five questions. To grade the exam, she and her TAs might consider the following
two options: each of them can grade all one hundred responses to one of the questions; say, P grades
question 1, A grades question 2, and so on. Alternatively, they can divide the one hundred exams into five
piles of twenty exams each, and each of them can grade all the papers in one of the piles; P grades the
papers in the first pile, A grades the papers in the second pile, and so on.

12
PARALLEL COMPUTING (BCS702)
In both approaches the “cores” are the professor and her TAs. The first approach might be
considered an example of task-parallelism. There are five tasks to be carried out: grading the first
question, grading the second question, and so on. Presumably, the graders will be looking for different
information in question 1, which is about Shakespeare, from the information in question 2, which is about
Milton, and so on. So, the professor and her TAs will be “executing different instructions.” On the other
hand, the second approach might be considered an example of data parallelism. The “data” are the
students’ papers, which are divided among the cores, and each core applies more or less the same
grading instructions to each paper.
The data are the values computed by Compute_next_value, and each core carries out roughly the
same operations on its assigned elements: it computes the required values by calling
Compute_next_value and adds them together. The second part of the first global sum example might be
considered an example of task-parallelism. There are two tasks: receiving and adding the cores’ partial
sums, which is carried out by the master core; and giving the partial sum to the master core, which is
carried out by the other cores. When the cores can work independently, writing a parallel program is much
the same as writing a serial program.
Things get a great deal more complex when the cores need to coordinate their work. In the second
global sum example, although the tree structure in the diagram is very easy to understand, writing the
actual code is relatively complex.
Unfortunately, it’s much more common for the cores to need coordination. In both global sum
examples, the coordination involves communication: one or more cores send their current partial sums
to another core. The global sum examples should also involve coordination through load balancing. In
the first part of the global sum, it’s clear that we want the amount of time taken by each core to be roughly
the same as the time taken by the other cores. If the cores are identical, and each call to
Compute_next_value requires the same amount of work, then we want each core to be assigned roughly
the same number of values as the other cores. If, for example, one core has to compute most of the
values, then the other cores will finish much sooner than the heavily loaded core, and their computational
power will be wasted.
A third type of coordination is synchronization. As an example, suppose that instead of
computing the values to be added, the values are read from stdin. Say, x is an array that is read in by the
master core:
if ( I’m the master core )
for (my_i = 0; my_i < n; my_i ++)
scanf ("%lf", &x [ my_i ] ) ;

13
PARALLEL COMPUTING (BCS702)
In most systems the cores are not automatically synchronized. Rather, each core works at its own
pace. In this case, the problem is that we don’t want the other cores to race ahead and start computing
their partial sums before the master is done initializing x and making it available to the other cores. That
is, the cores need to wait before starting execution of the code:

for ( my_i = my_first_i ; my_i < my_last_i ; my_i ++)

my_sum += x [ my_i ] ;
We need to add in a point of synchronization between the initialization of x and the computation
of the partial sums:
Synchronize_cores ( ) ;
The idea here is that each core will wait in the function Synchronize_cores until all the cores have
entered the function—in particular, until the master core has entered this function.
Currently, the most powerful parallel programs are written using explicit parallel constructs, that
is, they are written using extensions to languages such as C, C++, and Java. These programs include
explicit instructions for parallelism: core 0 executes task 0, core 1 executes task 1, . . ., all cores
synchronize, . . ., and so on, so such programs are often extremely complex. Furthermore, the complexity
of modern cores often makes it necessary to use considerable care in writing the code that will be
executed by a single core. There are other options for writing parallel programs—for example, higher level
languages—but they tend to sacrifice performance to make program development somewhat easier.

14
PARALLEL COMPUTING (BCS702)

PC Module2
No ratings yet
PC Module2
10 pages
Parallel Computing-Module2 Notes
No ratings yet
Parallel Computing-Module2 Notes
10 pages
BCS702 Module 2 Textbook
No ratings yet
BCS702 Module 2 Textbook
13 pages
Thara Module2 BCS702
No ratings yet
Thara Module2 BCS702
25 pages
Module 2
No ratings yet
Module 2
12 pages
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
No ratings yet
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
8 pages
Dept. of Electrical Engineering COMSATS University Islamabad Fall 2018
No ratings yet
Dept. of Electrical Engineering COMSATS University Islamabad Fall 2018
26 pages
Unit1 RMD PDF
No ratings yet
Unit1 RMD PDF
27 pages
Cuda
No ratings yet
Cuda
69 pages
Student Friendly Notes Module2
No ratings yet
Student Friendly Notes Module2
5 pages
BCS702 Module1 Detailed Notes
No ratings yet
BCS702 Module1 Detailed Notes
14 pages
03 (Parallel Software)
No ratings yet
03 (Parallel Software)
38 pages
BCS702 Module 5 Textbook
No ratings yet
BCS702 Module 5 Textbook
48 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
Parallel Computing IA1
No ratings yet
Parallel Computing IA1
29 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
34 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
PP CS
No ratings yet
PP CS
89 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Architecture
No ratings yet
Architecture
67 pages
Apt05 2024S2
No ratings yet
Apt05 2024S2
23 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Overview of Parallel Hardware Concepts
No ratings yet
Overview of Parallel Hardware Concepts
60 pages
Parallel Programming FDP
No ratings yet
Parallel Programming FDP
43 pages
Understanding Speedup in Parallel Computing
No ratings yet
Understanding Speedup in Parallel Computing
12 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
27 pages
Parallel Computing Explained
No ratings yet
Parallel Computing Explained
5 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
28 pages
Shared Memory in Parallel Computing
No ratings yet
Shared Memory in Parallel Computing
26 pages
Parallel Computing Concepts Explained
No ratings yet
Parallel Computing Concepts Explained
90 pages
Bcs702 Parallel Computing Module 1
100% (3)
Bcs702 Parallel Computing Module 1
35 pages
High Performance Computing
No ratings yet
High Performance Computing
17 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Amdahl's Law & Parallel Computing
No ratings yet
Amdahl's Law & Parallel Computing
19 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
Lecture 2 General Parallelism Terms
No ratings yet
Lecture 2 General Parallelism Terms
22 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
Unit 2 Cloud Computing
No ratings yet
Unit 2 Cloud Computing
19 pages
Unit 4
100% (1)
Unit 4
48 pages
CS3006 Parallel Computing Course Overview
100% (1)
CS3006 Parallel Computing Course Overview
46 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
PC Module1
No ratings yet
PC Module1
28 pages
Part4 22
No ratings yet
Part4 22
65 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Measurement Sheet Meghdambari
No ratings yet
Measurement Sheet Meghdambari
3 pages
Rx5112 Systems Verification Form: - For Calculation Fill All The Cells in Blue
No ratings yet
Rx5112 Systems Verification Form: - For Calculation Fill All The Cells in Blue
1 page
AUTOSAR SWS DiagnosticEventManager
No ratings yet
AUTOSAR SWS DiagnosticEventManager
265 pages
Lavarropas Drean Excellent 166 Guide
No ratings yet
Lavarropas Drean Excellent 166 Guide
6 pages
Latihan Soal
No ratings yet
Latihan Soal
13 pages
AI-Driven Smart Textbook Project Report
No ratings yet
AI-Driven Smart Textbook Project Report
43 pages
Electricity Worksheet
No ratings yet
Electricity Worksheet
7 pages
ME310-Mechanics of Machines: Syllabus
No ratings yet
ME310-Mechanics of Machines: Syllabus
3 pages
Outgoing: Voltage To External Incoming 1 Voltage To External Incoming 2
No ratings yet
Outgoing: Voltage To External Incoming 1 Voltage To External Incoming 2
1 page
Melanie Klein and Early Object Relations Theory
No ratings yet
Melanie Klein and Early Object Relations Theory
72 pages
Professional Rotisserie Equipment Guide
No ratings yet
Professional Rotisserie Equipment Guide
17 pages
EasyRET Dual-Band Antenna Specs
No ratings yet
EasyRET Dual-Band Antenna Specs
3 pages
CTT KN95 Mask: Features & Certifications
No ratings yet
CTT KN95 Mask: Features & Certifications
29 pages
Crew Roles and Routines. - 08-02-23
No ratings yet
Crew Roles and Routines. - 08-02-23
39 pages
Slides For Students-V3
No ratings yet
Slides For Students-V3
10 pages
Employee Performance Management Study
No ratings yet
Employee Performance Management Study
19 pages
MRI Safety Guidelines Overview
No ratings yet
MRI Safety Guidelines Overview
49 pages
Year 5 Missing Number Problems
No ratings yet
Year 5 Missing Number Problems
2 pages
Reward System Process Issues
100% (1)
Reward System Process Issues
16 pages
Xiaomi 14 CIVI: Premium Camera Phone
No ratings yet
Xiaomi 14 CIVI: Premium Camera Phone
4 pages
Formative Assessment for Teachers
No ratings yet
Formative Assessment for Teachers
7 pages
SAP Engineering Change Management
No ratings yet
SAP Engineering Change Management
87 pages
10390001-2-3 h1748 As Built Documents
No ratings yet
10390001-2-3 h1748 As Built Documents
768 pages
AS 1259.2-1990 Acoustics
No ratings yet
AS 1259.2-1990 Acoustics
28 pages
DIY Camping Fuel & Fire Starters
No ratings yet
DIY Camping Fuel & Fire Starters
2 pages
Enclosure 3 Quality Control Checklist For Completed Action Researchcx
No ratings yet
Enclosure 3 Quality Control Checklist For Completed Action Researchcx
9 pages
Uddeholm Pocket Book PDF
No ratings yet
Uddeholm Pocket Book PDF
80 pages
Understanding Family Life Cycle Stages
100% (1)
Understanding Family Life Cycle Stages
11 pages
A3 Template
100% (1)
A3 Template
2 pages
5-1 Manual Calculation WL
No ratings yet
5-1 Manual Calculation WL
21 pages

Module 2

Uploaded by

Module 2

Uploaded by

RV Educational Institutions®

RV Institute of Technology and Management

JP Nagar 8th Phase, Bengaluru - 560076

Course Name: Parallel Computing

Course Code: BCS702

shared double global_elapsed ;

Fig. 1.14 Multiple cores forming a global sum.

for ( my_i = my_first_i ; my_i < my_last_i ; my_i ++)

You might also like