Module-02 Bcs702 (Parallel Computing) Search Creators
Module-02 Bcs702 (Parallel Computing) Search Creators
GPU programming
GPU programming
GPU programming is fundamentally heterogeneous programming, involving the coordination of two distinct
types of processors: the CPU "host" system and the GPU itself.
This is because GPUs are typically not standalone processors; they don't run their own operating systems or
directly access secondary storage.
• The CPU host is responsible for the overall control and coordination.
• It allocates and initializes memory on both the CPU (host memory) and the GPU (device memory),
which are usually separate.
• It initiates the execution of programs on the GPU.
• It handles the output and retrieval of results from the GPU program.
2. GPU Architecture:
• Threads running on a GPU processor are typically organized into groups, often called SIMD (Single
Instruction, Multiple Data) groups.
• Within a SIMD group: Threads generally operate under the SIMD model. While they may not
execute in strict "lockstep" (i.e., not all execute the exact same instruction at the precise same time), no
thread in the group will move to the next instruction until all threads in the group have completed the
current instruction.
A significant challenge in SIMD execution is branching (e.g., if-else statements) where threads within the
same group take different execution paths.
Example:
If there are 32 threads in a SIMD group, threads with rank_in_gp < 16 will execute the my_x += 1 branch,
while threads with rank_in_gp >= 16 will be idled. Once the first set of threads finishes, the roles are reversed:
the rank_in_gp < 16 threads are idled, and the rank_in_gp >= 16 threads execute my_x -= 1.
Efficiency Impact: Idling half the threads for portions of execution is inefficient. Therefore, GPU
programmers must strive to minimize branching within SIMD groups.
• GPUs utilize a hardware scheduler (unlike CPUs, which typically use software schedulers). This
hardware scheduler operates with very low overhead.
• The scheduler's goal is to execute an instruction when all threads in a SIMD group are ready.
• To maximize hardware utilization, it's common practice to create a large number of SIMD groups.
This provides the scheduler with more options: if some SIMD groups are not ready (e.g., waiting for
data from memory or a previous instruction), they can be idled, and the scheduler can pick another
ready group to execute, thereby keeping the GPU busy.
When programming systems composed of clusters of multicore processors, a common approach for achieving
high performance is to combine different programming models:
• Shared-memory API: Used for communication and data sharing within a single node (i.e., among the
cores of a single multicore processor).
• Distributed-memory API: Used for communication and data exchange between different nodes in the
cluster.
The primary motivation for employing a hybrid API is to achieve the highest possible levels of performance.
By utilizing the shared-memory model within a node, programs can exploit faster communication paths (e.g.,
directly accessing shared memory) compared to the overheads of inter-node communication.
Despite the performance benefits, hybrid API programming introduces significant complexity, making
program development much more difficult. This increased complexity arises from:
• Managing two distinct programming models: Developers must understand and effectively integrate
both shared-memory and distributed-memory paradigms.
• Data placement and movement: Carefully orchestrating data movement between shared memory
within a node and distributed memory across nodes becomes crucial for performance.
• Synchronization: Ensuring correct synchronization both within and across nodes adds layers of
complexity.
Alternative Approach:
• Given the challenges, such systems are often programmed using a single, unified distributed-memory
API for both inter-node and intra-node communication.
• While this might not always yield the absolute peak performance achievable with a finely-tuned hybrid
approach, it significantly reduces program development complexity by providing a consistent
communication model across the entire cluster.
MIMD systems
MIMD Systems: Navigating Input/Output Challenges
When discussing MIMD (Multiple Instruction, Multiple Data) systems, the topic of input/output (I/O) often
presents unique challenges compared to serial programming.
While parallel I/O is a complex field in itself, the vast majority of parallel programs for these systems perform
relatively little I/O, often relying on standard C functions like printf, fprintf, scanf, and fscanf.
However, even this limited use can lead to significant issues due to the inherent nature of parallel execution.
1. Standard C's Ambiguity for Processes: Standard C is a serial language, and its specifications do not
define the behavior of I/O functions when called by multiple independent processes. This means their
behavior in a multiprocess environment can vary significantly across different systems.
2. Nondeterministic Outcomes for Threads: While threads forked by a single process do share stdin,
stdout, and stderr, simultaneous access by multiple threads to these shared resources leads to
nondeterministic outcomes. It's impossible to predict the exact order or interleaving of operations.
3. Output to Console (printf, fprintf):
o Developer Expectation: Developers usually want all output to appear on the console of the single
system where the program was initiated.
o System Behavior: Most systems indeed coalesce output to a single console. However, there's no
guarantee. Some systems might restrict stdout or stderr access to only one process, or even none.
o Nondeterminism: When multiple processes/threads can access stdout, the sequence of output is
usually nondeterministic. Data from different processes/threads might interleave unexpectedly,
making it hard to read or debug.
4. Input from Console (scanf, fscanf):
o Ambiguity: It's less obvious how scanf should behave with multiple processes/threads. Should
input be divided? Should only one be allowed to read?
To manage these potential issues and ensure predictable (or at least understandable) I/O behavior in parallel
programs, the following conventions are often adopted:
o Both Distributed and Shared-Memory Programs: All processes/threads can access stdout and
stderr.
o However, for most cases: Only a single process/thread (e.g., process 0 or the master thread) will be
used for all intended output to stdout.
o Exception: Debugging: For debugging, multiple processes/threads will often write to stdout. This is
acceptable because the goal is to observe internal states, even if interleaved.
o Debug output should always include the rank or ID of the process/thread generating the output. This
is crucial for distinguishing messages and understanding the flow of execution from different parallel
entities, especially when output is interleaved.
GPUs
In GPU programming, the CPU host code is generally responsible for all input/output (I/O) operations.
This simplifies I/O management significantly for a few key reasons:
Typically, only one process or thread runs on the CPU host to manage the GPU computations. This means that
the standard C I/O functions (like printf, scanf, fprintf, fscanf) behave exactly as they would in a regular serial
C program, without the complexities of nondeterministic behavior found in multi-process/multi-thread CPU
environments.
GPU Limitations:
GPUs are specialized for parallel computation and generally do not have direct access to standard I/O
streams or secondary storage (like hard drives). This offloads I/O responsibilities entirely to the host.
While the rule is to use the host for all I/O, there's a crucial exception for debugging GPU code:
• stdout for GPU Threads: In the systems commonly used for GPU programming, individual GPU threads
can write to stdout.
• Nondeterministic Output: Similar to MIMD programs running on CPUs, when multiple GPU threads
write to stdout concurrently, the order of the output is nondeterministic. This means the interleaved
output can vary from one run to the next.
• No Access to Other Streams/Storage: Crucially, GPU threads typically do not have access to stderr,
stdin, or secondary storage. This reinforces the host's role for all but very specific debugging output.
The primary goal of writing parallel programs is to achieve increased performance. When evaluating the
performance of homogeneous MIMD systems (where all cores have the same architecture, unlike GPUs), we
use metrics like speedup and efficiency.
Ideally, a parallel program divides work equally among cores with no added overhead.
If a program runs on p cores (one process or thread per core), the best possible runtime, Tparallel, would be Tserial
/p, where Tserial is the runtime of the equivalent serial program on a single core of the same design.
When this ideal is achieved, we say the program has linear speedup.
Practical Considerations: In practice, achieving linear speedup is rare due to inherent parallel overheads:
• Shared-memory programs: Often involve critical sections that require mutual exclusion mechanisms
(e.g., mutexes). Calls to mutex functions and serialization of critical sections introduce overhead not
present in the serial version.
• Distributed-memory programs: Require data transmission across networks, which is significantly
slower than local memory access.
These overheads are not present in serial programs and tend to increase with the number of processes or
threads (p).
For example, more threads mean more contention for critical sections, and more processes mean more network
communication. Consequently, the speedup S usually becomes a smaller fraction of the ideal p as p increases.
Tparallel, S, and E are not only dependent on p but also significantly on the problem size.
Observation:
In many parallel programs, as the problem size increases (with p fixed), speedups and efficiencies generally
increase (as illustrated in Tables 2.4 and 2.5 and Figures 2.18 and 2.19). Conversely, they decrease when the
problem size is reduced.
Reason:
This common behavior occurs because, for many parallel algorithms, the parallel overhead (Toverhead) grows
much more slowly than the time spent solving the original problem (Tserial) as the problem size increases. That
is, Tserial grows faster with problem size than Toverhead.
When reporting speedups and efficiencies, it's crucial to define how Tserial is measured:
Some argue for using the runtime of the absolute fastest serial program (even if it uses a different algorithm)
on the fastest available processor. This provides a "best-case" baseline for problem solving.
Approach 2 (Serial version of the parallel program on a single core of the parallel system):
In practice, most authors, including the text's authors, use the serial version of the same program that was
parallelized, run on a single core of the parallel system itself.
o Rationale: This approach is more practical and provides a better measure of the utilization of the
parallel cores for the specific parallelized algorithm, allowing efficiency to directly reflect how well
the parallel system is being used for that problem.
The second approach is generally preferred because it provides a more direct comparison of the parallel
implementation's effectiveness relative to its serial counterpart on the same hardware.
Amdahl’s law
Amdahl's Law: The Limits of Parallel Speedup
Amdahl's Law, an observation made by Gene Amdahl in the 1960s, states that the potential speedup of a
program due to parallelization is limited by the fraction of the program that remains inherently serial
(unparallelized), regardless of how many processing cores are available.
Illustrative Example:
Implications:
Amdahl's Law highlights a critical bottleneck: the inherently serial portion of a program severely limits the
achievable speedup, no matter how much parallel processing power is thrown at the parallelizable parts.
Even a small serial fraction can cap the potential performance gains.
Despite its seemingly daunting implications, there are several reasons why Amdahl's Law isn't always a cause
for despair in parallel programming:
• Amdahl's Law assumes a fixed problem size. However, for many problems, as the problem size
increases, the "inherently serial" fraction (r) often decreases proportionally to the total work.
• This means that larger problems can exhibit much better speedups on many cores, a concept formalized
by Gustafson's Law.
• In essence, as you scale up the problem, the parallelizable work often grows much faster than the serial
work.
Empirical Evidence:
• Scientists and engineers routinely achieve huge speedups on large distributed-memory systems for
real-world problems.
• This practical success demonstrates that for many relevant applications, the serial bottleneck is either
sufficiently small or grows slowly enough relative to the parallelizable work.
Adequate Speedup:
• A "small" speedup (e.g., 5x or 10x) might still be more than sufficient for many applications.
• If the effort involved in developing the parallel program is reasonable, such speedups can translate into
significant practical benefits, making the parallelization worthwhile even if theoretical linear speedup
isn't achieved.
The term "scalable" is used informally in many contexts to mean that a system or program can handle
increasing demands by growing its resources.
In the context of MIMD (Multiple Instruction, Multiple Data) parallel program performance, however,
"scalability" has a more formal and precise definition.
Accurately measuring the performance of MIMD (Multiple Instruction, Multiple Data) parallel programs
requires careful consideration beyond simple start-to-finish timers.
The methodology depends on the purpose of the timing and the nature of parallel execution.
There are generally two main reasons for timing parallel programs:
To understand if the program is behaving as intended. This often requires very detailed information, such as
time spent in specific code sections, communication overheads, or waiting times (e.g., processes waiting for
messages in distributed-memory programs). Large waiting times can indicate design or implementation flaws.
To determine the program's overall performance. For this, a single, concise runtime value is usually reported.
We will focus on this second type of timing here.
o We are usually interested in the time taken by the core computational part of the program, not the total
execution time from start to finish (which includes setup, I/O, etc.).
o Therefore, standard shell commands like time (Unix) are often insufficient as they report total program
execution time.
o We are generally not interested in "CPU time" (time spent actively executing instructions or system
calls initiated by the program). The standard C clock() function reports CPU time.
o Problem with CPU time: In parallel programs, processes/threads might spend significant time idle
(e.g., waiting for other processes/threads to send data, or waiting for a mutex to be released). This idle
time is not counted as CPU time because no function called by the process is active. However, this idle
time is a real cost to the overall parallel execution and must be included in performance evaluation to
provide an accurate picture of the program's true runtime.
o Solution: Wall Clock Time: The reported time for parallel programs is almost always "wall clock
time" (also known as elapsed time or real time). This measures the total time that elapses from the
beginning to the end of the timed section, as if measured by a stopwatch.
o Definition: Timer resolution is the smallest unit of measurement the timer can report (e.g., milliseconds,
microseconds, nanoseconds).
o Issue: If the resolution is too coarse (e.g., milliseconds) and the timed code section executes very
quickly (e.g., less than a millisecond), the timer might report zero or an inaccurate value.
o Verification: Many APIs provide functions to query the timer's resolution, or they specify a minimum
required resolution. Programmers must check these values.
o When the code being timed is executed by multiple processes or threads, each will record its own
my_start and my_finish times, resulting in p individual elapsed times.
o However, what's usually desired is a single overall time: the duration from when the first
process/thread began the timed section to when the last process/thread finished it.
o Challenge: Due to potentially unsynchronized clocks across different nodes (in distributed memory),
obtaining this exact "global" start and end time is often difficult.
o Compromise Approach:
A Barrier() function ensures that all processes/threads reach a certain point before any can proceed past it.
While it doesn't guarantee simultaneous return, it provides a crucial synchronization point.
Global Maximum:
Each process/thread calculates its my_elapsed time. A global reduction operation (like Global_max or
MPI_Reduce with MPI_MAX) is then used to find the maximum of all these individual elapsed times.
This maximum represents the time it took for the slowest process/thread to complete its portion of the work,
effectively capturing the total time the parallel section was active. Process/thread 0 (or the master) typically
prints this value.
o Running a program multiple times will almost certainly yield slightly different elapsed times, even with
identical inputs and systems.
o While reporting the mean or median might seem intuitive, it's generally unlikely for external factors to
make a program run faster than its best possible time. External interference (OS activity, network jitter,
etc.) almost always makes it run slower.
o Therefore, it's common practice to report the minimum time observed across multiple runs. This
minimum time is considered the most representative of the program's optimal performance under the
given conditions.
o Running more than one thread per physical core can significantly increase timing variability and
introduce extra overhead due to system scheduling and descheduling.
o For this reason, it's generally recommended to run no more than one thread per physical core when
aiming for optimal performance and consistent timings.
o Since most parallel programs (unless specifically designed for high-performance I/O) do not optimize
for I/O operations, it's common practice to exclude I/O time (reading inputs, printing final results) from
the reported performance timings. The focus remains on the core computation time.
While the general goal of parallel programming is increased performance, evaluating GPU program
performance differs from homogeneous MIMD systems due to the distinct architecture of GPUs.
It's common to report speedups of GPU programs over serial CPU programs or parallel MIMD
programs. This gives a straightforward indication of the performance gain achieved by using the GPU.
However, certain metrics and concepts, typically applied to MIMD systems, are less meaningful or directly
applicable to GPUs:
Efficiency:
In MIMD systems, efficiency is often calculated by comparing parallel runtime to a serial runtime on a single
core of the same type used by the parallel system.
Since GPU cores are fundamentally different (inherently parallel, highly specialized) from conventional CPUs,
this type of direct comparison to a "single GPU core serial execution" doesn't make sense.
Therefore, efficiency is ordinarily not used in discussions of GPU performance in the same formal way.
Linear Speedup:
Similarly, talking about "linear speedup" of a GPU program relative to a serial CPU program is not directly
applicable.
A CPU and a GPU operate on vastly different architectural principles, so a direct scalar comparison of core
counts for "linear" scaling is inappropriate.
Scalability in GPUs
• The formal definition of scalability (maintaining constant efficiency by increasing problem size
proportionally to processor count) from MIMD programs cannot be directly applied to GPUs because the
concept of "efficiency" (as previously defined) doesn't directly translate.
Applicability: If the inherently serial part of a GPU program (the portion that cannot be parallelized and runs
on the CPU) is considered, then Amdahl's Law can indeed be applied to GPU programs.
The resulting upper bound on possible speedup will be the same as for MIMD programs: if a fraction r of the
original serial program remains unparallelized and runs on a conventional serial processor, the best possible
speedup for the combined CPU-GPU program will be less than 1/r.
Caveats: The same caveats that apply to Amdahl's Law in MIMD systems also hold true for GPUs:
o The "inherently serial" fraction (r) is likely dependent on the problem size. For larger problems, r
often decreases, leading to higher potential speedups.
o Despite the theoretical limits, many GPU programs achieve huge speedups in practice,
demonstrating their effectiveness for highly parallelizable workloads.
o Even a relatively small speedup (e.g., 5x or 10x) can be perfectly adequate for many applications,
making GPU acceleration worthwhile.
The basic principles of timing discussed for MIMD programs also apply to GPU programs: measuring wall
clock time for the relevant sections.
Since a GPU program is typically initiated and concluded on a conventional CPU, for overall performance
measurement of the GPU-accelerated parts, you can usually just use the CPU's timer.
Timing Subsets of GPU Code: If you only want to time a specific portion of the code running directly on the
GPU, you will need to use timer functions provided by the GPU's programming API (e.g., CUDA events
for NVIDIA GPUs).
These allow for precise measurement of kernels and data transfers on the GPU device itself.