0% found this document useful (0 votes)
62 views22 pages

Module-02 Bcs702 (Parallel Computing) Search Creators

Uploaded by

Darshan Darshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views22 pages

Module-02 Bcs702 (Parallel Computing) Search Creators

Uploaded by

Darshan Darshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators

Search Creators Page 1


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators

Search Creators Page 2


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
Module – 02

GPU programming

GPU programming

GPU programming is fundamentally heterogeneous programming, involving the coordination of two distinct
types of processors: the CPU "host" system and the GPU itself.

This is because GPUs are typically not standalone processors; they don't run their own operating systems or
directly access secondary storage.

Here's a breakdown of the key aspects of GPU programming:

1. CPU Host's Role:

• The CPU host is responsible for the overall control and coordination.
• It allocates and initializes memory on both the CPU (host memory) and the GPU (device memory),
which are usually separate.
• It initiates the execution of programs on the GPU.
• It handles the output and retrieval of results from the GPU program.

2. GPU Architecture:

• A GPU contains one or more processors.


• Each of these processors is designed to run hundreds or thousands of threads concurrently.
• Processors typically share a large block of global memory.
• Critically, each individual processor also has a small block of much faster memory, often referred to
as a programmer-managed cache. This faster memory is only accessible by threads running on that
specific processor.

3. Thread Organization and Execution (SIMD Groups):

• Threads running on a GPU processor are typically organized into groups, often called SIMD (Single
Instruction, Multiple Data) groups.
• Within a SIMD group: Threads generally operate under the SIMD model. While they may not
execute in strict "lockstep" (i.e., not all execute the exact same instruction at the precise same time), no
thread in the group will move to the next instruction until all threads in the group have completed the
current instruction.

Search Creators Page 3


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
• Across different SIMD groups: Threads in different groups can run largely independently.

4. Branching and Idling Threads:

A significant challenge in SIMD execution is branching (e.g., if-else statements) where threads within the
same group take different execution paths.

Example:

// Thread private variables


int rank_in_gp, my_x;
...
if (rank_in_gp < 16)
my_x += 1;
else
my_x -= 1;

If there are 32 threads in a SIMD group, threads with rank_in_gp < 16 will execute the my_x += 1 branch,
while threads with rank_in_gp >= 16 will be idled. Once the first set of threads finishes, the roles are reversed:
the rank_in_gp < 16 threads are idled, and the rank_in_gp >= 16 threads execute my_x -= 1.

Efficiency Impact: Idling half the threads for portions of execution is inefficient. Therefore, GPU
programmers must strive to minimize branching within SIMD groups.

5. Hardware Scheduling and Maximizing Resource Use:

• GPUs utilize a hardware scheduler (unlike CPUs, which typically use software schedulers). This
hardware scheduler operates with very low overhead.
• The scheduler's goal is to execute an instruction when all threads in a SIMD group are ready.
• To maximize hardware utilization, it's common practice to create a large number of SIMD groups.
This provides the scheduler with more options: if some SIMD groups are not ready (e.g., waiting for
data from memory or a previous instruction), they can be idled, and the scheduler can pick another
ready group to execute, thereby keeping the GPU busy.

Search Creators Page 4


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
Programming hybrid systems

When programming systems composed of clusters of multicore processors, a common approach for achieving
high performance is to combine different programming models:

• Shared-memory API: Used for communication and data sharing within a single node (i.e., among the
cores of a single multicore processor).
• Distributed-memory API: Used for communication and data exchange between different nodes in the
cluster.

This combination is often referred to as a "hybrid" API programming model.

Why use a hybrid API?

The primary motivation for employing a hybrid API is to achieve the highest possible levels of performance.
By utilizing the shared-memory model within a node, programs can exploit faster communication paths (e.g.,
directly accessing shared memory) compared to the overheads of inter-node communication.

Challenges of Hybrid API Programming:

Despite the performance benefits, hybrid API programming introduces significant complexity, making
program development much more difficult. This increased complexity arises from:

• Managing two distinct programming models: Developers must understand and effectively integrate
both shared-memory and distributed-memory paradigms.
• Data placement and movement: Carefully orchestrating data movement between shared memory
within a node and distributed memory across nodes becomes crucial for performance.
• Synchronization: Ensuring correct synchronization both within and across nodes adds layers of
complexity.

Alternative Approach:

• Given the challenges, such systems are often programmed using a single, unified distributed-memory
API for both inter-node and intra-node communication.
• While this might not always yield the absolute peak performance achievable with a finely-tuned hybrid
approach, it significantly reduces program development complexity by providing a consistent
communication model across the entire cluster.

Search Creators Page 5


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
• In essence, the choice between a hybrid API and a purely distributed-memory API for clusters of
multicore processors is a trade-off between maximizing performance and managing development
complexity.
• Hybrid APIs are typically reserved for applications where squeezing out every last bit of performance
is paramount.

MIMD systems
MIMD Systems: Navigating Input/Output Challenges

When discussing MIMD (Multiple Instruction, Multiple Data) systems, the topic of input/output (I/O) often
presents unique challenges compared to serial programming.

While parallel I/O is a complex field in itself, the vast majority of parallel programs for these systems perform
relatively little I/O, often relying on standard C functions like printf, fprintf, scanf, and fscanf.

However, even this limited use can lead to significant issues due to the inherent nature of parallel execution.

I/O Challenges in MIMD Systems:

1. Standard C's Ambiguity for Processes: Standard C is a serial language, and its specifications do not
define the behavior of I/O functions when called by multiple independent processes. This means their
behavior in a multiprocess environment can vary significantly across different systems.
2. Nondeterministic Outcomes for Threads: While threads forked by a single process do share stdin,
stdout, and stderr, simultaneous access by multiple threads to these shared resources leads to
nondeterministic outcomes. It's impossible to predict the exact order or interleaving of operations.
3. Output to Console (printf, fprintf):
o Developer Expectation: Developers usually want all output to appear on the console of the single
system where the program was initiated.
o System Behavior: Most systems indeed coalesce output to a single console. However, there's no
guarantee. Some systems might restrict stdout or stderr access to only one process, or even none.
o Nondeterminism: When multiple processes/threads can access stdout, the sequence of output is
usually nondeterministic. Data from different processes/threads might interleave unexpectedly,
making it hard to read or debug.
4. Input from Console (scanf, fscanf):
o Ambiguity: It's less obvious how scanf should behave with multiple processes/threads. Should
input be divided? Should only one be allowed to read?

Search Creators Page 6


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
o System Behavior: The majority of systems allow at least one process (commonly process 0) to call
scanf, and most allow multiple threads to call scanf. However, some systems might not allow any
processes to call scanf.
o Nondeterminism: When multiple processes/threads read from stdin, the data read by each can vary
on different runs, even with identical input, due to race conditions.

Assumptions and Rules for I/O in Parallel Programs:

To manage these potential issues and ensure predictable (or at least understandable) I/O behavior in parallel
programs, the following conventions are often adopted:

Standard Input (stdin) Access:

o Distributed-Memory Programs: Only process 0 will access stdin.


o Shared-Memory Programs: Only the master thread or thread 0 will access stdin.
o Rationale: This centralizes input, avoiding nondeterminism and ensuring a single, clear source for
input data.

Standard Output (stdout) and Standard Error (stderr) Access:

o Both Distributed and Shared-Memory Programs: All processes/threads can access stdout and
stderr.
o However, for most cases: Only a single process/thread (e.g., process 0 or the master thread) will be
used for all intended output to stdout.
o Exception: Debugging: For debugging, multiple processes/threads will often write to stdout. This is
acceptable because the goal is to observe internal states, even if interleaved.

File I/O (other than stdin, stdout, stderr):

o Only a single process/thread will attempt to access any specific file.


o This means each process/thread can open its own private file for reading or writing, but no two
processes/threads will open the same file simultaneously.
o Rationale: This completely avoids race conditions and data corruption issues when dealing with file
streams.

Search Creators Page 7


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
Debug Output Best Practice:

o Debug output should always include the rank or ID of the process/thread generating the output. This
is crucial for distinguishing messages and understanding the flow of execution from different parallel
entities, especially when output is interleaved.

GPUs

GPUs and I/O: Host-Centric Approach with Debugging Exceptions

In GPU programming, the CPU host code is generally responsible for all input/output (I/O) operations.
This simplifies I/O management significantly for a few key reasons:

Single Host Process/Thread:

Typically, only one process or thread runs on the CPU host to manage the GPU computations. This means that
the standard C I/O functions (like printf, scanf, fprintf, fscanf) behave exactly as they would in a regular serial
C program, without the complexities of nondeterministic behavior found in multi-process/multi-thread CPU
environments.

GPU Limitations:

GPUs are specialized for parallel computation and generally do not have direct access to standard I/O
streams or secondary storage (like hard drives). This offloads I/O responsibilities entirely to the host.

Exception for Debugging:

While the rule is to use the host for all I/O, there's a crucial exception for debugging GPU code:

• stdout for GPU Threads: In the systems commonly used for GPU programming, individual GPU threads
can write to stdout.
• Nondeterministic Output: Similar to MIMD programs running on CPUs, when multiple GPU threads
write to stdout concurrently, the order of the output is nondeterministic. This means the interleaved
output can vary from one run to the next.
• No Access to Other Streams/Storage: Crucially, GPU threads typically do not have access to stderr,
stdin, or secondary storage. This reinforces the host's role for all but very specific debugging output.

Search Creators Page 8


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
Performance – Speedup and efficiency in MIMD systems

The primary goal of writing parallel programs is to achieve increased performance. When evaluating the
performance of homogeneous MIMD systems (where all cores have the same architecture, unlike GPUs), we
use metrics like speedup and efficiency.

Speedup and Efficiency in MIMD Systems

Ideally, a parallel program divides work equally among cores with no added overhead.

If a program runs on p cores (one process or thread per core), the best possible runtime, Tparallel, would be Tserial
/p, where Tserial is the runtime of the equivalent serial program on a single core of the same design.

When this ideal is achieved, we say the program has linear speedup.

Speedup (S) is formally defined as:

Practical Considerations: In practice, achieving linear speedup is rare due to inherent parallel overheads:

• Shared-memory programs: Often involve critical sections that require mutual exclusion mechanisms
(e.g., mutexes). Calls to mutex functions and serialization of critical sections introduce overhead not
present in the serial version.
• Distributed-memory programs: Require data transmission across networks, which is significantly
slower than local memory access.

These overheads are not present in serial programs and tend to increase with the number of processes or
threads (p).

For example, more threads mean more contention for critical sections, and more processes mean more network
communication. Consequently, the speedup S usually becomes a smaller fraction of the ideal p as p increases.

Search Creators Page 9


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
Efficiency (E) is defined as:

Dependence on Problem Size

Tparallel, S, and E are not only dependent on p but also significantly on the problem size.

Observation:

In many parallel programs, as the problem size increases (with p fixed), speedups and efficiencies generally
increase (as illustrated in Tables 2.4 and 2.5 and Figures 2.18 and 2.19). Conversely, they decrease when the
problem size is reduced.

Search Creators Page 10


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators

Reason:

This common behavior occurs because, for many parallel algorithms, the parallel overhead (Toverhead) grows
much more slowly than the time spent solving the original problem (Tserial) as the problem size increases. That
is, Tserial grows faster with problem size than Toverhead.

Search Creators Page 11


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
Choosing Tserial for Performance Measurement

When reporting speedups and efficiencies, it's crucial to define how Tserial is measured:

Approach 1 (Fastest possible serial program on fastest processor):

Some argue for using the runtime of the absolute fastest serial program (even if it uses a different algorithm)
on the fastest available processor. This provides a "best-case" baseline for problem solving.

Approach 2 (Serial version of the parallel program on a single core of the parallel system):

In practice, most authors, including the text's authors, use the serial version of the same program that was
parallelized, run on a single core of the parallel system itself.

o Rationale: This approach is more practical and provides a better measure of the utilization of the
parallel cores for the specific parallelized algorithm, allowing efficiency to directly reflect how well
the parallel system is being used for that problem.

The second approach is generally preferred because it provides a more direct comparison of the parallel
implementation's effectiveness relative to its serial counterpart on the same hardware.

Amdahl’s law
Amdahl's Law: The Limits of Parallel Speedup

Amdahl's Law, an observation made by Gene Amdahl in the 1960s, states that the potential speedup of a
program due to parallelization is limited by the fraction of the program that remains inherently serial
(unparallelized), regardless of how many processing cores are available.

Illustrative Example:

Search Creators Page 12


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators

Implications:

Amdahl's Law highlights a critical bottleneck: the inherently serial portion of a program severely limits the
achievable speedup, no matter how much parallel processing power is thrown at the parallelizable parts.

Even a small serial fraction can cap the potential performance gains.

Search Creators Page 13


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
Why Not Be Too Worried by Amdahl's Law:

Despite its seemingly daunting implications, there are several reasons why Amdahl's Law isn't always a cause
for despair in parallel programming:

Problem Size (Gustafson's Law):

• Amdahl's Law assumes a fixed problem size. However, for many problems, as the problem size
increases, the "inherently serial" fraction (r) often decreases proportionally to the total work.
• This means that larger problems can exhibit much better speedups on many cores, a concept formalized
by Gustafson's Law.
• In essence, as you scale up the problem, the parallelizable work often grows much faster than the serial
work.

Empirical Evidence:

• Scientists and engineers routinely achieve huge speedups on large distributed-memory systems for
real-world problems.
• This practical success demonstrates that for many relevant applications, the serial bottleneck is either
sufficiently small or grows slowly enough relative to the parallelizable work.

Adequate Speedup:

• A "small" speedup (e.g., 5x or 10x) might still be more than sufficient for many applications.
• If the effort involved in developing the parallel program is reasonable, such speedups can translate into
significant practical benefits, making the parallelization worthwhile even if theoretical linear speedup
isn't achieved.

Scalability in MIMD systems

The term "scalable" is used informally in many contexts to mean that a system or program can handle
increasing demands by growing its resources.

In the context of MIMD (Multiple Instruction, Multiple Data) parallel program performance, however,
"scalability" has a more formal and precise definition.

Search Creators Page 14


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
Formal Definition of Scalability:

Search Creators Page 15


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators

Taking timings of MIMD programs

Accurately measuring the performance of MIMD (Multiple Instruction, Multiple Data) parallel programs
requires careful consideration beyond simple start-to-finish timers.

The methodology depends on the purpose of the timing and the nature of parallel execution.

Reasons for Taking Timings

There are generally two main reasons for timing parallel programs:

During Program Development (Detailed Analysis):

To understand if the program is behaving as intended. This often requires very detailed information, such as
time spent in specific code sections, communication overheads, or waiting times (e.g., processes waiting for
messages in distributed-memory programs). Large waiting times can indicate design or implementation flaws.

After Development (Overall Performance Evaluation):

To determine the program's overall performance. For this, a single, concise runtime value is usually reported.
We will focus on this second type of timing here.

Search Creators Page 16


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
Key Considerations for Performance Timings

Timing Specific Code Sections, Not Entire Program:

o We are usually interested in the time taken by the core computational part of the program, not the total
execution time from start to finish (which includes setup, I/O, etc.).
o Therefore, standard shell commands like time (Unix) are often insufficient as they report total program
execution time.

Wall Clock Time vs. CPU Time:

o We are generally not interested in "CPU time" (time spent actively executing instructions or system
calls initiated by the program). The standard C clock() function reports CPU time.
o Problem with CPU time: In parallel programs, processes/threads might spend significant time idle
(e.g., waiting for other processes/threads to send data, or waiting for a mutex to be released). This idle
time is not counted as CPU time because no function called by the process is active. However, this idle
time is a real cost to the overall parallel execution and must be included in performance evaluation to
provide an accurate picture of the program's true runtime.
o Solution: Wall Clock Time: The reported time for parallel programs is almost always "wall clock
time" (also known as elapsed time or real time). This measures the total time that elapses from the
beginning to the end of the timed section, as if measured by a stopwatch.

A typical structure for measuring wall clock time is:

double start, finish;


// ...
start = Get_current_time(); // Hypothetical function
/* Code that we want to time */
// ...
finish = Get_current_time();
printf("The elapsed time = %e seconds\n", finish - start);

o Get_current_time() examples: Specific APIs provide such functions:


▪ MPI: MPI_Wtime()
▪ OpenMP: omp_get_wtime() Both return wall clock time.

Search Creators Page 17


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
Timer Resolution:

o Definition: Timer resolution is the smallest unit of measurement the timer can report (e.g., milliseconds,
microseconds, nanoseconds).
o Issue: If the resolution is too coarse (e.g., milliseconds) and the timed code section executes very
quickly (e.g., less than a millisecond), the timer might report zero or an inaccurate value.
o Verification: Many APIs provide functions to query the timer's resolution, or they specify a minimum
required resolution. Programmers must check these values.

Timing Parallel Sections Carefully:

o When the code being timed is executed by multiple processes or threads, each will record its own
my_start and my_finish times, resulting in p individual elapsed times.
o However, what's usually desired is a single overall time: the duration from when the first
process/thread began the timed section to when the last process/thread finished it.
o Challenge: Due to potentially unsynchronized clocks across different nodes (in distributed memory),
obtaining this exact "global" start and end time is often difficult.
o Compromise Approach:

shared double global_elapsed;


private double my_start, my_finish, my_elapsed;
// ...
/* Synchronize all processes/threads */
Barrier(); // e.g., MPI_Barrier(), #pragma omp barrier
my_start = Get_current_time();
/* Code that we want to time */
// ...
my_finish = Get_current_time();
my_elapsed = my_finish - my_start;
/* Find the max across all processes/threads */
global_elapsed = Global_max(my_elapsed); // e.g., MPI_Reduce(..., MPI_MAX, ...)
if (my_rank == 0) { // Or master thread
printf("The elapsed time = %e seconds\n", global_elapsed);
}

Search Creators Page 18


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
Barrier:

A Barrier() function ensures that all processes/threads reach a certain point before any can proceed past it.
While it doesn't guarantee simultaneous return, it provides a crucial synchronization point.

Global Maximum:

Each process/thread calculates its my_elapsed time. A global reduction operation (like Global_max or
MPI_Reduce with MPI_MAX) is then used to find the maximum of all these individual elapsed times.

This maximum represents the time it took for the slowest process/thread to complete its portion of the work,
effectively capturing the total time the parallel section was active. Process/thread 0 (or the master) typically
prints this value.

Variability in Timings and Reporting Minimum Time:

o Running a program multiple times will almost certainly yield slightly different elapsed times, even with
identical inputs and systems.
o While reporting the mean or median might seem intuitive, it's generally unlikely for external factors to
make a program run faster than its best possible time. External interference (OS activity, network jitter,
etc.) almost always makes it run slower.
o Therefore, it's common practice to report the minimum time observed across multiple runs. This
minimum time is considered the most representative of the program's optimal performance under the
given conditions.

Threads per Core:

o Running more than one thread per physical core can significantly increase timing variability and
introduce extra overhead due to system scheduling and descheduling.
o For this reason, it's generally recommended to run no more than one thread per physical core when
aiming for optimal performance and consistent timings.

Excluding I/O from Reported Timings:

o Since most parallel programs (unless specifically designed for high-performance I/O) do not optimize
for I/O operations, it's common practice to exclude I/O time (reading inputs, printing final results) from
the reported performance timings. The focus remains on the core computation time.

Search Creators Page 19


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
GPU performance
GPU Performance Evaluation

While the general goal of parallel programming is increased performance, evaluating GPU program
performance differs from homogeneous MIMD systems due to the distinct architecture of GPUs.

Comparing GPU Performance

It's common to report speedups of GPU programs over serial CPU programs or parallel MIMD
programs. This gives a straightforward indication of the performance gain achieved by using the GPU.

However, certain metrics and concepts, typically applied to MIMD systems, are less meaningful or directly
applicable to GPUs:

Efficiency:

In MIMD systems, efficiency is often calculated by comparing parallel runtime to a serial runtime on a single
core of the same type used by the parallel system.

Since GPU cores are fundamentally different (inherently parallel, highly specialized) from conventional CPUs,
this type of direct comparison to a "single GPU core serial execution" doesn't make sense.

Therefore, efficiency is ordinarily not used in discussions of GPU performance in the same formal way.

Linear Speedup:

Similarly, talking about "linear speedup" of a GPU program relative to a serial CPU program is not directly
applicable.

A CPU and a GPU operate on vastly different architectural principles, so a direct scalar comparison of core
counts for "linear" scaling is inappropriate.

Scalability in GPUs

• The formal definition of scalability (maintaining constant efficiency by increasing problem size
proportionally to processor count) from MIMD programs cannot be directly applied to GPUs because the
concept of "efficiency" (as previously defined) doesn't directly translate.

Search Creators Page 20


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
• However, the informal usage of scalability is routinely applied to GPUs: a GPU program is considered
scalable if increasing the size or power of the GPU (e.g., more streaming multiprocessors, more memory
bandwidth) results in a proportional speedup for the program.
• This implies that the algorithm can effectively utilize more GPU resources when available.

Amdahl's Law and GPUs

Applicability: If the inherently serial part of a GPU program (the portion that cannot be parallelized and runs
on the CPU) is considered, then Amdahl's Law can indeed be applied to GPU programs.

Speedup Upper Bound:

The resulting upper bound on possible speedup will be the same as for MIMD programs: if a fraction r of the
original serial program remains unparallelized and runs on a conventional serial processor, the best possible
speedup for the combined CPU-GPU program will be less than 1/r.

Caveats: The same caveats that apply to Amdahl's Law in MIMD systems also hold true for GPUs:

o The "inherently serial" fraction (r) is likely dependent on the problem size. For larger problems, r
often decreases, leading to higher potential speedups.
o Despite the theoretical limits, many GPU programs achieve huge speedups in practice,
demonstrating their effectiveness for highly parallelizable workloads.
o Even a relatively small speedup (e.g., 5x or 10x) can be perfectly adequate for many applications,
making GPU acceleration worthwhile.

Taking Timings of GPU Programs

The basic principles of timing discussed for MIMD programs also apply to GPU programs: measuring wall
clock time for the relevant sections.

Overall GPU Program Performance:

Since a GPU program is typically initiated and concluded on a conventional CPU, for overall performance
measurement of the GPU-accelerated parts, you can usually just use the CPU's timer.

o Start the timer before initiating the GPU computation(s).


o Stop the timer after the GPU computation(s) have completed and results are returned to the host.

More Complex Scenarios:

Search Creators Page 21


PARALLEL COMPUTING|BCS702 |Module -02 |Search Creators
For more intricate setups (e.g., programs running on multiple CPU-GPU pairs), more sophisticated timing
mechanisms (potentially involving synchronization across CPU-GPU pairs) would be required.

However, for simpler cases, the CPU host timer is sufficient.

Timing Subsets of GPU Code: If you only want to time a specific portion of the code running directly on the
GPU, you will need to use timer functions provided by the GPU's programming API (e.g., CUDA events
for NVIDIA GPUs).

These allow for precise measurement of kernels and data transfers on the GPU device itself.

Search Creators Page 22

You might also like