1.
GPU Programming Model
A Graphics Processing Unit (GPU) programming model is designed to manage and execute
computations on a highly parallel architecture. Unlike a CPU, which has a few powerful cores for
sequential tasks, a GPU has thousands of smaller, more efficient cores designed for handling multiple
tasks simultaneously. The most common programming model is CUDA (by NVIDIA), but the concepts
are similar for OpenCL (an open standard).
The model is based on a hierarchy of thread groups:
1. Grid: A grid is the highest-level grouping. It consists of all the threads launched to execute a
specific task, called a kernel. Think of the entire problem you want to solve.
2. Block: A grid is divided into multiple blocks. All threads within a block can cooperate by
sharing data through a fast, on-chip shared memory and can synchronize their execution.
Blocks are independent of each other and can be executed in any order or in parallel. Think
of a team working on a sub-problem.
3. Thread: A thread is the fundamental unit of execution. Each thread runs the same kernel
code but operates on different data, identified by its unique thread and block IDs. Think of an
individual worker on the team.
How it Works: The program is typically split into two parts:
• Host Code: Runs on the CPU. It handles serial parts of the program, manages memory on the
GPU, and launches kernels.
• Device Code (Kernel): Runs on the GPU. This is the parallel part of the program, executed by
thousands of threads.
The typical workflow is:
1. The CPU (host) allocates memory on the GPU (device).
2. The CPU copies the input data from its main memory (RAM) to the GPU's memory.
3. The CPU launches the kernel, specifying the number of blocks and threads per block (the grid
dimensions).
4. The thousands of GPU threads execute the kernel in parallel.
5. Once the kernel finishes, the CPU copies the results back from GPU memory to CPU memory.
This model is known as SIMT (Single Instruction, Multiple Thread). Groups of threads (called warps
in CUDA or wavefronts in AMD) execute the same instruction at the same time, but on different
pieces of data. This is extremely efficient for data-parallel problems like image processing, scientific
simulations, and machine learning.
2. GPU vs CPU MIMD
While both multi-core CPUs and GPUs can be broadly classified under the MIMD (Multiple
Instruction, Multiple Data) Flynn's taxonomy, they represent very different philosophies of parallel
architecture.
CPU: True MIMD (Task Parallelism) A modern multi-core CPU is a true MIMD machine. Each of its
cores is a powerful, independent processor.
• Large Caches: Each core has significant L1 and L2 cache to reduce memory latency.
• Complex Control Unit: Each core has sophisticated logic for branch prediction and out-of-
order execution to speed up a single instruction stream.
• Independent Execution: Each core can fetch, decode, and execute its own instructions from
completely different programs or threads. For example, Core 1 can run a web browser while
Core 2 runs a word processor. This is ideal for task parallelism, where you have different,
largely unrelated tasks to run.
GPU: SIMT, a subset of MIMD (Data Parallelism) A GPU's architecture is built for throughput, not
low latency. It is best described as SIMT (Single Instruction, Multiple Thread), which is a specialized
form of MIMD.
• Small Caches: Caches are smaller; the design hides memory latency by rapidly switching
between thousands of active threads, not by trying to avoid it.
• Simple Control Unit: Control logic is shared among a group of cores. Threads are bundled
into warps (e.g., 32 threads). All threads in a warp execute the same instruction at the same
time.
• Divergence Penalty: If threads within a warp need to take different paths (e.g., due to an if-
else statement), the warp executes both paths serially, with threads on the "inactive" path
being disabled. This is called branch divergence and can hurt performance.
• Dependent Execution: While different warps can execute different instructions (making it
MIMD), the fundamental parallelism comes from executing one instruction across lots of
data. This is ideal for data parallelism, where you apply the same operation to every element
of a large dataset.
Feature CPU (True MIMD) GPU (SIMT / MIMD)
Goal Minimize Latency Maximize Throughput
Core Design Few, complex, powerful cores Thousands of simple, efficient cores
Parallelism Data Parallelism (same task on different
Task Parallelism (different tasks)
Type data)
Control Logic Independent per core Shared among groups of cores (warps)
Relies on massive parallelism to hide
Memory Large caches for fast access
latency
General-purpose, sequential tasks,
Best For Repetitive, parallelizable computations
complex logic
Export to Sheets
In short, a CPU gives a few brilliant specialists full autonomy to work on complex, different problems.
A GPU gives a massive army of workers one simple instruction to carry out in unison on a huge
amount of material.
3. Amdahl's Law, Speedup, and Efficiency
These concepts are fundamental to understanding the limits of parallel computing.
Amdahl's Law
Amdahl's Law states that the maximum potential speedup of a program is limited by its sequential
part. Even if you have infinite processors, you can't make the part of the code that must run
sequentially any faster.
The formula for speedup is:
Where:
• is the theoretical speedup for processors.
• is the fraction of the program that is parallelizable (from 0 to 1).
• is the fraction of the program that is serial.
• is the number of processors.
The key takeaway is that the serial fraction becomes the bottleneck. As , the maximum speedup is
limited to . For example, if 10% of your code is serial (), you can never get more than a 10x speedup,
no matter how many processors you use.
Speedup
Speedup measures how much faster a parallel program runs compared to its sequential version.
Where:
• is the execution time on a single processor.
• is the execution time on processors. An ideal speedup is linear, meaning if you use
processors, the program runs times faster. In reality, this is rarely achieved due to
communication overhead and the serial portion of the code.
Efficiency
Efficiency measures how well the processors are being utilized. It's the observed speedup divided by
the number of processors used.
An efficiency of 1 (or 100%) means perfect linear speedup. An efficiency of 0.5 (or 50%) means that,
on average, the processors are being used effectively only half the time.
Numerical Example
Problem: Suppose a program takes 100 seconds to run on a single CPU. Profiling shows that 95% of
the execution time can be parallelized. What is the speedup and efficiency if we run it on an 8-core
CPU?
Solution:
1. Identify the variables:
o Sequential time seconds.
o Parallelizable fraction .
o Serial fraction .
o Number of processors .
2. Calculate Speedup using Amdahl's Law:
The theoretical speedup is approximately 5.93x.
3. Calculate the new parallel time:
4. Calculate Efficiency:
The efficiency is about 74.1%. This means that due to the serial portion and overhead, we are not
getting the full 8x speedup; we are effectively using the power of ~5.9 cores out of the 8 available.
4. MPI Functions for Communication
MPI (Message Passing Interface) is a standard for communication between processes running in a
distributed memory system. Communication can be categorized into two main types.
Point-to-Point Communication
This involves a message being sent between two specific processes. It's like making a direct phone
call.
• int MPI_Send(void* data, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm)
o Sends a message from the calling process to a specified destination.
o data: Pointer to the data buffer to be sent.
o count: Number of elements in the buffer.
o type: Data type of the elements (e.g., MPI_INT, MPI_DOUBLE).
o dest: Rank of the destination process.
o tag: An integer to distinguish messages.
o comm: The communicator (group of processes), usually MPI_COMM_WORLD. This is
a blocking send; the function does not return until the data has been safely sent out
of the send buffer.
• int MPI_Recv(void* data, int count, MPI_Datatype type, int source, int tag, MPI_Comm
comm, MPI_Status* status)
o Receives a message. It will wait (block) until a matching message arrives.
o data: Pointer to the buffer to store the received data.
o count: Maximum number of elements to receive.
o type: Data type of the elements.
o source: Rank of the source process (or MPI_ANY_SOURCE).
o tag: The expected message tag (or MPI_ANY_TAG).
o comm: The communicator.
o status: An object that provides information about the received message (actual
source, tag, and size).
Collective Communication
This involves communication among all processes in a communicator group. It's like a group
announcement or a team meeting.
• int MPI_Bcast(void* data, int count, MPI_Datatype type, int root, MPI_Comm comm)
o Broadcast: Sends data from one process (the root) to all other processes in the
group. It's a one-to-all operation.
o The root process provides the data to be sent, and all other processes provide a
buffer to receive it.
• int MPI_Scatter(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data,
int recv_count, MPI_Datatype recv_type, int root, MPI_Comm comm)
o Scatter: Distributes chunks of an array from a root process to all processes in the
group (including itself). For example, if the root has an array of 8 elements and there
are 4 processes, each process receives 2 elements. It's like dealing cards from a deck.
• int MPI_Gather(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data,
int recv_count, MPI_Datatype recv_type, int root, MPI_Comm comm)
o Gather: The inverse of Scatter. It collects data from all processes in the group and
concatenates it into an array on the root process. It's like collecting reports from
every team member.
• int MPI_Reduce(void* send_data, void* recv_data, int count, MPI_Datatype type, MPI_Op
op, int root, MPI_Comm comm)
o Reduce: Similar to Gather, but it performs a reduction operation (e.g., sum, max,
min) on the data as it's being collected. For example, it can calculate the global sum
of a value held by each process. The final result is stored only on the root process.
o op: The operation to perform (e.g., MPI_SUM, MPI_MAX, MPI_PROD).
5. Trapezoidal Rule for Numerical Integration
The Trapezoidal Rule is a numerical method used to find the approximate value of a definite integral,
which represents the area under a curve.
Explanation
The core idea is to divide the area under the curve of a function from to into a series of smaller
trapezoids instead of rectangles (as in Riemann sums). The sum of the areas of these trapezoids gives
an approximation of the total area. This is generally more accurate than using rectangles because the
slanted top of the trapezoid follows the curve more closely.
The area of a single trapezoid with width and parallel sides of height and is:
To approximate the integral , we divide the interval into equal subintervals, each of width . This
creates trapezoids.
The Composite Trapezoidal Rule formula is derived by summing the areas of all these trapezoids:
Notice that the heights of all the interior points ( to ) are counted twice because each is a side for
two adjacent trapezoids. The first () and last () points are only counted once.
Pseudocode
Here is a pseudocode algorithm to implement the Trapezoidal Rule:
FUNCTION Trapezoidal_Integration(function f, limit_a, limit_b, num_trapezoids_n)
// 1. Calculate the width of each trapezoid
h = (limit_b - limit_a) / num_trapezoids_n
// 2. Initialize the sum with the first and last terms
// These are the endpoints of the function
sum = 0.5 * (f(limit_a) + f(limit_b))
// 3. Loop through all the INTERIOR points
FOR i FROM 1 TO num_trapezoids_n - 1
x_i = limit_a + i * h
sum = sum + f(x_i)
END FOR
// 4. Multiply the sum by the width h to get the final area
integral_approximation = h * sum
// 5. Return the result
RETURN integral_approximation
END FUNCTION
This algorithm is efficient because it calculates the function value at each point only once and
correctly applies the formula's weighting (half for endpoints, full for interior points). The accuracy of
the approximation increases as num_trapezoids_n gets larger.
6. MPI I/O Operations and Challenges
In high-performance computing, I/O (Input/Output) is often a major bottleneck. When thousands of
parallel processes try to read or write data, coordinating access to the file system is a significant
challenge. MPI I/O (also known as Parallel I/O) is a part of the MPI standard designed to address this.
MPI I/O Operations
MPI I/O provides a parallel, high-performance interface to the file system. It treats a file as a shared
resource that all processes in a communicator can access.
1. Opening/Closing Files:
o MPI_File_open(): Collectively opens a file. All processes in the communicator must
call this function.
o MPI_File_close(): Collectively closes the file.
2. File Views:
o This is the most powerful concept in MPI I/O. A file view defines which parts of a file
are "visible" to a process. It allows a process to access non-contiguous data in the file
as if it were a single, contiguous block.
o MPI_File_set_view(): Sets the view for a process. This enables complex data layouts
(like accessing a sub-matrix from a large matrix stored on disk) to be handled
efficiently.
3. Data Access Routines: MPI I/O supports two modes of data access:
o Collective I/O (MPI_File_read_all, MPI_File_write_all): All processes in the
communicator must call these functions together. This is the preferred method
because it allows the MPI library to perform significant optimizations. The library can
merge many small, non-contiguous requests from different processes into a few
large, contiguous I/O operations, which is much more efficient for the underlying file
system.
o Non-collective I/O (MPI_File_read, MPI_File_write): Each process performs I/O
independently. This is simpler to program but is often much less performant, as it
can lead to file system contention.
Challenges in Parallel I/O
1. File System Contention: When thousands of processes try to access a single file or a single
file server simultaneously, it creates a massive bottleneck. The storage system can become
overwhelmed, leading to poor performance. Collective I/O helps mitigate this by
coordinating access, but contention remains a primary challenge.
2. Small I/O Problem: Parallel applications often require each process to read or write many
small, non-contiguous chunks of data. File systems are optimized for large, sequential
transfers. A flood of small requests is extremely inefficient. MPI I/O's collective buffering and
data sieving are designed to solve this by grouping small requests into larger ones.
3. Managing Non-Contiguous Data: Scientific data is often multi-dimensional (e.g., a 3D grid). A
process might only "own" a small part of this grid, which is stored non-contiguously in a file.
Reading this data naively would require many separate seek and read operations. MPI file
views, using derived datatypes, are the solution, allowing a process to describe its non-
contiguous data layout and access it in a single operation.
4. Data Consistency and Coherency: Ensuring that all processes have a consistent view of the
file is crucial. If one process writes data, other processes must be able to read that updated
data correctly. MPI I/O provides different consistency semantics to manage this, but it adds
complexity for the programmer.
5. Metadata Overload: In some file systems, creating or accessing many small files can be slow
because of the overhead of managing metadata (filenames, permissions, locations). Writing
a single large, shared file is often better, but this leads back to the challenges of coordinated
access.
Perfect Sumathi — here’s the expanded 20-mark version of all six topics.
Each topic now includes detailed explanations, examples, formulas, diagrams (conceptually
explained), and key points, exactly how you’d write in a 20-mark exam answer sheet.
1. GPU Programming Model (20 Marks)
Introduction
A GPU (Graphics Processing Unit) is a highly parallel processor designed for handling multiple
operations simultaneously. The GPU programming model enables developers to use GPUs not only
for graphics rendering but also for general-purpose computing (GPGPU) such as machine learning,
simulations, and numerical computations.
Architecture Overview
• GPUs contain hundreds to thousands of smaller processing cores, grouped into Streaming
Multiprocessors (SMs).
• Each SM executes many threads in parallel using a model known as SIMT (Single Instruction,
Multiple Threads).
Key Components:
• Thread: Smallest unit of execution.
• Block: A group of threads that share memory and synchronize.
• Grid: Collection of blocks that execute the same kernel.
Memory Hierarchy
1. Registers – Private to each thread (fastest access).
2. Shared Memory – Shared by threads within a block.
3. Global Memory – Accessible to all threads (slower).
4. Constant & Texture Memory – Read-only cached memories for optimization.
Programming Model (CUDA / OpenCL)
GPU programming models (like CUDA by NVIDIA or OpenCL) define how to launch and manage
kernels.
Example (CUDA):
__global__ void vectorAdd(int *a, int *b, int *c) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
c[i] = a[i] + b[i];
}
Launched using:
vectorAdd<<<gridSize, blockSize>>>(a, b, c);
Execution Model
• The GPU executes many threads in parallel, grouped into warps (usually 32 threads).
• Each warp executes the same instruction simultaneously (SIMT model).
• Divergence (if-else branches) causes some threads to be idle → reduced performance.
Advantages
• Extremely high throughput for data-parallel workloads.
• Efficient for repetitive mathematical operations.
• Cost-effective acceleration compared to CPU clusters.
Applications
• Deep learning model training (TensorFlow, PyTorch)
• Image and video processing
• Fluid dynamics, simulations, molecular modeling
• Real-time rendering in games and visualization
Conclusion
The GPU programming model provides a framework for efficiently utilizing thousands of lightweight
threads for parallel workloads. It transforms traditional graphics processors into general-purpose
high-performance computing units.
2. GPU vs CPU (MIMD) (20 Marks)
Introduction
Both CPUs and GPUs are parallel processing devices, but they differ significantly in architecture,
execution model, and optimization focus.
While CPUs are designed for MIMD (Multiple Instruction, Multiple Data) processing, GPUs use
SIMD/SIMT to achieve massive parallelism.
Architecture Comparison
Feature CPU (MIMD) GPU (SIMD/SIMT)
Core Count Few (2–64) powerful cores Thousands of simpler cores
MIMD – Each core executes a different SIMT/SIMD – Many threads execute
Execution Model
instruction on different data same instruction on multiple data
Memory
Deep cache hierarchy (L1, L2, L3) Large global memory, smaller cache
Hierarchy
Thread
Heavyweight threads Lightweight threads
Management
Performance
Low latency (faster single-task) High throughput (many tasks at once)
Focus
Control Flow Complex, branch-heavy tasks Simple, uniform control flow
Best Suited For OS tasks, compilers, databases Graphics, AI, ML, simulations
Example
• CPU (MIMD):
Each processor executes different tasks — e.g.,
o P1: sorts an array
o P2: performs database search
o P3: compresses a file
• GPU (SIMD/SIMT):
All threads execute the same task on different data — e.g.,
o Thread 1: Add pixel 1
o Thread 2: Add pixel 2
o Thread 3: Add pixel 3
Illustration
CPU (MIMD): Different Instructions → Different Data
GPU (SIMD): Same Instruction → Many Data
Conclusion
CPUs are flexible, optimized for sequential workloads, and handle complex logic efficiently.
GPUs, on the other hand, are specialized for parallel workloads, providing far greater throughput for
data-intensive computations.
3. Amdahl’s Law, Speedup, and Efficiency (20 Marks)
Introduction
In parallel computing, the goal is to execute programs faster using multiple processors.
Amdahl’s Law helps in estimating the maximum improvement possible through parallelization.
Amdahl’s Law
[
S = \frac{1}{(1 - P) + \frac{P}{N}}
]
Where:
• ( S ) = overall speedup
• ( P ) = parallel portion
• ( (1-P) ) = sequential portion
• ( N ) = number of processors
Interpretation
Even if 90% of the code is parallelizable, the 10% sequential portion limits the total speedup —
showing diminishing returns as N increases.
Speedup
[
S = \frac{T_{serial}}{T_{parallel}}
]
Shows how many times faster a parallel program runs compared to the serial version.
Efficiency
[
E = \frac{S}{N}
]
Represents how effectively each processor is used.
Numerical Example
Let ( P = 0.9 ), ( N = 8 )
[
S = \frac{1}{(1 - 0.9) + \frac{0.9}{8}} = \frac{1}{0.1 + 0.1125} = 4.71
]
[
E = \frac{4.71}{8} = 0.588 = 58.8%
]
→ Even with 8 processors, efficiency drops because of the sequential portion.
Graph (Conceptual)
Speedup vs Processors curve — initially steep, then flattens.
Conclusion
Amdahl’s Law demonstrates that the serial portion of a program limits total speedup. Efficient
parallel computing thus depends on minimizing sequential parts and balancing processor workloads.
4. MPI Functions for Point-to-Point and Collective Communication (20 Marks)
Introduction
MPI (Message Passing Interface) is the standard for communication in distributed-memory parallel
systems. It provides a rich set of functions for data exchange, synchronization, and coordination
among processes.
A. Point-to-Point Communication
Used when one process directly sends data to another.
Key Functions:
• MPI_Send(): Send data to another process
• MPI_Recv(): Receive data from another process
Example:
if (rank == 0) {
int msg = 10;
MPI_Send(&msg, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
else if (rank == 1) {
int msg;
MPI_Recv(&msg, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}
Modes of Send:
• Standard (MPI_Send)
• Buffered (MPI_Bsend)
• Synchronous (MPI_Ssend)
• Ready (MPI_Rsend)
B. Collective Communication
Involves all processes in a communicator.
Function Description
MPI_Bcast() One-to-all broadcast
MPI_Scatter() Distribute chunks from one process to all
MPI_Gather() Collect data from all processes
MPI_Reduce() Apply an operation (sum, max, etc.) and return result to root
MPI_Allreduce() Reduce + broadcast result to all
Example:
MPI_Reduce(&local_sum, &total_sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
Advantages:
• Efficient inter-process communication
• Portable and scalable across platforms
• Synchronization and load distribution made easy
Conclusion
MPI provides a powerful and flexible way to implement both point-to-point and collective
communication, enabling scalable parallel applications across distributed systems.
5. Trapezoidal Rule for Numerical Integration (20 Marks)
Introduction
The trapezoidal rule is a numerical method for estimating the area under a curve (integration). It
divides the area into trapezoids rather than rectangles for better accuracy.
Formula
Algorithm Steps
1. Divide [a, b] into n subintervals.
2. Compute the width ( h = \frac{b-a}{n} ).
3. Evaluate ( f(x_i) ) for all ( i ).
4. Apply the trapezoidal formula.
Pseudo-code:
Input: a, b, n
h = (b - a) / n
sum = f(a) + f(b)
for i = 1 to n-1:
x=a+i*h
sum += 2 * f(x)
I = (h / 2) * sum
print(I)
Parallel Implementation (MPI Idea):
Each process computes part of the summation and sends it to root process.
local_sum = compute_partial(a_local, b_local);
MPI_Reduce(&local_sum, &total_sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
Advantages
• Simple and stable
• Works well for smooth functions
• Easy to parallelize
Limitations
• Less accurate for functions with high curvature
• Requires uniform spacing
Applications
Used in numerical analysis, scientific computing, and engineering simulations for approximating
integrals.
6. MPI I/O Operations and Challenges (20 Marks)
Introduction
MPI I/O allows multiple processes to read and write files in parallel. It’s part of the MPI-2 standard
and essential for large-scale data-driven applications.
MPI I/O Functions
Function Description
MPI_File_open() Opens a file collectively
MPI_File_read() / MPI_File_write() Read/write operations
MPI_File_seek() Move file pointer
MPI_File_close() Close file
MPI_File_set_view() Define how file data is mapped for each process
Example:
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, "data.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY,
MPI_INFO_NULL, &fh);
MPI_File_write(fh, buffer, count, MPI_INT, MPI_STATUS_IGNORE);
MPI_File_close(&fh);
Parallel File Access Models
1. Independent I/O: Each process performs its own I/O.
2. Collective I/O: All processes cooperate to perform I/O efficiently.
Advantages
• Allows simultaneous file access
• Reduces bottlenecks in large systems
• Improves I/O performance through striping
Challenges
• Synchronization: Ensuring correct order of file access
• Consistency: Avoiding overwrites or data corruption
• Performance tuning: File striping, buffering, caching
• Portability: Different file systems and cluster environments
Applications
• High-performance simulations
• Big data processing
• Machine learning training datasets
Conclusion
MPI I/O provides an efficient and scalable method for managing parallel data access. Despite
synchronization and consistency challenges, it’s essential for large-scale scientific and engineering
computations.
Would you like me to make this into a beautifully formatted 20-mark PDF notes sheet (like your
interconnection one) with formulas, boxes, and diagrams for printing and revision?