0% found this document useful (0 votes)

11 views17 pages

PC

Uploaded by

Sumathi Shive Gowda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views17 pages

PC

Uploaded by

Sumathi Shive Gowda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

1.

GPU Programming Model

A Graphics Processing Unit (GPU) programming model is designed to manage and execute
computations on a highly parallel architecture. Unlike a CPU, which has a few powerful cores for
sequential tasks, a GPU has thousands of smaller, more efficient cores designed for handling multiple
tasks simultaneously. The most common programming model is CUDA (by NVIDIA), but the concepts
are similar for OpenCL (an open standard).

The model is based on a hierarchy of thread groups:

1. Grid: A grid is the highest-level grouping. It consists of all the threads launched to execute a
specific task, called a kernel. Think of the entire problem you want to solve.

2. Block: A grid is divided into multiple blocks. All threads within a block can cooperate by
sharing data through a fast, on-chip shared memory and can synchronize their execution.
Blocks are independent of each other and can be executed in any order or in parallel. Think
of a team working on a sub-problem.

3. Thread: A thread is the fundamental unit of execution. Each thread runs the same kernel
code but operates on different data, identified by its unique thread and block IDs. Think of an
individual worker on the team.

How it Works: The program is typically split into two parts:

• Host Code: Runs on the CPU. It handles serial parts of the program, manages memory on the
GPU, and launches kernels.

• Device Code (Kernel): Runs on the GPU. This is the parallel part of the program, executed by
thousands of threads.

The typical workflow is:

1. The CPU (host) allocates memory on the GPU (device).

2. The CPU copies the input data from its main memory (RAM) to the GPU's memory.

3. The CPU launches the kernel, specifying the number of blocks and threads per block (the grid
dimensions).

4. The thousands of GPU threads execute the kernel in parallel.

5. Once the kernel finishes, the CPU copies the results back from GPU memory to CPU memory.

This model is known as SIMT (Single Instruction, Multiple Thread). Groups of threads (called warps
in CUDA or wavefronts in AMD) execute the same instruction at the same time, but on different
pieces of data. This is extremely efficient for data-parallel problems like image processing, scientific
simulations, and machine learning.

2. GPU vs CPU MIMD

While both multi-core CPUs and GPUs can be broadly classified under the MIMD (Multiple
Instruction, Multiple Data) Flynn's taxonomy, they represent very different philosophies of parallel
architecture.
CPU: True MIMD (Task Parallelism) A modern multi-core CPU is a true MIMD machine. Each of its
cores is a powerful, independent processor.

• Large Caches: Each core has significant L1 and L2 cache to reduce memory latency.

• Complex Control Unit: Each core has sophisticated logic for branch prediction and out-of-
order execution to speed up a single instruction stream.

• Independent Execution: Each core can fetch, decode, and execute its own instructions from
completely different programs or threads. For example, Core 1 can run a web browser while
Core 2 runs a word processor. This is ideal for task parallelism, where you have different,
largely unrelated tasks to run.

GPU: SIMT, a subset of MIMD (Data Parallelism) A GPU's architecture is built for throughput, not
low latency. It is best described as SIMT (Single Instruction, Multiple Thread), which is a specialized
form of MIMD.

• Small Caches: Caches are smaller; the design hides memory latency by rapidly switching
between thousands of active threads, not by trying to avoid it.

• Simple Control Unit: Control logic is shared among a group of cores. Threads are bundled
into warps (e.g., 32 threads). All threads in a warp execute the same instruction at the same
time.

• Divergence Penalty: If threads within a warp need to take different paths (e.g., due to an if-
else statement), the warp executes both paths serially, with threads on the "inactive" path
being disabled. This is called branch divergence and can hurt performance.

• Dependent Execution: While different warps can execute different instructions (making it
MIMD), the fundamental parallelism comes from executing one instruction across lots of
data. This is ideal for data parallelism, where you apply the same operation to every element
of a large dataset.

Feature CPU (True MIMD) GPU (SIMT / MIMD)

Goal Minimize Latency Maximize Throughput

Core Design Few, complex, powerful cores Thousands of simple, efficient cores

Parallelism Data Parallelism (same task on different

Task Parallelism (different tasks)
Type data)

Control Logic Independent per core Shared among groups of cores (warps)

Relies on massive parallelism to hide

Memory Large caches for fast access
latency

General-purpose, sequential tasks,

Best For Repetitive, parallelizable computations
complex logic

Export to Sheets
In short, a CPU gives a few brilliant specialists full autonomy to work on complex, different problems.
A GPU gives a massive army of workers one simple instruction to carry out in unison on a huge
amount of material.

3. Amdahl's Law, Speedup, and Efficiency

These concepts are fundamental to understanding the limits of parallel computing.

Amdahl's Law

Amdahl's Law states that the maximum potential speedup of a program is limited by its sequential
part. Even if you have infinite processors, you can't make the part of the code that must run
sequentially any faster.

The formula for speedup is:

Where:

• is the theoretical speedup for processors.

• is the fraction of the program that is parallelizable (from 0 to 1).

• is the fraction of the program that is serial.

• is the number of processors.

The key takeaway is that the serial fraction becomes the bottleneck. As , the maximum speedup is
limited to . For example, if 10% of your code is serial (), you can never get more than a 10x speedup,
no matter how many processors you use.

Speedup

Speedup measures how much faster a parallel program runs compared to its sequential version.

Where:

• is the execution time on a single processor.

• is the execution time on processors. An ideal speedup is linear, meaning if you use
processors, the program runs times faster. In reality, this is rarely achieved due to
communication overhead and the serial portion of the code.

Efficiency

Efficiency measures how well the processors are being utilized. It's the observed speedup divided by
the number of processors used.

An efficiency of 1 (or 100%) means perfect linear speedup. An efficiency of 0.5 (or 50%) means that,
on average, the processors are being used effectively only half the time.

Numerical Example

Problem: Suppose a program takes 100 seconds to run on a single CPU. Profiling shows that 95% of
the execution time can be parallelized. What is the speedup and efficiency if we run it on an 8-core
CPU?
Solution:

1. Identify the variables:

o Sequential time seconds.

o Parallelizable fraction .

o Serial fraction .

o Number of processors .

2. Calculate Speedup using Amdahl's Law:

The theoretical speedup is approximately 5.93x.

3. Calculate the new parallel time:

4. Calculate Efficiency:

The efficiency is about 74.1%. This means that due to the serial portion and overhead, we are not
getting the full 8x speedup; we are effectively using the power of ~5.9 cores out of the 8 available.

4. MPI Functions for Communication

MPI (Message Passing Interface) is a standard for communication between processes running in a
distributed memory system. Communication can be categorized into two main types.

Point-to-Point Communication

This involves a message being sent between two specific processes. It's like making a direct phone
call.

• int MPI_Send(void* data, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm)

o Sends a message from the calling process to a specified destination.

o data: Pointer to the data buffer to be sent.

o count: Number of elements in the buffer.

o type: Data type of the elements (e.g., MPI_INT, MPI_DOUBLE).

o dest: Rank of the destination process.

o tag: An integer to distinguish messages.

o comm: The communicator (group of processes), usually MPI_COMM_WORLD. This is

a blocking send; the function does not return until the data has been safely sent out
of the send buffer.

• int MPI_Recv(void* data, int count, MPI_Datatype type, int source, int tag, MPI_Comm
comm, MPI_Status* status)

o Receives a message. It will wait (block) until a matching message arrives.

o data: Pointer to the buffer to store the received data.

o count: Maximum number of elements to receive.

o type: Data type of the elements.

o source: Rank of the source process (or MPI_ANY_SOURCE).

o tag: The expected message tag (or MPI_ANY_TAG).

o comm: The communicator.

o status: An object that provides information about the received message (actual
source, tag, and size).

Collective Communication

This involves communication among all processes in a communicator group. It's like a group
announcement or a team meeting.

• int MPI_Bcast(void* data, int count, MPI_Datatype type, int root, MPI_Comm comm)

o Broadcast: Sends data from one process (the root) to all other processes in the
group. It's a one-to-all operation.

o The root process provides the data to be sent, and all other processes provide a
buffer to receive it.

• int MPI_Scatter(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data,

int recv_count, MPI_Datatype recv_type, int root, MPI_Comm comm)

o Scatter: Distributes chunks of an array from a root process to all processes in the
group (including itself). For example, if the root has an array of 8 elements and there
are 4 processes, each process receives 2 elements. It's like dealing cards from a deck.

• int MPI_Gather(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data,

int recv_count, MPI_Datatype recv_type, int root, MPI_Comm comm)

o Gather: The inverse of Scatter. It collects data from all processes in the group and
concatenates it into an array on the root process. It's like collecting reports from
every team member.

• int MPI_Reduce(void* send_data, void* recv_data, int count, MPI_Datatype type, MPI_Op
op, int root, MPI_Comm comm)

o Reduce: Similar to Gather, but it performs a reduction operation (e.g., sum, max,
min) on the data as it's being collected. For example, it can calculate the global sum
of a value held by each process. The final result is stored only on the root process.

o op: The operation to perform (e.g., MPI_SUM, MPI_MAX, MPI_PROD).

5. Trapezoidal Rule for Numerical Integration

The Trapezoidal Rule is a numerical method used to find the approximate value of a definite integral,
which represents the area under a curve.

Explanation
The core idea is to divide the area under the curve of a function from to into a series of smaller
trapezoids instead of rectangles (as in Riemann sums). The sum of the areas of these trapezoids gives
an approximation of the total area. This is generally more accurate than using rectangles because the
slanted top of the trapezoid follows the curve more closely.

The area of a single trapezoid with width and parallel sides of height and is:

To approximate the integral , we divide the interval into equal subintervals, each of width . This
creates trapezoids.

The Composite Trapezoidal Rule formula is derived by summing the areas of all these trapezoids:

Notice that the heights of all the interior points ( to ) are counted twice because each is a side for
two adjacent trapezoids. The first () and last () points are only counted once.

Pseudocode

Here is a pseudocode algorithm to implement the Trapezoidal Rule:

FUNCTION Trapezoidal_Integration(function f, limit_a, limit_b, num_trapezoids_n)

// 1. Calculate the width of each trapezoid

h = (limit_b - limit_a) / num_trapezoids_n

// 2. Initialize the sum with the first and last terms

// These are the endpoints of the function

sum = 0.5 * (f(limit_a) + f(limit_b))

// 3. Loop through all the INTERIOR points

FOR i FROM 1 TO num_trapezoids_n - 1

x_i = limit_a + i * h

sum = sum + f(x_i)

END FOR

// 4. Multiply the sum by the width h to get the final area

integral_approximation = h * sum

// 5. Return the result

RETURN integral_approximation
END FUNCTION

This algorithm is efficient because it calculates the function value at each point only once and
correctly applies the formula's weighting (half for endpoints, full for interior points). The accuracy of
the approximation increases as num_trapezoids_n gets larger.

6. MPI I/O Operations and Challenges

In high-performance computing, I/O (Input/Output) is often a major bottleneck. When thousands of

parallel processes try to read or write data, coordinating access to the file system is a significant
challenge. MPI I/O (also known as Parallel I/O) is a part of the MPI standard designed to address this.

MPI I/O Operations

MPI I/O provides a parallel, high-performance interface to the file system. It treats a file as a shared
resource that all processes in a communicator can access.

1. Opening/Closing Files:

o MPI_File_open(): Collectively opens a file. All processes in the communicator must

call this function.

o MPI_File_close(): Collectively closes the file.

2. File Views:

o This is the most powerful concept in MPI I/O. A file view defines which parts of a file
are "visible" to a process. It allows a process to access non-contiguous data in the file
as if it were a single, contiguous block.

o MPI_File_set_view(): Sets the view for a process. This enables complex data layouts
(like accessing a sub-matrix from a large matrix stored on disk) to be handled
efficiently.

3. Data Access Routines: MPI I/O supports two modes of data access:

o Collective I/O (MPI_File_read_all, MPI_File_write_all): All processes in the

communicator must call these functions together. This is the preferred method
because it allows the MPI library to perform significant optimizations. The library can
merge many small, non-contiguous requests from different processes into a few
large, contiguous I/O operations, which is much more efficient for the underlying file
system.

o Non-collective I/O (MPI_File_read, MPI_File_write): Each process performs I/O

independently. This is simpler to program but is often much less performant, as it
can lead to file system contention.

Challenges in Parallel I/O

1. File System Contention: When thousands of processes try to access a single file or a single
file server simultaneously, it creates a massive bottleneck. The storage system can become
overwhelmed, leading to poor performance. Collective I/O helps mitigate this by
coordinating access, but contention remains a primary challenge.
2. Small I/O Problem: Parallel applications often require each process to read or write many
small, non-contiguous chunks of data. File systems are optimized for large, sequential
transfers. A flood of small requests is extremely inefficient. MPI I/O's collective buffering and
data sieving are designed to solve this by grouping small requests into larger ones.

3. Managing Non-Contiguous Data: Scientific data is often multi-dimensional (e.g., a 3D grid). A

process might only "own" a small part of this grid, which is stored non-contiguously in a file.
Reading this data naively would require many separate seek and read operations. MPI file
views, using derived datatypes, are the solution, allowing a process to describe its non-
contiguous data layout and access it in a single operation.

4. Data Consistency and Coherency: Ensuring that all processes have a consistent view of the
file is crucial. If one process writes data, other processes must be able to read that updated
data correctly. MPI I/O provides different consistency semantics to manage this, but it adds
complexity for the programmer.

5. Metadata Overload: In some file systems, creating or accessing many small files can be slow
because of the overhead of managing metadata (filenames, permissions, locations). Writing
a single large, shared file is often better, but this leads back to the challenges of coordinated
access.
Perfect Sumathi — here’s the expanded 20-mark version of all six topics.
Each topic now includes detailed explanations, examples, formulas, diagrams (conceptually
explained), and key points, exactly how you’d write in a 20-mark exam answer sheet.

1. GPU Programming Model (20 Marks)

Introduction

A GPU (Graphics Processing Unit) is a highly parallel processor designed for handling multiple
operations simultaneously. The GPU programming model enables developers to use GPUs not only
for graphics rendering but also for general-purpose computing (GPGPU) such as machine learning,
simulations, and numerical computations.

Architecture Overview

• GPUs contain hundreds to thousands of smaller processing cores, grouped into Streaming
Multiprocessors (SMs).

• Each SM executes many threads in parallel using a model known as SIMT (Single Instruction,
Multiple Threads).

Key Components:

• Thread: Smallest unit of execution.

• Block: A group of threads that share memory and synchronize.

• Grid: Collection of blocks that execute the same kernel.

Memory Hierarchy

1. Registers – Private to each thread (fastest access).

2. Shared Memory – Shared by threads within a block.

3. Global Memory – Accessible to all threads (slower).

4. Constant & Texture Memory – Read-only cached memories for optimization.

Programming Model (CUDA / OpenCL)

GPU programming models (like CUDA by NVIDIA or OpenCL) define how to launch and manage
kernels.

Example (CUDA):

global void vectorAdd(int a, int b, int *c) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

c[i] = a[i] + b[i];

}

Launched using:

vectorAdd<<<gridSize, blockSize>>>(a, b, c);

Execution Model

• The GPU executes many threads in parallel, grouped into warps (usually 32 threads).

• Each warp executes the same instruction simultaneously (SIMT model).

• Divergence (if-else branches) causes some threads to be idle → reduced performance.

Advantages

• Extremely high throughput for data-parallel workloads.

• Efficient for repetitive mathematical operations.

• Cost-effective acceleration compared to CPU clusters.

Applications

• Deep learning model training (TensorFlow, PyTorch)

• Image and video processing

• Fluid dynamics, simulations, molecular modeling

• Real-time rendering in games and visualization

Conclusion

The GPU programming model provides a framework for efficiently utilizing thousands of lightweight
threads for parallel workloads. It transforms traditional graphics processors into general-purpose
high-performance computing units.

2. GPU vs CPU (MIMD) (20 Marks)

Introduction

Both CPUs and GPUs are parallel processing devices, but they differ significantly in architecture,
execution model, and optimization focus.
While CPUs are designed for MIMD (Multiple Instruction, Multiple Data) processing, GPUs use
SIMD/SIMT to achieve massive parallelism.

Architecture Comparison
Feature CPU (MIMD) GPU (SIMD/SIMT)

Core Count Few (2–64) powerful cores Thousands of simpler cores

MIMD – Each core executes a different SIMT/SIMD – Many threads execute

Execution Model
instruction on different data same instruction on multiple data

Memory
Deep cache hierarchy (L1, L2, L3) Large global memory, smaller cache
Hierarchy

Thread
Heavyweight threads Lightweight threads
Management

Performance
Low latency (faster single-task) High throughput (many tasks at once)
Focus

Control Flow Complex, branch-heavy tasks Simple, uniform control flow

Best Suited For OS tasks, compilers, databases Graphics, AI, ML, simulations

Example

• CPU (MIMD):
Each processor executes different tasks — e.g.,

o P1: sorts an array

o P2: performs database search

o P3: compresses a file

• GPU (SIMD/SIMT):
All threads execute the same task on different data — e.g.,

o Thread 1: Add pixel 1

o Thread 2: Add pixel 2

o Thread 3: Add pixel 3

Illustration

CPU (MIMD): Different Instructions → Different Data

GPU (SIMD): Same Instruction → Many Data

Conclusion

CPUs are flexible, optimized for sequential workloads, and handle complex logic efficiently.
GPUs, on the other hand, are specialized for parallel workloads, providing far greater throughput for
data-intensive computations.
3. Amdahl’s Law, Speedup, and Efficiency (20 Marks)

Introduction

In parallel computing, the goal is to execute programs faster using multiple processors.
Amdahl’s Law helps in estimating the maximum improvement possible through parallelization.

Amdahl’s Law

[
S = \frac{1}{(1 - P) + \frac{P}{N}}
]
Where:

• ( S ) = overall speedup

• ( P ) = parallel portion

• ( (1-P) ) = sequential portion

• ( N ) = number of processors

Interpretation

Even if 90% of the code is parallelizable, the 10% sequential portion limits the total speedup —
showing diminishing returns as N increases.

Speedup

[
S = \frac{T_{serial}}{T_{parallel}}
]
Shows how many times faster a parallel program runs compared to the serial version.

Efficiency

[
E = \frac{S}{N}
]
Represents how effectively each processor is used.

Numerical Example

Let ( P = 0.9 ), ( N = 8 )
[
S = \frac{1}{(1 - 0.9) + \frac{0.9}{8}} = \frac{1}{0.1 + 0.1125} = 4.71
]
[
E = \frac{4.71}{8} = 0.588 = 58.8%
]

→ Even with 8 processors, efficiency drops because of the sequential portion.

Graph (Conceptual)

Speedup vs Processors curve — initially steep, then flattens.

Conclusion

Amdahl’s Law demonstrates that the serial portion of a program limits total speedup. Efficient
parallel computing thus depends on minimizing sequential parts and balancing processor workloads.

4. MPI Functions for Point-to-Point and Collective Communication (20 Marks)

Introduction

MPI (Message Passing Interface) is the standard for communication in distributed-memory parallel
systems. It provides a rich set of functions for data exchange, synchronization, and coordination
among processes.

A. Point-to-Point Communication

Used when one process directly sends data to another.

Key Functions:

• MPI_Send(): Send data to another process

• MPI_Recv(): Receive data from another process

Example:

if (rank == 0) {

int msg = 10;

MPI_Send(&msg, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);

else if (rank == 1) {

int msg;

MPI_Recv(&msg, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);

}

Modes of Send:

• Standard (MPI_Send)

• Buffered (MPI_Bsend)

• Synchronous (MPI_Ssend)

• Ready (MPI_Rsend)

B. Collective Communication

Involves all processes in a communicator.

Function Description

MPI_Bcast() One-to-all broadcast

MPI_Scatter() Distribute chunks from one process to all

MPI_Gather() Collect data from all processes

MPI_Reduce() Apply an operation (sum, max, etc.) and return result to root

MPI_Allreduce() Reduce + broadcast result to all

Example:

MPI_Reduce(&local_sum, &total_sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

Advantages:

• Efficient inter-process communication

• Portable and scalable across platforms

• Synchronization and load distribution made easy

Conclusion

MPI provides a powerful and flexible way to implement both point-to-point and collective
communication, enabling scalable parallel applications across distributed systems.

5. Trapezoidal Rule for Numerical Integration (20 Marks)

Introduction

The trapezoidal rule is a numerical method for estimating the area under a curve (integration). It
divides the area into trapezoids rather than rectangles for better accuracy.
Formula

Algorithm Steps

1. Divide [a, b] into n subintervals.

2. Compute the width ( h = \frac{b-a}{n} ).

3. Evaluate ( f(x_i) ) for all ( i ).

4. Apply the trapezoidal formula.

Pseudo-code:

Input: a, b, n

h = (b - a) / n

sum = f(a) + f(b)

for i = 1 to n-1:

x=a+i*h

sum += 2 * f(x)

I = (h / 2) * sum

print(I)

Parallel Implementation (MPI Idea):

Each process computes part of the summation and sends it to root process.

local_sum = compute_partial(a_local, b_local);

MPI_Reduce(&local_sum, &total_sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

Advantages

• Simple and stable

• Works well for smooth functions

• Easy to parallelize

Limitations

• Less accurate for functions with high curvature

• Requires uniform spacing

Applications

Used in numerical analysis, scientific computing, and engineering simulations for approximating
integrals.

6. MPI I/O Operations and Challenges (20 Marks)

Introduction

MPI I/O allows multiple processes to read and write files in parallel. It’s part of the MPI-2 standard
and essential for large-scale data-driven applications.

MPI I/O Functions

Function Description

MPI_File_open() Opens a file collectively

MPI_File_read() / MPI_File_write() Read/write operations

MPI_File_seek() Move file pointer

MPI_File_close() Close file

MPI_File_set_view() Define how file data is mapped for each process

Example:

MPI_File fh;

MPI_File_open(MPI_COMM_WORLD, "data.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY,

MPI_INFO_NULL, &fh);

MPI_File_write(fh, buffer, count, MPI_INT, MPI_STATUS_IGNORE);

MPI_File_close(&fh);

Parallel File Access Models

1. Independent I/O: Each process performs its own I/O.

2. Collective I/O: All processes cooperate to perform I/O efficiently.

Advantages

• Allows simultaneous file access

• Reduces bottlenecks in large systems

• Improves I/O performance through striping

Challenges

• Synchronization: Ensuring correct order of file access

• Consistency: Avoiding overwrites or data corruption

• Performance tuning: File striping, buffering, caching

• Portability: Different file systems and cluster environments

Applications

• High-performance simulations

• Big data processing

• Machine learning training datasets

Conclusion

MPI I/O provides an efficient and scalable method for managing parallel data access. Despite
synchronization and consistency challenges, it’s essential for large-scale scientific and engineering
computations.

Would you like me to make this into a beautifully formatted 20-mark PDF notes sheet (like your
interconnection one) with formulas, boxes, and diagrams for printing and revision?

Module 2
No ratings yet
Module 2
61 pages
PC Module2
No ratings yet
PC Module2
10 pages
Module 2
No ratings yet
Module 2
15 pages
Parallel Programming FDP
No ratings yet
Parallel Programming FDP
43 pages
Cuda
No ratings yet
Cuda
69 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
Parallel Computing-Module2 Notes
No ratings yet
Parallel Computing-Module2 Notes
10 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
Unit 4
100% (1)
Unit 4
48 pages
Understanding Parallel Processing Concepts
No ratings yet
Understanding Parallel Processing Concepts
38 pages
Module 2 PC
No ratings yet
Module 2 PC
10 pages
Ada2024 Gpu 1
No ratings yet
Ada2024 Gpu 1
47 pages
Understanding PGPU and CUDA Basics
No ratings yet
Understanding PGPU and CUDA Basics
70 pages
GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
Cpus: Latency Oriented Design
No ratings yet
Cpus: Latency Oriented Design
2 pages
Lec 14
No ratings yet
Lec 14
52 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
ch6 Notes
No ratings yet
ch6 Notes
5 pages
Module2 PP BDS702 Notes
No ratings yet
Module2 PP BDS702 Notes
15 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
GPU Architecture
33% (3)
GPU Architecture
28 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
BCS702 Module2
No ratings yet
BCS702 Module2
7 pages
Topic 8
No ratings yet
Topic 8
71 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Architecture
No ratings yet
Architecture
67 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
Owens
No ratings yet
Owens
67 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Module - 2
No ratings yet
Module - 2
31 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
CPU vs GPU Parallelism Explained
No ratings yet
CPU vs GPU Parallelism Explained
12 pages
HPC Important Question
No ratings yet
HPC Important Question
19 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Section 2 TR
No ratings yet
Section 2 TR
26 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
GPGPU
100% (1)
GPGPU
139 pages
Note2 4
No ratings yet
Note2 4
11 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
Overview of Parallel Computing Concepts
No ratings yet
Overview of Parallel Computing Concepts
46 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Comparative Study On CPU GPU and TPU
No ratings yet
Comparative Study On CPU GPU and TPU
9 pages
Thara Module2 BCS702
No ratings yet
Thara Module2 BCS702
25 pages
Parallel Programming Module 4
No ratings yet
Parallel Programming Module 4
93 pages
PP 1st Internals
No ratings yet
PP 1st Internals
40 pages
PDC Notes by Zatch-1
No ratings yet
PDC Notes by Zatch-1
42 pages
Student Friendly Notes Module2
No ratings yet
Student Friendly Notes Module2
5 pages
Aca
No ratings yet
Aca
13 pages
I SEM BCom Question Bank
No ratings yet
I SEM BCom Question Bank
36 pages
MODULE 3 TDS AND GST with Tally Prime
No ratings yet
MODULE 3 TDS AND GST with Tally Prime
8 pages
PM Revision QP (1)
No ratings yet
PM Revision QP (1)
3 pages
Updated SEP EVS question bank-2024-25 (1)
No ratings yet
Updated SEP EVS question bank-2024-25 (1)
4 pages
Message Authentication: First, What's The Goal? A B
No ratings yet
Message Authentication: First, What's The Goal? A B
26 pages
SDG Module 1
No ratings yet
SDG Module 1
10 pages
SDG Module 3
No ratings yet
SDG Module 3
21 pages
SE Module 2
No ratings yet
SE Module 2
32 pages
Computer Projecting Guide
No ratings yet
Computer Projecting Guide
136 pages
Experiment 10
No ratings yet
Experiment 10
16 pages
Types of Digital Images
No ratings yet
Types of Digital Images
23 pages
Doctor Appointment System Complete Research Content
No ratings yet
Doctor Appointment System Complete Research Content
4 pages
Photoshop Training Exercises
No ratings yet
Photoshop Training Exercises
17 pages
Top Android TV Box Benchmark Scores
No ratings yet
Top Android TV Box Benchmark Scores
14 pages
DSP System Toolbox
No ratings yet
DSP System Toolbox
19 pages
Windows 10 Shortcuts Guide
No ratings yet
Windows 10 Shortcuts Guide
13 pages
Module CSS 12 NEW
No ratings yet
Module CSS 12 NEW
30 pages
IT and Programming Insights
No ratings yet
IT and Programming Insights
14 pages
Migrating Arcgis Roads and Highways Data From Arcmap To Arcgis Pro 3.4
No ratings yet
Migrating Arcgis Roads and Highways Data From Arcmap To Arcgis Pro 3.4
11 pages
Chameli Devi Group of Institutions Department of Computer Science and Engineering Subject: Internet of Things (CS-801)
No ratings yet
Chameli Devi Group of Institutions Department of Computer Science and Engineering Subject: Internet of Things (CS-801)
31 pages
Computer Programming in Excel VBA: Kwabena Ofosu, PH.D., P.E., PTOE
No ratings yet
Computer Programming in Excel VBA: Kwabena Ofosu, PH.D., P.E., PTOE
91 pages
Idea 3 PPT Content Reference
No ratings yet
Idea 3 PPT Content Reference
4 pages
Install SQL Server 2022
No ratings yet
Install SQL Server 2022
9 pages
Tutorial 16 - 128x64 ST7920 Graphical - Dot-Matrix SPI LCD
No ratings yet
Tutorial 16 - 128x64 ST7920 Graphical - Dot-Matrix SPI LCD
8 pages
Diploma in Data Science Syllabus
No ratings yet
Diploma in Data Science Syllabus
29 pages
Annual Planning Class VIII IT
No ratings yet
Annual Planning Class VIII IT
2 pages
Introduction To Autodesk Inventor
No ratings yet
Introduction To Autodesk Inventor
21 pages
Fast WSI Blur Detection for Pathology
No ratings yet
Fast WSI Blur Detection for Pathology
14 pages
PowerMax+and+VMAX+Family+Configuration+and+Business+Continuity+Administration Lab+Guide Typo Corrected
No ratings yet
PowerMax+and+VMAX+Family+Configuration+and+Business+Continuity+Administration Lab+Guide Typo Corrected
176 pages
Intro to Software Engineering
No ratings yet
Intro to Software Engineering
21 pages
Multisim Simulation for Tweeter & Subwoofer
No ratings yet
Multisim Simulation for Tweeter & Subwoofer
14 pages
VDK - R2020x - Piping Parts Creation - V1.0
No ratings yet
VDK - R2020x - Piping Parts Creation - V1.0
83 pages
MQR0-22D003 CXDI-Elite CXDI-420CFixed Service Manual Rev.04 E03
No ratings yet
MQR0-22D003 CXDI-Elite CXDI-420CFixed Service Manual Rev.04 E03
159 pages
Understanding Computational Science
No ratings yet
Understanding Computational Science
9 pages
Chapter 5
No ratings yet
Chapter 5
31 pages
Lumenzia Button Reference
No ratings yet
Lumenzia Button Reference
2 pages
Development of Additive Manufacturing Technology
No ratings yet
Development of Additive Manufacturing Technology
52 pages
UNIT-4 Lecture Notes
No ratings yet
UNIT-4 Lecture Notes
13 pages

PC

Uploaded by

PC

Uploaded by

1.

GPU Programming Model

The model is based on a hierarchy of thread groups:

How it Works: The program is typically split into two parts:

The typical workflow is:

1. The CPU (host) allocates memory on the GPU (device).

4. The thousands of GPU threads execute the kernel in parallel.

2. GPU vs CPU MIMD

Feature CPU (True MIMD) GPU (SIMT / MIMD)

Goal Minimize Latency Maximize Throughput

Parallelism Data Parallelism (same task on different

Relies on massive parallelism to hide

General-purpose, sequential tasks,

3. Amdahl's Law, Speedup, and Efficiency

These concepts are fundamental to understanding the limits of parallel computing.

The formula for speedup is:

• is the theoretical speedup for processors.

• is the fraction of the program that is parallelizable (from 0 to 1).

• is the fraction of the program that is serial.

• is the number of processors.

• is the execution time on a single processor.

1. Identify the variables:

o Sequential time seconds.

2. Calculate Speedup using Amdahl's Law:

The theoretical speedup is approximately 5.93x.

3. Calculate the new parallel time:

4. MPI Functions for Communication

o Sends a message from the calling process to a specified destination.

o data: Pointer to the data buffer to be sent.

o count: Number of elements in the buffer.

o type: Data type of the elements (e.g., MPI_INT, MPI_DOUBLE).

o dest: Rank of the destination process.

o tag: An integer to distinguish messages.

o comm: The communicator (group of processes), usually MPI_COMM_WORLD. This is

o Receives a message. It will wait (block) until a matching message arrives.

o data: Pointer to the buffer to store the received data.

o type: Data type of the elements.

o source: Rank of the source process (or MPI_ANY_SOURCE).

o tag: The expected message tag (or MPI_ANY_TAG).

o comm: The communicator.

• int MPI_Scatter(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data,

• int MPI_Gather(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data,

o op: The operation to perform (e.g., MPI_SUM, MPI_MAX, MPI_PROD).

5. Trapezoidal Rule for Numerical Integration

Here is a pseudocode algorithm to implement the Trapezoidal Rule:

FUNCTION Trapezoidal_Integration(function f, limit_a, limit_b, num_trapezoids_n)

// 1. Calculate the width of each trapezoid

h = (limit_b - limit_a) / num_trapezoids_n

// 2. Initialize the sum with the first and last terms

// These are the endpoints of the function

sum = 0.5 * (f(limit_a) + f(limit_b))

// 3. Loop through all the INTERIOR points

FOR i FROM 1 TO num_trapezoids_n - 1

sum = sum + f(x_i)

// 4. Multiply the sum by the width h to get the final area

// 5. Return the result

6. MPI I/O Operations and Challenges

In high-performance computing, I/O (Input/Output) is often a major bottleneck. When thousands of

MPI I/O Operations

o MPI_File_open(): Collectively opens a file. All processes in the communicator must

o MPI_File_close(): Collectively closes the file.

o Collective I/O (MPI_File_read_all, MPI_File_write_all): All processes in the

o Non-collective I/O (MPI_File_read, MPI_File_write): Each process performs I/O

Challenges in Parallel I/O

3. Managing Non-Contiguous Data: Scientific data is often multi-dimensional (e.g., a 3D grid). A

1. GPU Programming Model (20 Marks)

• Thread: Smallest unit of execution.

• Block: A group of threads that share memory and synchronize.

• Grid: Collection of blocks that execute the same kernel.

1. Registers – Private to each thread (fastest access).

2. Shared Memory – Shared by threads within a block.

3. Global Memory – Accessible to all threads (slower).

4. Constant & Texture Memory – Read-only cached memories for optimization.

Programming Model (CUDA / OpenCL)

__global__ void vectorAdd(int *a, int *b, int *c) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

global void vectorAdd(int a, int b, int *c) {