0% found this document useful (0 votes)
16 views41 pages

PC Model Papers Solved

The document outlines key concepts in parallel computing, including definitions, classifications, and system architectures such as MIMD and SIMD. It discusses shared memory systems, cache coherence, and interconnects like Omega networks, as well as the importance of coordinating processes and threads. Additionally, it covers performance metrics like speedup and efficiency in parallel programs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views41 pages

PC Model Papers Solved

The document outlines key concepts in parallel computing, including definitions, classifications, and system architectures such as MIMD and SIMD. It discusses shared memory systems, cache coherence, and interconnects like Omega networks, as well as the importance of coordinating processes and threads. Additionally, it covers performance metrics like speedup and efficiency in parallel programs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Atria Institute of Technology

Model Question Paper (I & II) Solutions : Parallel Computing (BCS702)


Module – 1
Model Question Paper 1
Q.1 a. Define parallel computing. Explain Flynn’s Taxonomy for the classification of parallel
computers. (8 Marks)

Definition of Parallel Computing: Parallel computing (or parallel programming) is a computing paradigm
where multiple operations are carried out simultaneously to solve a single problem. It involves dividing a
large task into smaller subtasks that are then processed concurrently using multiple processors or cores to
achieve faster computation. For example, instead of one processor adding elements of a large array
sequentially, the array is split into chunks, and multiple cores add their respective parts at the same time.

Flynn’s Taxonomy: Flynn’s Taxonomy is a commonly used scheme to categorise parallel computers based
on the number of instruction streams and data streams they handle at once.

1. SISD (Single Instruction Stream, Single Data Stream): This represents traditional systems based on
the von Neumann architecture. It processes exactly one instruction and one data item at any given
time.
2. SIMD (Single Instruction, Multiple Data): These systems apply the same instruction to multiple data
items simultaneously. They are highly effective for data-parallel tasks where the same operation is
repeated across a large dataset.
3. MIMD (Multiple Instruction, Multiple Data): In these systems, independent instruction streams
operate on different data at the same time. MIMD systems consist of independent processors, each
with its own control unit and local memory, allowing them to run different programs asynchronously.

Q.1 b. Explain shared memory system with a diagram. Discuss the various interconnects for shared
memory. (8 Marks)

Shared Memory System: In a shared memory system, multiple processors (CPUs or cores) access and share
a common main memory space. This model allows for implicit communication between processes or threads
through shared variables, meaning they do not need to send explicit messages to exchange data. While this
makes programming easier, it can lead to bottlenecks as the number of cores increases.

Diagram Reference: Figure 2.5: "A shared-memory system"

Interconnects for Shared Memory: The performance of these systems depends heavily on the interconnect
that links processors to memory.
Atria Institute of Technology

1. Buses: A bus is a collection of parallel wires shared by all connected devices. They are low-cost and
flexible, but as more processors are added, the likelihood of contention increases because they must
wait for access to the shared wires, leading to decreased performance.
2. Crossbars: These use a grid of switches to provide bidirectional communication links. Crossbars are
much faster than buses because they allow multiple processors to communicate with different
memory modules simultaneously without conflict. However, they are significantly more expensive
and complex to build.

Q.1 c. Differentiate between static and dynamic threads with examples. (4 Marks)
Feature Dynamic Threads Static Threads
Creation Created (forked) only when work arrives. All threads are created at the beginning
after setup.
Lifecycle Terminate and join the master thread once their Run continuously until all work is
specific task is done. finished.
Efficiency Efficient resource usage; resources are only used Less efficient as idle threads still hold
when the thread is active. memory and registers.
Overhead High overhead due to frequent creation and Lower overhead since they are created
termination. only once.

• Example (Dynamic): A master thread in a web server waits for a network request; when one arrives,
it forks a worker thread to handle that specific request and shuts it down afterward.
• Example (Static): In a complex calculation, all threads are launched at the start of the program,
perform various tasks throughout the execution, and only join the master for cleanup at the very end.
Atria Institute of Technology

Q.2 a. Discuss the characteristics of SIMD systems. How does a GPU use SIMD parallelism to
optimize performance? Explain. (7 Marks)

Characteristics of SIMD Systems:

• Single Control Unit: A central unit broadcasts one instruction to multiple datapaths or processing
elements.
• Synchronous Execution: All datapaths operate in lockstep, meaning they must wait for the next
broadcast instruction before proceeding.
• Data Parallelism: They excel at tasks where the same operation is applied to large, uniform datasets,
such as image pixels or signal samples.
• Idle States: If a datapath does not have data matching an instruction (e.g., due to conditional
branching logic), it remains idle, which can reduce efficiency.

GPU Performance Optimisation via SIMD: GPUs were originally designed for real-time graphics,
processing points and triangles into pixels. They use SIMD parallelism extensively because shader functions
(parallel routines) are applied uniformly to thousands of similar elements like vertices.

• Massive Throughput: A single GPU core contains dozens of datapaths.


• Efficiency Example: If a shader operates on 1000 pixels and a GPU core can process 128 elements
simultaneously, the task requires only about 8 instruction steps instead of 1000 sequential steps.
• Multithreading: GPUs use hardware multithreading to store and swap the state of hundreds of threads
quickly, avoiding delays during memory access.

Q.2 b. Explain Omega network with a neat diagram. (7 Marks)

An Omega network is a cost-efficient indirect interconnect used in distributed-memory systems. It is


designed to be cheaper than a full crossbar while still supporting parallel communication.

• The omega network is a more cost-efficient design using 2×2 crossbar switches arranged in stages.
• It allows for some parallel communications but has contention—certain communications cannot
occur at the same time. Its bisection width is p / 2, and it uses only p × log₂(p) switches, making it
significantly cheaper than a full crossbar.
Atria Institute of Technology

• latency and bandwidth


• When evaluating interconnect performance, two key terms are used: latency and bandwidth.
• Latency refers to the time delay from the start of transmission to the reception of the first byte, while
bandwidth is the rate of data transfer once the transmission has started.
• The total transmission time for a message of size n bytes is given by:
Transmission Time = latency + (n / bandwidth).
• It's important to note that in distributed-memory systems, latency may also include message
assembly and disassembly time, such as adding headers, error correction data, and message size info.

Q.2 c. Discuss the means of coordinating processes/threads with an example. (6 Marks)

Coordinating processes or threads is essential to ensure a parallel program produces correct results. The
programmer must manage three main aspects:

1. Work Division: Dividing the total task so that each thread has a roughly equal workload (load
balancing) while minimising communication between them.
2. Synchronization: Ensuring threads reach certain points in the code at the right time.
3. Communication: Enabling threads to exchange data, which is implicit in shared memory (via shared
variables) and explicit in distributed memory (via message passing).

Example (Race Condition and Mutex): Consider two threads trying to update a shared variable x (initially 0)
by adding their private my_val to it. The operation x += my_val is not atomic; it involves loading x, adding the
value, and storing it back. If both threads do this simultaneously, one might overwrite the other’s update, a
problem known as a race condition. To coordinate this, programmers use a mutex (mutual exclusion lock) to
create a critical section. When one thread locks the mutex to update x, the other must wait, ensuring only one
thread executes the update at a time.

Analogy for Understanding: Think of Parallel Computing like a professional kitchen. In a sequential "SISD"
kitchen, one chef does everything: chops the onions, then fries the meat, then boils the pasta. In a parallel
"MIMD" kitchen, you have multiple chefs (cores). One chef makes the sauce while another boils the pasta.
Coordination (like a mutex) is like having a rule that only one chef can use the salt pot at a time to prevent
the dish from being seasoned twice!

Model Question Paper II


Q.1 a. What are MIMD systems? Discuss shared and distributed systems with neat diagrams for each.
(8 Marks)

MIMD (Multiple Instruction, Multiple Data) Systems: MIMD systems are a category of parallel computers
where multiple processors execute different instructions on different data sets simultaneously. Unlike
simpler systems, these are composed of independent processors, each equipped with its own control unit and
local memory or data path.

Key Characteristics:

• Asynchronous Execution: Processors operate at their own pace without needing to synchronize
clocks.

• No Global Clock: There is no requirement for processors to execute in "lockstep".

• Scalability: They are highly suitable for large-scale tasks as they scale well when adding more
processors.
Atria Institute of Technology

• 1. Shared-Memory MIMD Systems: In these systems, all processors access a common, shared main
memory space. Communication happens implicitly when processors read from or write to shared
variables. To prevent data conflicts, programmers must use synchronization tools like locks or
semaphores.

Diagram Reference: Figure 2.5: "A shared-memory system"

2. Distributed-Memory MIMD Systems: Here, every processor has its own private local memory. For
processors to work together, they must communicate by passing explicit messages over an interconnection
network. These systems are highly scalable and are commonly used in clusters and grid computing.

Diagram Reference: Figure 2.6: "A distributed-memory system"

Q.1 b. What is cache coherence? Discuss the various approaches to insuring cache coherence with
examples. (8 Marks)

Definition of Cache Coherence: Cache coherence is a problem that occurs in shared-memory systems where
multiple processor cores have their own private caches. If multiple cores cache the same memory location
and one core updates it, the other cores may continue using stale (old) data from their own caches. This
leads to an inconsistent view of memory where the update is not reflected across all processors.

Example: Imagine Core 0 and Core 1 both have a shared variable x (initially 2) in their caches. Core 0
updates x to 7. If Core 1 then performs a calculation using x, it might still use the value 2 from its own
cache, resulting in an incorrect calculation (e.g., getting 16 instead of the expected 36).

Approaches to Ensuring Cache Coherence:


Atria Institute of Technology

1. Snooping Cache Coherence: This is typically used in bus-based systems. All cores "snoop" (monitor)
the shared bus. When one core updates a variable, it broadcasts this on the bus; other cores see this
broadcast and invalidate their own local copies of that data.

2. Directory-Based Cache Coherence: This is used for larger systems where broadcasting would be too
slow. A central or distributed directory keeps track of which cores hold copies of specific cache
lines. When a core wants to write to a variable, the system checks the directory and notifies only the
specific cores that have a copy to invalidate or update them.

Q.1 c. Explain one-sided communication with an example. (4 Marks)


One-Sided Communication (Remote Memory Access): In traditional message passing, both the sender and
receiver must participate (one calls Send, the other calls Receive). In one-sided communication, only one
process performs the action—either reading from or writing directly to another process’s private memory
without that process being actively involved in the call.

Example: Suppose Process 0 (P0) needs to give data to Process 1 (P1). Using one-sided communication, P0
can copy the data directly into P1's memory. P1 does not need to execute an explicit "Receive" command.
To ensure safety, P1 might simply check a flag variable to see when the update is finished, or both processes
might synchronize before and after the operation to ensure the data is ready.

Q.2 a. Discuss the direct and indirect distributed memory interconnects with neat diagrams for each.
(8 Marks)

1. Direct Interconnects: In a direct interconnect, every switch is permanently linked to a specific processor-
memory pair, and the switches are then connected to each othe

Diagram Reference: Figure 2.8: "(a) A ring and (b) a toroidal mesh"
• Ring Topology: Each node is connected to exactly two neighbours. It uses 2p links for p processors.

• Toroidal Mesh: A more complex grid where each switch connects to five links, allowing more
simultaneous paths but at a higher cost.

2. Indirect Interconnects: In these systems, the switches are separate from the processors. Communication is
routed through a dedicated switching network to reach its destination.
Atria Institute of Technology

• Crossbar: Every processor can talk to any other processor simultaneously as long as there is no
destination conflict. It is fast but very expensive (p2 switches).

• Omega Network: Built using log₂(p) stages.

Each stage contains p/2 two-input/two-output (2×2) switches.

Total switch count = (p × log₂(p)), which is significantly less than p² of crossbar.

Diagram Reference: Figure 2.15: "An omega network

Q.2 b. Discuss the characteristics of vector processing systems. (6 Marks)

Vector processors are SIMD-style systems designed to operate on entire arrays or vectors of data at once,
rather than one individual number (scalar) at a time.

Key Characteristics:

• Vector Registers: These hold a complete vector (typically 4 to 256 elements) and allow operations on
all elements at the same time.

• Vectorized Functional Units: These units can apply the same operation (like addition) to all elements
in a vector or between pairs of vectors in a pipelined fashion.

• Vector Instructions: A single instruction can handle an entire block of data, which significantly
reduces the total number of instructions the CPU has to process.

• Interleaved Memory: The system uses multiple memory banks so it can load and store vector
elements very quickly, reducing delays.

• Strided Access & Scatter/Gather: Hardware support allows the system to read data at regular
intervals (strides) or from irregular locations (scatter/gather) efficiently.

Q.2 c. Explain message passing and partitioned global address space languages with examples. (6
Marks)

1. Message Passing: Used primarily in distributed-memory systems, this model relies on explicitly sending
and receiving data between processes. Each process has its own private memory and a unique ID called a
rank. The most common API for this is MPI (Message Passing Interface).
Atria Institute of Technology

• Example: Process 1 uses a Send() function to pass a string "Greetings" to Process 0, which must then
call a corresponding Receive() function to accept the data.

2. Partitioned Global Address Space (PGAS) Languages: PGAS languages provide a shared-memory
"illusion" on distributed-memory hardware. They allow the programmer to treat memory as a single global
space while still acknowledging that some data is physically local to a processor and some is remote.

• Example: A programmer declares an array as shared double x[n]. The language handles the
complexity of fetching data from other nodes, but the programmer can optimize performance by
ensuring each process mostly works on the part of x that is stored in its own local memory segment.

Analogy for Understanding: Think of Message Passing like two people in different houses sending letters to
each other; both have to actively check their mailboxes to communicate. PGAS is like having a shared
digital document (like Google Docs) where everyone can see the whole file, but it might "lag" slightly when
you try to edit a page that someone else is currently hosting on their computer.

Module – 2

Model Question Paper I

Q. 3 a. A program takes 100 sec. on a single processor. When run on 4 processors, it takes 30 sec.
Calculate the speedup and efficiency. (5 Marks)
Solution: To find the speedup and efficiency, we use the following definitions:

• Tserial: Time taken by the serial program = 100 seconds.


• Tparallel: Time taken by the parallel version = 30 seconds.
• p: Number of processors = 4.
1. Calculation of Speedup (S):
Speedup is the ratio of serial execution time to parallel execution time.

This means the program runs approximately 3.33 times faster on 4 processors than on one.

2. Calculation of Efficiency (E):


Efficiency measures the average utilization of each core.

Interpretation:
An efficiency of 0.8325 indicates that each core is utilized at roughly 83% of its capacity, while the
remaining time is lost due to parallel overheads such as synchronization or communication.

Q. 3 b. Explain how scalability is achieved in parallel programs with an example. (7 Marks)

Solution: Scalability refers to a program's ability to demonstrate performance gains as computing power
(such as the number of cores) increases.
How Scalability is Achieved: In a formal sense, a program is considered scalable if efficiency remains
constant when the number of processes/threads and the problem size are increased proportionally.
Atria Institute of Technology

Example: Suppose we have a program where:

• Serial runtime (Tserial) is represented by n (problem size).

• Parallel runtime is where p is the number of cores and 1 represents overhead.

• Efficiency (E) is calculated as:

To maintain this efficiency as we increase the number of processors by a factor of k (to kp), we must also
increase the problem size by a factor of x (to xn).

• The new efficiency becomes:

• If we set the problem scaling factor which is the original


efficiency.

Because the efficiency remains constant when both the core count and problem size increase at the same
rate, this program is weakly scalable.

Diagram Reference: For a visual representation of how increasing problem size maintains higher efficiency
as core counts grow, refer to:

Q. 3 c. Explain why MIMD performance metrics can’t be directly applied to GPU programs and how
GPU performance is evaluated. (8 Marks)
Solution: Why MIMD Metrics are Inapplicable: Standard MIMD metrics like efficiency and linear speedup
rely on the assumption that the serial and parallel versions of a program run on identical types of cores. In
GPU programming, this assumption fails because GPU cores are fundamentally different from traditional
general-purpose CPU cores. Therefore, comparing the utilization of a single CPU core to a single GPU core
(which is part of a massive SIMD group) does not provide a meaningful measure of efficiency.
How GPU Performance is Evaluated:

1. Relative Speedup: Performance is typically reported as the speedup of a GPU program compared to a
serial CPU program or a parallel MIMD (multi-core CPU) program.

2. Informal Scalability: A GPU program is considered scalable if it runs faster on a larger GPU compared to
a smaller one when the problem size is increased accordingly.

3. Amdahl’s Law Bound: If a fraction r of the program remains serial (running on the CPU) while the rest is
offloaded to the GPU, the maximum speedup is still limited to 1/r.
Atria Institute of Technology

4. Timing Facilities:

◦ CPU Timers: Used to measure the entire duration of a task by starting the timer before the GPU kernel
launch and stopping it after completion.

◦ GPU APIs: If specific internal portions of the GPU code need to be timed, developers must use
specialized timing facilities provided by the GPU's own API.

Q. 4 a. State and explain Amdahl’s law. A program achieves a speedup of 5 on 8 processors. Obtain the
fraction of the program that can be parallelized. (7 Marks)
Solution: Amdahl’s Law: Formulated by Gene Amdahl, the law states that the overall speedup of a parallel
program is strictly limited by its serial (unparallelized) portion. No matter how many cores are added, the
total runtime can never be less than the time required to execute the serial part. If a fraction r of a program is
serial, the maximum speedup (S) can never exceed 1/r.
Calculation: We are given:

• Speedup (S) = 5

• Number of processors (p) = 8

• Let f be the parallelized fraction of the program.

• The serial fraction is r=(1−f).

The formula for speedup is:

Answer: The fraction of the program that can be parallelized is 0.9143 or 91.43%.
Atria Institute of Technology

Q. 4 b. Explain the different approaches for taking timings in MIMD programs and discuss why total
CPU time may not reflect true parallel performance. (7 Marks)
Solution: Approaches for Taking Timings:
1. Wall Clock Time: This measures the actual elapsed time from the start to the end of a code segment,
similar to a stopwatch. It is the standard for parallel performance evaluation.
2. Global Elapsed Time: In parallel systems, we measure the time from when the first process starts to when
the last one finishes. This is typically done by placing a Barrier() before starting the timer and using a
Global_max() function to find the maximum time spent by any process.

3. Minimum Runtime: Because system noise can cause variability, developers usually report the minimum
runtime across multiple runs as the best approximation of true performance.

Why Total CPU Time is Misleading: CPU time (reported by functions like clock() in C) only accounts for the
time the processor is actively executing instructions. It excludes idle time. In parallel programs, a process
often spends significant time "idle" while blocked, waiting for communication from another process or
waiting at a synchronization barrier. Although the CPU isn't "working" during this time, the delay increases
the real-world time the user waits for the program to finish. Thus, CPU time fails to capture these critical
parallel overheads.

Q. 4 c. Is a program that obtains linear speedup strongly scalable? Explain your answer. (6 Marks)
Solution: Yes, a program that obtains linear speedup is strongly scalable.
Explanation:

1. Strong Scalability is defined as a case where the efficiency of a program remains constant as the number
of threads/processes increases, without increasing the problem size.

2. Linear Speedup occurs when the speedup S is equal to the number of processors p (S = p).
3. Efficiency (E) is calculated as E=S/p. For linear speedup, E=p/p=1.0 (or 100%).

4. If a program maintains linear speedup as more processors are added to a fixed problem size, its efficiency
remains perfectly constant at 1.0. Since constant efficiency for a fixed problem size satisfies the definition of
strong scalability, the program is strongly scalable.

Analogy for Amdahl's Law: Imagine you are cooking a meal that takes 100 minutes. 10 minutes is spent
preheating the oven (serial), and 90 minutes is spent chopping vegetables (parallelizable). Even if you hire
90 chefs to chop all the vegetables in 1 minute, the meal will still take at least 11 minutes because you
cannot speed up the preheating. The oven is your "serial bottleneck"

Model Question Paper II

Q. 3 a. Suppose the run-time of a serial program is given by 𝑻𝒔𝒆𝒓𝒊𝒂𝒍 = 𝒏𝟐 microseconds. A


parallelization has run-time 𝑻𝒑𝒂𝒓𝒂𝒍𝒍𝒆𝒍 = 𝒏𝟐 /𝒑 + 𝐥𝐨𝐠⁡𝟐 (𝒑). Write an OpenMP program that finds the
speedups and efficiencies for 𝒏 = 𝟏𝟎, 𝟐𝟎, … , 𝟑𝟐𝟎, and 𝒑 = 𝟏, 𝟐, … , 𝟏𝟐𝟖. (10 Marks)
Atria Institute of Technology

Solution: This program calculates speedup (𝑆 = 𝑇𝑠𝑒𝑟𝑖𝑎𝑙 /𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 ) and efficiency (𝐸 = 𝑆/𝑝) based on the
provided mathematical models.

#include <stdio.h>

#include <math.h>

#include <omp.h>

int main() {

int n, p;

double t_serial, t_parallel, speedup, efficiency;

printf("n\tp\tSpeedup\tEfficiency\n");

printf("------------------------------------\n");

// Parallelizing the calculation loops

#pragma omp parallel for private(p, t_serial, t_parallel, speedup, efficiency) collapse(2)

for (n = 10; n <= 320; n *= 2) {

for (p = 1; p <= 128; p *= 2) {


t_serial = (double)n * n;

// log2(p) is calculated using log(p)/log(2)

t_parallel = (t_serial / p) + (log((double)p) / log(2.0));

speedup = t_serial / t_parallel;

efficiency = speedup / p;

// Using critical to ensure output isn't jumbled

#pragma omp critical

printf("%d\t%d\t%.2f\t%.2f\n", n, p, speedup, efficiency);

}
return 0;

}
Atria Institute of Technology

Q. 3 b. Define speedup and efficiency in parallel programs. Explain how scalability is achieved in
parallel programs with an example. (6 Marks)

Solution:

• Speedup (𝑆): The ratio of the time taken by the serial program (𝑇𝑠𝑒𝑟𝑖𝑎𝑙 ) to the time taken by the
parallel version (𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 ). Ideally, if you use 𝑝cores, the speedup should be 𝑝(linear speedup).
𝑇𝑠𝑒𝑟𝑖𝑎𝑙
o Formula: 𝑆 = .
𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙

• • Efficiency (𝐸): The measure of the average utilization of each core. It is the ratio of speedup to the
number of processors.
𝑆
o Formula: 𝐸 = .
𝑝

• Achieving Scalability: Scalability is the ability of a program to handle growth in computing power. A
program is formally scalable if efficiency remains constant when the number of processes (𝑝) and the
problem size (𝑛) are increased proportionally.
𝑛 𝑛
• Example: If 𝑇𝑠𝑒𝑟𝑖𝑎𝑙 = 𝑛 and ⁡⁡𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 = + 1, then 𝐸 = . If we double the
𝑝 𝑛+𝑝
processors to 2𝑝, the efficiency would normally drop. However, if we also double the problem size
to 2𝑛, the new efficiency becomes:
2𝑛 𝑛
• 𝐸𝑛𝑒𝑤 = =
2𝑛+2𝑝 𝑛+𝑝

• Since the efficiency remains unchanged when both factors scale equally, the program is weakly
scalable.

• Diagram Reference:

Figure 2.19: Efficiencies of parallel program on different problem sizes


Atria Institute of Technology

Q. 3 c. Explain why Amdahl’s Law does not always accurately reflect real-world parallel program
performance. (4 Marks)

Solution: Amdahl's Law is often considered too pessimistic for the following reasons:

1. Fixed Problem Size: Amdahl’s Law assumes the problem size remains constant, but in the real
world, as we get more powerful systems, we usually run larger problems.

2. Shrinking Serial Fraction: As problem size increases, the fraction of the code that is inherently
serial (r) often becomes a smaller percentage of the total work, allowing for better scaling.

3. Practical Sufficiency: While Amdahl's Law might show a maximum theoretical speedup (e.g., 10x),
in many engineering fields, even a modest speedup is considered a significant success.

Q. 4 a. Suppose 𝑻𝒔𝒆𝒓𝒊𝒂𝒍 = 𝒏 and 𝑻𝒑𝒂𝒓𝒂𝒍𝒍𝒆𝒍 = 𝒏/𝒑 + 𝐥𝐨𝐠⁡𝟐 (𝒑). If we increase 𝒑⁡by factor 𝒌, find a
formula for how much we need to increase 𝒏to maintain constant efficiency. Calculate for 𝒑 = 𝟖⁡to
𝒑 = 𝟏𝟔. Is it scalable? (10 Marks)

Solution:

1. Efficiency Formula:
𝑇𝑠𝑒𝑟𝑖𝑎𝑙 𝑛 𝑛
𝐸= = 𝑛 =
𝑝 ⋅ 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 𝑝 ⋅ ( + log⁡2 (𝑝)) 𝑛 + 𝑝log⁡2 (𝑝)
𝑝

2. Finding the Scaling Factor (𝑥):


To keep 𝐸constant when 𝑝becomes 𝑘𝑝, let 𝑛become 𝑥𝑛:
𝑛 𝑥𝑛
=
𝑛 + 𝑝log⁡2 (𝑝) 𝑥𝑛 + 𝑘𝑝log⁡2 (𝑘𝑝)

𝑛(𝑥𝑛 + 𝑘𝑝log⁡2 (𝑘𝑝)) = 𝑥𝑛(𝑛 + 𝑝log⁡2 (𝑝))

𝑥𝑛2 + 𝑛𝑘𝑝log⁡2 (𝑘𝑝) = 𝑥𝑛2 + 𝑥𝑛𝑝log⁡2 (𝑝)

log⁡2 (kp)
𝑘log⁡2 (𝑘𝑝) = 𝑥log⁡2 (𝑝) ⟹ x = k ⋅
log⁡2 (p)

3. Calculation for doubling 𝑝(8 to 16):


Here, 𝑘 = 2and 𝑝 = 8.
log⁡2 (16) 4
𝑥 =2⋅ = 2 ⋅ = 2.66
log⁡2 (8) 3

To maintain efficiency, 𝑛must increase by a factor of 2.66.


4. Scalability:
Since efficiency can be maintained by increasing the problem size 𝑛as 𝑝increases, the program is
scalable (specifically, weakly scalable).
Atria Institute of Technology

Q. 4 b. Explain the role of the barrier function in synchronizing threads during performance
measurement with an example program. (6 Marks)

Solution: The Barrier function ensures that no process or thread proceeds past a certain point until all
processes/threads have reached that same point.

Role in Timing: When measuring "Global Elapsed Time," it is essential that all threads start their
stopwatches at the same time. Without a barrier, some threads might start earlier than others, leading to
inaccurate measurements of the parallel section.
Example Snippet:

Barrier(); // Everyone waits here so we start together

my_start = Get_current_time();

/* Parallel Work */

my_finish = Get_current_time();

my_elapsed = my_finish - my_start;

// Find the time taken by the slowest thread

global_elapsed = Global_max(my_elapsed);

Q. 4 c. Suppose 70% of a program can be parallelized. Find the speedup and efficiency for 4
processors. (4 Marks)

Solution: We use Amdahl’s Law:


• Parallel fraction (𝑓): 0.70
• Serial fraction (𝑟 = 1 − 𝑓): 0.30
• Number of processors (𝑝): 4

1. Speedup (𝑆):
1 1 1 1
𝑆= = = = = 2.105
𝑓 0.70 0.30 + 0.175 0.475
𝑟+𝑝 0.30 + 4

𝑆 2.105
2. Efficiency (𝐸): 𝐸 = = = 0.526 (or 52.6%)
𝑝 4

Analogy for Barrier: Imagine a group of 4 friends running a relay race but they must all start their
individual segments at the exact same moment for a synchronized photo. The barrier is the "Ready, Set,
Go!" shout that ensures no one gets a head start, allowing the photographer (the timer) to capture the total
time accurately.
Atria Institute of Technology

Module 3
Model Question Paper I

Q.5 a. Define MPI. Explain the following MPI functions with code snippets: a) MPI_Init
b) MPI_Finalize c) MPI_Comm_Size d) MPI_Comm_Rank (8 Marks)

Definition of MPI: MPI (Message-Passing Interface) is a standardized library of functions used to write
parallel programs for distributed-memory systems. It allows multiple processes to communicate by
explicitly passing messages to one another.

Explanation of MPI Functions:

a) MPI_Init: This function tells the MPI system to perform all necessary initialisation steps, such as
allocating message buffers and assigning ranks to processes. It must be the first MPI function called in any
program.
Code Snippet:
int main(int argc, char* argv[]) {

MPI_Init(&argc, &argv); // Initialises the MPI environment

// ... MPI code ...

b) MPI_Finalize: This function signals that the program has finished using the MPI library. It releases any
resources allocated by MPI, and no MPI functions can be called after it.
Code Snippet:

MPI_Finalize(); // Cleans up and shuts down MPI

return 0;

c) MPI_Comm_Size: This function determines the total number of processes currently active in a specific
communicator (usually the default MPI_COMM_WORLD, which includes all started processes).
Code Snippet:

int comm_sz;

MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); // Stores total processes in comm_sz

d) MPI_Comm_Rank: This function retrieves the rank (or ID) of the calling process within a communicator.
Ranks are non-negative integers ranging from 0to 𝑝 − 1, where 𝑝is the total number of processes.
Code Snippet:

int my_rank;

MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); // Stores the unique process ID in my_rank

Q.5 b. With the help of MPI programs, demonstrate how to address the problem of getting input from
user and to overcome nondeterminism. (8 Marks)

1. Addressing User Input: In MPI, most implementations only allow process 0 to access stdin. If multiple
processes tried to read input simultaneously, it would be unclear which process should receive which data.
To solve this, process 0 reads the input and then sends or broadcasts the values to all other processes.
Atria Institute of Technology

Demonstration Program (Input Handling): Based on Program 3.5: A function for reading
user input , process 0 uses scanf and then employs MPI_Send to distribute the data:

if (my_rank == 0) {
printf("Enter a, b, and n\n");
scanf("%lf %lf %d", a_p, b_p, n_p);
for (int dest = 1; dest < comm_sz; dest++) {
MPI_Send(a_p, 1, MPI_DOUBLE, dest, 0, MPI_COMM_WORLD);
// ... send other values ...
}
} else {
MPI_Recv(a_p, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
// ... receive other values ...
}

2. Overcoming Nondeterminism: Nondeterminism occurs when multiple processes attempt to write to


stdout at the same time, leading to unpredictable output orders or garbled text. To maintain a predictable
order, the programmer can have each non-zero process send its output as a message to process 0, which then
prints them sequentially by rank.

Demonstration Program (Predictable Output): MPI program that prints greetings

if (my_rank != 0) {
sprintf(greeting, "Greetings from process %d!", my_rank);
MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
} else {
printf("Greetings from process 0!\n");
for (int q = 1; q < comm_sz; q++) {
MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("%s\n", greeting);
}
}
Atria Institute of Technology

Q.5 c. Differentiate between collective and point-to-point communication. (4 Marks)


Feature Point-to-Point Communication Collective Communication

Processes Exactly two processes: one sender and one All processes within a communicator must
Involved receiver. participate.

Functions
MPI_Send and MPI_Recv. MPI_Reduce, MPI_Bcast, MPI_Scatter, etc.
Used

Uses tags to distinguish between different Does not use tags; messages are matched by
Tags
types of messages. order of calls.

A receive must match a specific sender's All processes must call the same function
Matching
rank, tag, and communicator. with compatible arguments.

Q.6 a. Explain the following MPI functions with the help of MPI programs: a) MPI_Reduce b)
MPI_Allreduce c) Broadcast (6 Marks)

a) MPI_Reduce: This function collects data from all processes in a communicator and combines it into a
single result using an operator (like MPI_SUM or MPI_MAX). The final result is stored only on the
destination (root) process. Example: Replacing manual summation in the trapezoidal rule:
MPI_Reduce(&local_int, &total_int, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

b) MPI_Allreduce: Similar to MPI_Reduce, but instead of storing the result on one process, the final
result is distributed to all processes. It is useful when every process needs the result to proceed with
further calculations. Example: MPI_Allreduce(&local_int, &total_int, 1, MPI_DOUBLE, MPI_SUM,
MPI_COMM_WORLD);

c) Broadcast: A broadcast operation sends data from a single source process to all other processes in the
communicator. Program Example: Based on Program 3.6)

// Process 0 reads 'n' then broadcasts it to everyone


if (my_rank == 0) scanf("%d", &n);
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

Q.6 b. Write an MPI program that implements multiplication of a vector by a scalar and dot product.
The user should enter two vectors and a scalar, all of which are read in by process 0 and distributed
among the processes. (8 Marks)

(Note: The following program combines elements from the sources, specifically the vector reading logic
from Program 3.9 and the reduction logic for global sums.)

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(void) {
int my_rank, comm_sz, n, local_n;
double scalar, local_dot = 0.0, global_dot;
double *vec_a = NULL, *vec_b = NULL, *local_a, *local_b, *local_res;
Atria Institute of Technology

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);

// Process 0 reads the dimension and the scalar


if (my_rank == 0) {
printf("Enter vector size n and scalar:\n");
scanf("%d %lf", &n, &scalar);
}

// Broadcast n and scalar to all processes


MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&scalar, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

local_n = n / comm_sz;
local_a = malloc(local_n * sizeof(double));
local_b = malloc(local_n * sizeof(double));
local_res = malloc(local_n * sizeof(double));

// Process 0 reads vectors and scatters them


if (my_rank == 0) {
vec_a = malloc(n * sizeof(double));
vec_b = malloc(n * sizeof(double));
printf("Enter vector A:\n");
for(int i=0; i<n; i++) scanf("%lf", &vec_a[i]);
printf("Enter vector B:\n");
for(int i=0; i<n; i++) scanf("%lf", &vec_b[i]);
}

MPI_Scatter(vec_a, local_n, MPI_DOUBLE, local_a, local_n, MPI_DOUBLE, 0,


MPI_COMM_WORLD);
MPI_Scatter(vec_b, local_n, MPI_DOUBLE, local_b, local_n, MPI_DOUBLE, 0,
MPI_COMM_WORLD);

// Perform Local Computation


for (int i = 0; i < local_n; i++) {
local_res[i] = local_a[i] * scalar; // Vector-Scalar multiplication
local_dot += local_a[i] * local_b[i]; // Part of Dot Product
}

// Gather dot product using Reduce


MPI_Reduce(&local_dot, &global_dot, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

if (my_rank == 0) {
printf("Dot Product: %f\n", global_dot);
free(vec_a); free(vec_b);
}

free(local_a); free(local_b); free(local_res);


MPI_Finalize();
return 0;
}
Atria Institute of Technology

Analogy for Understanding: Think of MPI like a professional kitchen staff. MPI_Init is the head chef calling
everyone to their stations to start the shift. Point-to-point communication is like one chef handing a specific
spice to another chef. Collective communication is like the head chef announcing the soup of the day to the
entire kitchen at once (Broadcast) or everyone contributing their chopped vegetables into one large pot to
make a single stew (Reduce). Finally, MPI_Finalize is the end of the shift when everyone cleans up their
station and goes home.

Q.6.c. Explain parallel odd-even transposition sorting algorithm with an example. (6 Marks)

Parallel Odd-Even Transposition Sort

The parallel odd-even transposition sort is a parallel sorting algorithm designed for distributed-memory
systems where data is spread across multiple processes. It is an adaptation of the serial bubble sort that
breaks dependencies by using alternating communication phases, allowing multiple comparisons and swaps
to happen simultaneously.

Algorithm Steps
1. Initialization: The nkeys are distributed such that each of the pprocesses holds n/pkeys.
2. Local Sort: Each process first sorts its assigned keys locally using a fast serial sorting algorithm,
such as qsort.
3. Phase Execution: The algorithm executes for exactly pphases (where pis the number of processes) to
guarantee the entire list is sorted.
• Even Phases (0, 2, 4...): Even-ranked processes (e.g., 0, 2, 4) pair with their right-hand neighbour
(e.g., 1, 3, 5).
• Odd Phases (1, 3, 5...): Odd-ranked processes (e.g., 1, 3, 5) pair with their right-hand neighbour
(e.g., 2, 4, 6).
4. Exchange and Merge: In each phase, the paired processes exchange their entire sorted sub-lists.
• The lower-ranked process in the pair keeps the smaller half of the combined keys.
• The higher-ranked process in the pair keeps the larger half of the combined keys.
• Optimization: Instead of a full sort, the processes perform a partial merge (forward merge for
smaller keys, backward merge for larger keys) to stay efficient.

Example (Based on Table 3.8, Page 91)


Consider 4 processes (P0 to P3) and 16 keys (n=16,p=4). Each process starts with 4 keys.
Atria Institute of Technology

Analogy for Understanding: Imagine four librarians sorting a huge delivery of books. First, each librarian
sorts their own pile on their desk (Local Sort). Then, Librarian 0 and 1 compare their piles, Librarian 0 takes
all the A-M books and Librarian 1 takes all the N-Z books (Even Phase). Next, Librarian 1 and 2 do the
same thing with their new piles (Odd Phase). They keep swapping and splitting until all the books in
Librarian 0's pile are earlier in the alphabet than Librarian 1's, and so on. After everyone has had a turn to
swap with their neighbours, the entire library is in order.

Model Question Paper II


Q.5 a. Explain MPI_Send and MPI_Recv functions with MPI programs. (6 Marks)

MPI_Send and MPI_Recv are the fundamental functions for point-to-point communication in MPI, where
one process sends a message and another receives it.

• MPI_Send: Transmits data to a specific destination process. Its parameters include the message
buffer, the number of elements, the MPI datatype, the rank of the destination, a message tag (to
distinguish between message types), and the communicator (usually MPI_COMM_WORLD).
• MPI_Recv: Receives data from a specific source. It requires matching the sender's tag and
communicator. It includes an additional status_p argument (or MPI_STATUS_IGNORE) to retrieve
information about the received message,.

MPI Program (Greetings Example): This program uses rank 0 to receive and print greetings from all other
processes.

#include <stdio.h>
#include <string.h>
#include <mpi.h>

int main(void) {
char greeting;
int comm_sz, my_rank;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

if (my_rank != 0) {
sprintf(greeting, "Greetings from process %d!", my_rank);
// Send message to process 0
MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
} else {
printf("Greetings from process 0!\n");
for (int q = 1; q < comm_sz; q++) {
// Receive message from process q
MPI_Recv(greeting, 100, MPI_CHAR, q, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%s\n", greeting);
}
}
MPI_Finalize();
return 0;
}

(Source: Program 3.1)


Atria Institute of Technology

Q.5 b. Explain the following MPI functions with suitable MPI programs for each: a) Scatter b) Gather
c) Allgather (8 Marks)

• a) Scatter: Used by a root process to divide a single buffer into equal segments and distribute one
segment to each process in the communicator.
o Program snippet: MPI_Scatter(send_buf, local_n, MPI_DOUBLE, recv_buf, local_n,
MPI_DOUBLE, 0, comm);,.
• b) Gather: The reverse of scatter; it collects segments of data from all processes and stores them in
rank order on a single destination process.
o Program snippet: MPI_Gather(local_buf, local_n, MPI_DOUBLE, total_buf, local_n,
MPI_DOUBLE, 0, comm);,.
• c) Allgather: Gathers data from all processes and distributes the combined complete result to every
process in the communicator.
o Program snippet: MPI_Allgather(local_x, local_n, MPI_DOUBLE, full_x, local_n,
MPI_DOUBLE, comm);,.

Demonstration Program (Vector Reading/Printing): Based on Programs 3.9 and 3.10, process 0 reads a
vector, scatters it for computation, and gathers it back for printing,.

// Scatter example (Read_vector)


if (my_rank == 0) {
a = malloc(n * sizeof(double));
for (i = 0; i < n; i++) scanf("%lf", &a[i]);
}
MPI_Scatter(a, local_n, MPI_DOUBLE, local_a, local_n, MPI_DOUBLE, 0, comm);

// Gather example (Print_vector)


MPI_Gather(local_b, local_n, MPI_DOUBLE, b, local_n, MPI_DOUBLE, 0, comm);
if (my_rank == 0) {
for (i = 0; i < n; i++) printf("%f ", b[i]);
}

(Sources: Program 3.9, Page 52; Program 3.10, Page 54; Program 3.12)

Q.5 c. Write a MPI program to demonstrate deadlock using point to point communication and
avoidance of deadlock by altering the call sequence. (6 Marks)

Deadlock occurs when every process is waiting indefinitely for another to proceed, often because all
processes are trying to send large messages simultaneously while none are receiving.

Deadlock Demonstration (Unsafe Ring): If every process calls MPI_Send before MPI_Recv with a large
message, the program may hang.

// Potential Deadlock
MPI_Send(msg, size, MPI_INT, (my_rank + 1) % comm_sz, 0, comm);
MPI_Recv(new_msg, size, MPI_INT, (my_rank + comm_sz - 1) % comm_sz, 0, comm,
MPI_STATUS_IGNORE);

Avoidance (Restructured Sequence): By making even ranks send first and odd ranks receive first, the
dependency is broken.

if (my_rank % 2 == 0) {
MPI_Send(msg, size, MPI_INT, (my_rank + 1) % comm_sz, 0, comm);
MPI_Recv(new_msg, size, MPI_INT, (my_rank + comm_sz - 1) % comm_sz, 0, comm,
MPI_STATUS_IGNORE);
Atria Institute of Technology

} else {
// Odd processes receive first to "unblock" the ring
MPI_Recv(new_msg, size, MPI_INT, (my_rank + comm_sz - 1) % comm_sz, 0, comm,
MPI_STATUS_IGNORE);
MPI_Send(msg, size, MPI_INT, (my_rank + 1) % comm_sz, 0, comm);
}

Q.6 a. Explain the trapezoidal rule in MPI. Write a program to demonstrate the trapezoidal rule in
MPI and illustrate the importance of Trap function. (8 Marks)

The trapezoidal rule estimates the area under a curve by dividing the interval into $n$ trapezoids. In MPI,
this is parallelised by assigning a subinterval $[local_a, local_b]$ to each process. Each process calculates
its local integral, and process 0 collects and sums these values,.

Importance of the Trap function: The Trap function implements the serial trapezoidal rule logic. It is
essential because it allows each process to independently compute the area for its assigned portion of the
global interval without communicating with others during the calculation phase.

MPI Program Snippet:

local_n = n / comm_sz;
local_a = a + my_rank * local_n * h;
local_b = local_a + local_n * h;
local_int = Trap(local_a, local_b, local_n, h); // Crucial local calculation

if (my_rank != 0) {
MPI_Send(&local_int, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
} else {
total_int = local_int;
for (source = 1; source < comm_sz; source++) {
MPI_Recv(&local_int, 1, MPI_DOUBLE, source, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
total_int += local_int;
}
}

• Diagram: Figure 3.5: Tasks and communications for the trapezoidal rule, Page 31
• Source: Program 3.2, Page 33; Program 3.3, Page 33.
Atria Institute of Technology

Q.6 b. Explain MPI Derived data types with examples. (6 Marks)

MPI Derived Datatypes allow grouping multiple data elements (even of different types) into a single unit
that can be sent in one message. This reduces communication overhead by sending one large message
instead of many small ones.

Example (Trapezoidal Input): Instead of three MPI_Bcast calls for a (double), b (double), and n (int), we
create one struct type,.

1. Define block lengths: {1, 1, 1}.


2. Compute displacements: Using MPI_Get_address to find where a, b, and n sit in memory.
3. Define types: {MPI_DOUBLE, MPI_DOUBLE, MPI_INT}.
4. Commit: MPI_Type_commit(&input_mpi_t);.

Code snippet: MPI_Bcast(&a, 1, input_mpi_t, 0, comm); // Broadcasts all three values at once. (Source:
Section 3.5, Pages 58-60; Program 3.13, Page 60,

Q.6 c. Find the speedups and efficiencies of the parallel odd-even sort. Does the program obtain linear
speedups? Determine whether it is scalable, strongly scalable or weakly scalable. (6 Marks)

𝑇𝑠𝑒𝑟𝑖𝑎𝑙
• Speedup (S): Defined as 𝑆 =
𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙

1800⁡𝑚𝑠
For the parallel odd-even sort, using Table 3.9, for 3200⁡keys, Swith 16 processes is ≈ 13.8.
130⁡𝑚𝑠

𝑆 13.8
• Efficiency (E): Defined as 𝐸 = In the above example, 𝐸 = ≈ 0.86.
𝑝 16

• Linear Speedup: A program has linear speedup if S ≈ p. Parallel odd-even sort does not obtain perfect
linear speedup because as pincreases, communication overhead grows. However, for large n, it approaches
linear speedup (efficiency stays high).

• Scalability:
o It is not strongly scalable because if we keep the number of keys (n) constant and increase processes
(p), efficiency drops.
o It is weakly scalable because if we increase nproportionally with p, the efficiency remains relatively
constant or improves. For example, in Table 3.9, doubling both pand n(from p = 2, n = 200to p = 4, n =
400) maintains similar runtimes (43ms vs 46ms).
Atria Institute of Technology

MODULE – 4

Model Question Paper I

Q.7.a. Define OpenMP. Explain the following OpenMP pragmas with code snippets: a) atomic b) for
c) parallel d) sections [8 Marks]

Answer: OpenMP is an Application Programming Interface (API) specifically designed for shared-memory
MIMD (Multiple Instruction, Multiple Data) programming. The "MP" stands for multiprocessing, as the API
targets systems where multiple autonomous CPU cores access a common main memory. Unlike Pthreads,
which is low-level and requires explicit thread management, OpenMP provides a higher-level abstraction
where the programmer indicates that a block of code should run in parallel, and the compiler/runtime system
handles the threads.

OpenMP Pragmas:

• a) atomic: This directive ensures that a specific memory location is updated atomically, meaning the
load, modify, and store operations are not interrupted by other threads. It is generally faster than a
critical section because many processors support this at the hardware level.

Code Snippet:

#pragma omp atomic


count++; // Increments 'count' safely across threads

• b) for: Used to parallelize for-loops by dividing the loop iterations among the available threads. The
loop must be in canonical form (iteration count must be known before the loop starts).

Code Snippet:

#pragma omp parallel for


for (i = 0; i < n; i++) {
a[i] = b[i] + c[i]; // Iterations are divided among threads
}

• c) parallel: This is the most basic directive. When encountered, the system spawns a team of threads
to execute the structured block of code following it.

Code Snippet:

#pragma omp parallel num_threads(4)


{
printf("Hello from thread %d\n", omp_get_thread_num());
}

• d) sections: (Note: The provided sources do not contain detailed information on the sections pragma;
the following is from external knowledge.) This directive is used to assign different, independent
blocks of code (sections) to different threads. Each section is executed once by one thread in the
team.

Code Snippet (External Information):

#pragma omp parallel sections


{
#pragma omp section
Atria Institute of Technology

function_A();
#pragma omp section
function_B();
}

Q.7.b. Write an OpenMP program to sort an array on n elements using both sequential and parallel
mergesort using Section. [8 Marks]

Answer: (Note: The provided sources cover Bubble Sort and Odd-Even Sort, but do not include a MergeSort
program or detailed sections usage. The following program is provided from external knowledge to
complete the requirement.)

In a parallel MergeSort, the two recursive calls (left and right halves) can be executed as independent tasks
or sections.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

void merge(int* a, int l, int m, int r) {


// Standard merge logic (External Information)
}

void parallel_mergesort(int* a, int l, int r) {


if (l < r) {
int m = l + (r - l) / 2;
// Using sections to run recursive calls in parallel
#pragma omp parallel sections if (r - l > 1000)
{
#pragma omp section
parallel_mergesort(a, l, m);
#pragma omp section
parallel_mergesort(a, m + 1, r);
}
merge(a, l, m, r);
}
}

int main() {
int n = 10;
int a[] = {9, 3, 7, 5, 6, 4, 8, 2, 0, 1};
parallel_mergesort(a, 0, n - 1);
return 0;
}

Q.7.c. How are locks used in message-passing programs? Explain with examples. [4 Marks]

Answer: In OpenMP message-passing programs, locks provide a more fine-grained way to ensure mutual
exclusion than the critical directive. While an unnamed critical section serializes all protected blocks across
the entire program, a lock can be associated with a specific data structure (like an individual message
queue).
Atria Institute of Technology

Usage Pattern:

1. Initialize: The lock is initialized once using omp_init_lock().


2. Acquire: A thread calls omp_set_lock() before entering a critical section. If the lock is held by
another thread, it blocks.
3. Release: After finishing the operation, the thread calls omp_unset_lock().
4. Destroy: Once the resource is no longer needed, the lock is destroyed using omp_destroy_lock().

Example: Instead of a global critical section that stops all threads from accessing any queue, each queue has
its own lock.

/* q_p points to a specific thread's queue */


omp_set_lock(&q_p->lock);
Enqueue(q_p, my_rank, mesg); // Only blocks threads accessing the SAME queue
omp_unset_lock(&q_p->lock);

Q.8.a. Explain the trapezoidal rule with the help of an OpenMP program. [7 Marks]

Answer: The trapezoidal rule is a numerical method to estimate the area under a curve 𝑦 = 𝑓(𝑥)by dividing
the interval [𝑎, 𝑏]into 𝑛subintervals and treating each as a trapezoid.

OpenMP Implementation Strategy:

1. Partitioning: The interval [a, b] is divided into sub-intervals, and each thread is assigned a contiguous
block of trapezoids.
2. Local Computation: Each thread calculates its local sum (my_result) using the serial trapezoidal rule
over its assigned range.
3. Synchronization: Since global_result is a shared variable, threads must update it within a critical
section to avoid a race condition.

Diagrams:

• Refer to Figure 5.3, "The trapezoidal rule",


• Refer to Figure 5.4, "Assignment of trapezoids to threads",.

OpenMP Program Snippet (Program 5.2):

void Trap(double a, double b, int n, double* global_result_p) {


double h, x, my_result;
double local_a, local_b;
int i, local_n;
int my_rank = omp_get_thread_num();
int thread_count = omp_get_num_threads();
Atria Institute of Technology

h = (b-a)/n;
local_n = n/thread_count;
local_a = a + my_rank * local_n * h;
local_b = local_a + local_n * h;

my_result = (f(local_a) + f(local_b))/2.0;


for (i = 1; i <= local_n-1; i++) {
x = local_a + i*h;
my_result += f(x);
}
my_result = my_result*h;

#pragma omp critical


*global_result_p += my_result;
}

Q.8.b. Discuss the following concepts in OpenMP with examples: a) Message-passing b) Sending and
Receiving Messages c) Termination Detection d) Startup and Queues [8 Marks]

Answer:

• a) Message-passing: In OpenMP shared-memory systems, message-passing can be simulated using


message queues. Each thread has its own queue. To communicate, a "sender" thread adds data to a
"receiver" thread's queue, and the receiver checks its own queue to process the data.
• b) Sending and Receiving Messages:
o Sending: Involves an enqueue operation. Since multiple threads might try to send messages to
the same destination queue simultaneously, the enqueue operation is a critical section to
protect the queue's rear pointer.
o Receiving: Involves a dequeue operation. Only the owner thread dequeues from its own
queue. Synchronization is usually only needed if the queue has exactly one message to
prevent a race condition with a simultaneous enqueue.
• c) Termination Detection: Threads need a way to know when no more messages will arrive so they
can stop. This is often handled by a global counter (e.g., done_sending). A thread increments this
counter when it finishes sending all its messages. A thread can terminate only when its queue is
empty and the counter equals the total number of threads.
• d) Startup and Queues:
o Queues: These are FIFO (First-In, First-Out) data structures. In OpenMP, a queue struct
typically contains a list of messages, front/rear pointers, and counters for enqueued/dequeued
items.
o Startup: At startup, a master thread allocates an array of queue pointers shared by all threads.
To avoid a thread trying to send a message before the destination queue is allocated, a
#pragma omp barrier is used to ensure all threads finish allocation before communication
begins.

Q.8.c. Illustrate with an OpenMP program, the implementation of thread-safety using a multi-
threaded tokenizer. [5 Marks]

Answer:

1. Definition of Thread-Safety: Thread-safety refers to the ability of a block of code or a function to be


executed simultaneously by multiple threads without causing errors, crashes, or data inconsistencies. A
program is considered thread-safe if it guarantees correct results even when multiple threads access shared
resources at the same time.
Atria Institute of Technology

2. The Problem with Standard strtok(): The standard C library function strtok() is not thread-safe. This is
because it uses an internal static variable to remember its position in the string between successive calls.
When multiple threads call strtok() simultaneously, they all share and overwrite this same static pointer,
leading to unpredictable results and corrupted data.

3. The Solution: strtok_r(): To ensure thread-safety, the C standard provides a re-entrant (thread-safe)
version called strtok_r(). Instead of an internal static variable, it uses a third argument, saveptr, which is a
user-managed pointer that keeps track of the string position. By ensuring each thread maintains its own
private saveptr, the function can be safely used in parallel.

4. OpenMP Implementation: In the following program, each thread receives a line of text and tokenizes it
independently. By declaring saveptr inside the parallel block (or as a private variable), we ensure that each
thread has its own state, preventing interference.

#include <stdio.h>

#include <string.h>

#include <omp.h>

void Tokenize(char* line, int my_rank) {

char *token;

char *saveptr; // Each thread has its own private pointer to track position

// First call: pass the string to be tokenized

token = strtok_r(line, " ", &saveptr);

while (token != NULL) {

printf("Thread %d found token: %s\n", my_rank, token);

// Subsequent calls: pass NULL to continue with the same string

token = strtok_r(NULL, " ", &saveptr);

int main() {

char lines = {"OpenMP is fast", "Thread safety matters"};


Atria Institute of Technology

#pragma omp parallel num_threads(2)

int my_rank = omp_get_thread_num(); // Get thread ID

// Each thread processes its assigned line safely

Tokenize(lines[my_rank], my_rank);

return 0;

Key Points for Exams:

• Static Variables: Functions like strtok(), rand(), and localtime() are not thread-safe because they rely
on internal static memory.
• Re-entrancy: Using re-entrant functions like strtok_r() allows each thread to manage its own state.
• Non-determinism: Thread-safety issues (race conditions) are difficult to detect because they are non-
deterministic and may only cause errors in specific, unpredictable runs.

Model Question Paper II

Q.7.a. Define shared memory programming with OpenMP. Explain the following OpenMP pragmas
with code snippets: a) parallel for b) single c) section d) critical [8 Marks]

Answer: Shared Memory Programming with OpenMP: OpenMP is an Application Programming Interface
(API) designed specifically for shared-memory MIMD (Multiple Instruction, Multiple Data) programming.
In this model, multiple autonomous CPU cores access a common main memory. OpenMP offers a high-level
abstraction where the programmer uses pragmas (compiler directives) to indicate parallel blocks, and the
system handles thread management automatically.

OpenMP Pragmas:

• parallel for: This directive automatically parallelizes a for loop by dividing the iterations among the
available threads.

Snippet:

#pragma omp parallel for


for (i = 0; i < n; i++) {
a[i] = b[i] + c[i];
}

• single: This directive ensures that the following structured block is executed by only one thread in
the team. It is often used for tasks like I/O or launching tasks.

Snippet:
Atria Institute of Technology

#pragma omp parallel


{
#pragma omp single
printf("This is printed only once.\n");
}

• section: Used within a sections block to define independent units of work. Each section is executed
once by one thread in the team.

Snippet:

#pragma omp parallel sections


{
#pragma omp section
func1();
#pragma omp section
func2();
}
``` (External Information /)

• critical: Specifies a block of code that must be executed by only one thread at a time, providing
mutual exclusion to prevent race conditions when updating shared variables.

Snippet:

#pragma omp critical


global_result += my_result;

Q.7.b. Explain reduction clause with syntax and example. How does the reduction clause simplify
parallel computation compared to using a critical section? [8 Marks]

Answer: Reduction Clause: The reduction clause allows each thread to have its own private copy of a
variable during a parallel region. At the end of the region, OpenMP automatically combines these private
copies into a single shared variable using a specified operator.

• Syntax: reduction(<operator> : <variable list>)


• Common Operators: +, *, -, &, |, &&, ||.
• Example:

double approx = 0.0;


#pragma omp parallel for reduction(+: approx)
for (i = 1; i <= n - 1; i++) {
approx += f(a + i * h);
}

Comparison with Critical Section:

• Performance: A critical section serializes the update. Only one thread can update the shared variable
at a time, which forces other threads to wait and reduces parallelism.
• Simplicity: With reduction, the programmer does not need to manually manage local sums or wrap
updates in critical blocks. OpenMP handles the private initialization and the final global aggregation
automatically, making the code cleaner and faster.
Atria Institute of Technology

Q.7.c. Write an OpenMP program that determines the default scheduling of parallel for loops. [4
Marks]

Answer: This program uses omp_get_thread_num() to identify which thread is executing which loop
iteration.

#include <stdio.h>
#include <omp.h>

int main() {
int n;
printf("Enter number of iterations: ");
scanf("%d", &n);

#pragma omp parallel for num_threads(2)


for (int i = 0; i < n; i++) {
int my_rank = omp_get_thread_num();
printf("Thread %d: Iteration %d\n", my_rank, i);
}
return 0;
}

Q.8.a. Explain schedule clause with syntax and example. Describe the various schedule types with
examples. [8 Marks]

Diagram:

Figure 5.5, " Scheduling visualization for the static, dynamic and guided schedule types
Atria Institute of Technology

Figure 5.5, " Scheduling visualization for the static, dynamic, and guided schedule types with 4
threads and 32 iterations. The first static schedule uses the default chunksize, whereas the second
uses a chunksize of 2. The exact distribution of work across threads will vary between different
executions of the program for the dynamic and guided schedule types, so this visualization shows one
of many possible scheduling outcomes

Answer: The schedule clause gives the programmer control over how loop iterations are mapped to threads.

• Syntax: schedule(<type> [, <chunksize>])


• Schedule Types:
o static: Iterations are divided into blocks of chunksize and assigned in round-robin fashion
before the loop runs.
▪ Example: schedule(static, 2) assigns iterations 0-1 to Thread 0, 2-3 to Thread 1, etc..
o dynamic: Iterations are divided into chunks; threads request a new chunk from the runtime
after finishing their current one. Useful when iteration times are unpredictable.
o guided: Similar to dynamic, but chunk sizes decrease over time as iterations remain.
o runtime: The schedule type and chunk size are determined at execution time via the
OMP_SCHEDULE environment variable.

Q.8.b. Discuss the scope of variables in OpenMP with examples. [6 Marks]

Answer: In OpenMP, the scope determines which threads can access a variable.

1. Shared Scope: All threads in a team share a single copy of the variable. Variables declared before a
parallel block are shared by default.
o Example: global_result in the trapezoidal rule program.
2. Private Scope: Each thread has its own unique copy of the variable. Changes made by one thread are
not visible to others. Variables declared inside the parallel block are private by default.
o Example: The loop index i in a parallel for is private.
3. default(none) Clause: This forces the programmer to explicitly declare the scope of every variable
used in the block, reducing errors.
o Example: #pragma omp parallel for default(none) shared(n) private(k, factor).

Q.8.c. Explain the concept of cache coherence and false sharing in OpenMP programs. [6 Marks]

Answer: Cache Coherence: Modern CPUs use fast local caches to store data from main memory. In shared-
memory systems, multiple copies of a variable can exist (one in main memory and one in each core's cache).
When one core updates its local cache copy, the others become outdated. Cache coherence is the mechanism
that ensures all cores eventually see the most recent value, typically by invalidating outdated cache lines.

False Sharing: Cache coherence is managed at the level of cache lines (usually 64 bytes) rather than
individual variables. False sharing occurs when two threads update different variables that happen to reside
on the same cache line.

• The Conflict: When Thread 0 updates its variable, the hardware invalidates the entire cache line for
Thread 1.
• Result: Thread 1 is forced to reload the line from main memory even though its data wasn't actually
changed by Thread 0. This causes significant performance degradation.
Atria Institute of Technology

Diagrams/Tables:

• Refer to Table 5.5, "Memory and cache accesses",


• Refer to Figure 5.6, "Matrix-vector multiplication" (context for false sharing).

MODULE – 5
Model Question Paper I
Q.9 a. Define GPU and GPGPU. Explain the architecture of a GPU with a neat block diagram and
compare it with a CPU architecture. (8 Marks)

1. Definitions:

• GPU (Graphics Processing Unit): Originally developed in the late 1990s and early 2000s, GPUs are
processors specifically designed to accelerate programs that render complex, detailed images and
animations.
• GPGPU (General-Purpose computing on GPUs): This refers to the use of GPUs for general
computational problems beyond graphics, such as numerical simulations, searching, and sorting.

2. GPU Architecture: Modern GPU architecture is built around massive parallelism. Key components
include:

• Streaming Multiprocessors (SMs): A GPU consists of several SMs which operate asynchronously.
• Streaming Processors (SPs): Each SM contains many SPs, which are the actual datapaths that
execute instructions. For example, some Nvidia GPUs have over 10,000 SPs in total.
• Memory Hierarchy: Each SM has a small, fast shared memory accessible by its SPs, while all SMs
share a larger but slower global memory.

Diagram Reference: Please refer to FIGURE 6.1: Simplified block diagram of a GPU
Atria Institute of Technology

3. Comparison: CPU vs. GPU Architecture:

Feature CPU Architecture GPU Architecture

Generally a SISD (Single A SIMD (Single Instruction, Multiple Data) or SIMT


Taxonomy
Instruction, Single Data) device. (Single Instruction, Multiple Thread) processor.

Each instruction operates on a One instruction is applied to many data elements


Execution
small set of data. simultaneously.

Low parallelism; optimized for


Parallelism Massive parallelism; optimized for parallel workloads.
serial tasks.

Memory Uses the Host memory. Uses the Device (global) memory.

Q.9 b. Define threads, blocks and grids. Write a CUDA program that prints greetings from threads in
multiple blocks, and mention the variables that are initialized in each thread’s memory when a kernel
begins execution. (6 Marks)

1. Definitions:

• Threads: The smallest units of execution in CUDA that run on the Streaming Processors (SPs).
• Blocks: Threads are grouped into blocks, which execute on a single Streaming Multiprocessor (SM).
• Grids: All thread blocks launched by a single kernel call are grouped into a grid.

2. CUDA Greetings Program (Multiple Blocks):

#include <stdio.h>
#include <cuda.h>

/* Kernel: runs on GPU */


__global__ void Hello(void) {
// Prints thread index and block index
printf("Hello from thread %d in block %d\n", threadIdx.x, blockIdx.x);
}

int main(int argc, char* argv[]) {


int blk_ct = 2; // Number of blocks
int th_per_blk = 4; // Threads per block

// Launch kernel with multiple blocks


Hello<<<blk_ct, th_per_blk>>>();

cudaDeviceSynchronize(); // Wait for GPU to finish


return 0;
}

(Based on Program 6.2)

3. Predefined Variables Initialized in Each Thread: When a kernel starts, each thread can identify its position
using these 3D variables:
Atria Institute of Technology

• threadIdx: The thread’s index within its specific block.


• blockIdx: The block’s index within the grid.
• blockDim: The dimensions (number of threads) of the block.
• gridDim: The dimensions (number of blocks) of the grid.

Q.9 c. Discuss the possible approaches to returning a result to the host from a CUDA kernel with the
help of code snippet. (6 Marks)

CUDA kernels have a void return type, meaning they cannot return values directly. The following
approaches are used to send results back to the host:

1. Using Unified Memory: Memory is allocated with cudaMallocManaged(), which is accessible by both
host and device. The system automatically synchronizes the data.

int *sum_p;
cudaMallocManaged(&sum_p, sizeof(int));
Add<<<1,1>>>(2, 3, sum_p); // Kernel writes to sum_p
cudaDeviceSynchronize();
printf("Sum: %d", *sum_p);

2. Without Unified Memory (Explicit Copying): Separate pointers are used for host and device. Data must
be explicitly moved using cudaMemcpy.

int *hsum_p, *dsum_p;


hsum_p = (int*)malloc(sizeof(int));
cudaMalloc(&dsum_p, sizeof(int));
Add<<<1,1>>>(2, 3, dsum_p);
// Copy result from device to host
cudaMemcpy(hsum_p, dsum_p, sizeof(int), cudaMemcpyDeviceToHost);

3. Using a Managed Global Variable: A global variable is declared with the __managed__ qualifier, making
it visible to both the host and device.

__managed__ int sum;


__global__ void Add(int x, int y) { sum = x + y; }

Q.10 a. What are the several problems that arise during a CUDA implementation? Explain with a
program how to implement CUDA trapezoidal rule I to overcome it. (8 Marks)

1. Common Problems in CUDA Implementation:

• Uninitialized Variables: Formal arguments like step size must be correctly passed and initialized.
• Out-of-Range Indices: Threads with indices outside the valid computation range (e.g., my_i >= n)
must be prevented from executing.
• Race Conditions: Multiple threads updating a shared variable simultaneously can lead to incorrect
results.
• Return Type Restriction: Kernels must be void, requiring memory-based communication.

2. CUDA Trapezoidal Rule I Implementation: To overcome these problems, the implementation uses
Unified Memory for sharing data and atomicAdd to prevent race conditions during summation.

/* Kernel */
Atria Institute of Technology

__global__ void Dev_trap(float a, float b, float h, int n, float* trap_p) {


int my_i = blockDim.x * blockIdx.x + threadIdx.x;

// Condition to avoid out-of-range threads


if (0 < my_i && my_i < n) {
float my_x = a + my_i * h;
float my_trap = f(my_x);
// atomicAdd prevents race conditions
atomicAdd(trap_p, my_trap);
}
}

/* Host Wrapper Function */


void Trap_wrapper(float a, float b, int n, float* trap_p, int blk_ct, int th_per_blk) {
*trap_p = 0.5 * (f(a) + f(b)); // Initialization
float h = (b - a) / n;

Dev_trap<<<blk_ct, th_per_blk>>>(a, b, h, n, trap_p);


cudaDeviceSynchronize();

*trap_p = h * (*trap_p); // Final scaling


}

(Based on Program 6.12)

Q.10 b. Explain the concept of warps in CUDA. Discuss with examples how warp shuffle operations
improve the efficiency of inter-thread communication within a warp. (8 Marks)

1. Concept of Warps: A warp is a group of 32 threads with consecutive ranks within a thread block. All
threads in a warp execute in SIMD fashion; they must execute the same instruction at the same time. If they
take different paths (like an if-else branch), they "diverge" and execute sequentially until they "converge"
again.

2. Warp Shuffle Operations: Warp shuffle functions (available on compute capability ≥ 3.0) allow a thread
to directly read values from the registers of another thread within the same warp.

Example: __shfl_down_sync In a tree-structured sum, instead of using slow shared memory, threads use
shuffles to pass values.

// Each thread adds a value from a thread 'diff' lanes away


var += __shfl_down_sync(0xffffffff, var, diff);

3. Why Warp Shuffles Improve Efficiency:

• Speed: They use registers, which are the fastest memory level (approx. 1 clock cycle), whereas
shared memory is slower and global memory is much slower.
• No Barriers: They eliminate the need for __syncthreads() because threads within a warp already
operate in lockstep.
• Reduced Memory Traffic: By avoiding shared/global memory, they reduce bank conflicts and
bandwidth congestion.
Atria Institute of Technology

Q.10 c. Write a note on heterogeneous computing. (4 Marks)

Heterogeneous computing refers to systems that use different types of processor architectures (e.g., a CPU
and a GPU) working together to execute a single program.

Key Points:

• Host and Device: The CPU is called the Host, and the specialized processor (like a GPU) is the
Device.
• SPMD Model: Programs often follow the Single Program, Multiple Data model, where the CPU
manages tasks and launches parallel kernels on the GPU.
• Performance: This approach became essential after 2003 when CPU single-thread performance
growth slowed. It allows programmers to use the massive parallelism of GPUs or the flexibility of
FPGAs and DSPs to achieve higher efficiency.
• Collaboration: A CUDA program is effectively two interacting programs—one running on the CPU
and one on the GPU—cooperating to solve a problem.

Model Question Paper II

Q.9 a. Write a CUDA program to perform addition of two vectors and explain the role of kernel and
main function in it. (8 Marks)

1. CUDA Vector Addition Program:

#include <stdio.h>
#include <cuda.h>

/* Kernel: Runs on GPU */


__global__ void Vec_add(const float x[], const float y[], float z[], const int n) {
// Calculate global thread index
int my_elt = blockDim.x * blockIdx.x + threadIdx.x;

// Check bounds to prevent out-of-range access


if (my_elt < n) {
z[my_elt] = x[my_elt] + y[my_elt];
}
}

/* Main Function: Runs on CPU */


int main(int argc, char* argv[]) {
int n = 1000; // Vector size
float *x, *y, *z;

// Allocate Unified Memory (accessible by CPU and GPU)


cudaMallocManaged(&x, n * sizeof(float));
cudaMallocManaged(&y, n * sizeof(float));
cudaMallocManaged(&z, n * sizeof(float));

// Initialize vectors x and y on host


for(int i=0; i<n; i++) { x[i] = 1.0; y[i] = 2.0; }

// Launch kernel with 4 blocks of 256 threads


Vec_add<<<4, 256>>>(x, y, z, n);

// Wait for GPU to finish


Atria Institute of Technology

cudaDeviceSynchronize();

// Free memory
cudaFree(x); cudaFree(y); cudaFree(z);
return 0;
}

(Based on Program 6.3)

2. Role of the Kernel:

• The kernel (defined with __global__) is a function started by the host but executed by many threads
in parallel on the device.
• It performs the actual computation. Each thread computes its own global rank (index) using
blockDim.x * blockIdx.x + threadIdx.x to determine which specific vector element to process.
• It operates as a parallel version of a serial loop, where each iteration is assigned to a different thread.

3. Role of the Main Function:

• The main function runs on the Host (CPU) and manages the overall program flow.
• It handles memory management, such as allocating memory on the device using
cudaMallocManaged or cudaMalloc.
• It launches the kernel using the triple-angle bracket syntax <<<blocks, threads>>> and ensures the
host waits for the device to complete via cudaDeviceSynchronize().

Q.9 b. Explain the use of __syncthreads function in CUDA with an example. Write a CUDA program
implementing trapezoidal rule using shared memory. (8 Marks)

1. Use of __syncthreads():

• __syncthreads() is a barrier synchronization function.


• It ensures that all threads in a block reach the same point in the code before any of them are allowed
to proceed further.
• Example: In a block-level sum, warp 0 might try to add up results from other warps. If warp 0 starts
before warp 1 has finished its calculation, the result will be wrong (a race condition). Using
__syncthreads() forces warp 0 to wait until warp 1 is done.
• Note: It only works within a block, not across multiple blocks.

2. CUDA Trapezoidal Rule (Shared Memory):

__global__ void Dev_trap(const float a, const float b, const float h, const int n, float* trap_p) {
// Declare shared memory for the block
__shared__ float shared_vals;
int my_i = blockDim.x * blockIdx.x + threadIdx.x;
int my_lane = threadIdx.x % warpSize;

shared_vals[my_lane] = 0.0;
if (0 < my_i && my_i < n) {
float my_x = a + my_i * h;
shared_vals[my_lane] = f(my_x);
}

// Synchronize to ensure all threads have written to shared_vals


__syncthreads();
Atria Institute of Technology

// Reduction within shared memory (simplified)


float result = Shared_mem_sum(shared_vals);

// Only thread 0 of the block updates the global sum


if (threadIdx.x == 0) atomicAdd(trap_p, result);
}

(Based on Program 6.14)

Q.9 c. Describe the working of atomicAdd and how it helps to avoid race conditions. (4 Marks)

• Race Condition: This occurs when multiple threads attempt to update a shared variable
simultaneously. Because addition involves multiple machine instructions (read, add, write), one
thread might overwrite the changes made by another.
• Working of atomicAdd: It is a built-in function that performs an indivisible operation. When a thread
calls atomicAdd, no other thread can observe or interfere with the variable until the addition is
complete.
• Benefit: It ensures consistency and correctness in concurrent execution by forcing threads to access
the memory location one at a time.

Q.10 a. Describe the concept of a tree-structured global sum and explain how it improves performance
over a basic global sum. (8 Marks)

1. Basic Global Sum:

• In a basic sum, threads are effectively serialized. Each thread waits for its turn to add its value to a
total using atomicAdd.
• If there are t threads, it requires t consecutive additions.
• Diagram Reference: See FIGURE 6.3: Basic sum .

2. Tree-Structured Global Sum:

• Threads are paired so that half of the active threads add their partial sum to their partner’s sum in
each step.
• This creates a binary tree (or "shrub-like") pattern where the number of active threads is halved in
every iteration.
Atria Institute of Technology

3. Performance Improvement:

• While a basic sum grows linearly with the number of threads, a tree-structured sum grows
logarithmically.
• For 1,000,000 threads, a basic sum requires 1,000,000 steps, while a tree-structured sum requires
only 21 steps.

Q.10 b. Explain how shared memory is used to compute warp sums in CUDA. (6 Marks)

• Purpose: For GPUs with compute capability < 3.0 (which lack warp shuffle), shared memory is used
to allow threads in a warp to communicate.
• Implementation (Dissemination Sum):
1. A shared memory array (e.g., shared_vals) is declared.
2. Threads in a warp execute in lockstep (SIMD fashion), which prevents race conditions during
the process.
3. In each iteration, every thread reads a value from a "source" index in shared memory and
adds it to its own index.
4. By the end of the iterations, every thread in the warp possesses the same final sum.
• Efficiency: Performance is high because shared memory is much faster than global memory.

Q.10 c. Discuss the differences between shared memory, registers, and global memory in CUDA in
terms of speed and size. (6 Marks)

GPU memory is organized in a three-level hierarchy:


Memory
Size / Capacity Speed / Latency Accessibility
Type

Smallest (e.g., 504 bytes


Registers Fastest (~1 clock cycle) Private to each thread.
per thread)

Shared Intermediate (e.g., 48KB Fast (roughly 10x slower than Accessible only by threads within
Memory per block) registers) the same block.

Global Largest (e.g., 1GB to Slowest (100x to 1000x Accessible by all threads in the
Memory 12GB per GPU) slower than registers) grid.

https://www.instagram.com/memebuddy18?igsh=MTZ4ejc3azRiMnVhMg==

You might also like