0% found this document useful (0 votes)

24 views67 pages

Parallel Computing Notes

This document provides an introduction to parallel programming, emphasizing the importance of utilizing multiple cores in modern computing. It classifies parallel computers using Flynn's Taxonomy (SISD, SIMD, MIMD) and memory-based classifications (shared vs. distributed memory systems). Additionally, it discusses the characteristics and applications of SIMD and MIMD systems, highlighting their advantages and challenges in programming.

Uploaded by

Vivek Kulkarni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views67 pages

Parallel Computing Notes

Uploaded by

Vivek Kulkarni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

MODULE -1

Introduction to parallel programming, Parallel hardware and parallel software

Introduction to parallel programming
 Why parallel at all? Modern computers have many cores (even phones!). To keep getting faster
programs, we have to write software that uses multiple cores together, instead of just one. That’s
what parallel programming does.
 Processes vs. threads: In shared-memory programs we usually start one process and create
multiple threads inside it. In distributed-memory programs we usually start multiple processes
(often on different machines). When a point applies to both, the book writes “processes/threads.”
 SPMD (Single Program, Multiple Data): A very common style: all cores run the same executable,
but each core does a different part based on its rank (ID). SPMD works for both data parallelism
(split an array) and task parallelism (split different jobs).
Parallel hardware and parallel software (the big picture)
Parallel Hardware Parallel hardware = hardware that allows programs to run tasks at the same time.
 Multiple issue & pipelining: CPU can run several instructions together internally, but this is not
visible to programmers.
 In this subject, we only call it parallel hardware if the programmer can see and use it by
writing/modifying code (like multicore processors, SIMD, GPUs).
 So:
o Invisible parallelism (CPU does it automatically) → not counted.
o Visible parallelism (programmer must write code to use it) → counted as parallel hardware.

 Parallel hardware (what the machine looks like): We care about machine features that are visible to
the programmer (you change code to use them): SIMD units, multiple cores, shared vs. distributed
memory, interconnects, etc.
 Parallel software (how we program it): We must (1) divide work, (2) synchronize, and (3)
communicate. Sometimes it’s easy (“embarrassingly parallel,” e.g., add two big arrays by splitting
indices), but usually we must coordinate carefully.
📘 Classifications of Parallel Computers
We can classify parallel computers in two different ways:

1️⃣ Flynn’s Taxonomy (based on instructions & data)

Flynn looked at how many instruction streams and how many data streams a computer can handle at the
same time.
🔹 (a) SISD → Single Instruction, Single Data
 Normal, old-style computers (von Neumann model).
 Can only do one instruction at a time, on one data value.
 Example: A basic single-core computer running one program step by step.

🔹 (b) SIMD → Single Instruction, Multiple Data

 All processors run the same instruction, but on different data.
 Good for array/matrix problems.
 Example: Adding two arrays → each processor adds one pair at the same time.
 Used in GPUs, vector processors.

🔹 (c) MIMD → Multiple Instruction, Multiple Data

 Different processors can run different instructions on different data.
 Very flexible – processors don’t have to do the same thing.
 Example: In a server, one processor handles file storage, another handles calculations.
 Used in most modern multicore CPUs.

⚡ So:
 SISD → serial (one instruction, one data).
 SIMD → same instruction on many data.
 MIMD → different instructions on different data.

2️⃣ Memory-Based Classification (based on memory organization)

This looks at how processors use memory to share data.
🔹 (a) Shared Memory Systems
 All processors share the same memory.
 They communicate by reading/writing shared variables.
 Easy to program.
 Problem: If too many processors try to use the memory at once → slowdown.
 Example: Multicore desktop/laptop processors.

🔹 (b) Distributed Memory Systems

 Each processor has its own private memory.
 Processors communicate by sending messages across a network.
 Harder to program, but scales better (can have thousands of processors).
 Example: Supercomputers or clusters of computers connected with network.

✨ Summary in One Line

 Flynn’s taxonomy → based on instructions and data streams (SISD, SIMD, MIMD).
 Memory classification → based on how processors access memory (shared vs distributed).

📘 SIMD Systems (Single Instruction, Multiple Data)

🔹 What is SIMD?
 SIMD = Single Instruction, Multiple Data.
 Means: One instruction is applied to many data items at the same time.
 Structure:
o One Control Unit → sends the instruction.
o Many Datapaths → each datapath works on its own data.
 So → same work is done in parallel on multiple pieces of data.

What does SIMD mean?

 SIMD = Single Instruction, Multiple Data.
 One central control unit sends the same instruction to many datapaths (also called lanes).
 Each lane works on its own piece of data. So, many data items are processed at the same time with
the same instruction.
Quick picture in words: one teacher (control unit) says, “Add 5,” and many students (lanes) each add 5 to
their own number at the same time.

How it runs (broadcast + lanes)

1. Control unit broadcasts an instruction (e.g., “add”).
2. Every lane either:
o Executes that instruction on its current data item, or
o Stays idle (does nothing) if it shouldn’t apply there.
All lanes move in lockstep: they wait for the next broadcast before doing the next thing.

Example: Vector (array) addition

We want:
for (i = 0; i < n; i++)
x[i] += y[i];
 If the SIMD machine has n lanes, it can load x[i] and y[i] into lane i, add them all at once, and store
back.
 If it has m lanes (m < n), it works in batches of m:
o Do elements 0..m-1, then m..2m-1, … until all are done.
o If the last batch isn’t full, some lanes sit idle in that batch.
Tiny example: m=4, n=15 → batches are 0–3, 4–7, 8–11, 12–14 (last batch uses only 3 lanes; 1 lane idle).

Conditionals hurt SIMD (why some lanes idle)

Consider:
for (i = 0; i < n; i++)
if (y[i] > 0.0) x[i] += y[i];
 Each lane checks its y[i].
 If y[i] > 0, that lane does the add; otherwise that lane waits while others add.
 Result: mixed work → some lanes idle → lower performance.
(You’ll also hear this called divergence in practice.)

Two important properties of (classic) SIMD lanes

 Synchronous/lockstep: lanes step together with the broadcast instruction stream.
 No per-lane instruction storage: a lane can’t “save” an instruction to run later; it either runs now or
waits.

When SIMD is a great fit

 Simple loops over large arrays where every element does the same operation.
 Image processing, audio/video filters, matrix/vector math, scientific number crunching.
 This style is called data parallelism: split data across lanes, apply the same steps.
When SIMD is a poor fit
 Irregular control flow (lots of if/else, different work per element).
 Pointer-heavy/graph algorithms with unpredictable memory access.
 Small problems (not enough work to keep many lanes busy).
Vector processors (a classic SIMD style)
Idea: work on vectors (arrays) directly, not just single numbers.
Key features:
 Vector registers: hold many elements at once (e.g., 4 to 256 64-bit numbers).
 Vector instructions: one “vector add” operates on a whole block in one go.
o That means a loop like x[i]+=y[i] becomes: load vector, add vector, store vector (per block),
instead of per element.
 Pipelined functional units: keep the “assembly line” full for high throughput.
 Interleaved (multi-bank) memory: multiple memory banks let you fetch/store consecutive
elements quickly without waiting on the same bank.
 Strided access & scatter/gather:
o Strided: access every k-th element (e.g., 0, 4, 8, … → stride 4).
o Gather/scatter: read/write elements at irregular positions; hardware speeds this up.
Pros:
 Very fast on many numeric codes.
 Compilers are good at auto-vectorizing simple loops (and tell you why a loop can’t be vectorized).
 High memory bandwidth; data you load tends to be fully used.
Cons:
 Not great for irregular data structures.
 Scalability limit: you can’t keep making vectors arbitrarily longer.
 Commodity CPUs often support only short vectors (wide, but not huge); long-vector systems are
custom and expensive.
 Scaling usually comes from more vector processors, not longer vectors.

GPUs (modern SIMD-influenced systems)

Graphics basics: Graphics APIs represent objects with points/lines/triangles, then turn them into pixels via
a graphics pipeline. Parts of this pipeline are programmable via small functions called shaders (short, C-like
code).
How GPUs parallelize:
 A shader is run on many elements (e.g., many pixels/vertices) in parallel.
 Each GPU core contains many lanes (e.g., 128) → uses SIMD inside a core.
 GPUs keep lanes busy by running huge numbers of threads and swapping between them to hide
memory delays (hardware multithreading).
Trade-offs:
 Excellent when you have lots of data and lots of similar work.
 Weaker on small problems (not enough threads to keep lanes busy).
Not pure SIMD or pure MIMD:
 Inside a core, lanes execute SIMD-style.
 Across cores (and sometimes within a core), different instruction streams can run → MIMD-like.
 Memory setups vary; many GPUs expose shared memory within a core/block (and larger systems
can look “distributed” across cores). For our course/text, focus on shared-memory view.
General computing on GPUs:
 GPUs are now common for high-performance computing beyond graphics (AI, simulations, etc.).
 You program them with GPU languages/APIs (e.g., CUDA, OpenCL), launching kernels (GPU
functions) from the CPU.

Key takeaways (SIMD in one glance)

 SIMD = same instruction, many data at once.
 Shines on uniform array loops (data parallelism).
 Suffers when work per element differs (lanes idle).
 Vector processors = classic SIMD machines with vector registers, vector instructions, and smart
memory (banks, stride, gather/scatter).
 GPUs use SIMD inside, plus massive threading; great for big, regular workloads.

📘 MIMD Systems (Multiple Instruction, Multiple Data)

🔹 What is MIMD?
 MIMD = Multiple Instruction, Multiple Data.
 This means:
o Many processors (CPUs/cores).
o Each processor can run its own instruction stream.
o Each processor can work on its own data set.
 So, different processors can do different tasks at the same time.
👉 Think of it like a team project:
 Each student (processor) can follow a different set of instructions.
 Each student can use their own books/notes (data).

🔹 How is MIMD different from SIMD?

 SIMD: Same instruction, many data (all processors do the same thing together).
 MIMD: Each processor does its own thing independently.

🔹 Key Characteristics of MIMD

1. Independent processors
o Each has its own control unit (decides which instruction to run).
o Each has its own datapath (executes instructions).
2. Asynchronous operation
o Processors do not have to work in lockstep (unlike SIMD).
o No need for a global clock → processors may run at different speeds.
o Even if they run the same program, they might be at different steps at the same time.
3. Synchronization needed
o Since processors work independently, we need special programming techniques (locks,
barriers, message passing) to make them coordinate when required.

🔹 Two Types of MIMD Systems

1. Shared-Memory Systems

 Many processors connect to one common memory.

 Processors communicate by reading/writing to shared variables in memory.
 Advantages:
o Easy to program.
o Natural way of communication (just use common variables).
 Disadvantages:
o Memory becomes a bottleneck (too many processors accessing at once).
o Need to handle cache coherence (keeping all processors’ copies of data consistent).
👉 Example: Modern multicore desktops/laptops.
2. Distributed-Memory Systems

 Each processor has its own private memory.

 Processors communicate by sending messages over a network.
 Advantages:
o Scales well (can build very large systems).
o No single shared memory bottleneck.
 Disadvantages:
o More difficult to program (programmer must explicitly manage communication).
o Slower communication compared to shared memory.
👉 Example: Supercomputers, clusters of machines connected by network.

📘 Shared-memory vs Distributed-memory Systems

🔹 Shared-memory systems

 Multiple cores (CPUs) connected to a common memory.

 Each core can directly access the same memory.
 Usually built using multicore processors (e.g., your laptop/desktop CPU).
🖥 How caches work
 Each core has its own small private cache (L1).
 Some higher-level caches (L2/L3) may be shared among cores.
 All cores ultimately connect to the same main memory (RAM).

🔸 Two types of shared-memory systems

1. UMA (Uniform Memory Access)
 All cores take the same amount of time to access any memory location.
 Simple for programmers → don’t need to worry about where data is stored.
 But can become slower if too many cores compete for the same memory bus.
👉 Example: Small-scale multicore systems (your desktop/laptop CPU).
2. NUMA (Non-Uniform Memory Access)

 Each processor (chip) has faster access to its own local memory.
 Accessing memory from another chip is slower.
 More complex for programmers → must keep in mind where data is stored.
 Advantage: can handle larger memory sizes and give faster local access.
👉 Example: Large servers with multiple CPU sockets.

🔹 Distributed-memory systems
 Each processor (or node) has its own private memory.
 Processors communicate via a network (e.g., Ethernet, InfiniBand).
 Communication happens through message passing instead of shared variables.
🔸 Clusters
 Collection of commodity computers (like PCs or servers) connected with a network.
 Each node is usually a shared-memory system (multicore).
 So clusters are often hybrid systems (nodes = shared memory, cluster = distributed memory).
 Widely used in scientific computing & supercomputers.
👉 Example: A university supercomputer made of 100 PCs linked with Ethernet.
🔸 Grids
 A larger version of clusters, but spread geographically (different places).
 Connects computers across cities or countries into one big system.
 Usually heterogeneous (different types of hardware combined).
 Needs special software (middleware) to make it appear like one big computer.
👉 Example: Worldwide LHC Computing Grid (used for CERN experiments).

✨ Summary
 Shared-memory systems: All cores share the same memory.
o UMA: Same access time everywhere → easy to program.
o NUMA: Local memory is faster than remote memory → bigger & faster but harder to
program.
 Distributed-memory systems: Each processor has private memory, communicate via messages.
o Clusters: Group of PCs connected → often hybrid (each PC has shared memory).
o Grids: Huge distributed systems spread over the globe, connecting different types of
machines.

✨ Summary
 MIMD systems allow processors to run different instructions on different data at the same time.
 They are asynchronous (independent clocks).
 Two main types:
o Shared-Memory MIMD: processors share one common memory.
o Distributed-Memory MIMD: each processor has its own memory, communication happens
via messages.

✅ So:
 SIMD = all processors do the same task on different data.
 MIMD = processors can do completely different tasks, either sharing memory or keeping separate
memory.

✨ Interconnection Networks
🔹 Introduction
 An interconnection network is the system that connects processors and memory.
 It decides how fast processors can communicate with memory and with each other.
 Even if processors are very powerful, a slow interconnect will make the whole system slow.
👉 Types:
1. Shared-memory interconnects
2. Distributed-memory interconnects

🔹 1. Shared-memory Interconnects
In shared-memory systems, all processors share the same memory.
They are connected through an interconnect.
(a) Bus
 A bus is a set of parallel wires.
 All processors and memory are connected to it.
 Only one processor can use it at a time.
✅ Advantages:
1. Very cheap and simple.
2. Easy to add devices.
❌ Disadvantages:
1. Slow when many processors are connected.
2. Only one can use it → others must wait.
3. Not good for large systems.
👉 Diagram to draw: Simple processors and memory connected by a single line (bus).

(b) Crossbar
 A crossbar uses switches to directly connect processors and memory.
 Many processors can access different memory modules at the same time.
📌 Fig. 2.7 shows:
 (a) Processors & memories connected with switches.
 (b) Switch positions.
 (c) Example of multiple processors accessing memory simultaneously.
✅ Advantages:
1. Very fast – allows parallel memory access.
2. No conflict unless two processors want the same memory.
❌ Disadvantages:
1. Very expensive – requires many switches and links.
2. Not practical for large systems.

🔹 2. Distributed-memory Interconnects
Here, each processor has its own memory.
Processors communicate by sending messages via switches.
(a) Direct Interconnects
 Each processor-memory pair is connected to a switch.
 Switches are connected in a fixed pattern.
👉 Common types:
1. Ring (Fig. 2.8a)
 Processors connected in a circle.
 Each has 2 neighbors.
 Better than bus (multiple communications).
 But some processors may have to wait for others.
2. 2D Toroidal Mesh (Fig. 2.8b)
 Processors arranged in a grid with extra “wrap-around” links.
 More costly than ring (needs more links).
 Allows much better simultaneous communication.

(b) Bisection Width

 A measure of how many communications can happen if we cut the system into 2 halves.
Examples:

 Ring (Fig. 2.9a) → only 2 communications across halves.

 Ring (Fig. 2.9b) → up to 4 communications.
 Toroidal Mesh (Fig. 2.10) → even higher bisection width → better performance.
Interconnection Networks
0) Why interconnects matter
 The interconnection network is the “roads + traffic lights” that let processors and memory talk.
 If the interconnect is slow, the whole parallel program becomes slow—even with very fast CPUs
and memory.

1) Shared-memory interconnects(processors share a single main memory)

1.1 Bus (classic, simple)
 A bus is a set of shared wires; all devices use the same path.
 Only one transfer at a time → others must wait (contention).
 Pros: cheap, simple, easy to add devices.
 Cons: gets saturated as devices increase → poor scalability.
 Why buses are being replaced: to grow core counts, systems switch to switched interconnects
(e.g., crossbar).
(Described in the “Shared-memory interconnects” section just before Fig. 2.7)
1.2 Switched interconnect: Crossbar
 Uses many small switches to connect each processor (Pi) to each memory module (Mj).
 If there are at least as many memory modules as processors, the only conflict is when two
processors request the same memory module at the same time.
 Much faster than a bus because many transfers can happen simultaneously.
 Downside: expensive (lots of switches and links). A small bus system is far cheaper than a same-size
crossbar system.
Figures to draw and label:
 Fig. 2.7(a) – Crossbar layout: circles = switches, squares = processors/memories, lines =
bidirectional links.
 Fig. 2.7(b) – The two possible 2×2 switch configurations.
 Fig. 2.7(c) – Example simultaneous accesses: e.g., P1→M4, P2→M3, P3→M1, P4→M2.

2) Distributed-memory interconnects
(each processor has its own private memory; nodes talk via a network)
Two families: Direct and Indirect.
2.1 Direct interconnects
(every switch is attached to a processor–memory node; switches link to each other)
Counting links correctly (important exam point)
 In direct networks, we normally count only switch-to-switch links, because CPU↔switch links may
run at different speeds.
 To get the total number of links, you can add p (the number of processors) to the switch-to-switch
count.
o Ring (Fig. 2.8a): count 3 switch-to-switch links (not 6 total).
o Toroidal mesh (Fig. 2.8b): count 18 switch-to-switch links (not 27 total).
(a) Ring
 Nodes in a circle; each has two neighbors.
 Better than a bus (multiple communications can happen).
 But easy to create traffic patterns that force waiting.
 Figure: Fig. 2.8(a) (Ring with attached processors).
(b) 2-D Toroidal Mesh
 Nodes in a grid with wrap-around connections (top↔bottom, left↔right).
 More links and more capable switches than a ring: each switch needs 5 links (N, E, S, W + local
processor) vs 3 in a ring (2 neighbors + processor).
 If there are p processors, number of switch-to-switch links is 2p in a toroidal mesh but p in a ring.
 Figure: Fig. 2.8(b).
Connectivity measure: Bisection width (direct networks)
 Split the network into two equal halves.
 Bisection width = minimum number of links that cross the split = max number of simultaneous
communications across the split.
 Ring with 8 nodes:
o Split like Fig. 2.9(a) → only 2 cross links.
o Another split Fig. 2.9(b) → 4 simultaneous connections can cross.
 Square 2-D toroidal mesh: bisection width = 2√p (much higher than ring).
 Figures: Fig. 2.9(a), Fig. 2.9(b) (ring bisections), Fig. 2.10 (mesh bisection).
(c) Fully connected network (ideal but impractical)
 Every switch connects to every other switch.
 Bisection width: p2/4 (very high).
 Cost: needs p2/2−p/2 links; each switch needs p ports → impractical beyond a few nodes; used as a
theoretical best baseline.
 Figure: Fig. 2.11.
(d) Hypercube (highly connected & used in real systems)
 Built inductively:
o 1-D hypercube = 2 nodes fully connected.
o 2-D = connect two 1-D cubes node-to-node.
o 3-D = connect two 2-D cubes, etc.
 For dimension d:
o Nodes: p=2d
o Each switch degree: d neighbors + 1 processor ⇒ 1 + d = 1 + log₂(p) wires per switch.
o Bisection width: p/2 (better than ring/mesh).
 Trade-off: higher connectivity but switches get more complex as d grows; overall more expensive
than a toroidal mesh.
 Figures: Fig. 2.12(a), (b), (c) (1-D, 2-D, 3-D hypercubes).

2.2 Indirect interconnects

(messages pass through a separate “switching fabric”; links often drawn unidirectional)
(a) Generic view
 Processors each have an incoming and outgoing link into a switching network.
 Figure: Fig. 2.13.
(b) Crossbar for distributed memory
 Same “any-to-any” idea, drawn with unidirectional links.
 As long as two processors don’t target the same destination at the same time, every processor can
send simultaneously.
 Bisection width (p × p crossbar): p.
 Figure: Fig. 2.14.
(c) Omega network (multistage network)
 Built from many small 2×2 switches arranged in log₂(p) stages.
 Some communication patterns conflict (cannot all happen at once).
o Example from the book: if processor 0 → 6, then processor 1 → 7 cannot happen
simultaneously.
 Cost vs crossbar:
o Uses (p/2) · log₂(p) of the 2×2 switch elements (far fewer than a full crossbar, which needs
on the order of p² crosspoints).
 Bisection width: p/2 (smaller than crossbar’s p).
 Figures: Fig. 2.15 (whole omega), Fig. 2.16 (a 2×2 omega switch).

3) Bandwidth, Bisection Bandwidth, and Latency (performance terms)

3.1 Link bandwidth
 Bandwidth = data rate of a link (e.g., megabits/s, megabytes/s).
3.2 Bisection bandwidth
 Same idea as bisection width, but instead of counting links, we sum their bandwidths across the
cut.
 Example: If each ring link is 1 Gbit/s, a ring’s bisection bandwidth is 2 Gbit/s = 2000 Mbit/s.
(Because two links cross the best cut in a standard ring.)
3.3 Latency and total message time
 Latency (l): time from start of sending until the first byte arrives at the receiver.
 Bandwidth (b): rate (bytes/s) after the first byte starts arriving.
 Message of n bytes:
n
Transmission time = l+
b
 Terminology caution: Some authors use “latency” to mean the entire transmission time, or to mean
the fixed overheads (e.g., building the message header, addressing, error-check; and unpacking at
the receiver).

4) Quick comparison
📘 Cache Coherence
1. What is Cache Coherence?
 In modern processors, each core has its own private cache to make memory access faster.
 But when multiple cores share data, problems can occur because each core might have different
copies of the same variable in its cache.
 Cache coherence means keeping all the caches in sync so that if one core updates a variable, the
other cores see the updated value.

2. Example to Understand the Problem

Refer to Figure 2.17 (Shared-memory system with two cores and two caches).
 Core 0 has its own cache.
 Core 1 has its own cache.
 Both share the same main memory via interconnect.
Let’s say:
 x = 2 (shared variable in memory).
 Core 0 executes: x = 7.
 Core 1 later executes: z1 = 4 * x.
❌ Problem:
 Core 1 may still use the old cached value of x = 2.
 Then, z1 = 4 * 2 = 8 instead of 28.
 This mismatch happens because caches are not updated across cores automatically.
👉 This problem is called the Cache Coherence Problem.

3. Why is this a Problem?

 Programmers cannot directly control caches.
 Even if memory is updated (with write-through or write-back), another core’s cache may still use
the stale (old) value.
 This makes the program give unpredictable results.

4. Solutions to Cache Coherence

There are two main approaches:
(A) Snooping Cache Coherence
 Works well in bus-based systems.
 Idea:
o When one core updates a variable in its cache, it broadcasts this update on the bus.
o Other cores “snoop” (watch) the bus.
o If they see that a variable they also have is updated, they invalidate or update their copy.
✅ Advantages:
 Simple and easy.
 Works with both write-through and write-back caches.
❌ Disadvantages:
 Needs a broadcast every time → high traffic.
 Not scalable (slow in large systems).

(B) Directory-Based Cache Coherence

 Works better in large distributed-memory systems.
 Idea:
o A special data structure called a directory keeps track of which core has cached which
memory block.
o When a core updates a variable:
 The directory is checked.
 Only those cores with the variable in their cache are informed → they invalidate or
update their copies.
✅ Advantages:
 No need for broadcasts (saves bandwidth).
 Scales well to large systems.
❌ Disadvantages:
 Needs extra storage for the directory.
 Directory lookup adds overhead.
5. Summary
 Cache coherence problem: Multiple cores may hold inconsistent values of shared variables in their
caches.
 Snooping method: Uses broadcasts, simple but not scalable.
 Directory method: Uses a directory, scalable but needs extra storage.

📘 False Sharing
1. What is False Sharing?
 CPU caches work with cache lines, not individual variables.
 A cache line is a block of memory (e.g., 64 bytes).
 If two variables are stored in the same cache line, and two different cores update them, the cache
system behaves as if both cores are sharing data, even if they are not actually using the same
variable.
 This causes unnecessary cache invalidations and extra memory traffic.
 This problem is called False Sharing.
👉 Important: False sharing does not cause wrong results, but it slows down performance a lot.

2. Example Program
int i, j, m, n;
double y[m];
/* Assign y = 0 */
for (i = 0; i < m; i++)
for (j = 0; j < n; j++)
y[i] += f(i,j);
 This program computes values of f(i,j) and stores them in y[i].

 If we parallelize it, different cores handle different parts of y.

3. Parallel Version
Suppose we divide work between two cores:
 Core 0: Updates y[0], y[1], y[2], y[3]
 Core 1: Updates y[4], y[5], y[6], y[7]
So logically, they are working on different elements of the array.

4. What Happens in Memory?

 Assume:
o m=8
o double = 8 bytes
o Cache line = 64 bytes
o y[0] is at the start of a cache line.
👉 This means that the entire array y[0] … y[7] fits into one cache line (8 doubles × 8 bytes = 64 bytes).

5. Problem (Why False Sharing Happens?)

 Even though Core 0 and Core 1 work on different parts of y, the entire array is in one cache line.
 When Core 0 updates y[0], the cache line is marked invalid in Core 1.
 Next, when Core 1 updates y[4], it reloads the cache line → which invalidates Core 0’s cache.
 This keeps repeating → lots of memory traffic between caches and main memory.
⚠️Result: Program becomes very slow because every update forces a cache line transfer.

6. Key Points
 False sharing:
o Does not give wrong results.
o But it ruins performance due to unnecessary memory traffic.
 It occurs when multiple cores update different variables in the same cache line.

7. How to Reduce False Sharing?

✅ Use temporary local storage for each core/thread.
✅ After finishing, copy local results into the shared array.
✅ Example: Each core stores its results in a private temp[] array, then combines into y[].
✅ Padding can also be used (place unused space between array elements so they fall into different cache
lines).

8. Summary
 False Sharing happens when different cores update different variables that lie in the same cache
line.
 Causes frequent cache invalidations and reloads.
 Does not cause incorrect results but hurts performance badly.
 Can be avoided using:
o Private temporary storage,
o Padding to separate variables into different cache lines.
📘 Shared-memory vs Distributed-memory
1. Shared-Memory Systems
 All processors share a single address space (one big memory).
 Processors communicate with each other by reading/writing shared variables.
 Example: Multicore processors, small SMP (Symmetric Multiprocessor) systems.
Advantages:
1. Easy to program → programmers just read/write shared variables.
2. Natural communication → no need for explicit message passing.
3. Good for small systems (few processors).
Disadvantages:
1. Scalability problem – as more processors are added, memory access conflicts increase.
o Example: In a bus, multiple processors may want memory at the same time → contention.
2. Crossbar interconnects can reduce contention, but they are very expensive.
3. Not suitable for very large systems with hundreds/thousands of processors.

2. Distributed-Memory Systems
 Each processor has its own private memory.
 Processors communicate by explicitly sending and receiving messages.
 Example: Supercomputers, clusters (with thousands of nodes).
Advantages:
1. Scalable → thousands of processors can be connected.
2. Cheaper interconnects like mesh, toroidal mesh, hypercube, etc.
3. Well-suited for problems with huge data and computations.
Disadvantages:
1. Harder to program → programmer must explicitly manage communication (send/receive
messages).
2. Data sharing is not automatic like in shared-memory.
3. Higher communication overhead compared to shared-memory.

3. Why Not Always Shared-Memory?

 New programmers prefer shared-memory since it looks simple.
 But hardware reality:
o Buses → cheap but only good for small systems.
o Crossbars → allow many processors but are very expensive.
 For large systems (1000+ processors), distributed-memory is more practical.

4. Comparison Table
Feature Shared-Memory Distributed-Memory
Memory Access Common global memory Each processor has its own local memory
Communicatio Through shared variables Through message passing
n
Programming Easy, implicit Harder, explicit send/receive
Scalability Poor (limited processors) Excellent (thousands possible)
Cost Expensive (crossbars, coherence needed) Cheaper interconnects (mesh, hypercube)
Examples Multicore CPUs, SMP systems Supercomputers, clusters

5. Summary (Exam-Ready)
 Shared-memory → simple, easy for programmers, but does not scale well (bus contention,
expensive crossbars).
 Distributed-memory → harder to program (need explicit communication) but scales cheaply to
thousands of processors using interconnects like hypercube, toroidal mesh.
 That’s why large parallel systems are usually distributed-memory.

✅ You can also draw a small diagram:

 Shared-memory: 3 processors connected to one memory.
 Distributed-memory: 3 processors each with private memory, connected via a network.

Coordinating the Processes/Threads

When we write parallel programs, the main challenge is how to make different processes or threads work
together properly.

✅ Case 1: Very Simple Example (Trivial Case)

 Suppose we want to add two arrays element by element:
double x[n], y[n];
for (int i = 0; i < n; i++)
x[i] += y[i];
 To parallelize this:
o If we have p processes/threads, we just divide the array elements among them.
o Example:
Thread 0 → works on 0 … n/p - 1

 Thread 1 → works on n/p … 2n/p - 1
 Thread 2 → works on 2n/p … 3n/p - 1 … and so on.
So each thread works independently on its own portion.
👉 This kind of program is called embarrassingly parallel.
 Meaning → it is very easy to split the work, no complicated coordination needed.
 The term sounds negative (“embarrassing”), but in reality, if you find such a problem, it’s actually
great news because parallelizing becomes very easy.

✅ Key Requirements for Work Division

When dividing work among processes/threads, the programmer must ensure:
1. Load Balancing (equal work)
o Each process/thread should get roughly the same amount of work.
o If some threads get more work than others → overall program becomes slow (since others
have to wait).
2. Minimal Communication
o The amount of data exchange between threads/processes should be kept as small as
possible.
o Example: If two threads need to constantly send data to each other, they waste time in
communication instead of computation.

✅ Case 2: General (Harder) Problems

 Unfortunately, most problems are not so simple.
 For most real-world programs, we need coordination between processes/threads.
 This means:
1. Synchronization → making sure processes/threads work in the correct order.
 Example: If one thread produces data, another should wait until data is ready.
2. Communication → processes/threads often need to share or exchange results.

✅ Synchronization vs Communication
 Distributed-memory systems:
o Processes have their own private memory.
o They communicate by sending messages.
o Here, communication itself often provides synchronization (because a process must wait
until it receives a message).
 Shared-memory systems:
o All threads share the same memory.
o They communicate by reading/writing shared variables.
o To do this safely, we need synchronization tools (like locks, barriers, semaphores).
So, in both systems, communication and synchronization are deeply connected.
✅ Important Terms
 Parallelization: Process of converting a serial (normal) program into a parallel one.
 Embarrassingly parallel: A program that can be parallelized easily, just by dividing work, without
worrying about synchronization.
 Load balancing: Ensuring all processes/threads get equal work.

📖 Exam-Ready Answer (10 Marks)

To coordinate processes/threads in parallel programming:
1. Some programs are trivial to parallelize (like array addition) → just divide work evenly → these are
called embarrassingly parallel programs.
2. Programmer must ensure:
o Load balancing → equal work for all.
o Minimal communication → reduce data transfer overhead.
3. For most problems, coordination is required:
o Synchronization → to ensure proper ordering of operations.
o Communication → to share/exchange data.
4. In distributed-memory systems, communication provides synchronization (via message passing).
5. In shared-memory systems, communication happens by accessing shared variables, but
synchronization (locks, barriers) is needed to avoid conflicts.
👉 Hence, coordination is the key challenge in parallel programming, except for embarrassingly parallel
problems.
Shared-memory programming
1) Shared vs. Private variables (how threads “talk”)
 In a shared-memory program, multiple threads run inside the same process and see the same
address space.
 A shared variable can be read/written by any thread.
 A private variable belongs to one thread only (others can’t normally access it).
 Because everyone can see shared variables, communication is implicit: threads “communicate” just
by reading/writing the same memory locations. No explicit messages are needed.

2) Two threading styles: Dynamic vs. Static

A) Dynamic threads (on-demand workers)
 Typical pattern: one master thread + 0 or more worker threads.
 The master waits for work (e.g., requests from network or queue).
 When work arrives, master forks (creates) a worker thread.
 Worker does the job, then joins (finishes and hands control back) to the master and terminates.
 Why good? Uses system resources efficiently: a thread exists only while it has work.
B) Static threads (fixed team)
 Master does setup, then creates all threads once at the start.
 All threads run until all work is finished.
 Then they join back to the master; master may do cleanup and exit.
 Resource use: less efficient if some threads are idle (they still hold a stack, registers, etc.).
 Performance: often better because creating/joining threads repeatedly is costly. If resources allow,
static can be faster.
 Bonus: The static style resembles distributed-memory programming (fixed ranks that run for the
whole program), so the same “mental model” carries over. That’s why many examples use static
threads.

3) Nondeterminism (why outputs can change each run)

 In MIMD systems, threads run asynchronously (not in lock-step).
 Nondeterminism = with the same input, you might see different outputs or a different order of
outputs in different runs.
 Example:
// thread 0 has my_x = 7, thread 1 has my_x = 19
printf("Thread %d > my_x = %d\n", my_rank, my_x);
Possible outputs (both are correct):
Thread 0 > my_x = 7
Thread 1 > my_x = 19
or
Thread 1 > my_x = 19
Thread 0 > my_x = 7
 Sometimes lines can even interleave (part of one line mixed with another) because both threads
write at the same time.
 Takeaway: timing varies from run to run (OS scheduling, hardware timing), so the order isn’t
guaranteed unless you enforce it.

4) Race conditions (the classic shared-update bug)

 A race condition happens when two or more threads access the same shared resource at the same
time and the final result depends on who gets there first.
 Example goal: both threads compute my_val (private) and add into a shared x (initialized to 0):
my_val = Compute_val(my_rank);
x += my_val; // shared update
 What can go wrong? Addition isn’t one step; it’s typically:
1. load x into a register
2. add my_val
3. store result back to x
 If both threads interleave these steps, you can lose one update.
Example timeline (simplified):
o T0: Thread 0 loads x=0
o T1: Thread 1 loads x=0
o T2: Thread 0 adds +7 → plans to store 7
o T3: Thread 1 adds +19 → plans to store 19
o T4: Thread 0 stores 7 (x becomes 7)
o T5: Thread 1 stores 19 (x becomes 19) ← lost the +7
 Correct result should be 26, but we got 19. That’s a race.

5) Making updates safe: atomicity & critical sections

Atomic operation (what we want)
 An operation is atomic if it appears to happen all at once: no other thread can see or insert a
change to that memory between start and finish.
Critical section (how we enforce it)
 A critical section is a block of code that only one thread may execute at a time.
 We protect it with a mutex (mutual exclusion lock).
Mutex / Lock (most common tool)
 Pattern:
my_val = Compute_val(my_rank);
Lock(&add_my_val_lock); // wait until you “own” the lock
x += my_val; // critical section
Unlock(&add_my_val_lock); // release the lock
 While one thread holds the lock, others trying to lock wait.
 Important: this serializes that small piece of code, so:
o Keep as few critical sections as possible.
o Make each as short as possible (do only the essential shared update).

6) Other synchronization tools (besides mutexes)

A) Busy-waiting (spin-wait)
 A thread loops checking a condition until it becomes true:
if (my_rank == 1)
while (!ok_for_1); // spin here

// critical section
x += my_val;

if (my_rank == 0)
ok_for_1 = true; // let thread 1 proceed
 Simple, but wastes CPU while spinning (doing no real work).
B) Semaphores
 Similar goal (control access / ordering), slightly different behavior than locks.
 Some patterns are easier to write with semaphores than with mutexes (you’ll see more in later
chapters).
C) Monitors
 A higher-level construct: an object whose methods are mutually exclusive (only one thread can be
inside any method at a time).
 Provides mutual exclusion without you manually placing locks around every method.

7) Thread safety (be careful with library functions)

 A function is thread-safe if multiple threads can call it at the same time safely.
 Local variables (non-static) inside a function live on the thread’s own stack, so they’re private and
safe.
 Static local variables inside a function persist across calls and are shared by all threads → danger.
 Example: C’s strtok
o First call: pass a string; strtok stores internal static pointer to remember where it left off.
o Next calls: it uses that static pointer to return the next token.
o If two threads both use strtok at the same time, they overwrite each other’s progress →
corrupted results.
 Takeaway: Not all serial-code functions are safe in multithreaded programs. Prefer thread-safe
variants (e.g., strtok_r) or write your own safe code.

8) Quick “how to think” checklist

 Decide what’s shared vs. private.
 If multiple threads update shared data, protect the update (lock or other mechanism).
 Minimize time spent holding locks.
 Expect nondeterminism in order/outputs unless you explicitly synchronize.
 Avoid busy-waiting unless you know it’s acceptable for your case.
 Be careful with library calls—check if they’re thread-safe.
 Prefer static threads when performance matters and resources allow; use dynamic threads when
workloads come and go.

9) Tiny exam snippets you can write

Race condition in one line:
A race condition occurs when two threads access and update the same shared variable at the same time,
and the final result depends on who finishes first.
Critical section in one line:
A critical section is code that must be executed by only one thread at a time to prevent races.
Mutex pattern (3 lines):
Lock(&L);
/* critical section: read/modify/write shared data */
Unlock(&L);
Thread-unsafe function example:
strtok uses a static internal pointer; simultaneous calls from different threads interfere. Use a thread-safe
alternative (e.g., strtok_r).

Distributed-Memory
In distributed-memory systems, each core (processor) has its own private memory.
Unlike shared-memory systems (where threads share variables directly), here a core cannot directly access
another core’s memory.
So, to communicate, cores use special mechanisms, mainly message-passing.

🔹 1. Message-Passing
 This is the most widely used method in distributed-memory systems.
 It is based on sending and receiving messages between processes.
 Processes (not threads) are usually used here because processes may run on completely different
CPUs with different operating systems.
 Each process is given a rank (like an ID): 0, 1, 2, …, p−1 (where p is total number of processes).
Example:
char message[100];
my_rank = Get_rank();

if (my_rank == 1) {
sprintf(message, "Greetings from process 1");
Send(message, MSG_CHAR, 100, 0); // send to process 0
} else if (my_rank == 0) {
Receive(message, MSG_CHAR, 100, 1); // receive from process 1
printf("Process 0 > Received: %s\n", message);
}
👉 Here:
 Process 1 → prepares a message and sends it to process 0.
 Process 0 → receives the message and prints it.
Important Points:
1. The program is SPMD (Single Program, Multiple Data) → All processes run the same program but
behave differently depending on their rank.
2. The variable message is private to each process (not the same memory).
3. Most systems allow all processes to use stdout/stderr (so printing is possible).
Send/Receive behavior:
 Send (blocking): The sending process waits until the receiving process starts receiving.
 Send (non-blocking/copying): The message is copied, and the sender continues execution without
waiting.
 Receive (usually blocking): The receiving process waits until the message arrives.
Collective Operations in Message-Passing:
 Broadcast: One process sends the same message to all other processes.
 Reduction: Combine results from all processes into one (e.g., sum, max).
 Others: Functions for complex data structures, managing processes, etc.
👉 The most widely used message-passing library: MPI (Message Passing Interface).
Almost all supercomputers and high-performance programs use MPI.
⚠️But:
 Message-passing is low-level → requires lots of detailed coding.
 Programmers must rewrite most of the program for parallelization.
 Data must be either replicated or explicitly divided across processes.
 Incremental rewriting (step by step) is hard → usually, a big rewrite is needed.
 Because of this, message-passing is called:
🔹 “The Assembly Language of Parallel Programming.”

🔹 2. One-Sided Communication (Remote Memory Access)

 In message-passing, two processes must participate: one sends, another receives.
 In one-sided communication, only one process needs to act.
Example:
 Process 0 directly writes into the memory of Process 1.
 Process 0 directly reads from the memory of Process 1.
This looks simpler, but it creates new problems:
1. Process 0 must know when it is safe to write (otherwise it may overwrite).
2. Process 1 must know when the data is ready (otherwise it may read old data).
o This requires synchronization or flag variables.
o Flag variable = Process 0 sets a value after writing, and Process 1 keeps checking (polling)
until it sees the flag.
⚠️Problems:
 Overheads from extra synchronization.
 Errors are harder to find (because processes don’t explicitly talk to each other).

🔹 3. Partitioned Global Address Space (PGAS) Languages

 Many programmers prefer shared-memory style programming because it’s easier.
 PGAS languages try to combine shared-memory style with distributed-memory systems.
How it works:
 Shared arrays are declared, but the programmer explicitly decides how they are partitioned across
processes.
 Private variables always stay in the local memory of the process.
Example:
shared int n = ...;
shared double x[n], y[n];
private int i, my_first_element, my_last_element;

my_first_element = ...
my_last_element = ...

for (i = my_first_element; i <= my_last_element; i++)

x[i] += y[i];
👉 Here:
 Arrays x and y are shared, but distributed among processes.
 Each process works only on its part of the arrays.
 If data is well distributed (close to the process’s local memory), performance is good.
 If badly distributed (all data on one process), performance is very poor.
PGAS Advantages:
 Easier to program than message-passing.
 Programmer still has control over data distribution (to avoid poor performance).
Current Status:
 Several research groups are developing PGAS languages.
 They are still evolving.
✅ Final Key Points Summary
1. Distributed-memory systems → each process has private memory.
2. Message-passing → processes use send and receive. (Most common, but low-level).
o Blocking and non-blocking communication.
o Collective operations (broadcast, reduction, etc.).
o MPI is the most widely used.
o Called “assembly language of parallel programming.”
3. One-sided communication → only one process actively communicates.
o Easier but requires synchronization (flag variables, etc.).
4. PGAS languages → mix of shared-memory style + distributed-memory hardware.
o Data explicitly partitioned by programmer.
o Performance depends on good data distribution.

👉 So in very simple words:

 Message-passing = writing letters to talk.
 One-sided communication = sneaking into someone’s notebook to write directly.
 PGAS = everyone shares a big notebook, but each person is responsible for only their own pages.
PC  IA – 2

Q1. Explain with a neat diagram a Distributed Memory System.

Ans : A Distributed Memory System is a type of parallel computer system where:
 Each processor (CPU) has its own private memory.
 The processors are connected to each other using a communication network.
 A processor can directly access only its own memory.
 To share data with another processor, it must send and receive messages through the network.
So, the processors do not share the same memory — instead, they communicate by message passing.

Simple Explanation:
Think of a distributed system as a group of computers (or processors) connected through a network:
 Each computer has its own local memory and data.
 If one computer wants to share data with another, it sends a message across the network.
 This message-passing is done using MPI (Message Passing Interface) functions.

Key Characteristics:
1. Private Memory: Each processor has its own memory. No other processor can access it directly.
2. Network Connection: Processors are connected through a communication network like Ethernet or
InfiniBand.
3. Message Passing: Data sharing between processors happens using message passing (e.g., MPI_Send,
MPI_Recv).
4. Scalability: It’s easy to add more processors to the system — making it good for large-scale parallel
computing.
5. Independent Execution: Each processor works independently on its part of the problem.

Neat Diagram:

Explanation of Diagram:
 Each processor (P0, P1, P2) has its own memory (M0, M1, M2).
 These processors are connected through a network.
 If P0 wants to send data to P2, it must send a message over the network.
 No processor can directly access another processor’s memory.

Example: Suppose we have 3 processors calculating the total marks of students:

 P0 has marks of students 1–10
 P1 has marks of students 11–20
 P2 has marks of students 21–30
Each processor calculates its local sum. Then, they send their results to P0, which adds them all to get the
total marks.
This shows how work is divided and results are shared using message passing.

Advantages:
1. High scalability: Easy to add more processors.
2. No memory conflicts: Since memory is private, no issues with multiple access.
3. Good for large distributed systems: Works well in cluster or cloud computing.

Disadvantages:
1. Communication overhead: Message passing between processors can be slower than shared memory.
2. Programming is harder: Programmer must handle message passing manually.
3. Data duplication: Some data may need to be stored in multiple processors.

Difference between Shared and Distributed Memory (for clarity):

Feature Shared Memory System Distributed Memory System
Memory Common to all processors Private to each processor
Communication Through shared variables Through message passing
Speed Faster (direct access) Slower (via network)
Example Multi-core CPU Computer cluster using MPI

In short: A Distributed Memory System is a set of processors, each with its own memory, connected through a
network.
They communicate by sending messages — not by sharing memory.
MPI is used to write programs for such systems.

Final Answer Summary (Short Notes):

 Each processor has private memory.
 Communication happens via message passing (MPI).
 Suitable for clusters and large parallel systems.
 Offers scalability and independence.
 Example: Parallel supercomputers and cloud clusters.

Question 2: Write an MPI program that prints greetings from the processes.
Answer:
🌐 Introduction to MPI
 MPI stands for Message Passing Interface.
 It is not a new programming language — it is a library of functions that allows multiple processes
running on different computers (or cores) to communicate with each other by sending and
receiving messages.
 MPI is used for parallel programming in distributed memory systems, where each process has its
own private memory.

🧠 Idea of the Program

In this MPI program, several processes are created.
 Each process has a unique rank (ID) — like 0, 1, 2, etc.
 All processes will send a greeting message to process 0.
 Process 0 will then receive all greetings and print them on the screen.
This is similar to a “Hello, World” program but done in parallel using MPI.

🧩 MPI Functions Used

1. MPI_Init()
o Starts the MPI environment.
o Must be called before using any MPI function.
2. MPI_Comm_size()
o Gets the total number of processes running.
3. MPI_Comm_rank()
o Gets the rank (ID) of the current process.
o Ranks start from 0.
4. MPI_Send()
o Used by non-zero processes to send their greeting messages to process 0.
5. MPI_Recv()
o Used by process 0 to receive messages from all other processes.
6. MPI_Finalize()
o Ends the MPI environment.
💻 Program Code

📘 Explanation of the Working

1. MPI_Init: Starts all MPI processes.
2. Rank Assignment: Each process gets a unique rank (0, 1, 2, …, p−1).
3. Message Sending: All processes except process 0 prepare a message and send it using MPI_Send().
4. Message Receiving: Process 0 uses MPI_Recv() inside a loop to receive all greetings one by one.
5. Output: Process 0 prints its own message first, then prints all the messages received from other
processes.
6. MPI_Finalize: Closes the MPI environment properly.
🧱 Key Points
 Only process 0 prints all messages.
 Each process runs the same program but behaves differently depending on its rank.
→ This type of program is called SPMD (Single Program Multiple Data).
 The program can be executed with any number of processes.

✅ Advantages
 Shows basic message passing in MPI.
 Helps understand communication between multiple processes.
 Forms the foundation for writing more complex parallel programs.

🏁 Conclusion
This MPI “Greetings” program is a simple example that demonstrates:
 How multiple processes communicate in a distributed system.
 The use of basic MPI functions like MPI_Init, MPI_Send, MPI_Recv, and MPI_Finalize.
It helps beginners understand the core idea of parallel programming using MPI.
Q3. Explain the MPI_Init, MPI_Finalize, MPI_Comm_size, and MPI_Comm_rank functions.
Ans : MPI (Message Passing Interface) is used in parallel programming where many processes work
together and communicate by sending messages.
Every MPI program must begin by starting MPI and end by closing it properly.
For this, we use some basic MPI functions like MPI_Init, MPI_Finalize, MPI_Comm_size, and
MPI_Comm_rank.

🔹 1. MPI_Init
Meaning:
 This is the first function that should be called in every MPI program.
 It tells the computer system to start the MPI environment so that processes can talk to each other.
Purpose:
 It sets up all the things needed for MPI to work (like communication links between processes).
 No other MPI function can be used before this.
Syntax: int MPI_Init(int *argc, char ***argv);
In simple words:
→ It starts MPI.
→ Must be the first MPI command in the program.
Example: MPI_Init(NULL, NULL);

🔹 2. MPI_Finalize
Meaning:
 This is the last function called in every MPI program.
 It tells the system that we are done using MPI.
Purpose:
 It closes the MPI environment and frees all the memory and resources used by MPI.
 After calling it, no other MPI function can be used.
Syntax: int MPI_Finalize(void);
In simple words:
→ It ends MPI.
→ Must be the last MPI command in the program.
Example: MPI_Finalize();
🔹 3. MPI_Comm_size
Meaning:
 It is used to find how many processes are running in the communicator (the group of all processes).
Purpose:
 It gives the total number of processes in the MPI program.
Syntax: int MPI_Comm_size(MPI_Comm comm, int *comm_sz);
Explanation:
 comm is the communicator (normally MPI_COMM_WORLD).
 comm_sz stores the total number of processes.
In simple words: → It tells “How many processes are working?”
Example:
int comm_sz;
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
If 4 processes are running, then comm_sz = 4.

🔹 4. MPI_Comm_rank
Meaning:
 It gives the unique ID number (rank) of each process.
 The ranks start from 0 up to total processes − 1.
Purpose:
 It helps each process know “Who am I?”
 This helps in assigning different tasks to different processes.
Syntax: int MPI_Comm_rank(MPI_Comm comm, int *my_rank);
Explanation:
 comm is usually MPI_COMM_WORLD.
 my_rank stores the process’s rank (like process number).
In simple words: → It tells “Which process am I?”
Example:
int my_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
If there are 4 processes, their ranks are 0, 1, 2, and 3.

🧠 Example Program
#include <stdio.h>
#include <mpi.h>

int main() {
int my_rank, comm_sz;

MPI_Init(NULL, NULL); // Start MPI

MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); // Find total processes
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); // Find rank of each process

printf("Hello from process %d of %d\n", my_rank, comm_sz);

MPI_Finalize(); // End MPI

return 0;
}
If run with 4 processes, output:
Hello from process 0 of 4
Hello from process 1 of 4
Hello from process 2 of 4
Hello from process 3 of 4

🧾 Summary Table
Function Name Purpose Meaning in Simple Words
MPI_Init Starts the MPI system “Start MPI”
MPI_Finalize Ends the MPI system “Stop MPI”
MPI_Comm_size Gives total number of processes “How many processes are
working?”
MPI_Comm_rank Gives rank (ID) of each process “Which process am I?”

🌟 In very short summary:

 MPI_Init() → Starts MPI system
 MPI_Comm_size() → Finds total number of processes
 MPI_Comm_rank() → Finds process ID (rank)
 MPI_Finalize() → Ends MPI system

💯 Answer (for 10 marks):

All four functions are basic and essential for every MPI program.
MPI_Init begins MPI, MPI_Finalize ends MPI, MPI_Comm_size tells how many processes are running, and
MPI_Comm_rank tells the ID number of each process. Together, they help to start, manage, and end
message passing between multiple processes in a parallel program.
4. Write a note on MPI Message Matching
Answer: In MPI (Message Passing Interface), message matching is the process of ensuring that a message
sent by one process (using MPI_Send) is correctly received by another process (using MPI_Recv).
It is an important concept in distributed-memory parallel programming because it helps processes
communicate accurately and efficiently.

🟩 1. Meaning of Message Matching

When two processes communicate in MPI, the system must make sure that:
 The message sent by a process is received by the intended destination.
 The data type, message size, and communication settings are all compatible.
This process of pairing the correct send and receive operations is called message matching.

🟩 2. Conditions for Successful Message Matching

Suppose:
 Process q sends a message using:
MPI_Send(send_buf_p, send_buf_sz, send_type, dest, send_tag, send_comm);
 Process r receives a message using:
MPI_Recv(recv_buf_p, recv_buf_sz, recv_type, src, recv_tag, recv_comm, &status);
Then, the message sent by q can be received by r only if the following four conditions are true:
1. Communicator matches: recv_comm = send_comm
(Both processes must belong to the same communication group.)
2. Tag matches: recv_tag = send_tag
(The tag acts as a label to identify the message type.)
3. Destination matches: The destination in the send call (dest) should be equal to the rank of the
receiver (r).
4. Source matches: The source in the receive call (src) should be equal to the rank of the sender (q).
✅ If all four conditions are true, then MPI considers the message as matched.

🟩 3. Compatibility of Buffers
Matching also depends on the data type and size of the message.
The send and receive buffers must be compatible:
 recv_type must be equal to send_type
 recv_buf_sz must be greater than or equal to send_buf_sz
So, a simple rule is:
If recv_type = send_type and recv_buf_sz ≥ send_buf_sz, then the message can be successfully received.
This ensures that no data is lost or corrupted during transmission.

🟩 4. Receiving Messages from Multiple Processes

Sometimes, one process may receive messages from many other processes — and the order in which
messages arrive is not predictable.
For example:
 Process 0 gives work to processes 1, 2, 3, etc.
 Each process sends back its result when done, but they may finish at different times.
If process 0 waits to receive results in fixed order (1, then 2, then 3), some processes may sit idle even if
they have already finished.
To solve this, MPI provides a special wildcard constant called:
➤ MPI_ANY_SOURCE
This allows the receiver to accept a message from any sender, no matter which process sends it first.
Example:
for (i = 1; i < comm_sz; i++) {
MPI_Recv(result, result_sz, result_type, MPI_ANY_SOURCE,
result_tag, comm, MPI_STATUS_IGNORE);
Process_result(result);
}
This lets process 0 receive results in the order they arrive, improving efficiency.

🟩 5. Receiving Messages with Different Tags

Sometimes, one process might receive different kinds of messages from another process — for example,
one for data and another for a signal.
If the receiver doesn’t know the order in which messages will come, it can use another wildcard:
➤ MPI_ANY_TAG
This allows the process to receive a message with any tag.

🟩 6. Important Points About Wildcards

1. Only receivers can use wildcards.
o Senders must specify the destination process and a non-negative tag.
o MPI follows a “push” communication model (sender pushes data to receiver).

2. No wildcard for communicator.

o Both sender and receiver must always use the same communicator.
🟩 7. Simple Example of Message Matching
Sender (Process 1)
MPI_Send(data, 10, MPI_INT, 0, 5, MPI_COMM_WORLD);
Receiver (Process 0)
MPI_Recv(data, 10, MPI_INT, 1, 5, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
Here:
 Communicator = MPI_COMM_WORLD ✅
 Tag = 5 ✅
 Destination = 0 ✅
 Source = 1 ✅
Hence, the message is perfectly matched and will be received successfully.

🟩 9. Advantages of Proper Message Matching

✅ Prevents messages from being lost or mismatched
✅ Avoids program deadlocks and hanging processes
✅ Improves synchronization between parallel processes
✅ Ensures reliable communication between multiple processors

🟩 10. Summary Table

Aspect Description
Meaning Matching a send with its correct receive operation
Key Conditions Same communicator, tag, source, and destination
Buffer Rule recv_buf_sz ≥ send_buf_sz and same data type
Wildcards MPI_ANY_SOURCE (any sender), MPI_ANY_TAG (any tag)
Restrictions Wildcards only for receivers; no wildcard for communicator
Benefit Ensures smooth, correct, and efficient process communication

In very simple words: 👉 Message matching in MPI is like making sure that a letter (message) sent by one
person (process) reaches the right person (receiver) — with the correct address (rank), correct label (tag),
and in the same group (communicator).
5. Explain the Trapezoidal Rule in MPI
1. The trapezoidal rule is a method used in mathematics to find the area under a curve or to
approximate an integral.
2. The main idea is to divide the interval between two points a and b into many small equal parts
called subintervals.
3. Each small part is treated as a trapezoid instead of a curved shape, which makes it easy to calculate
the area.
4. The formula for the area of one trapezoid is:
h
Area= [ f ( x i)+ f (x i+1 )]
2
b−a
where h= is the width of each trapezoid.
n
5. The total area (or approximate integral) is the sum of all trapezoid areas between a and b .
6. In a serial (normal) program, one computer or one process calculates all trapezoids one by one —
this takes more time for large values of n.
7. In MPI (Message Passing Interface), the total work is divided among many processes (cores or
computers) to make it faster.
8. Each process calculates the area of its part of the interval — for example, process 0 handles the first
few trapezoids, process 1 the next few, and so on.
9. After all processes finish their local area calculations, they send their results to process 0 using MPI
functions like MPI_Send and MPI_Recv.
10. Finally, process 0 adds all the local areas to get the total integral value, and prints the result.

1. Figure 3.3(a) shows the actual curved area under the graph of y=f (x )between points a and b .
2. This shaded part represents the true area (integral) that we want to calculate.
3. Figure 3.3(b) shows how we can approximate this curved area by dividing it into small trapezoids
instead of using the curve directly.
4. Each trapezoid fits under a small section of the curve, joining straight lines between points on the
graph.
5. The more trapezoids we use, the closer our total area will be to the real curved area.
6. Figure 3.4 zooms into one of those trapezoids to explain how its area is calculated.
7. The vertical sides of the trapezoid are at x iand x i+1.
8. The height (h) of the trapezoid is the distance between x iand x i+1.
9. The top and bottom lengths of the trapezoid are f (x i)and f (x i+1 ), which are the function values at
those points.
10. The area of one trapezoid is given by the formula:
h
Area= [ f ( x i)+ f (x i+1 )]
2
This is how the total area under the curve is estimated.

⚙️8. Advantages of MPI Trapezoidal Rule

✅ Faster computation because all processes work together
✅ Good use of multi-core or distributed systems
✅ Easy to extend for larger data
✅ Reduces the time compared to single-core processing

✅ In short:
The Trapezoidal Rule in MPI divides the curve into small trapezoids and shares the work among multiple
processes. Each process computes part of the area, and process 0 collects all results. This makes the
integration process faster, efficient, and parallel.
6. Illustrate the MPI Scatter and Gather Modules
MPI (Message Passing Interface) provides special collective communication functions called MPI_Scatter
and MPI_Gather.
These are used when data needs to be divided among multiple processes (Scatter) and later collected
back to one process (Gather).
Let’s understand both one by one in very simple words.

🟩 1. MPI_Scatter (Distributing Data)

Meaning:
MPI_Scatter is used to send parts of an array from one process (usually process 0) to all other processes in
a communicator.
Think of it like a teacher (process 0) distributing different pages of a worksheet to different students
(processes).
Each student only gets their part of the total pages.

Syntax:
MPI_Scatter(
void *send_buf_p, // data to send (only used by root)
int send_count, // how many elements each process gets
MPI_Datatype send_type,
void *recv_buf_p, // where received data will be stored
int recv_count,
MPI_Datatype recv_type,
int src_proc, // root process (usually 0)
MPI_Comm comm // communicator (like MPI_COMM_WORLD)
);

How it works:
 Suppose there are 4 processes and process 0 has an array of 8 numbers:
 [10, 20, 30, 40, 50, 60, 70, 80]
 We want each process to get 2 numbers.
 Then MPI_Scatter will do the following:
o Process 0 → gets [10, 20]
o Process 1 → gets [30, 40]
o Process 2 → gets [50, 60]
o Process 3 → gets [70, 80]
So the array is divided (scattered) into equal parts automatically.

Simple Example (from the textbook):

In Program 3.9 of the document, MPI_Scatter is used in a function called Read_vector to distribute parts of
a large vector to all processes.
Example code (simplified):
MPI_Scatter(a, local_n, MPI_DOUBLE,
local_a, local_n, MPI_DOUBLE,
0, MPI_COMM_WORLD);
✅ Process 0 sends parts of the full vector a.
✅ Each process receives its part in local_a.
✅ Saves memory and avoids sending the whole vector to everyone.

🟩 2. MPI_Gather (Collecting Data Back)

Meaning:
MPI_Gather does the opposite of MPI_Scatter.
It collects pieces of data from all processes and combines them into one array on a single process (usually
process 0).
Think of it like the teacher collecting all answer sheets back from the students.

Syntax:
MPI_Gather(
void *send_buf_p, // data to send from each process
int send_count, // how many items each sends
MPI_Datatype send_type,
void *recv_buf_p, // final array on root process
int recv_count, // how many items received from each
MPI_Datatype recv_type,
int dest_proc, // root process
MPI_Comm comm
);

How it works:
Continuing the previous example —
Each process has these small arrays:
P0 → [10, 20]
P1 → [30, 40]
P2 → [50, 60]
P3 → [70, 80]
Now, using MPI_Gather, all are collected into one large array on process 0:
[10, 20, 30, 40, 50, 60, 70, 80]

Simple Example (from the textbook):

In Program 3.10 (Print_vector function), MPI_Gather is used to collect distributed data:
MPI_Gather(local_b, local_n, MPI_DOUBLE,
b, local_n, MPI_DOUBLE,
0, MPI_COMM_WORLD);
✅ Each process sends its part local_b.
✅ Process 0 gathers all parts into b.
✅ Then process 0 prints the entire vector.

🟨 3. Key Difference between Scatter and Gather

Feature MPI_Scatter MPI_Gather
Purpose Distributes data from one process to all Collects data from all processes to one
others process
Direction One → Many Many → One
Used in Input distribution Output collection
Memory Only process 0 has full data Only process 0 receives full data
need
Analogy Teacher giving pages to students Teacher collecting pages back

🟩 4. Simple Real-Life Example

 Suppose we have a class project:
o Teacher divides the topic into 4 parts and gives one part to each student → MPI_Scatter.
o After students complete their work, they submit it back to the teacher → MPI_Gather.

🟦 5. Advantages
 Efficient: Saves memory and reduces time compared to sending data manually using many
MPI_Send and MPI_Recv calls.
 Simpler code: Only one function call distributes or collects all data.
 Faster communication: MPI optimizes these internally for speed.

✅ Final Summary (In Simple Words)

 MPI_Scatter → Breaks and sends parts of data from one process to all.
 MPI_Gather → Collects all small parts back into one process.
 Both are collective communication functions that help share and combine data in parallel
programming using MPI.

In short:
“MPI_Scatter sends pieces of data out to everyone, and MPI_Gather collects them back together.”

🌳 Tree-Structured Communication (Detailed Answer – 10 Marks)

🧠 1. Meaning
Tree-structured communication is a method of organizing message passing among multiple processes in
the form of a binary tree structure.
It is mainly used in parallel and distributed computing (like MPI programs) to perform operations such as a
global sum, broadcast, or data collection efficiently.
Instead of every process sending its result directly to one process (which causes overload), communication
happens step-by-step through a tree of processes, just like the branches of a tree passing information
upward or downward.

🌿 2. How It Works (Step-by-Step Explanation)

Let’s assume we have 8 processes numbered from 0 to 7, and each process has a value.
We want to find the total sum of all values.
A binary tree structure is used as follows:
🔹 Phase 1:
 Processes 1, 3, 5, and 7 each send their values to 0, 2, 4, and 6 respectively.
 Processes 0, 2, 4, and 6 receive the values and add them to their own values.
🔹 Phase 2:
 Now, processes 2 and 6 send their new (summed) values to 0 and 4.
 Processes 0 and 4 again add these new values.
🔹 Phase 3:
 Finally, process 4 sends its newest value to process 0,
and process 0 adds it to its total.
Now, process 0 holds the final total sum of all the values.

⚙️3. Why It Is Better Than the Simple Method

In the normal (simple) communication method:
 Every process (1–7) sends its value to process 0.
 Process 0 performs 7 receives + 7 additions.
 Other processes do almost no work.
 Process 0 becomes overloaded and slow.
In the tree-structured method:
 Each process only performs at most 2 sends and 2 additions.
 Many operations happen at the same time (concurrently).
 The total time to complete the communication is much less.

🕒 4. Example of Speed Improvement

Number of Simple Method Receives/Adds at Tree-Structured Method Receives/Adds at
Processes Process 0 Process 0
8 7 3
1024 1023 10
So, with 1024 processes, we reduce the work by a factor of more than 100 times faster!

💡 5. Parallelism (Work Done Simultaneously)

 In the first phase, processes 0, 2, 4, and 6 can all receive and add at the same time.
 This parallel activity makes the overall communication much quicker.
 The total time depends only on the height of the tree (number of levels), not on the number of
processes.
Thus, as we increase the number of processes, time increases very slowly (logarithmically).

⚖️6. Advantages
1. Faster communication:
o Reduces time complexity from O(n) to O(log₂ n).
2. Balanced workload:
o Every process participates equally in sending and adding data.
3. Concurrent execution:
o Multiple processes can work at once.
4. Scalability:
o Works efficiently even when there are thousands of processes.
5. Less bottleneck:
o No single process is overloaded with all messages.

🧩 7. Different Tree Structures

There can be many ways to build a tree structure:
 Example 1: Pair (0,2), (1,3), (4,6), (5,7)
 Example 2: Pair (0,4), (1,5), (2,6), (3,7)
Each structure gives different performance depending on:
 The number of processes (small or large tree)
 The system hardware (system A or B)
Hence, selecting the best tree design depends on the system and the type of operation.

🧮 8. Real-Life Analogy
Think of it like a relay race:
 Instead of all runners handing their baton directly to the last runner,
 They first pass it to teammates in pairs, then pairs pass to others,
 Until finally one runner gets the combined baton (final result).
This saves time and energy — just like in tree-structured communication.

🏁 9. Conclusion
Tree-structured communication is a powerful and efficient method used in parallel programming to share
and combine data among multiple processes.
It reduces communication time dramatically by using a binary tree pattern, allowing multiple operations to
happen simultaneously.
This makes it ideal for global sum, reduction, and broadcast operations in systems like MPI.

✅ In short:
Tree-structured communication organizes processes like a tree, where data flows up and down through
branches.
It makes message passing faster, balanced, and highly parallel compared to direct one-to-one
communication.

🌳 Explanation of Figure 3.6 and Figure 3.7

FIGURE 3.6 – A Tree-Structured Global Sum

📘 What the Figure Shows:

Figure 3.6 shows a binary tree pattern used to perform a global sum (adding values from all processes)
efficiently.
Each process has some data (for example, a number), and instead of all sending directly to process 0, they
send in pairs like branches of a tree.

⚙️Step-by-Step Explanation:
1. Initial Step:
o Suppose there are 8 processes (0 to 7).
o Each process has a number to add.
2. Phase 1 (First Level of the Tree):
o Process 1 → 0
o Process 3 → 2
o Process 5 → 4
o Process 7 → 6
➤ Each receiver (0, 2, 4, 6) adds the received value to its own.
🟢 Now we have 4 new sums stored in processes 0, 2, 4, and 6.

3. Phase 2 (Second Level of the Tree):

o Process 2 → 0
o Process 6 → 4
➤ Processes 0 and 4 again add the received values.
🟢 Now processes 0 and 4 each have a bigger combined sum.

4. Phase 3 (Final Level of the Tree):

o Process 4 → 0
➤ Process 0 adds the received value and now holds the final total sum of all 8 values.

💡 Key Idea:
 Instead of 7 processes sending everything directly to process 0 (which would be slow), the data
moves up the tree in steps.
 At each level, some processes send, others receive and add.
 This allows parallel addition — several adds happen at the same time.

Efficiency:
 In the old (simple) method → Process 0 did 7 adds.
 In the tree method → Process 0 does only 3 adds.
 For 1024 processes, process 0 does only 10 adds instead of 1023!
So, Figure 3.6 shows how a binary tree structure reduces communication time and spreads work among
processes.

FIGURE 3.7 – An Alternative Tree-Structured Global Sum

📘 What the Figure Shows:

Figure 3.7 shows that there is not only one way to form a tree for communication.
It gives an alternative tree structure for the same global sum operation — the processes are paired
differently.

⚙️How This Alternative Works:

1. Phase 1:
o Pairing is done differently:
(0,4), (1,5), (2,6), (3,7)
o In each pair, one process sends its value to the other, and the receiver adds.
2. Phase 2:
o The next level of pairing changes:
(0,2) and (1,3)
o Each receiver adds the received value.
3. Phase 3:
o Finally, (0,1)
o Process 0 receives the last value and calculates the final total.

💡 Key Point:
 This shows that the tree structure is flexible — we can design many different pairing combinations.
 Each combination can work differently depending on:
o The size of the system
o The number of processes
o The hardware setup
Some trees may perform better on certain systems (for example, smaller vs larger clusters).

⚖️Why Figure 3.7 Is Important:

 It helps us understand that there is no single “perfect” tree.
 Different process-pairing patterns can exist, and each may have different speed or performance on
different systems.
 MPI (Message Passing Interface) later solves this problem by providing built-in optimized tree-
structured functions like MPI_Reduce() so the programmer doesn’t need to manually test all
possible trees.

🧠 Summary of Both Figures

Feature Figure 3.6 Figure 3.7
Type Basic binary tree Alternative binary tree
Pairing example (1→0), (3→2), (5→4), (7→6) (0↔4), (1↔5), (2↔6), (3↔7)
To show there are multiple tree pairing
Purpose To show basic tree-based global sum
options
Reduces communication time, parallel
Benefit Shows flexibility and different possible trees
adding
Result Process 0 gets the total sum Process 0 also gets total, but by another route

🏁 In Simple Words
 Figure 3.6 = Shows how a tree-structured method combines data efficiently step by step.
 Figure 3.7 = Shows another way to build such a tree, proving that there can be many different trees
for the same purpose.

✅ Final Line (for conclusion):

Both Figure 3.6 and Figure 3.7 explain that in tree-structured communication, data moves in a hierarchy like
a tree.
This reduces time, balances work, and makes global communication in MPI programs much faster and more
efficient.

Q8. Explain Collective vs. Point-to-Point Communications.

In MPI (Message Passing Interface), processes in a parallel program communicate using two main types of
communication: point-to-point and collective communications.
Let’s understand both and how they differ.

1. Point-to-Point Communication
 In point-to-point communication, only two processes are involved — one sender and one receiver.
 It uses functions such as MPI_Send and MPI_Recv.
 The sender sends a message to a particular process, and the receiver gets it.
 Each send and receive must match correctly (same communicator, same tag, correct
sender/receiver ranks).
Example:
If process 1 wants to send data to process 0:
MPI_Send(&data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
MPI_Recv(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
Only processes 0 and 1 are communicating here.

2. Collective Communication
 Collective communication involves all the processes in a communicator (like MPI_COMM_WORLD).
 Every process must call the same collective function (like MPI_Bcast, MPI_Reduce, MPI_Allreduce,
etc.).
 It is used for global operations such as broadcasting data, gathering results, or reducing data from
all processes to one.
Example:
All processes may add up their results together using:
MPI_Reduce(&local_sum, &total_sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
Here, all processes participate, and process 0 receives the total sum.

3. Major Differences Between Collective and Point-to-Point Communications

Aspect Collective Communication Point-to-Point Communication
Number of
Only two processes — a sender and
processes All processes in the communicator must take part.
a receiver.
involved
Each process calls its own matching
All processes call the same collective function (e.g.,
Function calls send/receive functions (MPI_Send,
MPI_Reduce, MPI_Bcast).
MPI_Recv).
Matched only by communicator and order of calls Matched by communicator, tag, and
Matching
— no tags are used. process ranks.
All processes must give compatible arguments Each send/receive pair can have its
Arguments
(e.g., same destination process, same operator). own matching arguments.
The output variable (e.g., in MPI_Reduce) is only
Output buffer belongs only to the
Output usage valid on the destination process, but all must still
receiving process.
pass an argument (even NULL).
If one process skips the collective call or uses If send/receive pairs don’t match,
Errors incompatible arguments, the program can hang or only those two processes are
crash. affected.

4. Important Rules in Collective Communication

1. All processes must participate:
Every process in the communicator must call the same collective function.
Mixing collective and point-to-point calls (e.g., one process using MPI_Reduce, another using
MPI_Recv) will cause the program to hang or crash.
2. Compatible arguments:
All processes must provide compatible arguments. For example, if one process says the destination
is process 0 but another says 1, the result is unpredictable.
3. Output argument rule:
The output buffer is used only on the destination process (like process 0 in MPI_Reduce), but every
process must still provide it (even if it’s NULL).
4. Matching and order:
Collective communications are matched by communicator and order, not by tags.
For example, two MPI_Reduce calls are matched by the sequence they occur in the code — not by
variable names.
5. Aliasing not allowed:
You cannot use the same memory for both input and output in a collective function.
Example of illegal call:
6. MPI_Reduce(&x, &x, 1, MPI_DOUBLE, MPI_SUM, 0, comm); // ❌ illegal
This may cause incorrect results or a crash because input and output buffers are overlapping (aliased).

5. Example for Understanding

Consider this situation:
Process 0 Process 1 Process 2
a = 1; c = 2 a = 1; c = 2 a = 1; c = 2
MPI_Reduce(&a, MPI_Reduce(&c, &d, ...) MPI_Reduce(&a,
&b, ...) &b, ...)
MPI_Reduce(&c, &d, ...) MPI_Reduce(&a, MPI_Reduce(&c, &d, ...)
&b, ...)
At first glance, it seems b = 3 and d = 6, but actually:
 The first set of calls match with each other (in order), not by variable name.
 So b = 1 + 2 + 1 = 4 and d = 2 + 1 + 2 = 5.
This shows that order of calls decides which operations are grouped together.

✅ In Summary
 Point-to-point = One-to-one communication between two processes.
 Collective = One-to-all, all-to-one, or all-to-all communication involving all processes.
 Main differences are in participation, matching rules, and how arguments and results are handled.
 Collective calls are simpler for global data sharing but must follow strict rules — all processes must
join and arguments must match
Q9. Write a parallel implementation of vector addition.
Ans : Concept:
Vector addition means adding two vectors element by element.
If we have
x = (x₀, x₁, x₂, …, xₙ₋₁) and
y = (y₀, y₁, y₂, …, yₙ₋₁),
then
z = x + y = (x₀+y₀, x₁+y₁, x₂+y₂, …, xₙ₋₁+yₙ₋₁)
In serial programming, this can be done using a simple loop — but it is slow for large vectors.
In parallel programming (MPI), we divide the work among multiple processes so that each process adds a
portion of the vectors.

Steps in Parallel Implementation

1. Distribute data (Partitioning):
Divide the total vector into equal parts.
o If total size = n and number of processes = p,
then each process will handle local_n = n / p elements.
o Example: if n=12 and p=3,
 Process 0 → elements 0–3
 Process 1 → elements 4–7
 Process 2 → elements 8–11
2. Send data to processes:
o Process 0 (master) reads the full vectors x and y.
o It uses MPI_Scatter() to send the respective parts of each vector to every process.
3. Each process adds its part:
Every process adds the corresponding elements of its local vectors
→ local_z[i] = local_x[i] + local_y[i]
4. Collect results:
o After addition, each process sends its result back to process 0.
o Process 0 uses MPI_Gather() to collect all the partial results and form the complete vector z.
5. Display result:
Process 0 prints the final summed vector.

Output Example
If
x = [1 2 3 4 5 6 7 8 9 10 11 12]
y = [2 2 2 2 2 2 2 2 2 2 2 2]
Then final result printed by process 0:
z = [3 4 5 6 7 8 9 10 11 12 13 14]
→ In short:
Parallel vector addition in MPI means dividing the data among processes, performing local additions
independently, and combining results. This saves time and makes computation efficient. ✅

🌟 Detailed Explanation (In Very Simple Words)

1. What is vector addition?
 A vector is just a list of numbers.
Example:
 x = [2, 4, 6]
 y = [1, 3, 5]
 To add two vectors, we add each element one by one:
 z = [2+1, 4+3, 6+5] = [3, 7, 11]

2. What does “parallel implementation” mean?

 In a normal (serial) program, one CPU does all the additions one after another.
 In a parallel program, we divide the work among many CPUs (or processes).
o Each process handles a part of the vector.
o All processes work at the same time (in parallel).
 After everyone finishes, the results are combined to form the full vector.

3. How the program works

Each process runs this small function called Parallel_vector_sum.
Let’s see how it works step-by-step:
🧩 Function Inputs:
Parameter Meaning
local_x[] The part of vector x given to this process
local_y[] The part of vector y given to this process
local_z[] The part of the final result vector (output)
local_n Number of elements handled by this process

4. Inside the function:

for (local_i = 0; local_i < local_n; local_i++)
local_z[local_i] = local_x[local_i] + local_y[local_i];
 Each process starts a loop from 0 to local_n - 1.
 For every position local_i:
o It adds the corresponding elements from local_x and local_y.
o Stores the result in local_z.
So each process is independently adding its part of the vector.

5. Example
Let’s say we have 6 elements and 3 processes.
Process Elements it gets from Elements it gets from y Work done
x
P0 [2, 4] [1, 3] [2+1, 4+3] = [3, 7]
P1 [6, 8] [5, 7] [6+5, 8+7] = [11, 15]
P2 [10, 12] [9, 11] [10+9, 12+11] = [19,
23]
After all processes finish:
 The final vector z = [3, 7, 11, 15, 19, 23]
Each process handled only 2 elements — so the work was divided evenly.

6. Why is it called "parallel"?

Because:
 Each process or core works simultaneously.
 There is no dependency between the elements — each addition can be done independently.
 This makes the problem perfect for parallel computing.

7. Advantages
✅ Speed: Many additions happen at once → faster.
✅ Scalability: Works well even if vector has millions of elements.
✅ Simplicity: Each process performs the same simple operation.

8. Relation to MPI
 In MPI (Message Passing Interface) programs, each process has its own local arrays local_x, local_y,
and local_z.
 Data is divided using functions like MPI_Scatter.
 After the parallel addition, the results are combined using MPI_Gather.
So Parallel_vector_sum() is the core computation step inside an MPI-based parallel program.

🧠 In Short
 Each process adds its part of the vector.
 No communication is needed during addition.
 Final results are later collected together.
 It’s simple, efficient, and a great example of data parallelism.

📝 10-Mark Summary
Point Explanation
Definition Parallel vector addition means dividing two large vectors into smaller parts and adding them
simultaneously using multiple processors.
Function Used Parallel_vector_sum()
Inputs Local parts of two vectors and their size.
Working Each process adds corresponding elements and stores them in a local result vector.
Output Partial sum vectors which are later combined into the final result.
Tools Implemented using MPI in distributed-memory systems.
Type of Data parallelism (same operation on different data).
Parallelism
Advantage Faster execution, easy to scale, minimal communication.
Example If x=[1,2,3,4] and y=[5,6,7,8], two processes each handle half → result [6,8,10,12].
Conclusion Parallel vector addition is a simple yet powerful example showing how tasks can be split evenly
across processors for high performance.
10. Explain MPI-derived datatypes.
Ans : MPI-Derived Datatypes
In MPI (Message Passing Interface), sometimes we need to send different kinds of data together — like
integers, floats, and doubles — or data that is not stored continuously in memory.
To make this easy, MPI allows us to create our own user-defined data type, called a derived datatype.

1. What is a Derived Datatype?

An MPI-derived datatype is a way to group different data items together so they can be sent or received in
a single MPI call instead of many.
It tells MPI:
 What data types are included (like int, double, etc.)
 Where each item is stored in memory

2. Why Do We Need It?

Without a derived datatype, if we want to send three values — for example:
double a, b;
int n;
we must call MPI_Send three times.
This is slow and complicated.
With a derived datatype, we can group a, b, and n together and send them all in one MPI_Send or
MPI_Bcast call — saving time and making the program simpler.

3. How It Works
We tell MPI:
 How many items there are
 What types they are
 How far apart they are in memory (called displacement)
Then we “build” and “commit” this new datatype using special MPI functions.

4. Important MPI Functions

Function Purpose
MPI_Get_address() Finds memory address of each variable
MPI_Type_create_struct() Creates the new datatype
MPI_Type_commit() Finalizes (activates) the datatype
MPI_Type_free() Frees it when no longer needed

5. Example (Easy)
Here, the input_type groups a, b, and n together,
so all three values can be sent in one broadcast.

6. Advantages
✅ Sends many values together (faster communication)
✅ Easy to handle mixed data types
✅ Reduces coding effort
✅ Improves program performance

7. In Simple Words
MPI-derived datatype means making your own “data packet” that combines many variables into one.
You can then send this packet easily between processes — just like sending one message instead of several
small ones.

→ In short:
MPI-derived datatypes help to send different or scattered data in one go, making MPI programs faster and
simpler.

11. A parallel sorting algorithm

Ans : 1. What is a parallel sorting algorithm?
A parallel sorting algorithm is a method of sorting data where multiple processes (or computers) work
together at the same time to sort a collection of keys. Instead of one process doing all the work, the data is
split among many processes, and each process does part of the sorting simultaneously. This makes sorting
much faster for large datasets.

Parallel Sorting Algorithm (Odd–Even Transposition Sort)

A parallel sorting algorithm is used to sort large data using many processors working together. Each
processor sorts its part of the data and exchanges information with others through MPI (Message Passing
Interface).

1. Idea
 Total data of size n is divided among p processors, so each gets n/p elements.
 Each processor first sorts its local data using a serial sort (like quicksort).
 Then all processors cooperate to arrange the global order.

2. Steps of Parallel Odd–Even Transposition Sort

1. Local Sort – Each process sorts its own data.
2. p Phases of Communication:
o In each phase, every process finds a partner process:
 Even phase: even-ranked ↔ next odd-ranked
 Odd phase: odd-ranked ↔ next even-ranked
o They exchange their data using MPI_Sendrecv().
o Each keeps either:
 the smaller half (lower-ranked process), or
 the larger half (higher-ranked process).
3. After all p phases, all elements across all processors are globally sorted.

3. Example
If 4 processes are used, process 0 has the smallest values and process 3 has the largest after sorting.

4. Features
 Uses message passing (MPI) safely with MPI_Sendrecv.
 Simple and effective for moderate-sized data.
 Works on distributed memory systems.

5. Advantages
✅ Faster than serial sort
✅ Easy to implement
✅ Scales with number of processors
6. Limitations
❌ Needs many communication phases
❌ Slower if network cost is high

Result:
At the end, each process has sorted data, and all data across processes are in correct global order.
Hence, Parallel Odd–Even Transposition Sort provides an efficient way to sort data using MPI in parallel
systems.

PC - Module 1
No ratings yet
PC - Module 1
82 pages
PP 1st Internals
No ratings yet
PP 1st Internals
40 pages
Module 1-3
No ratings yet
Module 1-3
87 pages
Module 1
No ratings yet
Module 1
19 pages
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
No ratings yet
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
31 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
BCS702 Module1 Detailed Notes
No ratings yet
BCS702 Module1 Detailed Notes
14 pages
Unit - 01 Easid
No ratings yet
Unit - 01 Easid
18 pages
PARALLEL PROGRAMMING Module 1
No ratings yet
PARALLEL PROGRAMMING Module 1
20 pages
Unit 1
No ratings yet
Unit 1
21 pages
Perfect ? I
No ratings yet
Perfect ? I
7 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Pda 2
No ratings yet
Pda 2
105 pages
Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
27 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
BCS702 Module1
No ratings yet
BCS702 Module1
25 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
PC Module-1
No ratings yet
PC Module-1
11 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Lecture 10 - SIMD Architecture
No ratings yet
Lecture 10 - SIMD Architecture
27 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
RV64V: A Vector Architecture Overview
No ratings yet
RV64V: A Vector Architecture Overview
29 pages
COA U5 PPT Full
No ratings yet
COA U5 PPT Full
43 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Parallel Computing Architecture Overview
No ratings yet
Parallel Computing Architecture Overview
6 pages
Parallel Computer Architecture Guide
No ratings yet
Parallel Computer Architecture Guide
44 pages
Flynn's Taxonomy of Computer Architecture
No ratings yet
Flynn's Taxonomy of Computer Architecture
36 pages
Parallel Computer Architecture
No ratings yet
Parallel Computer Architecture
44 pages
Parallel Scalable Models
No ratings yet
Parallel Scalable Models
61 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
Parallel Computer Architecture Basics
No ratings yet
Parallel Computer Architecture Basics
268 pages
PC Module1
No ratings yet
PC Module1
28 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
Parallel Computing IA1
No ratings yet
Parallel Computing IA1
29 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
Architecture
No ratings yet
Architecture
67 pages
Symmetric & Distributed Memory Architectures
No ratings yet
Symmetric & Distributed Memory Architectures
31 pages
Hardware and Virtual Machines - Notes 1
No ratings yet
Hardware and Virtual Machines - Notes 1
12 pages
NOTES
No ratings yet
NOTES
19 pages
Module 1 Notes PC
No ratings yet
Module 1 Notes PC
27 pages
ACA1
No ratings yet
ACA1
29 pages
Unit 1
No ratings yet
Unit 1
48 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Parallel & Distributed Computing: By: M. Imran Siddiqui
No ratings yet
Parallel & Distributed Computing: By: M. Imran Siddiqui
25 pages
PP Mod 1
No ratings yet
PP Mod 1
11 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
Aca UNIT-5
No ratings yet
Aca UNIT-5
10 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Unit 1
No ratings yet
Unit 1
22 pages
Array & Vector Processor
No ratings yet
Array & Vector Processor
17 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
5 Marks Q. Describe Array Processor Architecture
No ratings yet
5 Marks Q. Describe Array Processor Architecture
11 pages
Theory of Parallelism in Computing
No ratings yet
Theory of Parallelism in Computing
48 pages
3 TH Sem-1
No ratings yet
3 TH Sem-1
33 pages
Ada Front PG Final
No ratings yet
Ada Front PG Final
4 pages
Adareportmain
No ratings yet
Adareportmain
18 pages
Service Request Portal Explanation
No ratings yet
Service Request Portal Explanation
5 pages
5 TH LBAMaths EM
No ratings yet
5 TH LBAMaths EM
174 pages
XOAR Trusted Hypervisor Study Sheet
No ratings yet
XOAR Trusted Hypervisor Study Sheet
2 pages
Cloud Data Encryption Techniques Study Sheet
No ratings yet
Cloud Data Encryption Techniques Study Sheet
2 pages
h14891 Vmax All Flash Data Sheet
No ratings yet
h14891 Vmax All Flash Data Sheet
7 pages
Soft Machines - Factiva
No ratings yet
Soft Machines - Factiva
70 pages
LP 1050
No ratings yet
LP 1050
107 pages
NetEngin Products and Solution
No ratings yet
NetEngin Products and Solution
64 pages
Intel Core I7 Processor
No ratings yet
Intel Core I7 Processor
18 pages
Intrinsic Safety Cable Guidelines
0% (1)
Intrinsic Safety Cable Guidelines
4 pages
ML Parallelization1
No ratings yet
ML Parallelization1
14 pages
Essentials of Money Credit and Banking by Miranda PDF: Direct Link #1
No ratings yet
Essentials of Money Credit and Banking by Miranda PDF: Direct Link #1
4 pages
CC - bcs601 Module 2notes @vtunetwork
No ratings yet
CC - bcs601 Module 2notes @vtunetwork
65 pages
Os For Iot PPT Vishnu
No ratings yet
Os For Iot PPT Vishnu
27 pages
Parallelizing Particle-In-Cell Codes With Openmp and Mpi: Nils Magnus Larsgård
No ratings yet
Parallelizing Particle-In-Cell Codes With Openmp and Mpi: Nils Magnus Larsgård
74 pages
Ds891 Zynq Ultrascale Plus Overview
No ratings yet
Ds891 Zynq Ultrascale Plus Overview
42 pages
HP UX Virtualization Brochure
No ratings yet
HP UX Virtualization Brochure
4 pages
Best Practices and HPC Strategies For Ansys Mechanical
No ratings yet
Best Practices and HPC Strategies For Ansys Mechanical
7 pages
PGDCA
No ratings yet
PGDCA
1,762 pages
JMeter Performance Testing FAQs
100% (1)
JMeter Performance Testing FAQs
12 pages
High Performance Haskell Web Servers
No ratings yet
High Performance Haskell Web Servers
50 pages
RTOS Course File 2024-25
No ratings yet
RTOS Course File 2024-25
112 pages
End of Topic Quiz Topic 1.1 Systems Architecture: 1. What Is The Purpose of The CPU?
No ratings yet
End of Topic Quiz Topic 1.1 Systems Architecture: 1. What Is The Purpose of The CPU?
9 pages
MagView Hardware Requirements
No ratings yet
MagView Hardware Requirements
16 pages
BlackBerry Dynamics Servers and BlackBerry Enterprise Mobility Server 3.0 Planning Guide en
No ratings yet
BlackBerry Dynamics Servers and BlackBerry Enterprise Mobility Server 3.0 Planning Guide en
23 pages
PPT ch01
No ratings yet
PPT ch01
103 pages
Quiz on System Unit: Processing & Memory
No ratings yet
Quiz on System Unit: Processing & Memory
2 pages
Multithreading Vs Asynchronous Programming Vs Parallel Programming
No ratings yet
Multithreading Vs Asynchronous Programming Vs Parallel Programming
16 pages
3D Stacked Memory in HPC Analysis
No ratings yet
3D Stacked Memory in HPC Analysis
4 pages
BCA Course Structure Utkal University
No ratings yet
BCA Course Structure Utkal University
26 pages
I Unit OS
No ratings yet
I Unit OS
47 pages
RealFlow Cinema 4D Manual 001-135
No ratings yet
RealFlow Cinema 4D Manual 001-135
135 pages
Unit-1 Cloud Computing
No ratings yet
Unit-1 Cloud Computing
18 pages
Rockchip RK3588J Datasheet V0.6-2023
No ratings yet
Rockchip RK3588J Datasheet V0.6-2023
35 pages