0% found this document useful (0 votes)
5 views53 pages

Module 1

The document discusses the transition from single-core to parallel computing due to limitations in performance gains from faster processors, emphasizing the need for parallel programming to fully utilize multicore systems. It outlines the challenges of converting serial programs to parallel ones and highlights key application areas that require increased computational power, such as climate modeling and drug discovery. The document also explains the differences between concurrent, parallel, and distributed computing, as well as the importance of coordination in parallel programming.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views53 pages

Module 1

The document discusses the transition from single-core to parallel computing due to limitations in performance gains from faster processors, emphasizing the need for parallel programming to fully utilize multicore systems. It outlines the challenges of converting serial programs to parallel ones and highlights key application areas that require increased computational power, such as climate modeling and drug discovery. The document also explains the differences between concurrent, parallel, and distributed computing, as well as the importance of coordination in parallel programming.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Module-1

Why Parallel Computing?


Shift in computer processor design around the early 2000s, when performance
improvements from faster single-core processors started to slow down.

1. Why do we care? Aren’t single-processor systems fast enough?


 At first glance, a 20% yearly increase still seems very good.
 But compared to the 50% yearly increase in the 1986–2002 era, it’s much smaller:
o 50% per year → ~60× faster in 10 years
o 20% per year → ~6× faster in 10 years
 Many applications (like gaming, scientific computing, big data analysis, AI) need
massive performance growth. A slowdown means we can’t rely on just waiting for
faster chips.

2. Why can’t we just keep making faster single processors?


There are fundamental limits:
 Power and heat: Higher clock speeds cause chips to overheat ("power wall").
 Physics: Shrinking transistors further makes leakage currents and energy inefficiency
worse.
 Memory bottleneck: CPUs are much faster than memory ("memory wall"), so the CPU
often waits for data.
So, manufacturers couldn’t just keep ramping up clock speed forever.

3. Why build multiprocessor (parallel) systems instead?


 By putting multiple cores (processors) on a single chip, manufacturers could still
increase performance without increasing clock speed too much.
 This means instead of one super-fast brain, you get many brains working together.
 The catch: programs need to be written in a way that uses multiple processors
(parallel programming).

4. Why can’t we automatically convert serial programs into parallel ones?


 It’s very hard because:
o Many programs have dependencies (one step must finish before the next).
o Automatic parallelization is limited — compilers can help, but they can’t always
figure out how to split work safely.
o Some tasks are inherently sequential (Amdahl’s Law).
So, developers must design software with parallelism in mind to take advantage of multiple
cores.
why we need ever-increasing performance:
1. Past advances depended on computation
 Human genome decoding,
 Medical imaging improvements,
 Fast & accurate web searches,
 Realistic computer games.
These wouldn’t have been possible without the huge leaps in processor performance over
pastdecades.
Also, new advances depend on older ones → today’s progress builds on yesterday’s
computational power.

2. As power grows, so do the problems we can solve


When performance increases, problems that were once too big or complex become
solvable.

3. Key application areas needing more performance


 Climate modeling
o Need highly accurate simulations including atmosphere, oceans, land, and ice.
o Helps test the effects of interventions on climate change.
 Protein folding 🧬
o Misfolded proteins are linked to diseases (Huntington’s, Parkinson’s,
Alzheimer’s).
o Modeling proteins is extremely complex and limited by current computing
power.
 Drug discovery
o Computational genomics can identify alternative treatments when existing
drugs fail for some patients.
o Requires massive genome analysis.
 Energy research
o Simulations of wind turbines, solar cells, and batteries can lead to cleaner,
more efficient technologies.
 Data analysis
o Data worldwide doubles every ~2 years.
o Raw data (e.g., DNA sequences, collider data, medical imaging, astronomy, web
search logs) is useless unless analyzed.
o Analysis needs vast computational resources.

WHY WE’RE BUILDING PARALLEL SYSTEMS


1. The old model: Faster single processors
 Historically, performance improved by making transistors smaller.
 Smaller transistors leads to faster switching and faster processors.
 For decades, this gave huge performance gains (Moore’s Law).

2. The problem: Power & heat (the “Power Wall”)


 As transistor speed increased, power consumption increased.
 More power → more heat dissipation.
 By the early 2000s, air cooling couldn’t handle the heat.
 Too much heat makes chips unreliable.
So, we can’t just keep raising clock speeds indefinitely.

3. But transistor density can still grow


 We can still fit more transistors on a chip (Moore’s Law continues).
 The challenge: How to use those extra transistors if we can’t just make them faster?

4. The solution is Parallelism


 Instead of one super-fast core, manufacturers build many simpler cores on one chip.
 Each core = a complete processor (CPU).
 A chip with many cores is called as a multicore processor.
 Old-style processors with one CPU are now called single-core systems.

5. Why this matters


 Economic reason: The chip industry must keep improving products to survive.
 Moral/innovation reason: More computational power = more progress in science,
medicine, energy, etc.
Why We Need to Write Parallel Programs
1. The problem with old (serial) programs
 Most existing programs were written for single-core systems.
 On a multicore system, you can run multiple instances of the same program (e.g., run
4 games at once), but that’s not useful — users want one program to run faster and
better, not more copies.
 Therefore: To use multiple cores effectively, programs must be parallelized.
2. Automatic conversion isn’t enough
 Researchers have tried to create compilers that translate serial code to parallel code.
 Success has been limited because:
o Translating each step independently into parallel code often leads to
inefficiency.
o Sometimes the best parallel solution requires a completely new algorithm,
not just a step-by-step parallelization of the serial one.
o Example: Matrix multiplication — turning it into parallel dot-products may be
inefficient compared to designing a new parallel matrix multiplication
algorithm.
Serial code(one core): Summation
sum = 0;
for (i = 0; i < n; i++) {
x = ComputeNextValue(...);
sum += x;
}
Parallel code (p cores, n ≫ p):
 Divide the loop into p chunks.
 Each core computes its own partial sum:
mysum = 0;
for (myi = myfirsti; myi < mylasti; myi++) {
myx = ComputeNextValue(...);
mysum += myx;
}

the earlier parallel summation example. This is about reducing bottlenecks in how the
partial results are combined. 1. The first method: Centralized collection

 Each core computes its own partial sum (mysum).


 All cores send their results to the master core (say, core 0).
 The master receives each value one by one and adds them up.
Problem: The master core does all the work.
 With 8 cores, master core handles 7 receives + 7 adds.
 As the number of cores grows, the master becomes a bottleneck.

The second method: Pairwise (tree-style) reduction


 Instead of all cores sending to the master, we combine results in stages:
o Stage 1: Pair the cores:
 Core 0 + Core 1, Core 2 + Core 3, Core 4 + Core 5, Core 6 + Core 7.
o Stage 2: Pair the winners (even-numbered cores now hold results):
 Core 0 + Core 2, Core 4 + Core 6.
o Stage 3: Final combination:
 Core 0 + Core 4.
 Now, Core 0 ends up with the total sum, but the work was spread across cores.
Advantage:
 With 8 cores, Core 0 only does 3 adds instead of 7.
 In general, the number of steps is log₂(p) (tree depth) instead of p−1.
o Example:
 8 cores → log₂(8) = 3 steps
 1024 cores → log₂(1024) = 10 steps (instead of 1023 adds by master!)
1. This is called a parallel reduction (or tree-based reduction).
2. It avoids the bottleneck of a single master core.
3. The improvement grows dramatically with the number of cores
diagram shows Top row: Each core (0–7) starts with its local sum:
 Core 0: 8, Core 1: 19, Core 2: 7, Core 3: 15, Core 4: 7, Core 5: 13, Core 6: 12, Core 7: 14
First stage (pairwise sums):
 Core 0 + Core 1 → 27
 Core 2 + Core 3 → 22
 Core 4 + Core 5 → 20
 Core 6 + Core 7 → 26
Second stage (next level of pairing):
 27 (Core 0) + 22 (Core 2) → 49
 20 (Core 4) + 26 (Core 6) → 46
Third stage (final reduction):
 49 + 46 → 95 (global sum, at Core 0).

Comparing the two global sum methods


 Method 1 (naïve / centralized):
o Master adds up results from all cores.
o Needs p − 1 operations (e.g., 999 adds for 1000 cores).
 Method 2 (tree reduction):
o Results combined in pairs over stages.
o Needs log₂(p) operations (e.g., only 10 adds for 1000 cores).
o Much more efficient, especially as p grows.
how we actually write parallel programs and the main challenges
Two Main Approaches to Parallelism
1. Task Parallelism
o Different cores do different tasks.
o Example: In grading exams, one person grades only Question 1 (Shakespeare),
another grades Question 2 (Milton), and so on.
o Each is doing a different job, so the instructions differ.
2. Data Parallelism
o Different cores do the same task on different pieces of data.
o Example: Split the 100 exam papers into 5 piles of 20. Each TA grades all
questions on their pile.
o Same instructions, but applied to different data.
The Need for Coordination
Writing parallel programs isn’t just about dividing work—it’s also about making the cores
work together smoothly. This requires:
1. Communication
o Cores often need to send results to each other (e.g., partial sums in global
sum).
o One core may act as the "master" that collects and combines results.
2. Load Balancing
o Work must be divided fairly.
o If one core has too much work while others are idle, performance is wasted.
3. Synchronization
o Cores don’t all run at the exact same speed.
o Sometimes they need to "wait for each other" before moving to the next step.
o Example: If the master core is reading input data, other cores must wait until
it’s done before starting computation.
Parallel programming is about dividing work (task vs. data), making cores cooperate
(communication, load balancing, synchronization), and carefully coding so that the system
actually runs efficiently.

1. Concurrent computing
 Definition: Multiple tasks appear to be in progress at the same time.
 Key idea: It doesn’t require multiple cores/processors. Even a single-core system with
multitasking OS (like time-slicing between tasks) counts as concurrent.
 Example:
o Your laptop running a browser, music player, and text editor concurrently.
o Only one core may be executing at an instant, but the OS switches so fast that
tasks appear to run simultaneously.

2. Parallel computing
 Definition: Multiple tasks are actually executed at the same time on multiple
cores/processors.
 Key idea: Tasks are tightly coupled — they work together on one problem.
 Typical setting: Shared-memory systems or high-speed cluster.
 Example:
o Adding 1 billion numbers using 8 CPU cores.
o Each core sums a portion, and then partial sums are combined.
o Requires coordination (synchronization, communication, load balancing).

3. Distributed computing
 Definition: Multiple tasks run on different computers (often geographically separated)
that communicate via a network.
 Key idea: Tasks are loosely coupled — not necessarily created together.
 Typical setting: Cloud, internet-scale systems.
 Example:
o Google’s web search: thousands of servers across the world crawl, index, and
serve results.
o Each server does part of the work, but they aren’t “sharing memory,” they
exchange messages.
Parallel Hardware and Parallel Software: SOME BACKGROUND

1. Von Neumann Architecture (Classical Computer Design)


The “classical” von Neumann architecture consists of main memory, a central processing unit
(CPU) or processor or core, and an interconnection between the memory and the CPU. Main
memory consists of a collection of locations, each of which is capable of storing both
instructions and data. Every location consists of an address, which is used to access the
location and the contents of the location—the instructions or data stored in the location.
The central processing unit is divided into a control unit and an arithmetic and logic unit
(ALU). The control unit is responsible for deciding which instructions in a program should be
executed, and the ALU is responsible for executing the actual instructions. Data in the CPU
and information about the state of an executing program are stored in special, very fast
storage called registers. The control unit has a special register called the program counter. It
stores the address of the next instruction to be executed.
Instructions and data are transferred between the CPU and memory via the interconnect.
This has traditionally been a bus, which consists of a collection of parallel wires and some
hardware controlling access to the wires. A von Neumann machine executes a single
instruction at a time, and each instruction operates on only a few pieces of data. See Figure
2.1.
When data or instructions are transferred from memory to the CPU, we sometimes say the
data or instructions are fetched or read from memory. When data are transferred from the
CPU to memory, we some times say the data are written to memory or stored. The
separation of memory and CPU is often called the von Neumann bottleneck,
Von Neumann Bottleneck
 Problem: The CPU is much faster than the memory access speed.
 CPU may execute 100+ instructions in the time it takes to fetch one piece of data from
memory.
 The bus/interconnect limits how quickly data & instructions travel.
Analogy:
 CPU is like factory making products.
 Memory is warehouse storing raw materials (data) and finished products (results).
 Road (bus) is the transport system between them.
 If the road is too narrow (limited bandwidth), the factory workers sit idle because raw
materials arrive too slowly.
Why This Matters
 This bottleneck makes computers inefficient.
 If CPU is starved of data/instructions, speed improvements in CPU alone don’t help
much.
 Hence, engineers started modifying this architecture (pipelining, caching, parallelism,
etc.) → to reduce idle time.

Key Concepts: Processes, Multitasking, and Threads


1. Process
When you run a program (say, open Chrome), the OS creates a process.
A process is an active instance of a program + its resources.
A process contains:
 The program code (executable machine instructions).
 Memory areas:
o Call stack → tracks active functions.
o Heap → dynamic memory (e.g., malloc/new).
o Other memory → global vars, data.
 Resource descriptors → files, network sockets, etc.
 Security info → what it can/can’t access.
 State info → is it running, waiting, blocked? Plus register values.
Think of a process like a workspace assigned to a worker in an office:
 Desk (memory),
 Tools (resources),
 Rules (security),
 Current task list (program counter + state).
2. Multitasking
 Modern OS allows multiple processes to appear to run at the same time.
 Even on a single CPU, the OS divides time into time slices (a few ms each).
 CPU runs process A → then switches to process B → then C → then back to A (this is
context switching).If a process is waiting for something (e.g., reading from disk), it
gets blocked → OS switches to another process that can keep working.
Analogy:
 One cook (CPU) works on 10 dishes (processes).
 Each dish gets a few minutes of attention before the cook moves to the next.
 If one dish has to simmer (blocked), the cook works on another.

3. Threads
 A thread is like a smaller unit of work inside a process.
 Processes usually have one “main” thread (the default execution path).
 But a process can create multiple threads and all share the same memory &
resources.
Threads are lightweight compared to processes:
 Don’t need separate memory space, faster switching.
 Still need their own program counter and own call stack (so they can run
independently).
Analogy:
 Process = restaurant.
 Threads = waiters inside that restaurant.
 They share the same kitchen & menu (resources), but each waiter keeps track of their
own table (stack + program counter).

4. Thread Lifecycle (Fork–Join Model)


 When a process starts a new thread → the execution forks (splits).
 When a thread finishes, it joins back to the main process.Fig(2.2)

Modifications to the von Neumann model


The von Neumann bottleneck means the CPU is very fast, but memory (RAM) is much
slower. Since the CPU often has to wait for memory, overall performance suffers.
To fix this, computer engineers added caching, virtual memory, and parallelism.
This part is about caching.
What is caching?
Think of it like this:
 CPU = factory
 Main memory = warehouse
 Road between them = slow, two-lane road
The CPU constantly needs raw materials (data & instructions) from memory. If every
time it has to go to the warehouse far away, it wastes time.
Solution is to Build a small storeroom (cache) right next to the CPU.
Cache stores a small amount of data that the CPU is very likely to need soon. It’s much
faster to access than main memory.
Key ideas behind caching
1. Locality principle – programs usually use data near what they just used.
o Spatial locality: If you read z[0], you’ll probably read z[1], z[2]... soon
(arrays).
o Temporal locality: If you use a variable once, you’ll probably use it again
soon.
2. Cache lines (blocks) – instead of fetching one item, CPU loads a whole block.
Example: If each cache line = 16 floats, and program asks for z[0], the CPU loads
z[0]..z[15]. The next 15 reads are super fast.
3. Levels of cache (hierarchy)
o L1: Smallest, fastest (inside CPU core).
o L2: Bigger, slower.
o L3: Even bigger, shared between cores.
CPU checks L1 → L2 → L3 → Main memory.
o If found = cache hit.
o If not = cache miss (stall, must fetch from slow RAM).
Writing data in cache
When CPU writes new data, cache & main memory may become different. Two ways
to handle this:
1. Write-through: Update both cache & memory immediately (slower but
consistent).
2. Write-back: Update cache only, mark it as dirty, and later update main memory
when the cache line is replaced (faster but needs careful management).
Cache mappings
When the CPU asks for data from main memory, and we bring it into the cache, we
must decide:
➡️Which cache slot should this memory block go into?
There are 3 main strategies:

1. Fully Associative
 A memory block can go anywhere in the cache.
 Super flexible, but needs extra hardware to search everywhere.
 Example: Main memory line 0 could go into cache slot 0, 1, 2, or 3.
Think of this like parking in a shopping mall with no assigned spots—you can park
anywhere.

2. Direct Mapped
 Each memory block has exactly one place in the cache.
 Simple & fast, but may cause many conflicts.
 Example: Cache has 4 slots, memory has 16 lines → slot = (line number mod 4).
o Memory line 0, 4, 8, 12 → cache slot 0
o Memory line 1, 5, 9, 13 → cache slot 1, etc.
Like a parking lot where your car has only one assigned spot.

3. N-way Set Associative


 A middle ground: each memory block can go into one of n possible slots (not
anywhere, but not fixed to just one).
 Example: In a 2-way set associative cache, cache slots are grouped into sets of
2:
o Memory line 0 → can go into cache slot 0 or 1
o Memory line 2 → can go into slot 2 or 3, etc.
Like parking in a lot where you have 2 assigned spots to choose from.
Eviction Policy (Which block gets kicked out?)
When the cache is full and a new block must be loaded:
 The most common rule = Least Recently Used (LRU)
→ Kick out the block that hasn’t been used in the longest time.
o Example: If line 0 (in slot 0) was just used, and line 2 (in slot 1) hasn’t
been used for a while, then line 2 will be replaced by line 4.
Like clearing out your fridge: you throw away the food you haven’t touched for the
longest time.
Caches and programs: an example
It’s important to remember that the workings of the CPU cache are controlled by the
system hardware, and we, the programmers, don’t directly determine which data and
which instructions are in the cache. However, knowing the principle of spatial and
temporal locality allows us to have some indirect control over caching. As an example,
C stores two-dimensional arrays in “row-major” order. That is, although we think of a
two-dimensional array as a rectangular block, memory is effectively a huge one-
dimensional array. So in row-major storage, we store row 0 first, then row 1, and so
on.
So memory layout of A[4][4] is:
A[0][0], A[0][1], A[0][2], A[0][3],
A[1][0], A[1][1], A[1][2], A[1][3],
A[2][0], A[2][1], A[2][2], A[2][3],
A[3][0], A[3][1], A[3][2], A[3][3]
Cache line:
 A cache doesn’t fetch single variables, it fetches a block (a "cache line").
 Example: 1 cache line = 4 elements.
Locality:
 Spatial locality: if you use A[0][0], you’ll probably use A[0][1], A[0][2] soon.
 Temporal locality: if you use a value once, you might use it again later.
for (i = 0; i < MAX; i++)
for (j = 0; j < MAX; j++)
y[i] += A[i][j] * x[j];
 Access pattern: row by row → contiguous memory.
 Example (MAX=4):
o Access order: A[0][0], A[0][1], A[0][2], A[0][3] (all in same cache line →
only 1 miss).
o Next row: A[1][0]...A[1][3] → again, just 1 miss.
 Total misses = 4 (one per row).
Cache lines are used efficiently. Once loaded, many elements are used before
eviction.
Loop 2 (bad locality)
for (j = 0; j < MAX; j++)
for (i = 0; i < MAX; i++)
y[i] += A[i][j] * x[j];
 Access pattern: column by column is non-contiguous memory.
 Example (MAX=4):
o Access order: A[0][0], A[1][0], A[2][0], A[3][0].
o Each of these is in a different cache line, so each access = cache miss.
o Then next column A[0][1], A[1][1]... → again 4 misses.
 Total misses = 16. Every element reloads the cache, wasting most of the cache’s
capacity.

Performance Difference
 Loop 1: 4 misses
 Loop 2: 16 misses
 Larger arrays then difference becomes much bigger.
 That’s why in experiments with MAX=1000, loop 1 was ~3x faster.
virtual memory
Why Virtual Memory?
 Programs think they each have their own huge memory space (virtual
memory).
 In reality, many programs share the same physical RAM.
 Virtual memory gives:
1. Illusion of large memory (even bigger than RAM, because disk is used as
backup).
2. Protection (one program cannot overwrite another’s memory).
3. Flexibility (any program can use any free RAM block).
Memory is divided into pages (usually 4 KB–16 KB).
Disk also has swap space divided into same-sized pages.
A program uses virtual addresses → these get mapped to physical addresses in RAM.
Page Table
 Each process has a page table: maps virtual page number (VPN) → physical
page number (PPN).
 Example:
Virtual page 5 → Physical page 20
Virtual page 6 → Physical page 11
 Virtual address is split into:
o Page offset (within page, e.g., last 12 bits for 4 KB page).
o Virtual page number (rest of the bits).
Translation = find VPN in page table, get PPN and combine with offset.
Problem with Page Tables
 To access memory, we first need to look into the page table.
 That’s an extra memory access (slows things down).
 If the page table is big, it may not fit in cache → even more slowdowns.
Solution: TLB (Translation Lookaside Buffer)
 The TLB is a cache for page table entries.
 Stores most recently used virtual and physical mappings.
 Typically has 16–512 entries (very small but very fast).
 Works like CPU cache:
o TLB hit is fast translation, no need to check page table.
o TLB miss is must check page table in RAM (slower).
Page Faults
 If a page is not in RAM at all (only on disk):
o Page fault occurs.
o OS must fetch the page from disk into RAM (super slow — millions of
cycles).
 To reduce disk writes:
o Virtual memory uses write-back, not write-through.
o A dirty bit marks pages that were modified, only those are written back.
Who Manages What?
 CPU hardware:
o Provides TLB.
o Helps with address translation.
 Operating system:
o Manages the page table.
o Handles page faults.
o Decides which pages to evict from RAM.
Instruction-Level Parallelism (ILP):
ILP = executing multiple instructions at the same time inside a CPU to improve
performance.
There are two common approaches:
1. Pipelining → break instruction execution into stages (like an assembly line).
2. Multiple issue → start more than one instruction at the same time (superscalar
processors).

⚙️Pipelining
Think of pipelining like a car assembly line:
 One worker bolts the engine.
 At the same time, another worker attaches the wheels on a different car.
 Another installs seats on yet another car.
Each worker specializes in one task → multiple cars are processed in parallel.

Example: Floating Point Addition


To add two floating point numbers (like 9.87×1049.87 \times 10^49.87×104 +
6.54×1036.54 \times 10^36.54×103), you need several steps:
1. Fetch operands (get the numbers).
2. Compare exponents (align the powers of 10).
3. Shift operand (so the exponents match).
4. Add the mantissas (the number parts).
5. Normalize (adjust so only one digit before decimal).
6. Round (keep precision).
7. Store result (save final answer).
Without pipelining → each addition takes 7 nanoseconds.
For 1000 additions, that’s 7000 nanoseconds.

With Pipelining
We build 7 hardware units, one for each step.
 While step 1 (fetch operands) is working on the 2nd addition,
 step 2 (compare exponents) is working on the 1st addition,
 step 3 is idle (until the pipeline fills)
After the first few cycles (pipeline “warm-up”), the CPU produces 1 result every
nanosecond instead of every 7.
So instead of 7000 ns, the whole loop takes only about 1006 ns (almost 7x faster).

Practical Limits
Pipelining doesn’t always give perfect speedup:
 If one stage is slower than others the whole pipeline slows to that stage’s
speed.
 If data needed by one instruction isn’t ready yet, the pipeline stalls.
 Hazards (data hazards, control hazards, structural hazards) also cause delays.

From the table 2.3 we see that after time 5, the pipelined loop produces a result every
nanosecond, instead of every seven nanoseconds, so the total time to execute the for
loop has been reduced from 7000 nanoseconds to 1006 nanoseconds—an
improvement of almost a factor of seven.
Multiple Issue = instead of just one assembly line, you have two (or more) parallel
lines.
o Example: if you have two floating-point adders, while one adds z[0] =
x[0] + y[0], the other can add z[1] = x[1] + y[1].
o That way, the loop finishes in about half the time.
So:
 Pipeline ⇒ keeps the workers busy.
 Multiple issue ⇒ hires extra workers to do multiple tasks at the same time.
Static vs. Dynamic Multiple Issue
 Static = compiler decides in advance which instructions can run together.
(Think: pre-planned schedule.)
 Dynamic = processor decides at run-time based on what’s ready. This is
superscalar.
peculation
Here’s the clever trick: processors don’t just wait for conditions to resolve — they
guess what will happen and keep going.
1. Branch speculation ( first example):
z = x + y;
if (z > 0) w = x;
else w = y;
CPU guesses z > 0. It executes w = x before knowing for sure.
If the guess was right → great, no time lost.
If wrong → undo that step (rollback) and do w = y instead.
Memory speculation (second example):
z = x + y;
w = *ap; // ap is a pointer
CPU guesses *ap is not pointing to z.
Executes both z = x + y; and w = *ap; in parallel.
If guess was right → saved time.
If wrong → discard speculative result and re-execute.
How CPUs Handle Wrong Guesses
 If compiler speculates → it inserts checks and corrective code.
 If hardware speculates → results are kept in a buffer until the CPU knows the
guess was correct.
o Correct guess ⇒ buffer committed to registers/memory.
o Wrong guess ⇒ buffer discarded and instructions re-run.
In-Order vs. Out-of-Order
 Even superscalar CPUs fetch in program order and commit results in order (to
avoid chaos in memory/registers).
 But they can execute instructions out of order internally — whichever ones are
ready.
 Optimizing compilers, on the other hand, can actually rearrange instruction
order ahead of time.
Thread-Level Parallelism (TLP) and hardware multithreading:
ILP = Instruction-Level Parallelism: The CPU tries to overlap or parallelize instructions
within one program/thread. But ILP is limited when instructions are dependent on
each other.
 Example: Fibonacci
f[0] = f[1] = 1;
for (i = 2; i <= n; i++)
f[i] = f[i-1] + f[i-2];
Each f[i] depends on f[i-1] and f[i-2]. No chance to execute instructions in parallel
inside one thread. So, ILP sometimes runs out of parallelism.
Thread-Level Parallelism (TLP)
Instead of squeezing parallelism inside one thread, we run multiple threads in
parallel.
 Each thread is a bigger unit of work than an instruction.
 If one thread gets stuck (e.g., waiting for memory), the CPU can execute
another thread.
 This is coarser-grained parallelism than ILP.
Hardware Multithreading
This is the hardware trick to keep the CPU busy by switching between threads.
1. Fine-Grained Multithreading
 CPU switches threads after every instruction.
 Advantage is No idle time if one thread stalls (say waiting for memory).
 Disadvantage is Even if one thread has plenty of ready instructions, it has to
share cycles with others. Slows down that thread.
2. Coarse-Grained Multithreading
 CPU switches only when a thread stalls on a long operation (like memory
access).
 Advantage is No unnecessary switching; a thread runs at full speed until it stalls.
 Disadvantage is Small/short stalls still waste CPU time.

3. Simultaneous Multithreading (SMT)


 This is the modern solution (used in Intel’s Hyper-Threading, AMD’s SMT).
 Runs multiple threads at the same time on a superscalar CPU.
 Each cycle, different threads can issue instructions to different functional units.
o Example: Thread A issues an integer instruction while Thread B issues a
floating-point instruction in the same cycle.
This combines ILP + TLP:
 Exploits instruction-level parallelism inside each thread.
 Exploits thread-level parallelism across threads.
If we also prioritize “preferred threads” (threads that have many instructions ready),
SMT can reduce slowdown.

Distributed-memory interconnects
Distributed-memory interconnects are often divided into two groups: direct interconnects
and indirect interconnects.

1. Direct interconnect
In direct interconnect each switch is directly connected to a processor-memory pair, and the
switches are connected to each other. Fig. 2.8 shows a ring and a two-dimensional toroidal
mesh. The circles are switches, the squares are processors, and the lines are bidirectional
links.

One of the simplest measures of the power of a direct interconnect is the number of links.
When counting links in a direct interconnect, it’s customary to count only switch-to-switch
links. This is because the speed of the processor-to-switch links may be very different from the
speed of the switch-to-switch links.

a. Ring
A ring is superior to a simple bus, since it allows multiple simultaneous communications.
However, it’s easy to devise communication schemes, in which some of the processors must
wait for other processors to complete their communications.

To get the total number of links, just add the number of processors to the number of switch-to-
switch links. So, in the diagram for a ring (Fig. 2.8a), would ordinarily count 3 links instead of 6.

b. Toroidal mesh: The toroidal mesh will be more expensive than the ring, because the
switches are more complex—they must support five links instead of three—and if there
are p processors, the number of links is 2p in a toroidal mesh, while it’s only p in a ring.
However,
the number of possible simultaneous communications patterns is greater with a mesh than
with a ring. For the toroidal mesh (Fig. 2.8b), would count 18 links instead of 27.

Bisection width: To understand this measure, imagine that the parallel system is divided into
two halves, and each half contains half of the processors or nodes. In Fig. 2.9(a) we’ve divided a
ring with eight nodes into two groups of four nodes, and we can see that only two
communications can take place between the halves.
the bisection width is to remove the minimum number of links needed to split the set of nodes
into two equal halves. The number of links removed is the bisection width.

If we have a square two-dimensional toroidal mesh with p = q2 nodes (where q is even), then
we can split the nodes into two halves by removing the “middle” horizontal links and the
“wraparound” horizontal links. (See Fig. 2.10.) This suggests that the bisection width is at most
2q = 2√p.

The bandwidth of a link is the rate at which it can transmit data. It’s usually given in megabits
or megabytes per second. Bisection bandwidth is often used as a measure of network quality.

It’s similar to bisection width. However, instead of counting the number of links joining the
halves, it sums the bandwidth of the links. For example, if the links in a ring have a bandwidth
of one billion bits per second, then the bisection bandwidth of the ring will be two billion bits
per second or 2000 megabits per second.

c. fully connected network:


The ideal direct interconnect is a fully connected network, in which each switch is directly
connected to every other switch. See Fig. 2.11. Its bisection width is p2/4

However, it’s impractical to construct such an interconnect for systems with more than a few
nodes, since it requires a total of p2/2 − p/2 links, and each switch must be capable of
connecting to p links. It is therefore more a “theoretical best possible” interconnect than a
practical one, and it is used as a basis for evaluating other interconnects.

d. hypercube:
The hypercube is a highly connected direct interconnect that has been used in actual
systems. Hypercubes are built inductively:
A one-dimensional hypercube is a fully connected system with two processors.

A two-dimensional hypercube is built from two one-dimensional hypercubes by joining


“corresponding” switches.

A three-dimensional hypercube is built from two two-dimensional hypercubes. (See Fig.


2.12.) Thus a hypercube of dimension d has p = 2d nodes, and a switch in a d-dimensional
hypercube is directly connected to a processor and d switches.

The bisection width of a hypercube is p/2, so it has more connectivity than a ring or
toroidal mesh, but the switches must be more powerful, since they must support 1 + d = 1
+ log2(p) wires, while the mesh switches only require five wires. So, a hypercube with p
nodes is more expensive to construct than a toroidal mesh.
Indirect interconnects: Indirect interconnects provide an alternative to direct
interconnects. In an indirect interconnect, the switches may not be directly connected to a
processor. They’re often shown with unidirectional links and a collection of processors,
each of which
has an outgoing and an incoming link, and a switching network. (See Fig. 2.13.)

The crossbar and the omega network are relatively simple examples of indirect networks.

The diagram of a distributed-memory crossbar in Fig. 2.14 has unidirectional links. Notice
that as long as two processors don’t attempt to communicate with the same processor, all
the processors can simultaneously communicate with another processor.
An omega network is shown in Fig. 2.15. The switches are two-by-two crossbars (see Fig.
2.16). Observe that unlike the crossbar, there are communications that cannot occur
simultaneously.

For example, in Fig. 2.15, if processor 0 sends a message to processor 6, then processor 1
cannot simultaneously send a message to processor 7. On the other hand, the omega
network is less expensive than the crossbar. The omega network uses 1/2p log(p) of the 2
× 2 crossbar switches, so it uses a total of 2p log(p) switches, while the crossbar uses p2.
It’s a little bit more complicated to define bisection width for indirect networks. However,
the principle is the same: we want to divide the nodes into two groups of equal size and
determine how much communication can take place between the two halves.

Alternatively, the minimum number of links that need to be removed so that the two
groups can’t communicate. The bisection width of a p×p crossbar is p, and the bisection
width of an omega network is p/2.

Latency and bandwidth:


Any time data is transmitted, how long it will take for the data to reach its destination.
Transmitting data between main memory and cache, cache and register, hard disk and
memory, or between two nodes in a distributed-memory or hybrid system.

The latency is the time that elapses between the source’s beginning to transmit the data
and the destination’s starting to receive the first byte.

The bandwidth is the rate at which the destination receives data after it has started to
receive the first byte. So if the latency of an interconnect is l seconds and the bandwidth is
b bytes per second, then the time it takes to transmit a message of n bytes is message
transmission time = l + n/b.
Cache coherence: CPU caches are managed by system hardware: programmers don’t have
direct control over them. This has several important consequences for shared-memory
systems.
To understand these issues, suppose we have a shared-memory system with two cores,
each of which has its own private data cache. (See Fig. 2.17.)

As long as the two cores only read shared data, there is no problem. For example, suppose
that x is a shared variable that has been initialized to 2, y0 is private and owned by core 0,
and y1 and z1 are private and owned by core 1.

Now suppose the following statements are executed at the indicated times:

Then the memory location for y0 will eventually get the value 2, and the memory location
for y1 will eventually get the value 6. However, it’s not so clear what value z1 will get. It
might at first appear that since core 0 updates x to 7 before the assignment to z1, z1 will
get the value 4×7 = 28.

However, at time 0, x is in the cache of core 1. So, unless for some reason x is evicted from
core 0’s cache and then reloaded into core 1’s cache, it actually appears that the original
value x = 2 may be used, and z1 will get the value 4×2 = 8.
Note that this unpredictable behaviour will occur regardless of whether the system is using
a write-through or a write-back policy. If it’s using a write-through policy, the main
memory will be updated by the assignment x = 7.
However, this will have no effect on the value in the cache of core 1. If the system is using a
write-back policy, the new value of x in the cache of core 0 probably won’t even be
available to core 1 when it updates z1.

the caches for single processor systems provide no mechanism for ensuring that when the
caches of multiple processors store the same variable, an update by one processor to the
cached variable is “seen” by the other processors. That is, that the cached value stored by
the other processors is also updated. This is called the cache coherence problem.

There are two main approaches to ensuring cache coherence: snooping cache coherence
and directory-based cache coherence.

1. Snooping cache coherence: The idea behind snooping comes from bus-based systems:
When the cores share a bus, any signal transmitted on the bus can be “seen” by all the
cores connected to the bus.

Thus, when core 0 updates the copy of x stored in its cache, if it also broadcasts this
information across the bus, and if core 1 is “snooping” the bus, it will see that x has been
updated, and it can mark its copy of x as invalid.

This is more or less how snooping cache coherence works. The principal difference
between our description and the actual snooping protocol is that the broadcast only
informs the other cores that the cache line containing x has been updated, not that x has
been updated.

First, it’s not essential that the interconnect be a bus, only that it support broadcasts from
each processor to all the other processors. Second, snooping works with both write-
through and write-back caches.

In principle, if the interconnect is shared—as with a bus—with write-through caches,


there’s no need for additional traffic on the interconnect, since each core can simply
“watch” for writes.

With write-back caches, on the other hand, an extra communication is necessary, since
updates to the cache don’t get immediately sent to memory.
2. Directory-based cache coherence:
In snooping cache coherence, whenever a variable changes in any core’s cache, the
update must be broadcast to all other cores.
 Example: Core 0 changes x = 5 → This info must be sent to every other core so their
caches are updated or invalidated.
In a small system (like 4 cores), broadcasting is fast enough. All cores hear about changes
almost instantly. But in a large system (hundreds/thousands of cores), broadcasting is slow
as Too many “listeners” and more time to send update messages. The network connecting
cores gets crowded. Distributed-memory system with a single address space each core has
its own private memory.
But here, they set it up so any core can directly read/write to any variable in any other core’s
memory.
 Example: Core 0 can just do y = x even if x is in Core 1’s memory.
The system could grow to thousands of cores, but snooping makes it not scalable because
each write broadcast update, network overload and performance crash.

Directory-based cache coherence protocols attempt to solve this problem through the use
of a data structure called a directory.
The directory is not stored in one place, it’s distributed. Each core/memory pair keeps the
directory info for the memory it owns. Directory has which cores have a copy of a given
cache line? Whether that cache line is valid, not valid or shared?
When a core reads data Ex: Core 0 reads variable x which lives in Core 2’s memory. Core 2:
1. Sends the data to Core 0.
2. Updates its directory entry → “Core 0 has a copy of x now.”
When a core updates data Ex: Core 0 changes x = 10. Core 0’s cache controller:
1. Checks the directory for x.
2. Finds only the cores that have a copy (say Core 3 and Core 5).
3. Sends invalidate messages only to those cores — not to all 1000 cores.
False sharing:
It’s important to remember that CPU caches are implemented in hardware, so they operate
on cache lines, not individual variables. This can have disastrous consequences for
performance.

You have:
 Two cores (Core 0 and Core 1)
 An array y[8] of doubles (each double = 8 bytes)
 Cache line size = 64 bytes → can hold 8 doubles
 So all of y[0] to y[7] fit in one single cache line.
Ex: for (i = 0; i < m; i++)
for (j = 0; j < n; j++)
y[i] += f(i, j);
We parallelize it:
 Core 0 → works on y[0] to y[3]
 Core 1 → works on y[4] to y[7]

/ ∗ Pr i vat e v a r i a b l e s ∗ /
i n t i , j , iter_count ;

/ ∗ Shared v a r i a b l e s i n i t i a l i z e d by one core ∗ /


i n t m , n , core_count

double y [ m ] ;
iter_count = m / core_count;

/ ∗ Core 0 does t h i s ∗ /
for ( i = 0; i < iter_count ; i ++)
for ( j = 0; j < n ; j ++)
y [ i] += f ( i , j ) ;

/ ∗ Core 1 does t h i s ∗ /
for ( i = iter_count ; i < 2∗ iter_count ; i ++)
for ( j = 0; j < n ; j ++)
y [ i] += f ( i , j ) ;
Logically they are working on different elements, so there’s no data dependency. But caches
don’t work on variables Caches store cache lines, not individual variables.
If y[0] to y[7] are in one cache line, then any update to any element in that line means:
1. The entire cache line gets marked as invalid in the other core’s cache.
2. That core must fetch the updated line again from memory (or other core’s cache)
before continuing.
In practice Core 0 updates y[0], cache line is marked modified in Core 0’s cache, invalid in
Core 1’s cache. Core 1 then tries to update y[4], Cache controller sees that its cache line is
invalid and fetches the whole line from Core 0 (or main memory).
Now Core 1 owns the line, Core 0’s copy is invalidated. Core 0 next tries to update y[1] but
must fetch the whole line back again.
This keeps happening for every single update, even though they’re working on different y[i]
values.
Why this is called False Sharing
 True sharing: both cores actually read/write the same variable.
 False sharing: they read/write different variables, but those variables happen to be on
the same cache line, so the hardware treats it like shared data.
Instead of using the fast L1 cache, most updates end up going through slow main memory or
cross-core cache transfers.
L1 cache access: ~1–4 CPU cycles
Main memory access: ~100–300 cycles huge slowdown.
To reduce false sharing:
1. Pad the array so each core works on elements in a different cache line.
2. Use local temporary storage inside each thread and merge results at the end.
3. Arrange data structures so different threads don’t share cache lines.
Shared-memory vs. distributed-memory
1. Shared-memory is easy for programmers
 In shared memory, all processors can just access the same variables without worrying
about explicit message passing.
 Example: Two threads can just read/write to the same array without sending
messages between them.
Shared-memory needs a high-speed interconnect (bus or crossbar) to connect all processors
to the same memory.
Bus problem:
 When few processors are connected, it works fine.
 But as the number of processors increases, they compete for the same bus.
 More processors lead to more conflicts hence performance drops fast.
 So, buses are good only for small systems (like your laptop with 4–16 cores).

Crossbar problem:
 Allows more simultaneous connections than a bus.
 But cost rises steeply as you add processors.
 Large crossbars are very expensive, so rare in big systems.
Distributed-memory is cheaper to scale
 Uses direct interconnects like:
o Hypercube
o Toroidal mesh
 Each processor has its own local memory so No single “traffic jam” point.
 Can scale to thousands of processors at much lower cost.
 Great for huge problems needing massive computation or huge datasets.

Parallel software
Multicore processors are in:
 Desktops
 Servers
 Mobile phones and tablets
This means your device can do more than one thing at the same time in hardware. But
software hasn’t fully caught up. Back in 2011, the book said: “We have parallel hardware, but
not much parallel software.”
In 2021, the situation is better but still incomplete:
o System software (like your OS) now uses multiple cores.
o Many popular apps (Excel, Photoshop, Chrome) can use more than one core.
o But many programs still use only a single core.
o Many programmers have never written parallel code
In the past, we could just rely on:
 Faster CPUs
 Smarter compilers
This gave automatic speed boosts without changing the software.
Now, CPU clock speeds aren’t rising much, performance growth comes mainly from adding
more cores. If a program only uses one core, it won’t get faster on newer hardware.
The solution is to Learn parallel programming: Developers must learn to write programs
that can use:
o Shared-memory architectures (threads, common memory)
o Distributed-memory architectures (message passing)
o Both MIMD and SIMD execution models.
Caveats
This section is basically giving two warnings (caveats) before diving into parallel
programming details and also clarifying SPMD. Mainly focus on what’s often called single
program, multiple data, or SPMD, programs.
SPMD is One program file, multiple cores running it at the same time. Each core runs the
same code, but can do different things depending on its ID.
Ex: if (I'm thread/process 0)
do this;
else
do that;
Here, all cores are running the same executable, but thread 0 does one thing and thread 1
does another. The same operation is done on different chunks of data called as data
parallelism.
if (I'm thread 0)
process first half of the array;
else // I'm thread 1
process second half of the array;

Task parallelism: Different threads do different tasks.


Example:
Thread 0: read data from disk
Thread 1: process the data
Thread 2: write results to a file
SPMD is flexible and it can implement both data-parallel and task-parallel designs using the
same "single program" model, just by branching logic based on the thread or process ID.
Coordinating the processes/threads
For example,
suppose we have two arrays and we want to add them:
double x [ n ] , y [ n ] ;
...
for ( i n t i = 0; i < n ; i ++)
x [ i] += y [ i ] ;
Serial version: One loop runs start to finish.
Parallel version: If you have p threads:

Thread 0 → handles x[0] to x[n/p - 1]

Thread 1 → handles x[n/p] to x[2n/p - 1]

and so on.
This is easy because, each thread can work independently (no need to share intermediate
results). No communication between threads is needed once the work is split.

When splitting work:


1. Load balancing — Each process/thread should get roughly the same amount of work
(avoid one thread sitting idle while others are busy).
2. Minimize communication — Less data transfer between threads means faster
performance.
If both are satisfied, you get high efficiency.
Load balancing → Ensuring even distribution of work.
Parallelization → Turning a serial program into a parallel one.
Embarrassingly parallel → Problems so easy to parallelize that no coordination is needed
(e.g., image processing pixel-by-pixel, array addition).
Despite the name, it’s a good thing, not something to be ashamed of.
For complex problems: You need synchronization, making sure threads reach certain points
together (e.g., waiting for all to finish before moving on).
You need communication for Sharing results, sending data to other threads.
Examples:
 Shared-memory systems: Communicate by directly reading/writing shared variables,
but require synchronization to avoid conflicts.
 Distributed-memory systems: Communicate by sending messages between processes
(MPI), which can also serve as synchronization points.

shared-memory:
Any thread can read/write Shared variables. Whereas Private variables, Belong to one
thread only and Communication is done via shared variables (implicit — no need to send
explicit messages like in distributed memory).
Dynamic Threads Paradigm
 How it works:
1. Master thread waits for a new task (e.g., request from network).
2. When work arrives → master forks a new worker thread.
3. Worker thread does its job → joins back to master → ends.
4. Repeat for each task.
 Advantages:
o Efficient resource usage → Threads only exist while they are needed.
o Ideal for systems where workload changes a lot over time.
 Disadvantages:
o Overhead of forking/joining each time a task arrives.
o Not the fastest for continuous or heavy workloads.
Static Threads Paradigm
How it works:
1. Master thread creates all worker threads at the start.
2. All threads stay alive until all work is done.
3. At the end, all workers join the master, and then the program cleans up.
Advantages:
 Lower overhead → No repeated thread creation/destruction.
 Better performance for workloads that are steady or predictable.
Disadvantages:
 Less efficient resource usage since Idle threads still consume memory & CPU stack
space.
 Wastes resources if there’s not always enough work to keep all threads busy.

Nondeterminism?
 In MIMD systems (Multiple Instruction, Multiple Data) where multiple processors (or
threads) run at the same time, they usually don’t stay perfectly in sync.
 This means the same input might produce different outputs depending on how the
processors finish their tasks. This unpredictability is called nondeterminism.
Example 1: Printing with Two Threads
 Imagine we have two threads:
o Thread 0 → has a private variable my_x = 7
o Thread 1 → has a private variable my_x = 19
Both threads run this code:
printf("Thread %d > my_x = %d\n", my_rank, my_x);
Possible outputs: Case 1:
Thread 0 > my_x = 7
Thread 1 > my_x = 19

Case 2:
Thread 1 > my_x = 19
Thread 0 > my_x = 7
Their outputs might get mixed together (e.g., half of one line, half of another), because both
are writing to the screen at the same time. The point is since the threads are independent,
we cannot predict the order.
Is nondeterminism always bad?
 Sometimes it’s okay, in the print example, since each output is labelled with the
thread ID, order doesn’t matter.
 Sometimes it’s dangerous, especially in shared-memory programs, where threads
work on the same data.

Example 2: Race Condition with Shared Variable


Suppose:
 Each thread calculates some value and stores it in its own variable my_val.
 Then, both threads try to add their values into a shared variable x, which starts at 0.
The code looks like this:
my_val = Compute_val(my_rank);
x += my_val;
How computers actually do x += my_val
To simplify, assume:
1. Load x from memory into a register.
2. Load my_val into another register.
3. Add them.
4. Store result back into memory.
Time Core 0 (Thread 0) Core 1 (Thread 1)

0 Done computing my_val=7 Still computing my_val

1 Loads x=0 Finished computing my_val

2 Loads my_val=7 Loads x=0

3 Adds → result=7 Loads my_val=19

4 Stores x=7 Adds 0+19=19

5 (Leaves) Stores x=19


Final result of x = 19 (Thread 0’s contribution is lost!)
This is called a race condition where both threads are “racing” to update x, and the outcome
depends on who finishes first.
Fixing Race Conditions using Critical Sections
 The update x += my_val must be done atomically (all steps happen as one unit).
 To guarantee that, we put it inside a critical section → a block of code that only one
thread can run at a time.
Using Mutex (Lock)
 A mutex (mutual exclusion lock) is a special object provided by the system/hardware.
 Process:
1. A thread must lock the mutex before entering the critical section.
2. Do the critical work (e.g., x += my_val).
3. Unlock the mutex when done.
Example:
my_val = Compute_val(my_rank);
Lock(&add_my_val_lock);
x += my_val;
Unlock(&add_my_val_lock);
This ensures only one thread updates x at a time. But it doesn’t force any specific order
(thread 0 or 1 can go first). Downside it makes that part of the program serial (not parallel).
So we want critical sections to be as short as possible.
Alternative 1: Busy-Waiting
 Instead of a mutex, a thread can just keep checking a condition until it’s allowed to
proceed.
 Example: Thread 1 must wait for Thread 0.
my_val = Compute_val(my_rank);
if (my_rank == 1)
while (!ok_for_1); // busy-wait loop

x += my_val; // critical section

if (my_rank == 0)
ok_for_1 = true; // allow thread 1 to continue
Here, Thread 1 will loop endlessly until ok_for_1 is set to true by Thread 0. Simple but
wasteful → the CPU is busy “waiting” instead of doing real work.

Alternative 2: Semaphores
 Similar to mutexes but slightly more flexible.
 A semaphore can allow multiple threads to enter at once (depending on its count).
 Some synchronization problems are easier to solve with semaphores than mutexes.

Alternative 3: Monitors
 A monitor is a higher-level concept.
 Think of it like a special object where only one thread can call its methods at a time.
 Provides mutual exclusion automatically without explicit lock/unlock calls.

Nondeterminism → Output is unpredictable when threads run independently.


Race Condition → Happens when threads update shared data at the same time.
Critical Section → Code that must only be executed by one thread at a time.
Mutex (Lock/Unlock) → Most common way to protect critical sections.
Busy-Waiting → Thread keeps looping until allowed (simple but wasteful).
Semaphores → Like advanced locks, more flexible.
Monitors → Higher-level locks built into objects.

Thread Safety

What is Thread Safety?


 In many cases, functions written for serial (single-threaded) programs can also be
used safely in parallel (multithreaded) programs.
 BUT there are exceptions, some functions don’t behave correctly if multiple threads
call them at the same time. When that happens, the function is said to be not thread
safe.

Why does this happen?


It usually happens because of static local variables in C.
 Normal local variables:
o Declared inside a function.
o Stored on the stack.
o Each thread has its own stack, so each thread gets its own copy of the variable.
Safe in multithreading.
 Static local variables:
o Declared inside a function but with static.
o They remember their value between function calls (they don’t get destroyed).
o All threads share the same variable. This can cause problems when multiple
threads use the function.
Example: strtok in C
 The function strtok splits a string into smaller parts (substrings).
 It works like this:
o On the first call, you pass it a string, and it remembers that string using a static
pointer.
o On the next calls, it keeps giving you the next part of the string, using that
stored static variable.
Problem is What if two threads use strtok at the same time? Suppose: Thread 0 calls strtok
first with "apple orange". Before Thread 0 finishes splitting, Thread 1 calls strtok with "cat
dog". Now the static variable inside strtok gets overwritten.
 Thread 0’s string ("apple orange") is lost.
 On its next call, Thread 0 might get pieces of Thread 1’s string ("cat dog") instead.

What does this mean?


 strtok is not thread safe.
 If used in a multithreaded program it may produce wrong results or random errors.

General Rule:
 A function is not thread safe if multiple threads can access or modify the same
shared data inside it.
 Many functions from single-threaded libraries are thread safe, but you must be
careful with the ones that rely on shared state (like strtok).

Distributed-Memory

 In shared-memory systems, all processors (cores/threads) can access the same


memory.
 But in distributed-memory systems, each processor has its own private
memory. A processor can directly use only its own memory. So, if one
processor wants data from another processor’s memory, it cannot read it
directly. Instead, they must communicate. The most widely used way is
message passing. Other APIs exist, but message passing is the standard.

You can even use message-passing on a shared-memory system by pretending


memory is private for each thread and exchanging data using messages.

How are distributed-memory programs run?


 They are usually started as multiple processes, not threads. Because processors
in distributed-memory systems may be:
o Independent CPUs
o Running their own operating systems
o Without shared infrastructure to create threads across nodes
So Processes (not threads) each with its own memory.

Message Passing
A message-passing API provides at least two functions:
1. Send → to send data to another process
2. Receive → to get data from another process
Each process is identified by a rank (like an ID number).
 If there are p processes, the ranks are 0, 1, …, p–1.

Example: Process 1 sends to Process 0


Pseudocode:
char message[100];
...
my_rank = Get_rank();

if (my_rank == 1) {
sprintf(message, "Greetings from process 1");
Send(message, MSG_CHAR, 100, 0);
}
else if (my_rank == 0) {
Receive(message, MSG_CHAR, 100, 1);
printf("Process 0 > Received: %s\n", message);
}
What happens here?
 Get_rank() → tells the process its ID.
 If rank = 1 → it creates a message ("Greetings from process 1") and sends it to
process 0.
 If rank = 0 → it waits to receive the message and then prints it.
SPMD Program (Single Program, Multiple Data)
 Both processes run the same code (same executable).
 But their behavior differs based on rank.
Local Variables → Each process’s message variable refers to different memory
blocks (private). Some programmers emphasize this by naming them my_message
or local_message.
Output (stdout/stderr) → Usually, all processes can print to the screen.
 Even though message-passing APIs don’t guarantee it, most implementations
allow it.

Behaviour of Send and Receive


Different implementations may behave differently:
 Blocking Send → The Send call waits until the matching Receive starts.
 Buffered Send → The Send call copies data into its own buffer, so the sender
can continue immediately.
 Blocking Receive → The Receive call usually waits until the data actually arrives.
There are other variations too (non-blocking send/receive, synchronous send, etc.).

More Functions in Message Passing


Message-passing APIs usually provide extra communication tools:
 Broadcast → One process sends the same data to all processes.
 Reduction → Combine results from all processes into one (e.g., sum, max, min).
 Process Management → Start/stop/manage processes.
 Complex Data Handling → Sending structured data across processes.
The most widely used API is MPI (Message-Passing Interface).

Advantages and Disadvantages of Message Passing


Advantages:
 Very powerful and flexible.
 Used in almost all supercomputers (the most powerful computers in the world).

Disadvantages:
 Very low-level (programmer must handle lots of details).
 To parallelize a serial program, you often need to rewrite most of it.
 Data structures must either:
o Be copied for each process, or
o Be split/distributed across processes.
 Incremental rewriting (parallelizing small parts step by step) is usually not
practical.
 Managing data movement between processes is tedious and error-prone.
Because of this, message passing is often called “the assembly language of
parallel programming.” It’s powerful but very detailed and hard to work with.
Key Points

In distributed-memory systems, each process has its own private memory.


Processes communicate by sending and receiving messages.
Example: Process 1 sends "Hello" to Process 0, and Process 0 prints it.
Message-passing programs are usually SPMD (same code, behavior depends on
rank).
Send/Receive can block or not, depending on the API.
APIs also support collective operations like broadcast and reduction.
MPI is the most common message-passing API.
Message passing is powerful but hard to use (like assembly language).

One-Sided Communication
Two-Sided (Normal) Message Passing In regular message-passing, communication
always needs two processes:
o One process sends data.
o Another process receives data.
Both must participate at the same time (a send must match a receive).

In one-sided communication (also called remote memory access, RMA):


o Only one process makes a function call.
o That single call either:
Reads data from another process’s memory into its own memory, or Writes data from
its own memory into another process’s memory. The other process does not
explicitly call send/receive at that moment.

Why use One-Sided Communication?


Simpler communication – only one process actively participates.
Faster communication – less overhead (no need for matching send/receive).
Fewer function calls – only one call instead of two.

But there are challenges, Imagine Process 0 writes data into Process 1’s memory.
Problems:
1. Safety problem → Process 0 must know it’s safe to overwrite memory in
Process 1.
2. Awareness problem → Process 1 must know when its memory has been
updated.

How to Solve These Problems?


 Synchronization before writing to Make sure Process 1 is ready before Process
0 writes.
 Synchronization after writing to Let Process 1 know data has arrived.

Ways to do this:
 Second synchronization step where Process 0 and 1 coordinate after the write.
 Flag variable
o Process 0 sets a special variable (e.g., done = true) after copying.
o Process 1 keeps checking (polling) the flag until it sees the update.
But Polling wastes time because Process 1 may repeatedly check the flag without
doing real work. Since only one process does the communication call, the other
process doesn’t “see” it happening directly.
This can cause:
o Hard-to-detect bugs
o Overwriting wrong memory
o Processes waiting forever if synchronization isn’t handled properly

Partitioned global address space languages:


 shared-memory programming is easier for many programmers (all processes
just read/write from the same memory).
 But in distributed-memory systems (like clusters of computers or multi-core
processors with private memory), memory is not actually shared. Each core or
node has its own local memory.
 If we pretend the entire system has one "giant shared memory," performance
can become terrible:
o Accessing local memory is super fast
o Accessing remote memory (another core’s memory) is very slow 🐢
(100x–1000x slower).
So, if a program frequently accesses remote memory without control, it will run very
inefficiently.

Solution: PGAS Languages


Partitioned Global Address Space (PGAS) languages try to give programmers the
ease of shared memory but with awareness of locality (which data is local vs.
remote).
 Global Address Space, All memory across processes looks like one shared
memory (like in shared-memory programming).
 Partitioned, The memory is divided so that:
o Some parts of the shared memory are local to each process/core.
o Other parts may be remote (belonging to another process).
This way, a programmer knows which part of the array is in local memory and can
design the program to minimize slow remote memory accesses.
shared int n = ...;
shared double x[n], y[n];
private int i, my_first_element, my_last_element;

my_first_element = ...;
my_last_element = ...;

/* Initialize x and y */
...

for (i = my_first_element; i <= my_last_element; i++)


x[i] += y[i];
Here, arrays x and y are shared across all processes. Each process only works on its
own part of the array (my_first_element to my_last_element). If the
compiler/runtime ensures that each process’s assigned portion of x and y is stored
in its local memory, the code is fast. But if x is all on core 0 and y is all on core 1,
then each access x[i] += y[i] would involve remote memory access is very slow.

Why PGAS Helps


 PGAS languages let the programmer control the distribution of shared data
across processes.
 The programmer knows where each piece of data lives, so they can write
efficient code that avoids unnecessary remote access.
 Private variables are always local and no performance problem.

Examples of PGAS Languages


Some well-known PGAS programming models and languages are:
 UPC (Unified Parallel C)
 Coarray Fortran
 Chapel (from Cray)
 X10 (IBM project)
 Titanium (Java-based PGAS language)

You might also like