Module 1
Module 1
the earlier parallel summation example. This is about reducing bottlenecks in how the
partial results are combined. 1. The first method: Centralized collection
1. Concurrent computing
Definition: Multiple tasks appear to be in progress at the same time.
Key idea: It doesn’t require multiple cores/processors. Even a single-core system with
multitasking OS (like time-slicing between tasks) counts as concurrent.
Example:
o Your laptop running a browser, music player, and text editor concurrently.
o Only one core may be executing at an instant, but the OS switches so fast that
tasks appear to run simultaneously.
2. Parallel computing
Definition: Multiple tasks are actually executed at the same time on multiple
cores/processors.
Key idea: Tasks are tightly coupled — they work together on one problem.
Typical setting: Shared-memory systems or high-speed cluster.
Example:
o Adding 1 billion numbers using 8 CPU cores.
o Each core sums a portion, and then partial sums are combined.
o Requires coordination (synchronization, communication, load balancing).
3. Distributed computing
Definition: Multiple tasks run on different computers (often geographically separated)
that communicate via a network.
Key idea: Tasks are loosely coupled — not necessarily created together.
Typical setting: Cloud, internet-scale systems.
Example:
o Google’s web search: thousands of servers across the world crawl, index, and
serve results.
o Each server does part of the work, but they aren’t “sharing memory,” they
exchange messages.
Parallel Hardware and Parallel Software: SOME BACKGROUND
3. Threads
A thread is like a smaller unit of work inside a process.
Processes usually have one “main” thread (the default execution path).
But a process can create multiple threads and all share the same memory &
resources.
Threads are lightweight compared to processes:
Don’t need separate memory space, faster switching.
Still need their own program counter and own call stack (so they can run
independently).
Analogy:
Process = restaurant.
Threads = waiters inside that restaurant.
They share the same kitchen & menu (resources), but each waiter keeps track of their
own table (stack + program counter).
1. Fully Associative
A memory block can go anywhere in the cache.
Super flexible, but needs extra hardware to search everywhere.
Example: Main memory line 0 could go into cache slot 0, 1, 2, or 3.
Think of this like parking in a shopping mall with no assigned spots—you can park
anywhere.
2. Direct Mapped
Each memory block has exactly one place in the cache.
Simple & fast, but may cause many conflicts.
Example: Cache has 4 slots, memory has 16 lines → slot = (line number mod 4).
o Memory line 0, 4, 8, 12 → cache slot 0
o Memory line 1, 5, 9, 13 → cache slot 1, etc.
Like a parking lot where your car has only one assigned spot.
Performance Difference
Loop 1: 4 misses
Loop 2: 16 misses
Larger arrays then difference becomes much bigger.
That’s why in experiments with MAX=1000, loop 1 was ~3x faster.
virtual memory
Why Virtual Memory?
Programs think they each have their own huge memory space (virtual
memory).
In reality, many programs share the same physical RAM.
Virtual memory gives:
1. Illusion of large memory (even bigger than RAM, because disk is used as
backup).
2. Protection (one program cannot overwrite another’s memory).
3. Flexibility (any program can use any free RAM block).
Memory is divided into pages (usually 4 KB–16 KB).
Disk also has swap space divided into same-sized pages.
A program uses virtual addresses → these get mapped to physical addresses in RAM.
Page Table
Each process has a page table: maps virtual page number (VPN) → physical
page number (PPN).
Example:
Virtual page 5 → Physical page 20
Virtual page 6 → Physical page 11
Virtual address is split into:
o Page offset (within page, e.g., last 12 bits for 4 KB page).
o Virtual page number (rest of the bits).
Translation = find VPN in page table, get PPN and combine with offset.
Problem with Page Tables
To access memory, we first need to look into the page table.
That’s an extra memory access (slows things down).
If the page table is big, it may not fit in cache → even more slowdowns.
Solution: TLB (Translation Lookaside Buffer)
The TLB is a cache for page table entries.
Stores most recently used virtual and physical mappings.
Typically has 16–512 entries (very small but very fast).
Works like CPU cache:
o TLB hit is fast translation, no need to check page table.
o TLB miss is must check page table in RAM (slower).
Page Faults
If a page is not in RAM at all (only on disk):
o Page fault occurs.
o OS must fetch the page from disk into RAM (super slow — millions of
cycles).
To reduce disk writes:
o Virtual memory uses write-back, not write-through.
o A dirty bit marks pages that were modified, only those are written back.
Who Manages What?
CPU hardware:
o Provides TLB.
o Helps with address translation.
Operating system:
o Manages the page table.
o Handles page faults.
o Decides which pages to evict from RAM.
Instruction-Level Parallelism (ILP):
ILP = executing multiple instructions at the same time inside a CPU to improve
performance.
There are two common approaches:
1. Pipelining → break instruction execution into stages (like an assembly line).
2. Multiple issue → start more than one instruction at the same time (superscalar
processors).
⚙️Pipelining
Think of pipelining like a car assembly line:
One worker bolts the engine.
At the same time, another worker attaches the wheels on a different car.
Another installs seats on yet another car.
Each worker specializes in one task → multiple cars are processed in parallel.
With Pipelining
We build 7 hardware units, one for each step.
While step 1 (fetch operands) is working on the 2nd addition,
step 2 (compare exponents) is working on the 1st addition,
step 3 is idle (until the pipeline fills)
After the first few cycles (pipeline “warm-up”), the CPU produces 1 result every
nanosecond instead of every 7.
So instead of 7000 ns, the whole loop takes only about 1006 ns (almost 7x faster).
Practical Limits
Pipelining doesn’t always give perfect speedup:
If one stage is slower than others the whole pipeline slows to that stage’s
speed.
If data needed by one instruction isn’t ready yet, the pipeline stalls.
Hazards (data hazards, control hazards, structural hazards) also cause delays.
From the table 2.3 we see that after time 5, the pipelined loop produces a result every
nanosecond, instead of every seven nanoseconds, so the total time to execute the for
loop has been reduced from 7000 nanoseconds to 1006 nanoseconds—an
improvement of almost a factor of seven.
Multiple Issue = instead of just one assembly line, you have two (or more) parallel
lines.
o Example: if you have two floating-point adders, while one adds z[0] =
x[0] + y[0], the other can add z[1] = x[1] + y[1].
o That way, the loop finishes in about half the time.
So:
Pipeline ⇒ keeps the workers busy.
Multiple issue ⇒ hires extra workers to do multiple tasks at the same time.
Static vs. Dynamic Multiple Issue
Static = compiler decides in advance which instructions can run together.
(Think: pre-planned schedule.)
Dynamic = processor decides at run-time based on what’s ready. This is
superscalar.
peculation
Here’s the clever trick: processors don’t just wait for conditions to resolve — they
guess what will happen and keep going.
1. Branch speculation ( first example):
z = x + y;
if (z > 0) w = x;
else w = y;
CPU guesses z > 0. It executes w = x before knowing for sure.
If the guess was right → great, no time lost.
If wrong → undo that step (rollback) and do w = y instead.
Memory speculation (second example):
z = x + y;
w = *ap; // ap is a pointer
CPU guesses *ap is not pointing to z.
Executes both z = x + y; and w = *ap; in parallel.
If guess was right → saved time.
If wrong → discard speculative result and re-execute.
How CPUs Handle Wrong Guesses
If compiler speculates → it inserts checks and corrective code.
If hardware speculates → results are kept in a buffer until the CPU knows the
guess was correct.
o Correct guess ⇒ buffer committed to registers/memory.
o Wrong guess ⇒ buffer discarded and instructions re-run.
In-Order vs. Out-of-Order
Even superscalar CPUs fetch in program order and commit results in order (to
avoid chaos in memory/registers).
But they can execute instructions out of order internally — whichever ones are
ready.
Optimizing compilers, on the other hand, can actually rearrange instruction
order ahead of time.
Thread-Level Parallelism (TLP) and hardware multithreading:
ILP = Instruction-Level Parallelism: The CPU tries to overlap or parallelize instructions
within one program/thread. But ILP is limited when instructions are dependent on
each other.
Example: Fibonacci
f[0] = f[1] = 1;
for (i = 2; i <= n; i++)
f[i] = f[i-1] + f[i-2];
Each f[i] depends on f[i-1] and f[i-2]. No chance to execute instructions in parallel
inside one thread. So, ILP sometimes runs out of parallelism.
Thread-Level Parallelism (TLP)
Instead of squeezing parallelism inside one thread, we run multiple threads in
parallel.
Each thread is a bigger unit of work than an instruction.
If one thread gets stuck (e.g., waiting for memory), the CPU can execute
another thread.
This is coarser-grained parallelism than ILP.
Hardware Multithreading
This is the hardware trick to keep the CPU busy by switching between threads.
1. Fine-Grained Multithreading
CPU switches threads after every instruction.
Advantage is No idle time if one thread stalls (say waiting for memory).
Disadvantage is Even if one thread has plenty of ready instructions, it has to
share cycles with others. Slows down that thread.
2. Coarse-Grained Multithreading
CPU switches only when a thread stalls on a long operation (like memory
access).
Advantage is No unnecessary switching; a thread runs at full speed until it stalls.
Disadvantage is Small/short stalls still waste CPU time.
Distributed-memory interconnects
Distributed-memory interconnects are often divided into two groups: direct interconnects
and indirect interconnects.
1. Direct interconnect
In direct interconnect each switch is directly connected to a processor-memory pair, and the
switches are connected to each other. Fig. 2.8 shows a ring and a two-dimensional toroidal
mesh. The circles are switches, the squares are processors, and the lines are bidirectional
links.
One of the simplest measures of the power of a direct interconnect is the number of links.
When counting links in a direct interconnect, it’s customary to count only switch-to-switch
links. This is because the speed of the processor-to-switch links may be very different from the
speed of the switch-to-switch links.
a. Ring
A ring is superior to a simple bus, since it allows multiple simultaneous communications.
However, it’s easy to devise communication schemes, in which some of the processors must
wait for other processors to complete their communications.
To get the total number of links, just add the number of processors to the number of switch-to-
switch links. So, in the diagram for a ring (Fig. 2.8a), would ordinarily count 3 links instead of 6.
b. Toroidal mesh: The toroidal mesh will be more expensive than the ring, because the
switches are more complex—they must support five links instead of three—and if there
are p processors, the number of links is 2p in a toroidal mesh, while it’s only p in a ring.
However,
the number of possible simultaneous communications patterns is greater with a mesh than
with a ring. For the toroidal mesh (Fig. 2.8b), would count 18 links instead of 27.
Bisection width: To understand this measure, imagine that the parallel system is divided into
two halves, and each half contains half of the processors or nodes. In Fig. 2.9(a) we’ve divided a
ring with eight nodes into two groups of four nodes, and we can see that only two
communications can take place between the halves.
the bisection width is to remove the minimum number of links needed to split the set of nodes
into two equal halves. The number of links removed is the bisection width.
If we have a square two-dimensional toroidal mesh with p = q2 nodes (where q is even), then
we can split the nodes into two halves by removing the “middle” horizontal links and the
“wraparound” horizontal links. (See Fig. 2.10.) This suggests that the bisection width is at most
2q = 2√p.
The bandwidth of a link is the rate at which it can transmit data. It’s usually given in megabits
or megabytes per second. Bisection bandwidth is often used as a measure of network quality.
It’s similar to bisection width. However, instead of counting the number of links joining the
halves, it sums the bandwidth of the links. For example, if the links in a ring have a bandwidth
of one billion bits per second, then the bisection bandwidth of the ring will be two billion bits
per second or 2000 megabits per second.
However, it’s impractical to construct such an interconnect for systems with more than a few
nodes, since it requires a total of p2/2 − p/2 links, and each switch must be capable of
connecting to p links. It is therefore more a “theoretical best possible” interconnect than a
practical one, and it is used as a basis for evaluating other interconnects.
d. hypercube:
The hypercube is a highly connected direct interconnect that has been used in actual
systems. Hypercubes are built inductively:
A one-dimensional hypercube is a fully connected system with two processors.
The bisection width of a hypercube is p/2, so it has more connectivity than a ring or
toroidal mesh, but the switches must be more powerful, since they must support 1 + d = 1
+ log2(p) wires, while the mesh switches only require five wires. So, a hypercube with p
nodes is more expensive to construct than a toroidal mesh.
Indirect interconnects: Indirect interconnects provide an alternative to direct
interconnects. In an indirect interconnect, the switches may not be directly connected to a
processor. They’re often shown with unidirectional links and a collection of processors,
each of which
has an outgoing and an incoming link, and a switching network. (See Fig. 2.13.)
The crossbar and the omega network are relatively simple examples of indirect networks.
The diagram of a distributed-memory crossbar in Fig. 2.14 has unidirectional links. Notice
that as long as two processors don’t attempt to communicate with the same processor, all
the processors can simultaneously communicate with another processor.
An omega network is shown in Fig. 2.15. The switches are two-by-two crossbars (see Fig.
2.16). Observe that unlike the crossbar, there are communications that cannot occur
simultaneously.
For example, in Fig. 2.15, if processor 0 sends a message to processor 6, then processor 1
cannot simultaneously send a message to processor 7. On the other hand, the omega
network is less expensive than the crossbar. The omega network uses 1/2p log(p) of the 2
× 2 crossbar switches, so it uses a total of 2p log(p) switches, while the crossbar uses p2.
It’s a little bit more complicated to define bisection width for indirect networks. However,
the principle is the same: we want to divide the nodes into two groups of equal size and
determine how much communication can take place between the two halves.
Alternatively, the minimum number of links that need to be removed so that the two
groups can’t communicate. The bisection width of a p×p crossbar is p, and the bisection
width of an omega network is p/2.
The latency is the time that elapses between the source’s beginning to transmit the data
and the destination’s starting to receive the first byte.
The bandwidth is the rate at which the destination receives data after it has started to
receive the first byte. So if the latency of an interconnect is l seconds and the bandwidth is
b bytes per second, then the time it takes to transmit a message of n bytes is message
transmission time = l + n/b.
Cache coherence: CPU caches are managed by system hardware: programmers don’t have
direct control over them. This has several important consequences for shared-memory
systems.
To understand these issues, suppose we have a shared-memory system with two cores,
each of which has its own private data cache. (See Fig. 2.17.)
As long as the two cores only read shared data, there is no problem. For example, suppose
that x is a shared variable that has been initialized to 2, y0 is private and owned by core 0,
and y1 and z1 are private and owned by core 1.
Now suppose the following statements are executed at the indicated times:
Then the memory location for y0 will eventually get the value 2, and the memory location
for y1 will eventually get the value 6. However, it’s not so clear what value z1 will get. It
might at first appear that since core 0 updates x to 7 before the assignment to z1, z1 will
get the value 4×7 = 28.
However, at time 0, x is in the cache of core 1. So, unless for some reason x is evicted from
core 0’s cache and then reloaded into core 1’s cache, it actually appears that the original
value x = 2 may be used, and z1 will get the value 4×2 = 8.
Note that this unpredictable behaviour will occur regardless of whether the system is using
a write-through or a write-back policy. If it’s using a write-through policy, the main
memory will be updated by the assignment x = 7.
However, this will have no effect on the value in the cache of core 1. If the system is using a
write-back policy, the new value of x in the cache of core 0 probably won’t even be
available to core 1 when it updates z1.
the caches for single processor systems provide no mechanism for ensuring that when the
caches of multiple processors store the same variable, an update by one processor to the
cached variable is “seen” by the other processors. That is, that the cached value stored by
the other processors is also updated. This is called the cache coherence problem.
There are two main approaches to ensuring cache coherence: snooping cache coherence
and directory-based cache coherence.
1. Snooping cache coherence: The idea behind snooping comes from bus-based systems:
When the cores share a bus, any signal transmitted on the bus can be “seen” by all the
cores connected to the bus.
Thus, when core 0 updates the copy of x stored in its cache, if it also broadcasts this
information across the bus, and if core 1 is “snooping” the bus, it will see that x has been
updated, and it can mark its copy of x as invalid.
This is more or less how snooping cache coherence works. The principal difference
between our description and the actual snooping protocol is that the broadcast only
informs the other cores that the cache line containing x has been updated, not that x has
been updated.
First, it’s not essential that the interconnect be a bus, only that it support broadcasts from
each processor to all the other processors. Second, snooping works with both write-
through and write-back caches.
With write-back caches, on the other hand, an extra communication is necessary, since
updates to the cache don’t get immediately sent to memory.
2. Directory-based cache coherence:
In snooping cache coherence, whenever a variable changes in any core’s cache, the
update must be broadcast to all other cores.
Example: Core 0 changes x = 5 → This info must be sent to every other core so their
caches are updated or invalidated.
In a small system (like 4 cores), broadcasting is fast enough. All cores hear about changes
almost instantly. But in a large system (hundreds/thousands of cores), broadcasting is slow
as Too many “listeners” and more time to send update messages. The network connecting
cores gets crowded. Distributed-memory system with a single address space each core has
its own private memory.
But here, they set it up so any core can directly read/write to any variable in any other core’s
memory.
Example: Core 0 can just do y = x even if x is in Core 1’s memory.
The system could grow to thousands of cores, but snooping makes it not scalable because
each write broadcast update, network overload and performance crash.
Directory-based cache coherence protocols attempt to solve this problem through the use
of a data structure called a directory.
The directory is not stored in one place, it’s distributed. Each core/memory pair keeps the
directory info for the memory it owns. Directory has which cores have a copy of a given
cache line? Whether that cache line is valid, not valid or shared?
When a core reads data Ex: Core 0 reads variable x which lives in Core 2’s memory. Core 2:
1. Sends the data to Core 0.
2. Updates its directory entry → “Core 0 has a copy of x now.”
When a core updates data Ex: Core 0 changes x = 10. Core 0’s cache controller:
1. Checks the directory for x.
2. Finds only the cores that have a copy (say Core 3 and Core 5).
3. Sends invalidate messages only to those cores — not to all 1000 cores.
False sharing:
It’s important to remember that CPU caches are implemented in hardware, so they operate
on cache lines, not individual variables. This can have disastrous consequences for
performance.
You have:
Two cores (Core 0 and Core 1)
An array y[8] of doubles (each double = 8 bytes)
Cache line size = 64 bytes → can hold 8 doubles
So all of y[0] to y[7] fit in one single cache line.
Ex: for (i = 0; i < m; i++)
for (j = 0; j < n; j++)
y[i] += f(i, j);
We parallelize it:
Core 0 → works on y[0] to y[3]
Core 1 → works on y[4] to y[7]
/ ∗ Pr i vat e v a r i a b l e s ∗ /
i n t i , j , iter_count ;
double y [ m ] ;
iter_count = m / core_count;
/ ∗ Core 0 does t h i s ∗ /
for ( i = 0; i < iter_count ; i ++)
for ( j = 0; j < n ; j ++)
y [ i] += f ( i , j ) ;
/ ∗ Core 1 does t h i s ∗ /
for ( i = iter_count ; i < 2∗ iter_count ; i ++)
for ( j = 0; j < n ; j ++)
y [ i] += f ( i , j ) ;
Logically they are working on different elements, so there’s no data dependency. But caches
don’t work on variables Caches store cache lines, not individual variables.
If y[0] to y[7] are in one cache line, then any update to any element in that line means:
1. The entire cache line gets marked as invalid in the other core’s cache.
2. That core must fetch the updated line again from memory (or other core’s cache)
before continuing.
In practice Core 0 updates y[0], cache line is marked modified in Core 0’s cache, invalid in
Core 1’s cache. Core 1 then tries to update y[4], Cache controller sees that its cache line is
invalid and fetches the whole line from Core 0 (or main memory).
Now Core 1 owns the line, Core 0’s copy is invalidated. Core 0 next tries to update y[1] but
must fetch the whole line back again.
This keeps happening for every single update, even though they’re working on different y[i]
values.
Why this is called False Sharing
True sharing: both cores actually read/write the same variable.
False sharing: they read/write different variables, but those variables happen to be on
the same cache line, so the hardware treats it like shared data.
Instead of using the fast L1 cache, most updates end up going through slow main memory or
cross-core cache transfers.
L1 cache access: ~1–4 CPU cycles
Main memory access: ~100–300 cycles huge slowdown.
To reduce false sharing:
1. Pad the array so each core works on elements in a different cache line.
2. Use local temporary storage inside each thread and merge results at the end.
3. Arrange data structures so different threads don’t share cache lines.
Shared-memory vs. distributed-memory
1. Shared-memory is easy for programmers
In shared memory, all processors can just access the same variables without worrying
about explicit message passing.
Example: Two threads can just read/write to the same array without sending
messages between them.
Shared-memory needs a high-speed interconnect (bus or crossbar) to connect all processors
to the same memory.
Bus problem:
When few processors are connected, it works fine.
But as the number of processors increases, they compete for the same bus.
More processors lead to more conflicts hence performance drops fast.
So, buses are good only for small systems (like your laptop with 4–16 cores).
Crossbar problem:
Allows more simultaneous connections than a bus.
But cost rises steeply as you add processors.
Large crossbars are very expensive, so rare in big systems.
Distributed-memory is cheaper to scale
Uses direct interconnects like:
o Hypercube
o Toroidal mesh
Each processor has its own local memory so No single “traffic jam” point.
Can scale to thousands of processors at much lower cost.
Great for huge problems needing massive computation or huge datasets.
Parallel software
Multicore processors are in:
Desktops
Servers
Mobile phones and tablets
This means your device can do more than one thing at the same time in hardware. But
software hasn’t fully caught up. Back in 2011, the book said: “We have parallel hardware, but
not much parallel software.”
In 2021, the situation is better but still incomplete:
o System software (like your OS) now uses multiple cores.
o Many popular apps (Excel, Photoshop, Chrome) can use more than one core.
o But many programs still use only a single core.
o Many programmers have never written parallel code
In the past, we could just rely on:
Faster CPUs
Smarter compilers
This gave automatic speed boosts without changing the software.
Now, CPU clock speeds aren’t rising much, performance growth comes mainly from adding
more cores. If a program only uses one core, it won’t get faster on newer hardware.
The solution is to Learn parallel programming: Developers must learn to write programs
that can use:
o Shared-memory architectures (threads, common memory)
o Distributed-memory architectures (message passing)
o Both MIMD and SIMD execution models.
Caveats
This section is basically giving two warnings (caveats) before diving into parallel
programming details and also clarifying SPMD. Mainly focus on what’s often called single
program, multiple data, or SPMD, programs.
SPMD is One program file, multiple cores running it at the same time. Each core runs the
same code, but can do different things depending on its ID.
Ex: if (I'm thread/process 0)
do this;
else
do that;
Here, all cores are running the same executable, but thread 0 does one thing and thread 1
does another. The same operation is done on different chunks of data called as data
parallelism.
if (I'm thread 0)
process first half of the array;
else // I'm thread 1
process second half of the array;
and so on.
This is easy because, each thread can work independently (no need to share intermediate
results). No communication between threads is needed once the work is split.
shared-memory:
Any thread can read/write Shared variables. Whereas Private variables, Belong to one
thread only and Communication is done via shared variables (implicit — no need to send
explicit messages like in distributed memory).
Dynamic Threads Paradigm
How it works:
1. Master thread waits for a new task (e.g., request from network).
2. When work arrives → master forks a new worker thread.
3. Worker thread does its job → joins back to master → ends.
4. Repeat for each task.
Advantages:
o Efficient resource usage → Threads only exist while they are needed.
o Ideal for systems where workload changes a lot over time.
Disadvantages:
o Overhead of forking/joining each time a task arrives.
o Not the fastest for continuous or heavy workloads.
Static Threads Paradigm
How it works:
1. Master thread creates all worker threads at the start.
2. All threads stay alive until all work is done.
3. At the end, all workers join the master, and then the program cleans up.
Advantages:
Lower overhead → No repeated thread creation/destruction.
Better performance for workloads that are steady or predictable.
Disadvantages:
Less efficient resource usage since Idle threads still consume memory & CPU stack
space.
Wastes resources if there’s not always enough work to keep all threads busy.
Nondeterminism?
In MIMD systems (Multiple Instruction, Multiple Data) where multiple processors (or
threads) run at the same time, they usually don’t stay perfectly in sync.
This means the same input might produce different outputs depending on how the
processors finish their tasks. This unpredictability is called nondeterminism.
Example 1: Printing with Two Threads
Imagine we have two threads:
o Thread 0 → has a private variable my_x = 7
o Thread 1 → has a private variable my_x = 19
Both threads run this code:
printf("Thread %d > my_x = %d\n", my_rank, my_x);
Possible outputs: Case 1:
Thread 0 > my_x = 7
Thread 1 > my_x = 19
Case 2:
Thread 1 > my_x = 19
Thread 0 > my_x = 7
Their outputs might get mixed together (e.g., half of one line, half of another), because both
are writing to the screen at the same time. The point is since the threads are independent,
we cannot predict the order.
Is nondeterminism always bad?
Sometimes it’s okay, in the print example, since each output is labelled with the
thread ID, order doesn’t matter.
Sometimes it’s dangerous, especially in shared-memory programs, where threads
work on the same data.
if (my_rank == 0)
ok_for_1 = true; // allow thread 1 to continue
Here, Thread 1 will loop endlessly until ok_for_1 is set to true by Thread 0. Simple but
wasteful → the CPU is busy “waiting” instead of doing real work.
Alternative 2: Semaphores
Similar to mutexes but slightly more flexible.
A semaphore can allow multiple threads to enter at once (depending on its count).
Some synchronization problems are easier to solve with semaphores than mutexes.
Alternative 3: Monitors
A monitor is a higher-level concept.
Think of it like a special object where only one thread can call its methods at a time.
Provides mutual exclusion automatically without explicit lock/unlock calls.
Thread Safety
General Rule:
A function is not thread safe if multiple threads can access or modify the same
shared data inside it.
Many functions from single-threaded libraries are thread safe, but you must be
careful with the ones that rely on shared state (like strtok).
Distributed-Memory
Message Passing
A message-passing API provides at least two functions:
1. Send → to send data to another process
2. Receive → to get data from another process
Each process is identified by a rank (like an ID number).
If there are p processes, the ranks are 0, 1, …, p–1.
if (my_rank == 1) {
sprintf(message, "Greetings from process 1");
Send(message, MSG_CHAR, 100, 0);
}
else if (my_rank == 0) {
Receive(message, MSG_CHAR, 100, 1);
printf("Process 0 > Received: %s\n", message);
}
What happens here?
Get_rank() → tells the process its ID.
If rank = 1 → it creates a message ("Greetings from process 1") and sends it to
process 0.
If rank = 0 → it waits to receive the message and then prints it.
SPMD Program (Single Program, Multiple Data)
Both processes run the same code (same executable).
But their behavior differs based on rank.
Local Variables → Each process’s message variable refers to different memory
blocks (private). Some programmers emphasize this by naming them my_message
or local_message.
Output (stdout/stderr) → Usually, all processes can print to the screen.
Even though message-passing APIs don’t guarantee it, most implementations
allow it.
Disadvantages:
Very low-level (programmer must handle lots of details).
To parallelize a serial program, you often need to rewrite most of it.
Data structures must either:
o Be copied for each process, or
o Be split/distributed across processes.
Incremental rewriting (parallelizing small parts step by step) is usually not
practical.
Managing data movement between processes is tedious and error-prone.
Because of this, message passing is often called “the assembly language of
parallel programming.” It’s powerful but very detailed and hard to work with.
Key Points
One-Sided Communication
Two-Sided (Normal) Message Passing In regular message-passing, communication
always needs two processes:
o One process sends data.
o Another process receives data.
Both must participate at the same time (a send must match a receive).
But there are challenges, Imagine Process 0 writes data into Process 1’s memory.
Problems:
1. Safety problem → Process 0 must know it’s safe to overwrite memory in
Process 1.
2. Awareness problem → Process 1 must know when its memory has been
updated.
Ways to do this:
Second synchronization step where Process 0 and 1 coordinate after the write.
Flag variable
o Process 0 sets a special variable (e.g., done = true) after copying.
o Process 1 keeps checking (polling) the flag until it sees the update.
But Polling wastes time because Process 1 may repeatedly check the flag without
doing real work. Since only one process does the communication call, the other
process doesn’t “see” it happening directly.
This can cause:
o Hard-to-detect bugs
o Overwriting wrong memory
o Processes waiting forever if synchronization isn’t handled properly
my_first_element = ...;
my_last_element = ...;
/* Initialize x and y */
...