0% found this document useful (0 votes)

24 views25 pages

BCS702 Module1

Hfh hgyfn gfjjjv hhjf kjj

Uploaded by

vishwarajbpatil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views25 pages

BCS702 Module1

Hfh hgyfn gfjjjv hhjf kjj

Uploaded by

vishwarajbpatil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Module 01

Module 1 : Introduction to Parallel Programming, Parallel

Hardware, and Parallel Software

1. Classification of Parallel Computers:

Parallel computers are classified using Flynn’s taxonomy:
- SISD (Single Instruction, Single Data): Traditional sequential computers.
- SIMD (Single Instruction, Multiple Data): Multiple processing elements operate on
different data with the same instruction.
- MIMD (Multiple Instruction, Multiple Data): Independent processors execute different
instructions on different data.
Another classification depends on memory access:
- Shared Memory: All processors share a global memory.
- Distributed Memory: Each processor has its own local memory and communicates via
messages.

2. SIMD Systems:
SIMD systems apply the same operation to multiple data points simultaneously.
- Structure: One control unit broadcasts to many data paths.
- Use Cases: Vector addition, image processing.
- Limitation: Poor with irregular control flow or task parallelism.
- Example: Vector processors and GPUs which use SIMD at core level.

for (i = 0; i < n; i++)

x[i] += y[i];

 Each datapath performs operation on different elements.

 Performance can degrade if conditional execution is needed.
 Idle datapaths occur in irregular control flows.

Vector Processors:
Work on vectors (arrays of data) vs. scalars Core features:
 Vector registers.
 Vectorized & pipelined functional units.
 Interleaved memoryStrided access & scatter/gather.

Vector processors are a type of SIMD (Single Instruction, Multiple Data) architecture that are
specially designed to perform operations on entire arrays (vectors) of data with a single
instruction.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 1

Module 01

Key Characteristics:

 Vector Registers: Instead of operating on scalar (single) values, vector

processors use registers that hold multiple data elements (vectors). Range from 4
to 256 – 64 bit.
 Vector Instructions: A single instruction performs the same operation on all
elements of a vector (e.g., add two arrays).
 Pipelining: Functional units in vector processors are pipelined, allowing multiple
operations to overlap and increase throughput.
 Memory Access: They use techniques like strided access (access every nth
element) and scatter/gather to efficiently read/write non-contiguous data from
memory.

Advantages:

 High performance for data-parallel problems (e.g., scientific simulations,

image processing).
 Reduced instruction fetch overhead.
 Easier to program for regular workloads compared to general-purpose SIMD
systems.

Limitations:

 Less efficient for irregular or conditional operations (e.g., if-else branching).

 Performance drops if data dependencies or memory access patterns are
not aligned.

Examples of Use:

 Cray supercomputers
 Modern CPUs with SIMD extensions (e.g., AVX, SSE)
 GPU-like designs for accelerating array operations

Example loop:

for (i = 0; i < n; i++)

x[i] += y[i];
Memory Access in Vector Processors:

 Interleaved Memory:
- Divides memory into 'banks' for parallel access
 Strided Memory Access:
- Access elements at regular intervals (e.g., every 5th element)
 Scatter/Gather:
- Read/write to irregular memory patterns
- Hardware support often required

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 2

Module 01

Graphics Processing Units (GPUs):

• Use massive parallelism for graphics and compute tasks.

• Follow SIMD-style execution model.
• Execute shader functions on multiple elements in parallel.
• Use hundreds to thousands of threads.

GPU Architecture:
 GPUs contain many cores (each core has multiple datapaths)
 Shared vs Distributed Memory:
- Shared: Cores access same memory pool
- Distributed: Private memory per core
 Real-world GPUs can execute multiple instruction streams (MIMD) too

Modern Use of GPUs:

Beyond graphics: used for scientific computing, AI, and ML.

• Highly parallel workloads benefit from GPU acceleration.
• Support general-purpose languages (e.g., CUDA, OpenCL).
• Increasingly integrated into CPUs and desktops.

3. MIMD Systems:
MIMD systems allow processors to execute independent instructions on different data.
- Asynchronous Execution: No global clock.
Types are:
- Shared-memory MIMD
- Distributed-memory MIMD
- Hybrid Systems: Combine both shared and distributed memory across nodes.

 Shared-memory System:

A shared memory system is a type of computer architecture where multiple processors (or
cores) access a common memory space. It is widely used in multicore processors and
SMP (Symmetric Multiprocessing) systems.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 3

Module 01

Figure 1.1 : Shared – Memory System.

 The most widely available shared-memory systems use one or more multicore
processors. a Multicore processor has multiple CPUs or cores on a single
chip.
 Typically, the cores have private level 1 caches, while other caches may or may
not be shared between the cores.
 In shared-memory systems with multiple multicore processors, the interconnect
can either connect all the processors directly to main memory, or each
processor can have a direct connection to a block of main memory, and the
processors can access each other’s blocks of main memory through special
hardware built into the processors. (See Figs. 1.2. and 1.3.)
 In the first type of system, the time to access all the memory locations will be
the same for all the cores, while in the second type, a memory location to which
a core is directly connected, can be accessed more quickly than a memory
location that must be accessed through another chip.
 Thus the first type of system is called a Uniform memory access, or UMA,
system, while the second type is called a Nonuniform memory access, or
NUMA, system.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 4

Module 01

Figure 1.2 : Uniform memory access multicore system.

Figure 1.3 : NonUniform memory access multicore system.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 5

Module 01

 Distributed Memory Systems :

A distributed memory system is a parallel computing architecture where each

processor has its own private memory, and processors communicate by passing messages
over a network.

Figure 1.4 : Distributed - memory system.

4. Interconnection Networks:
The interconnect plays a decisive role in the performance of both distributed- and
shared- memory systems: even if the processors and memory have virtually unlimited
performance, a slow interconnect will seriously degrade the overall performance of all
but the simplest parallel program.

There are two types of interconnection networks.

1. Shared-memory interconnection.
2. Distributed-memory interconnection.

1. Shared-memory interconnection.

Definition:
Shared memory interconnects are the communication pathways that allow
multiple processors (or cores) to access the same physical memory in a shared-
memory system.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 6

Module 01

Types of Interconnects:

 Bus-based interconnects:
o All processors share a common communication bus.
o Simple and low-cost, but suffer from bandwidth limitations as
more processors are added.
 Crossbar switches:
o Allow multiple simultaneous connections between processors and
memory.
o More expensive and complex but offer higher throughput than buses.
 Multistage interconnection networks:
o Use multiple switching stages to connect processors and memory modules.
o Offer scalability and better performance for larger systems.
 NUMA vs UMA:

UMA (Uniform Memory Access):

o All processors have equal access time to memory.

o Easier to program but does not scale well.

NUMA (Non-Uniform Memory Access):

o Memory is physically distributed, but accessible by all processors.

o Access times vary depending on memory location.
o More scalable, and common in large shared-memory multiprocessors.
 Scalability Considerations:

Interconnect type directly affects system scalability.

As the number of processors increases, contention for shared memory can become
a bottleneck, especially with simpler interconnects like buses.

Figure 1.5 (a) : Shows a Simple Cross-Bar.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 7

Module 01

Figure 1.5 (b) : Individual switches can assume one of the two configurations.

Figure 1.5 (c) : Shows the configuration of switches.

2. Distributed Memory Systems:

Distributed-memory systems consist of multiple processors, each with its own private memory.
These processors are connected through a network interconnect, and all communication is
achieved through explicit message passing.

Key Features:

 No shared address space; each processor can only access its local memory
directly.
 Interprocessor communication must occur through messages, typically using MPI
(Message Passing Interface).
 The design and performance of the interconnection network play a crucial
role in overall system efficiency.

- Direct Networks: Mesh, torus, hypercube (high performance, complex).

- Indirect Networks: Crossbar, omega networks (scalable, but with contention).

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 8

Module 01

Key Metrics:
- Latency: Time taken to send a message.
- Bandwidth: Amount of data sent per second.
- Bisection Width: Measures network capacity.

Common Network Topologies:

1. Bus-based Networks:
o Simple and cost-effective but not scalable.
o Rare in large-scale distributed systems.
2. Ring and Mesh Networks:
o Each processor is connected to a few neighbors.
o Example: 2D Mesh where each processor connects to 4 adjacent nodes.
3. Torus Networks:
o Similar to a mesh but with wrap-around connections for better
load balancing.
4. Hypercube Networks:
o Logarithmic number of links between nodes.
o Excellent for scalability and low diameter.
5. Fat-Tree and Multistage Networks:
o Hierarchical, high-bandwidth interconnects.
o Often used in supercomputers and large HPC clusters.

Communication Performance Considerations:

 Latency: Time to send a message from source to destination.

 Bandwidth: Volume of data that can be transmitted per unit time.
 Congestion: Can arise when multiple processors try to
communicate simultaneously through the same links.

Scalability:

Distributed-memory interconnects are designed to scale to thousands of nodes, making

them suitable for high-performance computing applications. However, their efficiency
depends heavily on minimizing communication overhead and balancing load across the
network.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 9

Module 01

Figure 1.6(a) : Shows a ring. Figure 1.6(b) :A toroidal mesh.

Figure 1.7(a) : Only two communications Can takes place.

Figure 1.7(b) : four simultaneous connections can take place.

The number of links removed is the bisection width. If we have a square two-dimensional
toroidal mesh with p = q2 nodes (where q is even), then we can split the nodes into two
halves by removing the “middle” horizontal links and the “wraparound” horizontal links.
(See Fig. 1.8.) This suggests that the bisection width is at most 2q = 2√ p.

Figure 1.8 : A bisection of a toroidal mesh.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 10

Module 01

 This is the smallest possible number of links, and the bisection width of a square
two-dimensional toroidal mesh is 2√ p.
 The bandwidth of a link is the rate at which it can transmit data. It’s usually given
in megabits or megabytes per second. Bisection bandwidth is often used as a
measure of network quality. It’s similar to bisection width. However, instead of
counting the number of links joining the halves, it sums the bandwidth of the
links.
 For example, if the links in a ring have a bandwidth of one billion bits per
second, then the bisection bandwidth of the ring will be two billion bits per
second or 2000 megabits per second.
 The ideal direct interconnect is a fully connected network, in which each switch is
directly connected to every other switch. (See Fig. 1.9.)
 Its bisection width is p2/4. However, it’s impractical to construct such an
interconnect for systems with more than a few nodes, since it requires a total of
p2/2 − p/2 links, and each switch must be capable of connecting to p links. It is
therefore more a “theoretical best possible” interconnect than a practical one,
and it is used as a basis for evaluating other interconnects.
 The hypercube is a highly connected direct interconnect that has been used in
actual systems. Hypercubes are built inductively: A one-dimensional hypercube is
a fully connected system with two processors.
 A two-dimensional hypercube is built from two one-dimensional hypercubes by
joining “corresponding” switches. Simi larly, a three-dimensional hypercube is
built from two two-dimensional hypercubes. (See Fig. 1.10.)
 Thus a hypercube of dimension d has p = 2d nodes, and a switch in a d-
dimensional hypercube is directly connected to a processor and d switches. The
bisection width of a hypercube is p/2, so it has more connectivity than a ring or
toroidal mesh, but the switches must be more powerful, since they must support
1
+d =1+log2(p) wires, while the mesh switches only require five wires. So a
hypercube with p nodes is more expensive to construct than a toroidal mesh

Figure 1.9 : A fully connected network.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 11
Module 01

Figure 1.10(a) : One-dimensional hypercubes.

Figure 1.10(b) : Two-dimensional hypercubes.
Figure 1.10(c) : Three-dimensional hypercubes.

 Indirect interconnects provide an alternative to direct interconnects. In an

indirect interconnect, the switches may not be directly connected to a processor.
They’re often shown with unidirectional links and a collection of processors,
each of which has an outgoing and an incoming link, and a switching network.
(See Fig. 1.11.)
 The crossbar and the omega network are relatively simple examples of indi rect
networks. We saw a shared-memory crossbar with bidirectional links earlier (Fig.
1.5). The diagram of a distributed-memory crossbar in Fig. 1.12 has unidirectional
links. Notice that as long as two processors don’t attempt to communicate with
the same processor, all the processors can simultaneously communicate with
another processor. An omega network is shown in Fig.1.13.
 The switches are two-by-two crossbars (see Fig. 1.14). Observe that unlike the
crossbar, there are communications that can not occur simultaneously.
 For example, in Fig. 1.13, if processor 0 sends a message to processor 6, then
processor 1 cannot simultaneously send a message to proces sor 7. On the other
hand, the omega network is less expensive than the crossbar. The omega network
uses 1 2plog2(p) of the 2 ×2 crossbar switches, so it uses a total of 2plog2(p)
switches, while the crossbar uses p2.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 12

Module 01

Figure 1.11: A generic indirect network.

Figure 1.12: A crossbar interconnect for distributed-memory.

The bisection width of a p×p crossbar is p, and the bisection width of an omega network is
p/2.

3. Latency and bandwidth:

 Any time data is transmitted, we’re interested in how long it will take for the
data to reach its destination. This is true whether we’re talking about transmitting
data between main memory and cache, cache and register, hard disk and
memory, or be tween two nodes in a distributed-memory or hybrid system.
 There are two figures that are often used to describe the performance of an
interconnect (regardless of what it’s connecting): the latency and the bandwidth.
The latency is the time that elapses between the source’s beginning to transmit the
data and the destination’s starting to receive the first byte.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 13

Module 01

 The bandwidth is the rate at which the destination receives data after it has
started to receive the first byte. So if the latency of an interconnect is l seconds
and the bandwidth is b bytes per second, then the time it takes to transmit
message of n bytes is

message transmission time = l +n/b.

Figure 1.13: A An omega network.

Figure 1.14: A switch in an omega network.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 14

Module 01

5. Cache Coherence:
In shared-memory systems, copies of the same variable in different caches can
become inconsistent.
Problem: Updates by one processor may not be visible to others.
Solutions:
Snooping Protocols: Caches monitor memory access via a bus.
Directory-based Protocols: Maintain a directory of which caches hold which data.
False Sharing: Two threads using different variables in the same cache line can cause
unnecessary invalidations.

Figure 1.15: A shared-memory system with two cores and two caches.

x is a shared variable that has been initialized to 2, y0 is private and owned by core 0, and
y1 and z1 are private and owned by core 1. Now suppose the following statements are
executed at the indicated times.

Then the memory location for y0 will eventually get the value 2, and the memory
location for y1 will eventually get the value 6. However, it’s not so clear what value z1
will get. It might at first appear that since core 0 updates x to 7 before the assignment to
z1, z1 will get the value 4 × 7 = 28. However, at time 0, x is in the cache of core 1.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 15

Module 01

So unless for some reason x is evicted from core 0’s cache and then reloaded into core
1’s cache, it actually appears that the original value x = 2 may be used, and z1 will get
the value 4 × 2 = 8.

Note that this unpredictable behavior will occur regardless of whether the system is
using a write-through or a write-back policy. If it’s using a write-through policy, the
main memory will be updated by the assignment x = 7. However, this will have no effect
on the value in the cache of core 1. If the system is using a write-back policy, the new
value of x in the cache of core 0 probably won’t even be available to core 1 when it
updates z1.

That is, that the cached value stored by the other processors is also updated. This is called
the cache coherence problem. Snooping cache coherence There are two main approaches
to ensuring cache coherence: snooping cache coherence and directory-based cache
coherence.

 Snooping Cache Coherence:

 In shared-memory multiprocessor systems, each core has its own cache.
 If multiple caches hold the same memory block, an update by one core may not
be visible to the others → this is the cache coherence problem.

Snooping solution:

 All caches monitor (snoop) the interconnect (usually a bus).

 When a processor updates a cache block, it broadcasts this update across the bus.
 Other caches snooping the bus check if they also store that block:
 If yes → they either invalidate their copy or update it (depending on
the protocol).

Key points:

 Works with both write-through and write-back caches.

 Not limited to buses, but requires an interconnect that supports broadcasts.
 Simple and effective for small systems.
 Limitation: Not scalable to large systems because every update requires
a broadcast.

 Directory-Based Cache Coherence :

 In large multiprocessor systems, broadcasting every cache update (as in
snooping) is too costly and doesn’t scale well.
 Directory-based protocols solve this by keeping a directory that records which
caches currently hold a copy of each memory block.
Prof Eshwaraj B Patil, CSE Dept, VNEC Page 16
Module 01

How it works:

1. Directory Structure
o Each memory block has an associated entry in the directory.
o The entry stores information about which processors/caches have a copy of
that block and whether it is shared or modified.
2. On a Read/Write:
o When a core loads a block into its cache, the directory is updated to
reflect that this cache holds a copy.
o When a core writes to a block:
 The directory looks up which caches have that block.
 Those caches are then sent invalidate or update messages.
3. Advantages:
o Only caches that hold the block are contacted (no global broadcast).
o Much more scalable than snooping.
o Works well in systems with thousands of cores.
4. Drawbacks:
o Requires extra storage for the directory.
o Directory lookups add some latency compared to snooping.

 False Sharing :
 Definition: False sharing occurs when two or more processors access
different variables that happen to be located in the same cache line.
 Even though the variables are logically independent, the cache system treats them
as if they are shared.
 This causes unnecessary invalidations and memory traffic, which hurts
performance (but does not cause incorrect results).

Example :

Suppose we have two cores updating different parts of an array y[] in parallel:

/* Shared variables */

int i, j, m, n, core_count;
double y[m]; // array shared by all cores

/* Core 0 executes */

for (i = 0; i < m/core_count; i++)

{
for (j = 0; j < n; j++) { y[i]
+= f(i, j);
}
}

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 17

Module 01

/* Core 1 executes */

for (i = m/core_count; i < 2*(m/core_count); i++)

{
for (j = 0; j < n; j++)
{
y[i] += f(i, j);
}
}

 Assume:
o Cache line size = 64 bytes
o Each double = 8 bytes
o One cache line = 8 elements of y[]
 If y[0]...y[7] fit in one cache line:
o Core 0 updates y[0..3], Core 1 updates y[4..7].
o Even though they don’t access each other’s elements, the entire line is
invalidated whenever one core writes.
o This forces frequent reloads from memory → performance
degrades badly.

Key Notes:

 False sharing → performance problem, not correctness problem.

 It increases memory traffic and slows down parallel execution.
 Solution: avoid placing unrelated variables in the same cache line, e.g.,
o Padding arrays/structures so that independent variables lie in different
cache lines.
o Using thread-local storage for temporary results.

6. Shared v/s Distributed Memory:

 Shared-Memory Systems:

 Definition: All processors share a single address space and can directly
access the same memory.
 Communication: Done implicitly by reading/writing shared variables.
 Hardware: Processors connected to memory via interconnect (bus, crossbar,
etc.).

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 18

Module 01

Types are:

o UMA (Uniform Memory Access): All cores access memory

with equal latency.
o NUMA (Non-Uniform Memory Access): Access time depends on
whether the memory is local or remote to a core.
 Pros:
o Easier to program (shared variables).
o Good for small–medium systems.
 Cons:
o Hard to scale beyond a few processors (bus contention,
coherence overhead).
o Expensive interconnects needed for larger systems.

 Distributed-Memory Systems:

 Definition: Each processor has its own private memory.

 Communication: Done explicitly by message-passing.
 Hardware: A network (e.g., ring, mesh, hypercube, Ethernet)
connects processor-memory pairs.

Examples are: Clusters, grids, supercomputers.

 Pros:
o Highly scalable (can build systems with thousands of processors).
o Cheaper interconnects compared to shared memory.
 Cons:
o Programming is harder (explicit message passing, MPI).
o More overhead for communication.

Difference between Shared memory and Distributed memory:

Feature Shared Memory Distributed Memory

Memory Access All processors access same global Each processor has its own private
memory. memory.

Communication Implicit (via shared variables). Explicit (via message passing).

Ease of Programming Easier (closer to serial Harder (need explicit data exchange).
programming).

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 19

Module 01

Scalability Limited (bus/memory contention). Very scalable (thousands of

processors).

Cost Expensive interconnect for large Relatively cheaper, commodity

systems. hardware.

Examples Multicore CPUs, small SMP Clusters, supercomputers, grids.

servers.

7. Coordinating the Processes or Threads:

In parallel programming, processes/threads must work together correctly. Coordination
mainly involves:

 Dividing the Work:

 Each thread should get a fair share of work (load balancing).

 Communication should be minimized to reduce overhead.

Example : Vector Addition (SPMD style):

// Shared arrays x[n], y[n]

// Process/thread rank = my_rank, total = p

int my_first = my_rank * (n / p);

int my_last = (my_rank + 1) * (n / p) - 1;

for (int i = my_first; i <= my_last; i++)

x[i] += y[i]; // Each process/thread updates its portion

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 20

Module 01

 Synchronization:

 Threads often must wait for each other before moving to the next step.
 This ensures results are consistent.

Example – Barrier Synchronization:

// Pseudocode

compute_partial_result();

barrier(); // wait until all threads finish

combine_results();

 Communication:

 In shared-memory systems:
o Communication happens implicitly via shared variables.
 In distributed-memory systems:
o Communication happens explicitly using message passing.

Example – Message Passing (MPI-style pseudocode):

if (my_rank == 0)

send(data, destination=1);

else if (my_rank == 1)

receive(data, source=0);

 Trivial v/s Complex Coordination:

 Some problems are embarrassingly parallel (minimal coordination).

 Others require careful synchronization + communication to avoid race
conditions and ensure correctness.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 21

Module 01

8. Shared-Memory :
In shared-memory programming, all threads/processes share a common memory space.

 Shared variables → can be read/written by any thread.

 Private variables → belong to a single thread.
 Communication happens implicitly by accessing shared variables.
 Synchronization is required to avoid race conditions when multiple threads
update the same variable.

Example 1: Simple Data Parallelism:

(Each thread works on a part of an array)

// Shared arrays: x[n], y[n]

// Each thread updates its assigned portion

#pragma omp parallel for // OpenMP parallel loop

for (int i = 0; i < n; i++)

x[i] += y[i]; // Threads update shared array

Here, x[] and y[] are shared, but loop index i is private to each thread.

Example 2: Race Condition (needs synchronization): Suppose

multiple threads add values into a shared variable sum: int sum =

0; // shared

#pragma omp parallel for

for (int i = 0; i < n; i++)

sum += x[i]; // race condition (not thread-safe)

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 22

Module 01

Here, multiple threads may try to update sum simultaneously, leading to incorrect results.

Example 3: Using a Critical Section (Safe)

To fix the race condition, we synchronize updates:

int sum = 0; // shared

#pragma omp parallel for

for (int i = 0; i < n; i++) {

#pragma omp critical // only one thread at a time

sum += x[i]; // thread-safe

Now updates to sum are atomic → correct result.

 Shared-memory programming is simpler to write (implicit communication).

 Need to carefully manage synchronization to avoid errors.
 OpenMP provides easy ways (parallel for, critical, private) to implement shared-
memory parallelism.

9. Distributed-Memory :
In a distributed-memory system:

 Each processor has its own private memory.

 A processor cannot directly access another’s memory.
 Communication is explicit, usually through message passing.
 The most widely used API is MPI (Message Passing Interface).

Key Points:

 Shared-memory: communication is implicit (shared variables).

 Distributed-memory: communication is explicit (send/receive).
 Programming is harder (need to manage data distribution).
 But highly scalable (thousands of processors)

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 23

Module 01

Example 1: Message Passing Interface:

 In distributed-memory systems, each process has access only to its local

memory.
 To exchange data, processes must explicitly communicate using message-
passing.
 The most widely used API for message passing is MPI (Message Passing
Interface).
 Message passing typically uses two core operations:
1. Send → A process sends data to another process.
2. Receive → The target process receives that data.
 Processes are usually identified by a unique rank (0, 1, 2, …, p−1).

Message passing supports:

 Point-to-point communication (between two processes).

 Collective communication (e.g., broadcast, reduction across all processes).

char message[100];

int my_rank = Get_rank(); // Get process rank if

(my_rank == 1)

sprintf(message, "Greetings from process 1");

Send(message, MSG_CHAR, 100, 0); // Send message to process 0

else if (my_rank == 0)

Receive(message, MSG_CHAR, 100, 1); // Receive message from process 1 printf("Process 0

> Received: %s\n", message);

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 24

Module 01

Explanation of Code:

 Get_rank() → Gets the process’s unique ID.

 If the process rank is 1, it creates a message and sends it to process 0.
 If the process rank is 0, it receives the message and prints it.
 This is an SPMD (Single Program, Multiple Data) style: all processes run
the same program but behave differently based on their rank.

Prof Eshwaraj B Patil, CSE Dept, VNEC Page 25

PP 1st Internals
No ratings yet
PP 1st Internals
40 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
PARALLEL PROGRAMMING Module 1
No ratings yet
PARALLEL PROGRAMMING Module 1
20 pages
Multi Core
No ratings yet
Multi Core
7 pages
Module 2
No ratings yet
Module 2
124 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Understanding Parallel Computing Basics
No ratings yet
Understanding Parallel Computing Basics
22 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
Symmetric & Distributed Memory Architectures
No ratings yet
Symmetric & Distributed Memory Architectures
31 pages
Week 4 PDC
No ratings yet
Week 4 PDC
11 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
atII Bks Lec 2021 31 32
No ratings yet
atII Bks Lec 2021 31 32
16 pages
Overview of Parallel Hardware Concepts
No ratings yet
Overview of Parallel Hardware Concepts
60 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
Parallel Computing and BigData
No ratings yet
Parallel Computing and BigData
4 pages
BCS702 Module1 Detailed Notes
No ratings yet
BCS702 Module1 Detailed Notes
14 pages
Architecture
No ratings yet
Architecture
67 pages
02 - B (Parallel Hardware)
No ratings yet
02 - B (Parallel Hardware)
40 pages
Advanced Computer Architecture Assigment
No ratings yet
Advanced Computer Architecture Assigment
60 pages
Seminar
No ratings yet
Seminar
85 pages
Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
27 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
91 pages
Aca
No ratings yet
Aca
13 pages
Unit 1
No ratings yet
Unit 1
21 pages
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
No ratings yet
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
17 pages
Perfect ? I
No ratings yet
Perfect ? I
7 pages
Parallel Computer Models Overview
No ratings yet
Parallel Computer Models Overview
20 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
49 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
Single vs Multi-Core Processor Architectures
No ratings yet
Single vs Multi-Core Processor Architectures
31 pages
Parallel Computing IA1
No ratings yet
Parallel Computing IA1
29 pages
Understanding MIMD Architecture
No ratings yet
Understanding MIMD Architecture
28 pages
Parallel Computing Concepts Guide
No ratings yet
Parallel Computing Concepts Guide
32 pages
NOTES
No ratings yet
NOTES
19 pages
COA U5 PPT Full
No ratings yet
COA U5 PPT Full
43 pages
Advanced Computer Architecture Unit 1
No ratings yet
Advanced Computer Architecture Unit 1
23 pages
5 4 Parallel
No ratings yet
5 4 Parallel
47 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Parallel Detailed Explanations
No ratings yet
Parallel Detailed Explanations
2 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
LP V Theory and Practical Explanation: o o o o
No ratings yet
LP V Theory and Practical Explanation: o o o o
96 pages
PP16 Lec4 Arch3
No ratings yet
PP16 Lec4 Arch3
23 pages
Parallel Processing Explained
No ratings yet
Parallel Processing Explained
22 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Computer Architecture for CS Students
No ratings yet
Computer Architecture for CS Students
72 pages
Prebook MCAP
No ratings yet
Prebook MCAP
11 pages
Module 2
No ratings yet
Module 2
5 pages
Module 1
No ratings yet
Module 1
63 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
PP Mod 1
No ratings yet
PP Mod 1
11 pages
BCS702 Module2
No ratings yet
BCS702 Module2
7 pages
Block List-Cie-1-2025-26
No ratings yet
Block List-Cie-1-2025-26
4 pages
Nirosha New
No ratings yet
Nirosha New
2 pages
Infosys CPC VTU 2
No ratings yet
Infosys CPC VTU 2
3 pages
DDCO Ist IA III Sem 2024
No ratings yet
DDCO Ist IA III Sem 2024
2 pages
Nov Esha
No ratings yet
Nov Esha
2 pages
FPGA Technology Overview
No ratings yet
FPGA Technology Overview
15 pages
Mingpu: A Minimum Gpu Library For Computer Vision: Pavel Babenko and Mubarak Shah
No ratings yet
Mingpu: A Minimum Gpu Library For Computer Vision: Pavel Babenko and Mubarak Shah
30 pages
ACA UNIT-1 A Kai Hwang
No ratings yet
ACA UNIT-1 A Kai Hwang
32 pages
BDA Mid Jntuh r18
No ratings yet
BDA Mid Jntuh r18
4 pages
AMD Fusion Processor Overview
No ratings yet
AMD Fusion Processor Overview
8 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
50 pages
Principles of Distributed Computing
No ratings yet
Principles of Distributed Computing
173 pages
Unit 1 Scenario Based Question
No ratings yet
Unit 1 Scenario Based Question
13 pages
Day1 C++ With Parallel Programming
No ratings yet
Day1 C++ With Parallel Programming
31 pages
Cloud Computing
No ratings yet
Cloud Computing
14 pages
CA Unit-1
No ratings yet
CA Unit-1
20 pages
3.3.5 Reduced Instruction Set Computing Processors (RISC)
No ratings yet
3.3.5 Reduced Instruction Set Computing Processors (RISC)
11 pages
Drive Size and Usage Overview
No ratings yet
Drive Size and Usage Overview
18 pages
HPC Assignments Theory
No ratings yet
HPC Assignments Theory
2 pages
Mooncake
No ratings yet
Mooncake
3 pages
CS8491-Computer Architecture
No ratings yet
CS8491-Computer Architecture
18 pages
Parallel Computing in High Frequency Trading
No ratings yet
Parallel Computing in High Frequency Trading
8 pages
B.Tech CSE 5th Semester Syllabus
No ratings yet
B.Tech CSE 5th Semester Syllabus
20 pages
PDC Merged
No ratings yet
PDC Merged
30 pages
MCA 2-Year Syllabus - (Session) 2020-22
No ratings yet
MCA 2-Year Syllabus - (Session) 2020-22
67 pages
Curriculum of Computer Science
No ratings yet
Curriculum of Computer Science
23 pages
Register Transfer Level Design of 32-Bit RISC-V Out-of-Order Processor
No ratings yet
Register Transfer Level Design of 32-Bit RISC-V Out-of-Order Processor
7 pages
HPC Final Merged
No ratings yet
HPC Final Merged
707 pages
Lambda Echelon Deep Learning GPU Cluster White Paper v2022!09!23
No ratings yet
Lambda Echelon Deep Learning GPU Cluster White Paper v2022!09!23
36 pages
Itanium Processor Architecture Overview
No ratings yet
Itanium Processor Architecture Overview
18 pages
A Parallel Approach To XML Parsing: Wei Lu, Kenneth Chiu, Yinfei Pan
No ratings yet
A Parallel Approach To XML Parsing: Wei Lu, Kenneth Chiu, Yinfei Pan
8 pages
Advanced Techniques For Embedded Systems Design and Test
No ratings yet
Advanced Techniques For Embedded Systems Design and Test
297 pages
Bewerbungsantrag Oder Information
No ratings yet
Bewerbungsantrag Oder Information
12 pages
Ubuntu Cluster Setup Guide
No ratings yet
Ubuntu Cluster Setup Guide
21 pages