BCS702 Module1
BCS702 Module1
2. SIMD Systems:
SIMD systems apply the same operation to multiple data points simultaneously.
- Structure: One control unit broadcasts to many data paths.
- Use Cases: Vector addition, image processing.
- Limitation: Poor with irregular control flow or task parallelism.
- Example: Vector processors and GPUs which use SIMD at core level.
x[i] += y[i];
Vector Processors:
Work on vectors (arrays of data) vs. scalars Core features:
Vector registers.
Vectorized & pipelined functional units.
Interleaved memoryStrided access & scatter/gather.
Vector processors are a type of SIMD (Single Instruction, Multiple Data) architecture that are
specially designed to perform operations on entire arrays (vectors) of data with a single
instruction.
Key Characteristics:
Advantages:
Limitations:
Examples of Use:
Cray supercomputers
Modern CPUs with SIMD extensions (e.g., AVX, SSE)
GPU-like designs for accelerating array operations
Example loop:
Interleaved Memory:
- Divides memory into 'banks' for parallel access
Strided Memory Access:
- Access elements at regular intervals (e.g., every 5th element)
Scatter/Gather:
- Read/write to irregular memory patterns
- Hardware support often required
GPU Architecture:
GPUs contain many cores (each core has multiple datapaths)
Shared vs Distributed Memory:
- Shared: Cores access same memory pool
- Distributed: Private memory per core
Real-world GPUs can execute multiple instruction streams (MIMD) too
3. MIMD Systems:
MIMD systems allow processors to execute independent instructions on different data.
- Asynchronous Execution: No global clock.
Types are:
- Shared-memory MIMD
- Distributed-memory MIMD
- Hybrid Systems: Combine both shared and distributed memory across nodes.
Shared-memory System:
A shared memory system is a type of computer architecture where multiple processors (or
cores) access a common memory space. It is widely used in multicore processors and
SMP (Symmetric Multiprocessing) systems.
The most widely available shared-memory systems use one or more multicore
processors. a Multicore processor has multiple CPUs or cores on a single
chip.
Typically, the cores have private level 1 caches, while other caches may or may
not be shared between the cores.
In shared-memory systems with multiple multicore processors, the interconnect
can either connect all the processors directly to main memory, or each
processor can have a direct connection to a block of main memory, and the
processors can access each other’s blocks of main memory through special
hardware built into the processors. (See Figs. 1.2. and 1.3.)
In the first type of system, the time to access all the memory locations will be
the same for all the cores, while in the second type, a memory location to which
a core is directly connected, can be accessed more quickly than a memory
location that must be accessed through another chip.
Thus the first type of system is called a Uniform memory access, or UMA,
system, while the second type is called a Nonuniform memory access, or
NUMA, system.
4. Interconnection Networks:
The interconnect plays a decisive role in the performance of both distributed- and
shared- memory systems: even if the processors and memory have virtually unlimited
performance, a slow interconnect will seriously degrade the overall performance of all
but the simplest parallel program.
1. Shared-memory interconnection.
2. Distributed-memory interconnection.
1. Shared-memory interconnection.
Definition:
Shared memory interconnects are the communication pathways that allow
multiple processors (or cores) to access the same physical memory in a shared-
memory system.
Types of Interconnects:
Bus-based interconnects:
o All processors share a common communication bus.
o Simple and low-cost, but suffer from bandwidth limitations as
more processors are added.
Crossbar switches:
o Allow multiple simultaneous connections between processors and
memory.
o More expensive and complex but offer higher throughput than buses.
Multistage interconnection networks:
o Use multiple switching stages to connect processors and memory modules.
o Offer scalability and better performance for larger systems.
NUMA vs UMA:
As the number of processors increases, contention for shared memory can become
a bottleneck, especially with simpler interconnects like buses.
Figure 1.5 (b) : Individual switches can assume one of the two configurations.
Key Features:
No shared address space; each processor can only access its local memory
directly.
Interprocessor communication must occur through messages, typically using MPI
(Message Passing Interface).
The design and performance of the interconnection network play a crucial
role in overall system efficiency.
Key Metrics:
- Latency: Time taken to send a message.
- Bandwidth: Amount of data sent per second.
- Bisection Width: Measures network capacity.
1. Bus-based Networks:
o Simple and cost-effective but not scalable.
o Rare in large-scale distributed systems.
2. Ring and Mesh Networks:
o Each processor is connected to a few neighbors.
o Example: 2D Mesh where each processor connects to 4 adjacent nodes.
3. Torus Networks:
o Similar to a mesh but with wrap-around connections for better
load balancing.
4. Hypercube Networks:
o Logarithmic number of links between nodes.
o Excellent for scalability and low diameter.
5. Fat-Tree and Multistage Networks:
o Hierarchical, high-bandwidth interconnects.
o Often used in supercomputers and large HPC clusters.
Scalability:
The number of links removed is the bisection width. If we have a square two-dimensional
toroidal mesh with p = q2 nodes (where q is even), then we can split the nodes into two
halves by removing the “middle” horizontal links and the “wraparound” horizontal links.
(See Fig. 1.8.) This suggests that the bisection width is at most 2q = 2√ p.
This is the smallest possible number of links, and the bisection width of a square
two-dimensional toroidal mesh is 2√ p.
The bandwidth of a link is the rate at which it can transmit data. It’s usually given
in megabits or megabytes per second. Bisection bandwidth is often used as a
measure of network quality. It’s similar to bisection width. However, instead of
counting the number of links joining the halves, it sums the bandwidth of the
links.
For example, if the links in a ring have a bandwidth of one billion bits per
second, then the bisection bandwidth of the ring will be two billion bits per
second or 2000 megabits per second.
The ideal direct interconnect is a fully connected network, in which each switch is
directly connected to every other switch. (See Fig. 1.9.)
Its bisection width is p2/4. However, it’s impractical to construct such an
interconnect for systems with more than a few nodes, since it requires a total of
p2/2 − p/2 links, and each switch must be capable of connecting to p links. It is
therefore more a “theoretical best possible” interconnect than a practical one,
and it is used as a basis for evaluating other interconnects.
The hypercube is a highly connected direct interconnect that has been used in
actual systems. Hypercubes are built inductively: A one-dimensional hypercube is
a fully connected system with two processors.
A two-dimensional hypercube is built from two one-dimensional hypercubes by
joining “corresponding” switches. Simi larly, a three-dimensional hypercube is
built from two two-dimensional hypercubes. (See Fig. 1.10.)
Thus a hypercube of dimension d has p = 2d nodes, and a switch in a d-
dimensional hypercube is directly connected to a processor and d switches. The
bisection width of a hypercube is p/2, so it has more connectivity than a ring or
toroidal mesh, but the switches must be more powerful, since they must support
1
+d =1+log2(p) wires, while the mesh switches only require five wires. So a
hypercube with p nodes is more expensive to construct than a toroidal mesh
The bisection width of a p×p crossbar is p, and the bisection width of an omega network is
p/2.
Any time data is transmitted, we’re interested in how long it will take for the
data to reach its destination. This is true whether we’re talking about transmitting
data between main memory and cache, cache and register, hard disk and
memory, or be tween two nodes in a distributed-memory or hybrid system.
There are two figures that are often used to describe the performance of an
interconnect (regardless of what it’s connecting): the latency and the bandwidth.
The latency is the time that elapses between the source’s beginning to transmit the
data and the destination’s starting to receive the first byte.
The bandwidth is the rate at which the destination receives data after it has
started to receive the first byte. So if the latency of an interconnect is l seconds
and the bandwidth is b bytes per second, then the time it takes to transmit
message of n bytes is
5. Cache Coherence:
In shared-memory systems, copies of the same variable in different caches can
become inconsistent.
Problem: Updates by one processor may not be visible to others.
Solutions:
Snooping Protocols: Caches monitor memory access via a bus.
Directory-based Protocols: Maintain a directory of which caches hold which data.
False Sharing: Two threads using different variables in the same cache line can cause
unnecessary invalidations.
Figure 1.15: A shared-memory system with two cores and two caches.
x is a shared variable that has been initialized to 2, y0 is private and owned by core 0, and
y1 and z1 are private and owned by core 1. Now suppose the following statements are
executed at the indicated times.
Then the memory location for y0 will eventually get the value 2, and the memory
location for y1 will eventually get the value 6. However, it’s not so clear what value z1
will get. It might at first appear that since core 0 updates x to 7 before the assignment to
z1, z1 will get the value 4 × 7 = 28. However, at time 0, x is in the cache of core 1.
So unless for some reason x is evicted from core 0’s cache and then reloaded into core
1’s cache, it actually appears that the original value x = 2 may be used, and z1 will get
the value 4 × 2 = 8.
Note that this unpredictable behavior will occur regardless of whether the system is
using a write-through or a write-back policy. If it’s using a write-through policy, the
main memory will be updated by the assignment x = 7. However, this will have no effect
on the value in the cache of core 1. If the system is using a write-back policy, the new
value of x in the cache of core 0 probably won’t even be available to core 1 when it
updates z1.
That is, that the cached value stored by the other processors is also updated. This is called
the cache coherence problem. Snooping cache coherence There are two main approaches
to ensuring cache coherence: snooping cache coherence and directory-based cache
coherence.
Snooping solution:
Key points:
How it works:
1. Directory Structure
o Each memory block has an associated entry in the directory.
o The entry stores information about which processors/caches have a copy of
that block and whether it is shared or modified.
2. On a Read/Write:
o When a core loads a block into its cache, the directory is updated to
reflect that this cache holds a copy.
o When a core writes to a block:
The directory looks up which caches have that block.
Those caches are then sent invalidate or update messages.
3. Advantages:
o Only caches that hold the block are contacted (no global broadcast).
o Much more scalable than snooping.
o Works well in systems with thousands of cores.
4. Drawbacks:
o Requires extra storage for the directory.
o Directory lookups add some latency compared to snooping.
False Sharing :
Definition: False sharing occurs when two or more processors access
different variables that happen to be located in the same cache line.
Even though the variables are logically independent, the cache system treats them
as if they are shared.
This causes unnecessary invalidations and memory traffic, which hurts
performance (but does not cause incorrect results).
Example :
Suppose we have two cores updating different parts of an array y[] in parallel:
/* Shared variables */
int i, j, m, n, core_count;
double y[m]; // array shared by all cores
/* Core 0 executes */
/* Core 1 executes */
Assume:
o Cache line size = 64 bytes
o Each double = 8 bytes
o One cache line = 8 elements of y[]
If y[0]...y[7] fit in one cache line:
o Core 0 updates y[0..3], Core 1 updates y[4..7].
o Even though they don’t access each other’s elements, the entire line is
invalidated whenever one core writes.
o This forces frequent reloads from memory → performance
degrades badly.
Key Notes:
Shared-Memory Systems:
Definition: All processors share a single address space and can directly
access the same memory.
Communication: Done implicitly by reading/writing shared variables.
Hardware: Processors connected to memory via interconnect (bus, crossbar,
etc.).
Types are:
Distributed-Memory Systems:
Pros:
o Highly scalable (can build systems with thousands of processors).
o Cheaper interconnects compared to shared memory.
Cons:
o Programming is harder (explicit message passing, MPI).
o More overhead for communication.
Synchronization:
Threads often must wait for each other before moving to the next step.
This ensures results are consistent.
// Pseudocode
compute_partial_result();
combine_results();
Communication:
In shared-memory systems:
o Communication happens implicitly via shared variables.
In distributed-memory systems:
o Communication happens explicitly using message passing.
if (my_rank == 0)
send(data, destination=1);
else if (my_rank == 1)
receive(data, source=0);
8. Shared-Memory :
In shared-memory programming, all threads/processes share a common memory space.
Here, x[] and y[] are shared, but loop index i is private to each thread.
multiple threads add values into a shared variable sum: int sum =
0; // shared
Here, multiple threads may try to update sum simultaneously, leading to incorrect results.
9. Distributed-Memory :
In a distributed-memory system:
Key Points:
char message[100];
(my_rank == 1)
else if (my_rank == 0)
Explanation of Code: