Parallel processors
Dr. Mrs. B. Janet,
Department of Computer Applications,
NIT, Trichy 15.
Computer Organizations
SEQUENTIAL & PARALLEL PROCESSING
SEQUENTIAL
PARALLEL
Program
Program
TASK 1
CPU
CPU
CPU
CPU
TASK 1
TASK 2
TASK 3
RESULT
RESULT
Program
TASK 2
CPU
RESULT
MASSIVE PARALLEL COMPUTERS
CAN HAVE THOUSANDS OF CPUs
Processor Designs
Pipelined ALU
Within operations
Across operations
Parallel ALUs
Parallel processors
Parallel Processor
Increase system performance by using
multiple Processors that can execute in
Parallel
Symmetric Multi-Processor
Cluster
Non-Uniform Memory Access (NUMA)
Pipelining
Instruction Level Parallelism causes overlap of
instructions
Loop Level Parallelism in iterations of a loop
Multiple Processor OrganizationTypes of Parallel Processor systems
Single instruction, single data stream - SISD
Single instruction, multiple data stream - SIMD
Multiple instruction, single data stream - MISD
Multiple instruction, multiple data stream- MIMD
Single Instruction, Single Data
Stream - SISD
Single processor
Single instruction stream
Data stored in single memory
CU
- Control Unit
Uni-processor
IS
- Instruction Stream
PU
DS
MU
- Processing Unit
- Data Stream
- Memory Unit
Single Instruction, Multiple Data
Stream - SIMD
Single machine instruction
Controls simultaneous execution of a
number of processing elements on a
Lockstep basis
Each processing element has associated
data memory
Each instruction executed on different set
of data by different processors
Vector and array processors
Parallel Organizations - SIMD
LM
- Local Memory
Multiple Instruction, Single Data
Stream - MISD
Sequence of data is transmitted to set of
processors
Each processor executes different
instruction sequence
Never been implemented
Multiple Instruction, Multiple
Data Stream- MIMD
Set of processors
simultaneously
execute different
instruction
sequences on
different sets of data
SMPs, clusters and
NUMA systems
MIMD - Overview
General purpose processors
Each can process all instructions
necessary
Further classified by method of processor
communication
Parallel Organizations - MIMD
Distributed Memory
Taxonomy of Parallel Processor
Architectures
SMP
Multiple similar processors within same
computer, interconnected by bus or switching.
Problem is Cache coherance
Symmetric Multiprocessors
A stand alone computer with the following characteristics
Two or more similar processors of comparable capacity
Processors share same memory and I/O
Processors are connected by a bus or other internal
connection such that Memory access time is approximately
the same for each processor
All processors share access to I/O
Either through same channels or different channels giving
paths to same devices
All processors can perform the same functions (hence
symmetric)
System controlled by integrated operating system
providing interaction between processors, threads
scheduling and synchronisation
Interaction at job, task, file and data element levels
SMP Advantages
Performance
If some work can be done in parallel
Availability
Since all processors can perform the same
functions, failure of a single processor does not
halt the system
Incremental growth
User can enhance performance by adding
additional processors
Scaling
Vendors can offer range of products based on
number of processors
Block Diagram of Tightly
Coupled Multiprocessor
Tightly Coupled - SMP
Processors share memory
Communicate via that shared memory
Symmetric Multiprocessor (SMP)
Share single memory or pool
Shared bus to access memory
Memory access time to given area of memory
is approximately the same for each processor
Symmetric Multiprocessor Organization
IBM z990
Multiprocessor
Structure
Chip Multiprocessing
More than one processor implemented on
a single chip
Multithreading and Chip
Multiprocessors
Instruction stream divided into smaller
streams (threads)
Executed in parallel
Wide variety of multithreading designs
Cluster
A Group of interconnected whole computers
working together as a unified computing
resource.
Clusters
Alternative to SMP
High performance
High availability
Server applications
A group of interconnected whole computers
Working together as unified resource
Illusion of being one machine
Each computer called a node
Cluster Benefits
Absolute scalability
Incremental scalability
High availability
Superior price/performance
Cluster Configurations - Standby
Server, No Shared Disk
Cluster Configurations Shared Disk
Cluster v. SMP
Both provide multiprocessor support to high demand
applications.
Both available commercially
SMP for longer
SMP:
Easier to manage and control
Closer to single processor systems
Scheduling is main difference
Less physical space
Lower power consumption
Clustering:
Superior incremental & absolute scalability
Superior availability
Redundancy
NUMA
Shared memory multi-processor in which
the access time from a given processor to
a word in memory varies with the location
of the memory word.
Nonuniform Memory Access (NUMA)
Alternative to SMP & clustering
Uniform memory access
All processors have access to all parts of memory
Using load & store
Access time to all regions of memory is the same
Access time to memory for different processors same
As used by SMP
Nonuniform memory access
All processors have access to all parts of memory
Using load & store
Access time of processor differs depending on region of memory
Different processors access different regions of memory at
different speeds
Cache coherent NUMA
Cache coherence is maintained among the caches of the various
processors
Significantly different from SMP and clusters
Motivation
SMP has practical limit to number of processors
Bus traffic limits to between 16 and 64 processors
In clusters each node has own memory
Apps do not see large global memory
Coherence maintained by software not hardware
NUMA retains SMP flavour while giving large
scale multiprocessing
e.g. Silicon Graphics Origin NUMA 1024 MIPS
R10000 processors
Objective is to maintain transparent system wide
memory while permitting multiprocessor nodes,
each with own bus or internal interconnection
system
CC-NUMA Organization
Scalar Processor Approaches
Single-threaded scalar
Simple pipeline
No multithreading
Interleaved multithreaded scalar
Easiest multithreading to implement
Switch threads at each clock cycle
Pipeline stages kept close to fully occupied
Hardware needs to switch thread context between
cycles
Blocked multithreaded scalar
Thread executed until latency event occurs
Would stop pipeline
Processor switches to another thread
Vector Computation
Maths problems involving physical processes present
different difficulties for computation
Aerodynamics, seismology, meteorology
Continuous field simulation
High precision
Repeated floating point calculations on large arrays of
numbers
Supercomputers handle these types of problem
Hundreds of millions of flops
$10-15 million
Optimised for calculation rather than multitasking and I/O
Limited market
Research, government agencies, meteorology
Array processor
Alternative to supercomputer
Configured as peripherals to mainframe & mini
Just run vector portion of problems
Vector Addition Example
Vector Processor
Process vectors or arrays of Data
Approaches
to
Vector
Computation
Symmetric Multiprocessing to the Rescue
Multiple Processors
Multithreading
One
instruction
stream per
slot
Multi threaded Processor
To Replicate some components of the
processor to execute multiple threads
concurrently
Multithreading
Alleviates some of the
memory latency
problems
Still has problems
What if red thread waits
for data from memory and
there is a cache miss?
Yellow thread waits
unnecessarily
Hyperthreading
More than one
instruction stream
per slot
SMP vs SMT
Having Multiple Cores
Cache
Cache
Arch. State
Arch. State
Arch. State
Arch. State
Arch. State
Arch. State
APIC
APIC
APIC
APIC
APIC
APIC
Processor
Core
Processor
Core
Processor
Core
Processor
Core
System Bus
Dual Processor
On-Die Cache
Processor
Core
On-Die Cache
System Bus
System Bus
HyperThreading
Dual Core
Multicores
Two or more processors on the same chip
Each has an independent interface to the
front side bus
Both OS and the applications must
support thread-level parallelism
Typically