COMP 4901Q: High Performance
Computing (HPC)
Lecture 2: Introduction to
Parallel Computer
Architecture
Instructor: Shaohuai SHI ([email protected])
Teaching assistants: Mingkai TANG ([email protected])
Yazhou
Course XING
website: ([email protected])
https://course.cse.ust.hk/comp4901q/
1
Outline
🞂 Computer Architecture
🞂 Parallel Computer Architecture
🞂 Performance Measurement in Parallel
Computing
2
Computer Architecture
🞂 CPU
🞂 Central Processing Unit
🞂 Responsible for running
programs
🞂 Main Memory (or
RAM)
🞂 Random access memory
🞂Temporary storage for program
and data
🞂 Buses Hardware organization of a typical system [1]
🞂 Data movement
🞂 I/O Devices
🞂 Disk: Long-term storage
🞂 Mouse/Keyboard/Display 3
The Processor von Neumann architecture
🞂 Arithmetic logic unit (ALU)
🞂 Performs mathematical and logic
operations
🞂 Control unit (CU)
🞂Directs the movement of instructions
in and out of the processor
🞂 Sends control signals to the ALU
🞂 Registers
🞂 Instruction Source: http://computerscience.chemeketa.edu/cs160Reader/ComputerArchitecture/Processor.html
register
🞂 Program counter
(PC)
🞂 General purpose
4
registers
Instruction Cycle and Pipelining
🞂 Each instruction has the
following cycle
🞂 Fetch Stage
🞂the next instruction (from PC) is fetched
from the memory address
🞂 Decode Stage
🞂instruction
presented in the instruction
register is interpreted by CU
🞂 Execute Stage
🞂 ALU to perform mathematical or
logic functions
🞂 Pipeline
🞂Differentstages from different
instructions can be processed in
parallel 5
Memory Hierarchy
🞂Keep
processor busy
by reducing data
movement
🞂 Fast storage is
expensive
🞂 L0 (Registers):
1ns, KB
🞂 L1, L2, L3: 10ns,
MB
🞂 Main memory:
🞂 DRAM
100ns, GB
🞂
🞂 Double Data Rate
Disk: 10ms,TB
(DDR)
🞂 Remote: 10sec, Memory hierarchy of modern computers [1]
🞂High Bandwidth
PB
Memory (HBM)
6
All Computers Become Parallel
🞂 From mobile phones to supercomputers
🞂 Mobile phones: e.g., Apple A9, ARMv8-A dual-core
🞂 Desktop computer: e.g., Intel Core i3, 2 Cores
🞂 Servers
🞂 Multiple Cores per CPU: e.g., Intel(R) Xeon(R) Gold 6230
CPU: 20 Cores
🞂 Multiple CPUs: e.g., Intel(R) Xeon(R) Gold
6230, 20
🞂 QPI: Intel QuickPath Interconnect
(Unidirectional Speed: 6.4 GT/s)
🞂 UPI (starting at 2017): Intel Ultra Path Interconnect
(Unidirectional Speed: 10.4 GT/s)
🞂 Multiple GPUs
🞂 PCIe for GPU or other external devices: e.g., PCIe3.0x16
(8 GT/s per lane, ~1GB/s per lane)
🞂 NVLink for GPUs: e.g., NVLink 3.0 (can be up to 96
lanes) for Nvidia A100 GPUs (50 Gbit/s per lane)
🞂 Clusters
🞂 Multiple servers connected with high-speed interconnects (e.g.,
Mellanox 100 Gbit/s EDR InfiniBand ConnectX-5)
🞂 Parallel Computers
7
🞂 Make use of multiple cores to finish one task
Parallel Hardware: Flynn’s Taxonomy, 1966
🞂 Michael J. Flynn: an American professor emeritus at
Stanford University
SISD (SIMD)
Single instruction Single instruction
stream Single data stream Multiple data
No stream Popular: stream
SSE, AVX, GPU
parallelism!
MISD (MIMD)
Multiple instruction Multiple instruction
stream Single data stream Multiple data
stream stream
Uncommo Popular: multi-core, cluster
n
8
SISD
🞂Boththe instruction and data streams are
sequentially executed
🞂Singlecontrol unit (CU) fetches single instruction
stream (IS) from memory
🞂TheCU then generates appropriate control signals
to direct single processing element (PE) to
operate on single data stream (DS) processing unit
(PU)
Image credit: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
9
SIMD
🞂Parallelism
achieved by dividing data
among the processors with data
parallelism.
🞂 Applies the same instruction to
multiple data items.
🞂 Examples of SIMD architectures
🞂 Intel x86 CPUs: MMX, SSE, and AVX
🞂 MMX: MultiMedia eXtension,
introduced in 1997
🞂A single instruction can be applied to two 32-bit
integers, four 16-bit integers, or eight 8-bit
integers at once
🞂 SSE: Streaming SIMD Extensions, since 1999
🞂One SSE instruction can perform 4 single- Image credit: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
precision or 2 double-precision operations
🞂 AVX (Advanced Vector Extensions), AVX2,
AVX-512, since 2008
10
🞂One AVX-256 instruction can perform 8 single-
Example of S I S D vs. SIMD
🞂 SIMD
🞂 SISD
🞂 Traditional mode 🞂E.g., Sandy
Bridge: AVX (256 bits) or
🞂 One operation produces Cascade Lake: AVX-512 (512 bits)
one result 🞂One operation (256-bit AVX) produces
256/64 = 4 results (double-precision,
64bits)
11
MIMD
🞂 Supports multiple simultaneous instruction
streams operating on multiple data streams
🞂 A collection of fully independent processing
units or cores, each of which has its own
control unit and its own ALU.
🞂 Shared-memory systems
🞂 Each processor can access each memory Image credit: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
location.
🞂 Distributed-memory systems
🞂 Connected by a commodity interconnection
network.
Image credit: https://en.wikipedia.org/wiki/File:Shared_memory.svg
Image credit: https://hpc.llnl.gov/training/tutorials/introduction-parallel-computing-tutorial
12
Shared-memory systems
🞂 Uniform memory access (UMA)
🞂 all processors can directly access main
memory
🞂each processor has equal memory
accessing time (latency) and access
speed
🞂E.g., Sun Starfire servers, Compaq alpha UM
server and HP v series A
🞂 less scalable
🞂Increasingly
difficult for hardware to provide
shared memory behavior to increasing
🞂 Non-uniform
CPU cores memory access
(NUMA)
🞂the processors can access each others’
blocks of main memory through special
hardware (e.g., QPI, UPI)
🞂the access time of the memory relies
NUM
on the distance where the processor
A
is placed 13
Distributed-memory systems
🞂A set of processing nodes
interconnected by a network
🞂 Components
🞂 Links
🞂 Switches
🞂 Network interface cards (NIC)
🞂 Network communication speed
matters
🞂 Link speed
🞂 Routing
Image credit [1]
🞂 Network topology
🞂 Fat tree, Bcube,Torus, …
14
[1] http://www.cables-solutions.com/connectivity-options-comparison-10g-serversswitches-networking.html
Performance Measurement
🞂 Define Performance (Effective FLOPS) = FLOP/Execution_Time
🞂 “Processor X is n time faster than Processor Y on running a program”
15
Goal of Parallelism
🞂Parallel
program: instructions are executed in parallel by multiple
processors (single server or clusters) to reduce the execution
time of the program
🞂 Serial run-time = Tserial
🞂 Parallel run-time = Tparallel<Tserial
Tseria
speedup l
Tparalle
= l
🞂Ideal: linear scaling with increased
number of cores
🞂 Fact: limited by Amdahl’s Law
16
Amdahl’s Law
🞂Any computing task involves some part that can be parallelized (Tp
denotes its time), and some part that cannot be parallelized (Ts
denotes its time).
🞂The theoretical speedup is limited by the part of the task that
cannot benefit from parallelism.
🞂 For a serial program:
🞂 Tserial = Tp + Ts=(1-s)+(s),
🞂 s is the Amdahl fraction
🞂 fraction of work done sequentially
🞂 It is parallelized with p processors
🞂 Tparallel ≥ Tp / p + Ts=(1-s)/p+(s)
🞂 speedup = Tserial / Tparallel
≤ 1/((1-s)/p+(s)) ≤ 1/s
17
Strong-scaling vs. Weak-scaling
🞂 Strong-scaling 🞂 Weak-
🞂The total problem size (W) is fixed scaling
🞂The total problem size is scaled for p
for p processors processors: W × p
🞂 Every processor has the 🞂 Every processor has the same problem
problem size:W/p size:W
🞂 Limited by Amdahl’s law 🞂 Limited by Gustafson’s law
🞂 speedup ≤ 1/s 🞂 speedup ≤ s + (1-s) × p
18
Image credit: http://kth.se/blogs/pdc/2018/11/scalability-strong-and-weak-scaling
Scaling Efficiency
🞂 Consider a parallel system using p processors, and achieves
speedup of
S = Tserial / Tparallel
🞂 Scaling Efficiency = S/p : it measures how efficient a parallel
solution is.
🞂For
example, by using 4 CPU cores, we decrease the running time from 1 hour to
20 minutes. Then the efficiency is 75%.
🞂 Linear scaling => Scaling Efficiency=100%.
19
Reading List
🞂Thomas Sterling, Matthew Anderson and Maciej Brodowicz (2018), “High
Performance Computing: Modern Systems and Practices,” Morgan Kaufmann,
Chapter 2. [PDF: https://www.sciencedirect.com/book/9780124201583/high-
performance-computing]
🞂Duncan, R. (1990). “A survey of parallel computer
architectures”. Computer, 23(2), 5-16. [PDF:
https://ieeexplore.ieee.org/stamp/stamp.jsp?
tp=&arnumber=44900]
20
Summary
🞂 Computer architecture
🞂 CPU
🞂 Memory
🞂 Instruction cycle
🞂 Instruction pipelining
🞂 Parallel computer architecture
🞂 Flynn’s Taxonomy
🞂 SISD, SIMD, MISD, and MIMD
🞂 Shared-memory and distributed-memory
systems
🞂 Network topology
🞂 Performance measurement in parallel
computing
🞂 Speedup
🞂 Amdahl’s law and Gustafson’s law
21