0% found this document useful (0 votes)
46 views21 pages

Lecture 2

The document is a lecture outline for COMP 4901Q: High Performance Computing, covering topics such as computer architecture, parallel computer architecture, and performance measurement in parallel computing. Key concepts include CPU functions, memory hierarchy, Flynn's Taxonomy, and the implications of Amdahl's and Gustafson's laws on parallelism. The lecture aims to provide foundational knowledge for understanding high-performance computing systems and their efficiency.

Uploaded by

10585
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views21 pages

Lecture 2

The document is a lecture outline for COMP 4901Q: High Performance Computing, covering topics such as computer architecture, parallel computer architecture, and performance measurement in parallel computing. Key concepts include CPU functions, memory hierarchy, Flynn's Taxonomy, and the implications of Amdahl's and Gustafson's laws on parallelism. The lecture aims to provide foundational knowledge for understanding high-performance computing systems and their efficiency.

Uploaded by

10585
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

COMP 4901Q: High Performance

Computing (HPC)
Lecture 2: Introduction to
Parallel Computer
Architecture
Instructor: Shaohuai SHI ([email protected])
Teaching assistants: Mingkai TANG ([email protected])
Yazhou
Course XING
website: ([email protected])
https://course.cse.ust.hk/comp4901q/

1
Outline

🞂 ​ Computer Architecture

🞂 ​ Parallel Computer Architecture

🞂 ​ Performance Measurement in Parallel


Computing

2
Computer Architecture
🞂 ​ CPU
🞂 ​ Central Processing Unit
🞂 ​ Responsible for running
programs
🞂 ​ Main Memory (or
RAM)
🞂 ​ Random access memory
🞂​Temporary storage for program
and data
🞂 ​ Buses Hardware organization of a typical system [1]
🞂 ​ Data movement
🞂 ​ I/O Devices
🞂 ​ Disk: Long-term storage
🞂 ​ Mouse/Keyboard/Display 3
The Processor von Neumann architecture

🞂 ​ Arithmetic logic unit (ALU)


🞂 ​ Performs mathematical and logic
operations
🞂 ​ Control unit (CU)
🞂​Directs the movement of instructions
in and out of the processor
🞂 ​ Sends control signals to the ALU
🞂 ​ Registers
🞂 ​ Instruction Source: http://computerscience.chemeketa.edu/cs160Reader/ComputerArchitecture/Processor.html

register
🞂 ​ Program counter
(PC)
🞂 ​ General purpose
4
registers
Instruction Cycle and Pipelining
🞂 ​ Each instruction has the
following cycle
🞂 ​ Fetch Stage
🞂​the next instruction (from PC) is fetched
from the memory address
🞂 ​ Decode Stage
🞂​instruction
presented in the instruction
register is interpreted by CU
🞂 ​ Execute Stage
🞂 ​ ALU to perform mathematical or
logic functions

🞂 ​ Pipeline
🞂​Differentstages from different
instructions can be processed in
parallel 5
Memory Hierarchy
🞂​Keep
processor busy
by reducing data
movement

🞂 ​ Fast storage is
expensive
🞂 ​ L0 (Registers):
1ns, KB
🞂 ​ L1, L2, L3: 10ns,
MB
🞂 ​ Main memory:
🞂 ​ DRAM
100ns, GB
🞂 ​
🞂 ​ Double Data Rate
Disk: 10ms,TB
(DDR)
🞂 ​ Remote: 10sec, Memory hierarchy of modern computers [1]
🞂​High Bandwidth
PB
Memory (HBM)

6
All Computers Become Parallel
🞂 ​From mobile phones to supercomputers
🞂 ​ Mobile phones: e.g., Apple A9, ARMv8-A dual-core
🞂 ​ Desktop computer: e.g., Intel Core i3, 2 Cores
🞂 ​ Servers
🞂 ​Multiple Cores per CPU: e.g., Intel(R) Xeon(R) Gold 6230
CPU: 20 Cores
🞂 ​ Multiple CPUs: e.g., Intel(R) Xeon(R) Gold
6230, 20
🞂 ​ QPI: Intel QuickPath Interconnect
(Unidirectional Speed: 6.4 GT/s)
🞂 ​ UPI (starting at 2017): Intel Ultra Path Interconnect
(Unidirectional Speed: 10.4 GT/s)
🞂 ​ Multiple GPUs
🞂 ​ PCIe for GPU or other external devices: e.g., PCIe3.0x16
(8 GT/s per lane, ~1GB/s per lane)
🞂 ​ NVLink for GPUs: e.g., NVLink 3.0 (can be up to 96
lanes) for Nvidia A100 GPUs (50 Gbit/s per lane)
🞂 ​ Clusters
🞂 ​ Multiple servers connected with high-speed interconnects (e.g.,
Mellanox 100 Gbit/s EDR InfiniBand ConnectX-5)
🞂 ​Parallel Computers
7
🞂 ​ Make use of multiple cores to finish one task
Parallel Hardware: Flynn’s Taxonomy, 1966
🞂 ​ Michael J. Flynn: an American professor emeritus at
Stanford University

SISD (SIMD)
Single instruction Single instruction
stream Single data stream Multiple data
No stream Popular: stream
SSE, AVX, GPU
parallelism!
MISD (MIMD)
Multiple instruction Multiple instruction
stream Single data stream Multiple data
stream stream
Uncommo Popular: multi-core, cluster
n

8
SISD
🞂​Boththe instruction and data streams are
sequentially executed

🞂​Singlecontrol unit (CU) fetches single instruction


stream (IS) from memory

🞂​TheCU then generates appropriate control signals


to direct single processing element (PE) to
operate on single data stream (DS) processing unit
(PU)

Image credit: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy

9
SIMD
🞂​Parallelism
achieved by dividing data
among the processors with data
parallelism.
🞂 ​ Applies the same instruction to
multiple data items.
🞂 ​ Examples of SIMD architectures
🞂 ​ Intel x86 CPUs: MMX, SSE, and AVX
🞂 ​ MMX: MultiMedia eXtension,
introduced in 1997
🞂​A single instruction can be applied to two 32-bit
integers, four 16-bit integers, or eight 8-bit
integers at once
🞂 ​ SSE: Streaming SIMD Extensions, since 1999
🞂​One SSE instruction can perform 4 single- Image credit: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
precision or 2 double-precision operations
🞂 ​ AVX (Advanced Vector Extensions), AVX2,
AVX-512, since 2008
10
🞂​One AVX-256 instruction can perform 8 single-
Example of S I S D vs. SIMD
🞂 ​SIMD
🞂 SISD
🞂 ​ Traditional mode 🞂​E.g., Sandy
Bridge: AVX (256 bits) or
🞂 ​ One operation produces Cascade Lake: AVX-512 (512 bits)
one result 🞂​One operation (256-bit AVX) produces
256/64 = 4 results (double-precision,
64bits)

11
MIMD
🞂 ​ Supports multiple simultaneous instruction
streams operating on multiple data streams
🞂 ​ A collection of fully independent processing
units or cores, each of which has its own
control unit and its own ALU.
🞂 ​ Shared-memory systems
🞂 ​ Each processor can access each memory Image credit: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
location.
🞂 ​ Distributed-memory systems
🞂 ​ Connected by a commodity interconnection
network.

Image credit: https://en.wikipedia.org/wiki/File:Shared_memory.svg


Image credit: https://hpc.llnl.gov/training/tutorials/introduction-parallel-computing-tutorial

12
Shared-memory systems
🞂 ​ Uniform memory access (UMA)
🞂 ​ all processors can directly access main
memory
🞂​each processor has equal memory
accessing time (latency) and access
speed
🞂​E.g., Sun Starfire servers, Compaq alpha UM
server and HP v series A
🞂 ​ less scalable
🞂​Increasingly
difficult for hardware to provide
shared memory behavior to increasing
🞂 ​ Non-uniform
CPU cores memory access
(NUMA)
🞂​the processors can access each others’
blocks of main memory through special
hardware (e.g., QPI, UPI)
🞂​the access time of the memory relies
NUM
on the distance where the processor
A
is placed 13
Distributed-memory systems
🞂​A set of processing nodes
interconnected by a network
🞂 ​ Components
🞂 ​ Links
🞂 ​ Switches
🞂 ​ Network interface cards (NIC)
🞂 ​ Network communication speed
matters
🞂 ​ Link speed
🞂 ​ Routing
Image credit [1]
🞂 ​ Network topology
🞂 ​ Fat tree, Bcube,Torus, …

14
[1] http://www.cables-solutions.com/connectivity-options-comparison-10g-serversswitches-networking.html
Performance Measurement
🞂 ​Define Performance (Effective FLOPS) = FLOP/Execution_Time
🞂 ​“Processor X is n time faster than Processor Y on running a program”

15
Goal of Parallelism
🞂​Parallel
program: instructions are executed in parallel by multiple
processors (single server or clusters) to reduce the execution
time of the program
🞂 ​ Serial run-time = Tserial
🞂 ​ Parallel run-time = Tparallel<Tserial

Tseria
speedup l
Tparalle
= l
🞂​Ideal: linear scaling with increased
number of cores
🞂 ​Fact: limited by Amdahl’s Law

16
Amdahl’s Law
🞂​Any computing task involves some part that can be parallelized (Tp
denotes its time), and some part that cannot be parallelized (Ts
denotes its time).
🞂​The theoretical speedup is limited by the part of the task that
cannot benefit from parallelism.
🞂 ​ For a serial program:
🞂 ​ Tserial = Tp + Ts=(1-s)+(s),
🞂 ​ s is the Amdahl fraction
🞂 ​ fraction of work done sequentially
🞂 ​ It is parallelized with p processors
🞂 ​ Tparallel ≥ Tp / p + Ts=(1-s)/p+(s)
🞂 ​ speedup = Tserial / Tparallel
≤ 1/((1-s)/p+(s)) ≤ 1/s
17
Strong-scaling vs. Weak-scaling
🞂 ​ Strong-scaling 🞂 ​ Weak-
🞂​The total problem size (W) is fixed scaling
🞂​The total problem size is scaled for p
for p processors processors: W × p
🞂 ​ Every processor has the 🞂 ​Every processor has the same problem
problem size:W/p size:W
🞂 ​ Limited by Amdahl’s law 🞂 ​Limited by Gustafson’s law
🞂 ​ speedup ≤ 1/s 🞂 ​ speedup ≤ s + (1-s) × p

18
Image credit: http://kth.se/blogs/pdc/2018/11/scalability-strong-and-weak-scaling
Scaling Efficiency
🞂 ​ Consider a parallel system using p processors, and achieves
speedup of
S = Tserial / Tparallel

🞂 ​ Scaling Efficiency = S/p : it measures how efficient a parallel


solution is.
🞂​For
example, by using 4 CPU cores, we decrease the running time from 1 hour to
20 minutes. Then the efficiency is 75%.

🞂 ​ Linear scaling => Scaling Efficiency=100%.

19
Reading List

🞂​Thomas Sterling, Matthew Anderson and Maciej Brodowicz (2018), “High


Performance Computing: Modern Systems and Practices,” Morgan Kaufmann,
Chapter 2. [PDF: https://www.sciencedirect.com/book/9780124201583/high-
performance-computing]

🞂​Duncan, R. (1990). “A survey of parallel computer


architectures”. Computer, 23(2), 5-16. [PDF:
https://ieeexplore.ieee.org/stamp/stamp.jsp?
tp=&arnumber=44900]

20
Summary
🞂 ​ Computer architecture
🞂 ​ CPU
🞂 ​ Memory
🞂 ​ Instruction cycle
🞂 ​ Instruction pipelining
🞂 ​ Parallel computer architecture
🞂 ​ Flynn’s Taxonomy
🞂 ​ SISD, SIMD, MISD, and MIMD
🞂 ​ Shared-memory and distributed-memory
systems
🞂 ​ Network topology
🞂 ​ Performance measurement in parallel
computing
🞂 ​ Speedup
🞂 ​ Amdahl’s law and Gustafson’s law
21

You might also like