0% found this document useful (0 votes)

46 views21 pages

Lecture 2

The document is a lecture outline for COMP 4901Q: High Performance Computing, covering topics such as computer architecture, parallel computer architecture, and performance measurement in parallel computing. Key concepts include CPU functions, memory hierarchy, Flynn's Taxonomy, and the implications of Amdahl's and Gustafson's laws on parallelism. The lecture aims to provide foundational knowledge for understanding high-performance computing systems and their efficiency.

Uploaded by

10585

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views21 pages

Lecture 2

Uploaded by

10585

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

COMP 4901Q: High Performance

Computing (HPC)
Lecture 2: Introduction to
Parallel Computer
Architecture
Instructor: Shaohuai SHI ([email protected])
Teaching assistants: Mingkai TANG ([email protected])
Yazhou
Course XING
website: ([email protected])
https://course.cse.ust.hk/comp4901q/

1
Outline

🞂 Computer Architecture

🞂 Parallel Computer Architecture

🞂 Performance Measurement in Parallel

Computing

2
Computer Architecture
🞂 CPU
🞂 Central Processing Unit
🞂 Responsible for running
programs
🞂 Main Memory (or
RAM)
🞂 Random access memory
🞂Temporary storage for program
and data
🞂 Buses Hardware organization of a typical system [1]
🞂 Data movement
🞂 I/O Devices
🞂 Disk: Long-term storage
🞂 Mouse/Keyboard/Display 3
The Processor von Neumann architecture

🞂 Arithmetic logic unit (ALU)

🞂 Performs mathematical and logic
operations
🞂 Control unit (CU)
🞂Directs the movement of instructions
in and out of the processor
🞂 Sends control signals to the ALU
🞂 Registers
🞂 Instruction Source: http://computerscience.chemeketa.edu/cs160Reader/ComputerArchitecture/Processor.html

register
🞂 Program counter
(PC)
🞂 General purpose
4
registers
Instruction Cycle and Pipelining
🞂 Each instruction has the
following cycle
🞂 Fetch Stage
🞂the next instruction (from PC) is fetched
from the memory address
🞂 Decode Stage
🞂instruction
presented in the instruction
register is interpreted by CU
🞂 Execute Stage
🞂 ALU to perform mathematical or
logic functions

🞂 Pipeline
🞂Differentstages from different
instructions can be processed in
parallel 5
Memory Hierarchy
🞂Keep
processor busy
by reducing data
movement

🞂 Fast storage is
expensive
🞂 L0 (Registers):
1ns, KB
🞂 L1, L2, L3: 10ns,
MB
🞂 Main memory:
🞂 DRAM
100ns, GB
🞂
🞂 Double Data Rate
Disk: 10ms,TB
(DDR)
🞂 Remote: 10sec, Memory hierarchy of modern computers [1]
🞂High Bandwidth
PB
Memory (HBM)

6
All Computers Become Parallel
🞂 From mobile phones to supercomputers
🞂 Mobile phones: e.g., Apple A9, ARMv8-A dual-core
🞂 Desktop computer: e.g., Intel Core i3, 2 Cores
🞂 Servers
🞂 Multiple Cores per CPU: e.g., Intel(R) Xeon(R) Gold 6230
CPU: 20 Cores
🞂 Multiple CPUs: e.g., Intel(R) Xeon(R) Gold
6230, 20
🞂 QPI: Intel QuickPath Interconnect
(Unidirectional Speed: 6.4 GT/s)
🞂 UPI (starting at 2017): Intel Ultra Path Interconnect
(Unidirectional Speed: 10.4 GT/s)
🞂 Multiple GPUs
🞂 PCIe for GPU or other external devices: e.g., PCIe3.0x16
(8 GT/s per lane, ~1GB/s per lane)
🞂 NVLink for GPUs: e.g., NVLink 3.0 (can be up to 96
lanes) for Nvidia A100 GPUs (50 Gbit/s per lane)
🞂 Clusters
🞂 Multiple servers connected with high-speed interconnects (e.g.,
Mellanox 100 Gbit/s EDR InfiniBand ConnectX-5)
🞂 Parallel Computers
7
🞂 Make use of multiple cores to finish one task
Parallel Hardware: Flynn’s Taxonomy, 1966
🞂 Michael J. Flynn: an American professor emeritus at
Stanford University

SISD (SIMD)
Single instruction Single instruction
stream Single data stream Multiple data
No stream Popular: stream
SSE, AVX, GPU
parallelism!
MISD (MIMD)
Multiple instruction Multiple instruction
stream Single data stream Multiple data
stream stream
Uncommo Popular: multi-core, cluster
n

8
SISD
🞂Boththe instruction and data streams are
sequentially executed

🞂Singlecontrol unit (CU) fetches single instruction

stream (IS) from memory

🞂TheCU then generates appropriate control signals

to direct single processing element (PE) to
operate on single data stream (DS) processing unit
(PU)

Image credit: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy

9
SIMD
🞂Parallelism
achieved by dividing data
among the processors with data
parallelism.
🞂 Applies the same instruction to
multiple data items.
🞂 Examples of SIMD architectures
🞂 Intel x86 CPUs: MMX, SSE, and AVX
🞂 MMX: MultiMedia eXtension,
introduced in 1997
🞂A single instruction can be applied to two 32-bit
integers, four 16-bit integers, or eight 8-bit
integers at once
🞂 SSE: Streaming SIMD Extensions, since 1999
🞂One SSE instruction can perform 4 single- Image credit: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
precision or 2 double-precision operations
🞂 AVX (Advanced Vector Extensions), AVX2,
AVX-512, since 2008
10
🞂One AVX-256 instruction can perform 8 single-
Example of S I S D vs. SIMD
🞂 SIMD
🞂 SISD
🞂 Traditional mode 🞂E.g., Sandy
Bridge: AVX (256 bits) or
🞂 One operation produces Cascade Lake: AVX-512 (512 bits)
one result 🞂One operation (256-bit AVX) produces
256/64 = 4 results (double-precision,
64bits)

11
MIMD
🞂 Supports multiple simultaneous instruction
streams operating on multiple data streams
🞂 A collection of fully independent processing
units or cores, each of which has its own
control unit and its own ALU.
🞂 Shared-memory systems
🞂 Each processor can access each memory Image credit: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
location.
🞂 Distributed-memory systems
🞂 Connected by a commodity interconnection
network.

Image credit: https://en.wikipedia.org/wiki/File:Shared_memory.svg

Image credit: https://hpc.llnl.gov/training/tutorials/introduction-parallel-computing-tutorial

12
Shared-memory systems
🞂 Uniform memory access (UMA)
🞂 all processors can directly access main
memory
🞂each processor has equal memory
accessing time (latency) and access
speed
🞂E.g., Sun Starfire servers, Compaq alpha UM
server and HP v series A
🞂 less scalable
🞂Increasingly
difficult for hardware to provide
shared memory behavior to increasing
🞂 Non-uniform
CPU cores memory access
(NUMA)
🞂the processors can access each others’
blocks of main memory through special
hardware (e.g., QPI, UPI)
🞂the access time of the memory relies
NUM
on the distance where the processor
A
is placed 13
Distributed-memory systems
🞂A set of processing nodes
interconnected by a network
🞂 Components
🞂 Links
🞂 Switches
🞂 Network interface cards (NIC)
🞂 Network communication speed
matters
🞂 Link speed
🞂 Routing
Image credit [1]
🞂 Network topology
🞂 Fat tree, Bcube,Torus, …

14
[1] http://www.cables-solutions.com/connectivity-options-comparison-10g-serversswitches-networking.html
Performance Measurement
🞂 Define Performance (Effective FLOPS) = FLOP/Execution_Time
🞂 “Processor X is n time faster than Processor Y on running a program”

15
Goal of Parallelism
🞂Parallel
program: instructions are executed in parallel by multiple
processors (single server or clusters) to reduce the execution
time of the program
🞂 Serial run-time = Tserial
🞂 Parallel run-time = Tparallel<Tserial

Tseria
speedup l
Tparalle
= l
🞂Ideal: linear scaling with increased
number of cores
🞂 Fact: limited by Amdahl’s Law

16
Amdahl’s Law
🞂Any computing task involves some part that can be parallelized (Tp
denotes its time), and some part that cannot be parallelized (Ts
denotes its time).
🞂The theoretical speedup is limited by the part of the task that
cannot benefit from parallelism.
🞂 For a serial program:
🞂 Tserial = Tp + Ts=(1-s)+(s),
🞂 s is the Amdahl fraction
🞂 fraction of work done sequentially
🞂 It is parallelized with p processors
🞂 Tparallel ≥ Tp / p + Ts=(1-s)/p+(s)
🞂 speedup = Tserial / Tparallel
≤ 1/((1-s)/p+(s)) ≤ 1/s
17
Strong-scaling vs. Weak-scaling
🞂 Strong-scaling 🞂 Weak-
🞂The total problem size (W) is fixed scaling
🞂The total problem size is scaled for p
for p processors processors: W × p
🞂 Every processor has the 🞂 Every processor has the same problem
problem size:W/p size:W
🞂 Limited by Amdahl’s law 🞂 Limited by Gustafson’s law
🞂 speedup ≤ 1/s 🞂 speedup ≤ s + (1-s) × p

18
Image credit: http://kth.se/blogs/pdc/2018/11/scalability-strong-and-weak-scaling
Scaling Efficiency
🞂 Consider a parallel system using p processors, and achieves
speedup of
S = Tserial / Tparallel

🞂 Scaling Efficiency = S/p : it measures how efficient a parallel

solution is.
🞂For
example, by using 4 CPU cores, we decrease the running time from 1 hour to
20 minutes. Then the efficiency is 75%.

🞂 Linear scaling => Scaling Efficiency=100%.

19
Reading List

🞂Thomas Sterling, Matthew Anderson and Maciej Brodowicz (2018), “High

Performance Computing: Modern Systems and Practices,” Morgan Kaufmann,
Chapter 2. [PDF: https://www.sciencedirect.com/book/9780124201583/high-
performance-computing]

🞂Duncan, R. (1990). “A survey of parallel computer

architectures”. Computer, 23(2), 5-16. [PDF:
https://ieeexplore.ieee.org/stamp/stamp.jsp?
tp=&arnumber=44900]

20
Summary
🞂 Computer architecture
🞂 CPU
🞂 Memory
🞂 Instruction cycle
🞂 Instruction pipelining
🞂 Parallel computer architecture
🞂 Flynn’s Taxonomy
🞂 SISD, SIMD, MISD, and MIMD
🞂 Shared-memory and distributed-memory
systems
🞂 Network topology
🞂 Performance measurement in parallel
computing
🞂 Speedup
🞂 Amdahl’s law and Gustafson’s law
21

Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
Lecture 4
No ratings yet
Lecture 4
27 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Lecture 2 General Parallelism Terms
No ratings yet
Lecture 2 General Parallelism Terms
22 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Introduction To Computing
No ratings yet
Introduction To Computing
6 pages
Aca
No ratings yet
Aca
13 pages
Architecture
No ratings yet
Architecture
67 pages
Parallel Computing Concepts Guide
No ratings yet
Parallel Computing Concepts Guide
32 pages
Understanding Parallel Computing Basics
No ratings yet
Understanding Parallel Computing Basics
11 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
49 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
43 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Overview of Parallel Computing Models
No ratings yet
Overview of Parallel Computing Models
65 pages
Lecture-2-06 01 2025
No ratings yet
Lecture-2-06 01 2025
21 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Unit 5
No ratings yet
Unit 5
66 pages
Introduction to Parallel Computing Basics
No ratings yet
Introduction to Parallel Computing Basics
6 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
CSC580 Quick Notes Lect1and2
100% (1)
CSC580 Quick Notes Lect1and2
18 pages
CS3006 Parallel Computing Course Overview
100% (1)
CS3006 Parallel Computing Course Overview
46 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
Lecture Week - 2 General Parallelism Terms
No ratings yet
Lecture Week - 2 General Parallelism Terms
24 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
Lecture 2 General Parallelism Terms
No ratings yet
Lecture 2 General Parallelism Terms
22 pages
Levels of Parallelism in Computing
No ratings yet
Levels of Parallelism in Computing
70 pages
Pda 2
No ratings yet
Pda 2
105 pages
Parallel Processor Computing Unit 1
No ratings yet
Parallel Processor Computing Unit 1
10 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
38 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
COA - Module-5
No ratings yet
COA - Module-5
35 pages
Unit 1
No ratings yet
Unit 1
22 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Unit 1
No ratings yet
Unit 1
21 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
90 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Understanding Parallel Computing Basics
No ratings yet
Understanding Parallel Computing Basics
22 pages
BDS Session 2
No ratings yet
BDS Session 2
59 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
DC Machines: Principles & Parts
No ratings yet
DC Machines: Principles & Parts
3 pages
B.Sc. Computer Science Program Overview
No ratings yet
B.Sc. Computer Science Program Overview
10 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
22 pages
30-4110NS Digital Wideband UEGO Gauge No Sensor
No ratings yet
30-4110NS Digital Wideband UEGO Gauge No Sensor
14 pages
Template CV Europass
No ratings yet
Template CV Europass
2 pages
Fifth Generation Computer
No ratings yet
Fifth Generation Computer
15 pages
Facebook News Feed Advertising Guide
No ratings yet
Facebook News Feed Advertising Guide
19 pages
Resume For The Position of Qa-Qc Welding Inspector or Qa-Qc Mechanical Engineer
100% (2)
Resume For The Position of Qa-Qc Welding Inspector or Qa-Qc Mechanical Engineer
4 pages
Morefun MP63 WEB User - Manual
No ratings yet
Morefun MP63 WEB User - Manual
18 pages
50 Fast Photoshop CS Techniques PDF
No ratings yet
50 Fast Photoshop CS Techniques PDF
383 pages
Ims Architecture
No ratings yet
Ims Architecture
12 pages
Model 1430 Microtector Electronic Point Gage Installation and Operating Instructions
No ratings yet
Model 1430 Microtector Electronic Point Gage Installation and Operating Instructions
4 pages
Technical Report Writing Format
No ratings yet
Technical Report Writing Format
9 pages
Message
No ratings yet
Message
6 pages
Telecom Giants: China Mobile vs Vodafone
No ratings yet
Telecom Giants: China Mobile vs Vodafone
25 pages
Attack Report
No ratings yet
Attack Report
5 pages
Injection Valve EV 14 Datasheet 51 en 2775993867pdf
No ratings yet
Injection Valve EV 14 Datasheet 51 en 2775993867pdf
7 pages
JEE Main Solved Questions: AC Circuits
No ratings yet
JEE Main Solved Questions: AC Circuits
7 pages
Automotive Quality Assurance Roles
No ratings yet
Automotive Quality Assurance Roles
1 page
Ayush Borage: Education
No ratings yet
Ayush Borage: Education
1 page
Agile Testing: Principles and Practices
No ratings yet
Agile Testing: Principles and Practices
2 pages
Bosch ESI (Tronic) 1Q.2015 News PDF
No ratings yet
Bosch ESI (Tronic) 1Q.2015 News PDF
2 pages
RANGER 5 AXIS ROBOT OPERATIONS MANUAL AB VERSION REV 2.7 Euro
0% (1)
RANGER 5 AXIS ROBOT OPERATIONS MANUAL AB VERSION REV 2.7 Euro
87 pages
Delta UPS Solutions Overview
No ratings yet
Delta UPS Solutions Overview
48 pages
Notes For Interfacing The CDM 1250 Uhf Radio With The MMDVM Board
No ratings yet
Notes For Interfacing The CDM 1250 Uhf Radio With The MMDVM Board
22 pages
AirPods Pro 2 vs Sony WF-1000XM5 Comparison
No ratings yet
AirPods Pro 2 vs Sony WF-1000XM5 Comparison
1 page
Metrology: Limits, Fits, and Tolerances
100% (1)
Metrology: Limits, Fits, and Tolerances
147 pages
The Soft Warehouse Newsletter 4
No ratings yet
The Soft Warehouse Newsletter 4
10 pages
Eaton Arc Flash Safety Solutions
No ratings yet
Eaton Arc Flash Safety Solutions
2 pages
House Rental Monitoring System Prototype 2.0 2
100% (2)
House Rental Monitoring System Prototype 2.0 2
26 pages

Lecture 2

Uploaded by

Lecture 2

Uploaded by

COMP 4901Q: High Performance

🞂 ​ Parallel Computer Architecture

🞂 ​ Performance Measurement in Parallel

🞂 ​ Arithmetic logic unit (ALU)

🞂​Singlecontrol unit (CU) fetches single instruction

🞂​TheCU then generates appropriate control signals

Image credit: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy

Image credit: https://en.wikipedia.org/wiki/File:Shared_memory.svg

🞂 ​ Scaling Efficiency = S/p : it measures how efficient a parallel

🞂 ​ Linear scaling => Scaling Efficiency=100%.

🞂​Thomas Sterling, Matthew Anderson and Maciej Brodowicz (2018), “High

🞂​Duncan, R. (1990). “A survey of parallel computer

You might also like

🞂 Parallel Computer Architecture

🞂 Performance Measurement in Parallel

🞂 Arithmetic logic unit (ALU)

🞂Singlecontrol unit (CU) fetches single instruction

🞂TheCU then generates appropriate control signals

🞂 Scaling Efficiency = S/p : it measures how efficient a parallel

🞂 Linear scaling => Scaling Efficiency=100%.

🞂Thomas Sterling, Matthew Anderson and Maciej Brodowicz (2018), “High

🞂Duncan, R. (1990). “A survey of parallel computer