Computer Organization
Chapter 1: Measuring &
Understanding Performance
Mª Antonia Trenas Castro
[email protected] [Adapted from Mary Jane Irwin’s slides (PSU) based on Computer Organization
and Design, ARM Edition, Patterson & Hennessy, © 2017, Elsevier]
EC Chapter 1.1 Dept. of Comp. Arch., UMA
The Computer Revolution
Progress in computer technology
Underpinned by Moore’s Law
Makes novel applications feasible
Computers in automobiles
Cell phones
Human genome project
World Wide Web
Search Engines
Computers are pervasive
EC Chapter 1.2 Dept. of Comp. Arch., UMA
Classes of Computers
Desktop computers
Designed to deliver good performance to a single user at low
cost usually executing 3rd party software, usually incorporating a
graphics display, a keyboard, and a mouse
Servers
Used to run larger programs for multiple, simultaneous users
typically accessed only via a network and that places a greater
emphasis on dependability and (often) security
Supercomputers
A high performance, high cost class of servers with hundreds to
thousands of processors, terabytes of memory and petabytes of
storage that are used for high-end scientific and engineering
applications
Embedded computers (processors)
A computer inside another device used for running one
predetermined application
EC Chapter 1.3 Dept. of Comp. Arch., UMA
Connectivity driving computing
Source: Morgan Stanley
EC Chapter 1.4 Dept. of Comp. Arch., UMA
The Processor Market
embedded growth >> desktop growth
EC Chapter 1.5 Dept. of Comp. Arch., UMA
The Processor Market
EC Chapter 1.8 Dept. of Comp. Arch., UMA
Technology shift: ARM has cannibalized the processor market
EC Chapter 1.10 Dept. of Comp. Arch., UMA
A case of success
John Papandriopoulos, PhD from the U. Melbourne
EC Chapter 1.11 Dept. of Comp. Arch., UMA
Servers and supercomputers
Servers and supercomputing market
Google: search engine for ARM, Pixel8 SoC will be based on Tensor
TPU and ARM CPUs.
AMD Seattle Server-on-a-Chip based on Cortex-A57 (ARMv8)
Ampere One chip (192 core ARM) for Cloud servers (OCI and AWS).
Apple supports a broad shift to ARM with macOS (iPhone, iPad, SoC
M1, M2 used in laptops and desktops).
Mont Blanc, ExaNest, EUROEXA projects: supercomputers made of
ARM
- Once commodity processors took over and be prepared for when mobile
processors do so
The 2nd top system, Fugaku, installed at RIKEN Center for
Computational Science (R-CCS) in Kobe, Japan, gives 442 petaflops
(June 2023).
- Fugaku, is powered by Fujitsu’s 48-core A64FX SoC, powered by ARM
processors.
- In single or further reduced precision, which are often used in machine
learning and AI applications, Fugaku’s peak performance is over 1,000
petaflops (2 exaflops).
EC Chapter 1.14 Dept. of Comp. Arch., UMA
Supercomputers
(top500, Jun 2023 )
https://www.top500.org/lists/top500/
2022/06/
EC Chapter 1.15 Dept. of Comp. Arch., UMA
What You Will Learn
How programs are translated into the machine language
And how the hardware executes them
The hardware/software interface
What determines program performance
And how it can be improved
How hardware designers improve performance
What is parallel processing
EC Chapter 1.17 Dept. of Comp. Arch., UMA
Is it useful?
Marcos Fajardo wrote from scratch a
rendering system, Arnold, that runs natively
on x86 CPUs, where it tries to take
advantage of all available threads and
SIMD lanes for optimal parallelism.
It is used in
Thor
Captain America
X-Men: First Class
The Avengers
Gravity
Guardians of the Galaxy
Star Wars: The Force Awakens
Game of Thrones
…
EC Chapter 1.18 Dept. of Comp. Arch., UMA
Understanding Performance
Algorithm
Determines number of operations executed
Programming language, compiler, architecture
Determine number of machine instructions executed
per operation
Processor and memory system
Determine how fast instructions are executed
I/O system (including OS)
Determines how fast I/O operations are executed
EC Chapter 1.19 Dept. of Comp. Arch., UMA
Below the Program
Applications software
Written in high-level Systems software
language
Hardware
System software
Operating system – supervising program that interfaces the
user’s program with the hardware (e.g., Linux, MacOS,
Windows)
- Handles basic input and output operations
- Allocates storage and memory
- Schedules tasks & Provides for protected sharing among multiple
applications
Compiler – translate programs written in a high-level language
(e.g., C, Java) into instructions that the hardware can execute
EC Chapter 1.20 Dept. of Comp. Arch., UMA
Below the Program, Con’t
High-level language program (in C)
swap (int v[], int k)
(int temp;
temp = v[k];
v[k] = v[k+1]; one-to-many
v[k+1] = temp;
) C compiler
Assembly language program (for MIPS)
swap: sll $2, $5, 2
add $2, $4, $2
lw $15, 0($2)
lw $16, 4($2) one-to-one
sw $16, 0($2)
sw $15, 4($2)
jr $31 assembler
Machine (object, binary) code (for MIPS)
000000 00000 00101 0001000010000000
000000 00100 00010 0001000000100000
. . .
EC Chapter 1.21 Dept. of Comp. Arch., UMA
Advantages of Higher-Level Languages ?
Higher-level languages
Allow the programmer to think in a more natural language and
for their intended use (Fortran/C for scientific computation,
Cobol for business programming, Lisp for symbol
manipulation, Java for web programming, …)
Improve programmer productivity – more understandable
code that is easier to debug and validate
Improve program maintainability
Allow programs to be independent of the computer on which
they are developed (compilers and assemblers can translate
high-level language programs to the binary instructions of any
machine)
Emergence of optimizing compilers that produce very efficient
assembly code optimized for the target machine
As a result, very little programming is done today at
the assembler level
EC Chapter 1.22 Dept. of Comp. Arch., UMA
Disadvantages of scripting languages
Modern scripting languages are interpreted (Python,
JavaScript, PHP, Perl, Ruby …)
dynamically-typed and encourage reuse
Efficient for programmers but not for execution
Matrix Multiply:
relative
speedup to a
Python version
(18 core CPU)
EC Chapter 1.23 Dept. of Comp. Arch., UMA
Under the Covers
Same components for all kinds of computer
Five classic components of a computer – input, output, memory,
datapath, and control
datapath + control =
processor (CPU)
Input/output includes
User-interface devices
Display, keyboard,
mouse
Storage devices
Hard disk, CD/DVD,
flash
Network adapters
For communicating
with other
computers
EC Chapter 1.26 Dept. of Comp. Arch., UMA
Evolution of processors for desktops and laptops
Intel 80286 (1982) Intel80386 (1985) Intel Pentium (1993)
134×103 transistors 275×103 transistors 3.1×106 transistors
12 Mhz, 68,7 mm2 33 Mhz, 104 mm2 66 Mhz, 264 mm2
Intel Pentium II (1997) Intel Pentium III (1999) Intel Pentium 4 (2000)
7.5×106 transistors 28×106 transistors 42×106 transistors
300 Mhz, 209 mm2 733 Mhz, 140 mm2 1.5 Ghz, 224 mm2
Remarkable improvement in performance, without
considering power consumption or die size
EC Chapter 1.27 Dept. of Comp. Arch., UMA
12th Generation Intel Core Processors
Alder Lake
8P + 8E cores 1GHz-5.5GHz / 770 Mhz-4.0GHz
GPU with 32 EU TDP: 9 W-241 W
4-30MB L3 10 nm
EC Chapter 1.28 Dept. of Comp. Arch., UMA
13th Generation Intel Core Proc.
Raptor Lake
8(16)P + 16E cores 6 GHz P-cores / 4,3 GHZ E-cores
GPU up to 96 EU TDP: 35 W-253 W
36 MB L3 7 nm
EC Chapter 1.29 Dept. of Comp. Arch., UMA
Processors for desktops and laptops
Desktops (35 – 150W) and laptops (9 – 55 W) or mobiles (9 - 25W):
Intel Raptor Lake U (mobile) AMD Ryzen U 7000
Raptor Lake (7 nm) Ryzen (5 nm)
2P- 4(8)E cores, 1,1-1,3 GHz 4-16 cores, 1.9-2.2 GHz
TDP: 15W TDP: 9-28 W
8-12 MB L3 Cache 16 MB L3 Cache
GPU Xe with 48-96 EU GPU RDNA with 12 CU
EC Chapter 1.31 Dept. of Comp. Arch., UMA
Processors for smartphones and IoT
M2 Apple (5 nm)
8-10 cores ARM
20 Billions TRTs
CPU 4 cores Firestorm
3.5 GHz + 4 cores
Icestorm (2.8 GHz)
1.3-13.8W
GPU with 10 EU
NE with 16 cores
EC Chapter 1.32 Dept. of Comp. Arch., UMA
Evolution of processors for embedded systems: ARM
800x more energy efficient
500x smaller
EC Chapter 1.33 Dept. of Comp. Arch., UMA
ARM's primary CPU core families
EC Chapter 1.34 Dept. of Comp. Arch., UMA
Processors for smartphones and IoT
IoT (0.5-4W):
ARM Cortex Qualcomm QCS410/QCS610
Cortex- A7 (28 nm) QCS410 (11 nm)
4 cores, 1.3 GHz 4 cores (2Gold/2 Silver,
1 MB L2 Cache (2,2-1,8GHZ)
GPU with 2 EU 2 MB L2 cache
EC Chapter 1.35 GPU + DSPDept.
+AI engine
of Comp. Arch., UMA
Technology Trends
Electronics
technology continues
to evolve
Increased capacity
and performance
Reduced cost
Year Technology Relative performance/cost
1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2005 Ultra large scale IC 6,200,000,000
EC Chapter 1.39 Dept. of Comp. Arch., UMA
Moore’s Law
Intel’s Gordon Moore proposed the Moore’s law in 1965
Courtesy, Intel ®
EC Chapter 1.42 Dept. of Comp. Arch., UMA
Technology Scaling Road Map (ITRS)
Year 2006 2008 2010 2012 2014 2020 2022
Feature size 65 45 32 22 14 5 4
(nm)
Intg. 0.3 0.7 1.1 1.4 1.9 > 16 > 90
Capacity
(BT)
Progression to a 2 nm transistor in 2024
Fun facts about 45nm transistors
30 million can fit on the head of a pin
If car prices had fallen at the same rate as the price of a
single transistor has since 1968, a new car today would cost
about 1 cent
The highest transistor count in a microprocessor is
134 billion transistors, in Apple's ARM-based M2 Ultra,
which is fabricated using 5 nm.
EC Chapter 1.43 Dept. of Comp. Arch., UMA
Moore’s Law Slowdown
EC Chapter 1.45 Dept. of Comp. Arch., UMA
Technology problems
The number of manufactures decrease
EC Chapter 1.46 Dept. of Comp. Arch., UMA
Technology problems
The number of manufactures decrease
EC Chapter 1.47 Dept. of Comp. Arch., UMA
Situation after 2016
More than Moore era
The technology scaling may finish at the middle of the
next decade
2 nm may be the final
- distance between Silice atoms 0.5 nm
- Size of the atom 0.24 nm
Experts expect Moore's law will end by around 2025.
Continuation of technological progress in a variety of other areas:
new chip architectures, quantum computing, and AI and machine
learning.
EC Chapter 1.48 Dept. of Comp. Arch., UMA
End of Growth of Processor Performance Speed?
Dark Silicon and the end
of Multicore Scaling
EC Chapter 1.49 Dept. of Comp. Arch., UMA
The Sea Change: the switch to multiprocessors
Uniprocessor Performance
Constrained by power, instruction-level parallelism,
memory latency
EC Chapter 1.50 Dept. of Comp. Arch., UMA
Technology & Power: Dennard Scaling
Dennard Scaling: as transistors get smaller, their power
density stays constant, so that the power use stays in
proportion with area Not true anymore
EC Chapter 1.51 Dept. of Comp. Arch., UMA
What Happened to Clock Rates and Why?
my plate: 2000W on 400cm2 = 5 W/cm2
Rocket
Watts/cm2 my micro: 120W on 4cm2 = 30 W/cm2 Sun's
Nozzle
1000 Surface
Nuclear Reactor
100
Pentium 4, 2000
Hot plate
Pentium III
Pentium II Intel Core Duo, 2006
10
Pentium Pro
Pentium
i386
i486
1
nm
“New Microarchitecture Challenges in the Coming Generations of CMOS Process
Technologies” – Fred Pollack, Intel. Micro32 Conf. keynote. 1999.
EC Chapter 1.52 Dept. of Comp. Arch., UMA
But What Happened to Clock Rates and Why?
Clock rates hit a
“power wall”
Power (Watts)
EC Chapter 1.53 Dept. of Comp. Arch., UMA
Power Trends
In CMOS IC technology
Power Capacitive load Voltage Frequency
2
×30 ×1000
How could clock rates grow by a factor of 1000 while
power grew by only a factor of 30?
5V → 1V
EC Chapter 1.55 Dept. of Comp. Arch., UMA
Reducing Power
Suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction
Pold Cold V Fold2
1
old
1.92
Pnew Cold 0.85 (Vold 0.85) Fold 0.85 0.85
2 4
The power wall
We can’t reduce voltage further
We can’t remove more heat
How else can we improve performance?
EC Chapter 1.56 Dept. of Comp. Arch., UMA
Fallacy: Low Power at Idle
X4 power benchmark
- At 100% load: 295W
- At 50% load: 246W (83%)
- At 10% load: 180W (61%)
Although dynamic power is the primary source of power dissipation
in CMOS, static power dissipation occurs due to leakage current
that flows even when a transistor is off. Leakage is responsible of
40% of the power consumption (in 2008). Thus, increasing the
number of transistors increases power dissipation, even if the
transistors are always off.
Google data center
- Mostly operates at 10% – 50% load
- At 100% load less than 1% of the time
However even servers that are only 10% utilized burn about 2/3 of
their peak (100% load) power
Consider designing processors to make power proportional
to load
EC Chapter 1.57 Dept. of Comp. Arch., UMA
The sea change: multiprocessors
42 Years of Microprocessor Trend Data (Feb. 15, 2018)
42 Years of Microprocessor Trend Data (Feb.
15, 2018)
EC Chapter 1.59 Dept. of Comp. Arch., UMA
The Sea Change: DSAs
The power challenge has forced a change in the design of
microprocessors
As of 2006 all desktop and server companies are shipping
microprocessors with multiple processors – cores – per chip with
other computing devices
2014: Domain Specific Architectures: Achieve higher efficiency by
tailoring the architecture to characteristics of the domain
Product AMD Intel Qualcomm Espressif
Ryzen 7000 Raptor Lake QCS2290 ESP32-C3
(RISC-V)
Cores per 4-16 + GPU 8+16 + GPU 4 + GPU 1/2
chip +DSP
Clock rate 2.0-5.4 GHz 1.1-5.1/0.8- 2.0 GHz 160 MHz
3.9 GHz
Power ~15-75 W ~15-253W 9W < 1 W (5 𝜇𝑊)
EC Chapter 1.60 Dept. of Comp. Arch., UMA
The sea change: DSAs
Multicore microprocessors and DSAs
More than one processor per chip and different programming
model
Requires explicitly parallel programming
Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer
Hard to do
- Programming for performance for different devices
- Load balancing
- Optimizing communication and synchronization
EC Chapter 1.61 Dept. of Comp. Arch., UMA
Why DSAs Can Win (no magic)?
Tailor the Architecture to the Domain
More effective parallelism for a specific domain:
SIMD vs. MIMD
VLIW vs. Speculative, out-of-order
More effective use of memory bandwidth
User controlled versus caches
Eliminate unneeded accuracy
IEEE replaced by lower precision FP
32-64 bit integers to 8-16 bit integers
Domain specific programming language provides path for
software
EC Chapter 1.62 Dept. of Comp. Arch., UMA
DSA example: Tensor Processing Unit (TPU)
Perf/Watt TPU vs CPU & GPU
EC Chapter 1.63 Dept. of Comp. Arch., UMA
Performance
EC Chapter 1.73 Dept. of Comp. Arch., UMA
Performance Metrics
Purchasing perspective
given a collection of machines, which has the
- best performance ?
- least cost ?
- best cost/performance?
Design perspective
faced with design options, which has the
- best performance improvement ?
- least cost ?
- best cost/performance?
Both require
basis for comparison
metric for evaluation
Our goal is to understand what factors in the architecture
contribute to overall system performance and the relative
importance (and cost) of these factors
EC Chapter 1.74 Dept. of Comp. Arch., UMA
Throughput versus Response Time
Response time (execution time) – the time between the
start and the completion of a task
Important to individual users
Throughput (bandwidth) – the total amount of work done
in a given unit time
Important to data center managers
Will need different performance metrics as well as a
different set of applications to benchmark embedded and
desktop computers, which are more focused on response
time, versus servers, which are more focused on
throughput
How are response time and throughput affected by
● Replacing the processor with a faster version?
● Adding more processors?
EC Chapter 1.75
We’ll focus on response time for now…
Dept. of Comp. Arch., UMA
Defining (Speed) Performance
To maximize performance, need to minimize execution
time
performanceX = 1 / execution_timeX
If X is n times faster than Y, then
performanceX execution_timeY
-------------------- = --------------------- = n
performanceY execution_timeX
Decreasing response time almost always improves
throughput
EC Chapter 1.76 Dept. of Comp. Arch., UMA
Relative Performance Example
If computer A runs a program in 10 seconds and
computer B runs the same program in 15 seconds, how
much faster is A than B?
We know that A is n times faster than B if
performanceA execution_timeB
-------------------- = --------------------- = n
performanceB execution_timeA
The performance ratio is 15
------ = 1.5
10
So A is 1.5 times faster than B
EC Chapter 1.77 Dept. of Comp. Arch., UMA
Measuring Execution Time
Elapsed time
Total response time= Wall clock Time = Elapsed Time
- it includes all aspects to complete a task
- Processing, I/O operations, OS overhead, idle time
Determines system performance
Productivity
Throughput: the total amount of work done in a given unit time
EC Chapter 1.78 Dept. of Comp. Arch., UMA
Measuring Execution Time
Elapsed time versus CPU time
Elapsed Time P1.asm
IO+
OS P1.exe OS P2.exe OS P1.exe P2.exe P1.exe OS P1.exe IO sll $2, $5, 2
add $2, $4, $2
lw $15, 0($2)
Result lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
jr $31
sll $2, $5, 2
User CPU Time add $2, $4, $2
lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
jr $31
sll $2, $5, 2
add $2, $4, $2
System CPU Time
User CPU Time
CPU time = User CPU Time + System CPU time
EC Chapter 1.79 Dept. of Comp. Arch., UMA
Measuring Execution Time
CPU time
Time spent processing a given job
- Discounts I/O time, other jobs’ shares
Comprises user CPU time and system CPU time
Different programs are affected differently by CPU and system
performance
Elapsed
Example: ‘time’ in Unix: time CPU utiliz.:
(90.7 + 12.9) /
90.7u 12.9s 2:39 65% (2*60 + 39)
I/O and other
user CPU system processes
time CPU time
Our goal: user CPU time + system CPU time
EC Chapter 1.80 Dept. of Comp. Arch., UMA
Review: Machine Clock Rate
Clock rate (clock cycles per second in MHz or GHz) is
inverse of clock cycle time (clock period)
CC = 1 / CR
one clock period
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec (10-9) clock cycle => 1 GHz (109) clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate
EC Chapter 1.81 Dept. of Comp. Arch., UMA
Performance Factors
CPU execution time (CPU time) – time the CPU spends
working on a task
Does not include time waiting for I/O or running other programs
CPU execution time = # CPU clock cyclesx clock cycle time
for a program for a program
or
CPU execution time # CPU clock cycles for a program
= -------------------------------------------
for a program clock rate
Can improve performance by increasing the the clock
rate or by reducing the number of clock cycles required
for a program
● Hardware designer must often trade off clock rate against cycle
count
EC Chapter 1.82 Dept. of Comp. Arch., UMA
Improving Performance Example
A program runs on computer A with a 2 GHz clock in 10
seconds. What clock rate must computer B run at to run
this program in 6 seconds? Unfortunately, to accomplish
this, computer B will require 1.2 times as many clock
cycles as computer A to run the program.
EC Chapter 1.83 Dept. of Comp. Arch., UMA
Improving Performance Example
A program runs on computer A with a 2 GHz clock in 10
seconds. What clock rate must computer B run at to run
this program in 6 seconds? Unfortunately, to accomplish
this, computer B will require 1.2 times as many clock
cycles as computer A to run the program.
CPU clock cyclesA
CPU timeA = -------------------------------
clock rateA
CPU clock cyclesA = 10 sec x 2 x 109 cycles/sec
= 20 x 109 cycles
CPU timeB 1.2 x 20 x 109 cycles
= -------------------------------
clock rateB
clock rateB 1.2 x 20 x 109 cycles
= ------------------------------- = 4 GHz
6 seconds
EC Chapter 1.84 Dept. of Comp. Arch., UMA
Clock Cycles per Instruction
Not all instructions take the same amount of time to
execute
One way to think about execution time is that it equals the
number of instructions executed multiplied by the average time
per instruction
# CPU clock cycles # Instructions Average clock cycles
for a program = for a program x per instruction
Clock cycles per instruction (CPI) – the average number
of clock cycles each instruction takes to execute
A way to compare two different implementations of the same ISA
CPI for this instruction class
A B C
CPI 1 2 3
EC Chapter 1.85 Dept. of Comp. Arch., UMA
Using the Performance Equation
Computers A and B implement the same ISA. Computer
A has a clock cycle time of 250 ps and an effective CPI
of 2.0 for some program and computer B has a clock
cycle time of 500 ps and an effective CPI of 1.2 for the
same program. Which computer is faster and by how
much?
EC Chapter 1.86 Dept. of Comp. Arch., UMA
Using the Performance Equation
Computers A and B implement the same ISA. Computer
A has a clock cycle time of 250 ps and an effective CPI of
2.0 for some program and computer B has a clock cycle
time of 500 ps and an effective CPI of 1.2 for the same
program. Which computer is faster and by how much?
Each computer executes the same number of instructions, I,
so
CPU timeA = I x 2.0 x 250 ps = 500 x I ps
CPU timeB = I x 1.2 x 500 ps = 600 x I ps
Clearly, A is faster … by the ratio of execution times
performanceA execution_timeB 600 x I ps
------------------- = --------------------- = ---------------- = 1.2
performanceB execution_timeA 500 x I ps
EC Chapter 1.87 Dept. of Comp. Arch., UMA
Effective (Average) CPI
Computing the overall effective CPI is done by looking at
the different types of instructions and their individual
cycle counts and averaging
n
Overall effective CPI = (CPIi x ICi)
i=1
Where ICi is the count (percentage) of the number of instructions
of class i executed
CPIi is the (average) number of clock cycles per instruction for
that instruction class
n is the number of instruction classes
The overall effective CPI varies by instruction mix – a
measure of the dynamic frequency of instructions across
one or many programs
EC Chapter 1.88 Dept. of Comp. Arch., UMA
THE Performance Equation
Our basic performance equation is then
CPU time = Instruction_count x CPI x clock_cycle
or
Instruction_count x CPI
CPU time = -----------------------------------------------
clock_rate
These equations separate the three key factors that
affect performance
Can measure the CPU execution time by running the program
The clock rate is usually given
Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
CPI varies by instruction type and ISA implementation for which
we must know the implementation details
EC Chapter 1.89 Dept. of Comp. Arch., UMA
Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle
Instruction CPI clock_cycle
_count
Algorithm X X
Programming
language X X
Compiler X X
ISA
X X X
Core
organization X X
Technology
X
EC Chapter 1.90 Dept. of Comp. Arch., UMA
Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle
EC Chapter 1.91 Dept. of Comp. Arch., UMA
A Simple Example
Op Freq CPIi Freq x CPIi
ALU 50% 1
Load 20% 5
Store 10% 3
Branch 20% 2
=
How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
How does this compare with using branch prediction to shave
a cycle off the branch time?
What if two ALU instructions could be executed at once?
EC Chapter 1.92 Dept. of Comp. Arch., UMA
A Simple Example
Op Freq CPIi Freq x CPIi
ALU 50% 1 .5 .5 .5 .25
Load 20% 5 1.0 .4 1.0 1.0
Store 10% 3 .3 .3 .3 .3
Branch 20% 2 .4 .4 .2 .4
= 2.2 1.6 2.0 1.95
How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
How does this compare with using branch prediction to shave
a cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
EC Chapter 1.93 Dept. of Comp. Arch., UMA
THE Performance Equation
Our basic performance equation is then
CPU time = Instruction_count x CPI x clock_cycle
or
Instruction_count x CPI
CPU time = -----------------------------------------------
clock_rate
or
NI x CPI
Tcpu = -------------------- = NI x CPI x tck
F
EC Chapter 1.94 Dept. of Comp. Arch., UMA
Summary: Evaluating ISAs
Design-time metrics:
Can it be implemented, in how long, at what cost?
Can it be programmed? Ease of compilation?
Static Metrics:
How many bytes does the program occupy in memory?
Dynamic Metrics:
How many instructions are executed? How many bytes does the
processor fetch to execute the program?
How many clocks are required per instruction? CPI
How "lean" a clock is practical?
Best Metric: Time to execute the program!
depends on the instructions set, the
processor organization, and compilation Inst. Count Cycle Time
techniques.
EC Chapter 1.95 Dept. of Comp. Arch., UMA
Pitfall: MIPS as a Performance Metric
MIPS: Millions of Instructions Per Second
Doesn’t account for
- Differences in ISAs between computers
- Differences in complexity between instructions
Instruction count
MIPS
Execution time 10 6
Instruction count Clock rate
Instruction count CPI CPI 10 6
10 6
Clock rate
CPI varies between programs on a given CPU
EC Chapter 1.96 Dept. of Comp. Arch., UMA
MFLOPS as a Performance Metric
MFLOPS: Millions of Floating Point Operations Per
Second
Doesn’t account for
- Differences in number of cycles for different operations
- It just models the Floating Point Unit (FPU)
- Only for FP applications
Supercomputers metric:
EC Chapter 1.97 Dept. of Comp. Arch., UMA
Workloads and Benchmarks
Benchmarks – a set of programs that form a “workload”
specifically chosen to measure performance
SPEC (System Performance Evaluation Cooperative)
creates standard sets of benchmarks starting with
SPEC89. The latest is SPEC CPU2017 which consists of
10 integer benchmarks (SPECspeed2017_int) and 10
floating-point benchmarks (SPECspeed2017_fp) for
measuring speed. There are also two suites for measuring
throughput (SPECrate2017_int and SPECrate2017_fp).
www.spec.org
There are also benchmark collections for power workloads
(SPECpower_ssj2008), for Cloud (SPEC Cloud IaaS
2018), for Graphics/Workstations (SPECviewperf 2020),
Virtualization (SPECvirt Datacenter 2021), Java
Client/Server …
EC Chapter 1.98 Dept. of Comp. Arch., UMA
SPEC CPU Benchmark
Programs used to measure performance
Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC)
Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2017
Elapsed time to execute a selection of programs
- Negligible I/O, so focuses on CPU performance
Normalize relative to reference machine
Summarize as geometric mean of performance ratios
- SPECspeed2017_int (integer)
– SPECspeed2017_int_base
– SPECspeed2017_int_peak
- SPECspeed2017_fp (floating-point)
– SPECspeed2017_fp_base
– SPECspeed2017_fp_peak
n
n
Execution time ratio
i1
i
EC Chapter 1.99 Dept. of Comp. Arch., UMA
Comparing and Summarizing Performance
How do we summarize the performance for benchmark
set with a single number?
First the execution times are normalized giving the “SPEC ratio”
(bigger is faster, i.e., SPEC ratio is the inverse of execution time)
The SPEC ratios are then “averaged” using the geometric mean
(GM)
n
GM = n SPEC ratioi
i=1
Guiding principle in reporting performance measurements
is reproducibility – list everything another experimenter
would need to duplicate the experiment (version of the
operating system, compiler settings, input set used,
specific computer configuration (clock rate, cache sizes
and speed, memory size and speed, etc.))
EC Chapter 1.100 Dept. of Comp. Arch., UMA
Workloads and Benchmarks SPEC CPU2017
EC Chapter 1.101 Dept. of Comp. Arch., UMA
Workloads and Benchmarks SPEC CPU2017
EC Chapter 1.102 Dept. of Comp. Arch., UMA
SPECINTC2017 on 2.20 GHz Intel Xeon E7-8890 v4
EC Chapter 1.103 Dept. of Comp. Arch., UMA
SPEC2017 Power measurements (new)
You will need additional hardware:
A power analyzer
A temperature sensor
An x86-based Linux or Windows "Controller System", where you
will run SPEC PTDaemon.
EC Chapter 1.104 Dept. of Comp. Arch., UMA
Amdahl’s Law
Pitfall: Improving an aspect of a computer and
expecting a proportional improvement in overall
performance
TCPU _ org 1
S
TCPU _ imp Fm (1 Fm)
Sm
Fm = Fraction of improvement
Sm = Factor of improvement
EC Chapter 1.109 Dept. of Comp. Arch., UMA
Amdahl’s Law – How to compute it
S= =
tsubSM
tsubCM
TcpuSM
TcpuCM
tresto tresto
S 1 𝐹𝑚 2 Sm 3
𝑇𝑐𝑝𝑢𝐶𝑀 tresto tsub = (1 𝐹𝑚 𝑥 𝑇𝑐𝑝𝑢𝑆𝑀 + 𝑥 𝑇𝑐𝑝𝑢𝑆𝑀 4
tresto 𝑇𝑐𝑝𝑢 𝑡𝑠𝑢𝑏 𝑇𝑐𝑝𝑢 𝐹𝑚 𝑥 𝑇𝑐𝑝𝑢𝑆𝑀 ...(2)
tsub … 3 , 2
EC Chapter 1.110 Dept. of Comp. Arch., UMA
Amdahl’s Law
Pitfall: Improving an aspect of a computer and
expecting a proportional improvement in overall
performance
TCPU _ org 1
S
TCPU _ imp Fm (1 Fm)
Sm
Fm = Fraction of improvement
Sm = Factor of improvement
The opportunity for improvement is affected by how
much time the modified event consumes
Corollary 1: make the common case fast
The performance enhancement possible with a given
improvement is limited by the amount the improved
feature is used
Corollary 2: the bottleneck will limit the improv.
EC Chapter 1.111 Dept. of Comp. Arch., UMA
Amdahl’s Law
Example: multiply accounts for 80s/100s
How much improvement in multiply performance to
get 5× overall?
Study the limit cases in the Amdahl law
1 1
Fm=0 A 1 Fm=1 A Am
0 (1) 1 / Am (1 1)
1 1
Am=1 A 1 Am=inf A
Fm (1 Fm) 0 (1 Fm)
EC Chapter 1.112 Dept. of Comp. Arch., UMA
Amdahl’s Law
Example: multiply accounts for 80s/100s
How much improvement in multiply performance to
get 5× overall?
80 Can’t be done!
20 20
n
Study the limit cases in the Amdahl law
1 1
Fm=0 S 1 Fm=1 S Sm
0 (1) 1 / Sm (11)
1 1
Sm=1 S 1 Sm=inf S
Fm (1 Fm) 0 (1 Fm)
EC Chapter 1.113 Dept. of Comp. Arch., UMA
Concluding Remarks
Cost/performance is improving
Due to underlying technology development
Hierarchical layers of abstraction
In both hardware and software
Instruction set architecture
The hardware/software interface
Execution time: the best performance measure
Power is a limiting factor
Use parallelism to improve performance
EC Chapter 1.114 Dept. of Comp. Arch., UMA