0% found this document useful (0 votes)

25 views84 pages

Ec23 Chapter1

This document provides an overview of computer organization and performance. It discusses how computer technology has progressed due to Moore's Law, enabling novel applications. It describes different classes of computers and how connectivity is driving computing. The document outlines what will be learned in the course, including how hardware executes programs, what determines performance, and parallel processing. It provides context on computer components, languages, and the evolution of processors.

Uploaded by

David Morillas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views84 pages

Ec23 Chapter1

Uploaded by

David Morillas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Computer Organization

Chapter 1: Measuring &

Understanding Performance
Mª Antonia Trenas Castro
[email protected]

[Adapted from Mary Jane Irwin’s slides (PSU) based on Computer Organization
and Design, ARM Edition, Patterson & Hennessy, © 2017, Elsevier]

EC Chapter 1.1 Dept. of Comp. Arch., UMA

The Computer Revolution

 Progress in computer technology

 Underpinned by Moore’s Law

 Makes novel applications feasible

 Computers in automobiles
 Cell phones
 Human genome project
 World Wide Web
 Search Engines

 Computers are pervasive

EC Chapter 1.2 Dept. of Comp. Arch., UMA

Classes of Computers
 Desktop computers
 Designed to deliver good performance to a single user at low
cost usually executing 3rd party software, usually incorporating a
graphics display, a keyboard, and a mouse
 Servers
 Used to run larger programs for multiple, simultaneous users
typically accessed only via a network and that places a greater
emphasis on dependability and (often) security
 Supercomputers
 A high performance, high cost class of servers with hundreds to
thousands of processors, terabytes of memory and petabytes of
storage that are used for high-end scientific and engineering
applications
 Embedded computers (processors)
 A computer inside another device used for running one
predetermined application
EC Chapter 1.3 Dept. of Comp. Arch., UMA
Connectivity driving computing

Source: Morgan Stanley

EC Chapter 1.4 Dept. of Comp. Arch., UMA

The Processor Market
embedded growth >> desktop growth

EC Chapter 1.5 Dept. of Comp. Arch., UMA

The Processor Market

EC Chapter 1.8 Dept. of Comp. Arch., UMA

Technology shift: ARM has cannibalized the processor market

EC Chapter 1.10 Dept. of Comp. Arch., UMA

A case of success
 John Papandriopoulos, PhD from the U. Melbourne

EC Chapter 1.11 Dept. of Comp. Arch., UMA

Servers and supercomputers
 Servers and supercomputing market
 Google: search engine for ARM, Pixel8 SoC will be based on Tensor
TPU and ARM CPUs.
 AMD Seattle Server-on-a-Chip based on Cortex-A57 (ARMv8)
 Ampere One chip (192 core ARM) for Cloud servers (OCI and AWS).
 Apple supports a broad shift to ARM with macOS (iPhone, iPad, SoC
M1, M2 used in laptops and desktops).
 Mont Blanc, ExaNest, EUROEXA projects: supercomputers made of
ARM
- Once commodity processors took over and be prepared for when mobile
processors do so
 The 2nd top system, Fugaku, installed at RIKEN Center for
Computational Science (R-CCS) in Kobe, Japan, gives 442 petaflops
(June 2023).
- Fugaku, is powered by Fujitsu’s 48-core A64FX SoC, powered by ARM
processors.
- In single or further reduced precision, which are often used in machine
learning and AI applications, Fugaku’s peak performance is over 1,000
petaflops (2 exaflops).
EC Chapter 1.14 Dept. of Comp. Arch., UMA
Supercomputers
(top500, Jun 2023 )

 https://www.top500.org/lists/top500/
2022/06/

EC Chapter 1.15 Dept. of Comp. Arch., UMA

What You Will Learn
 How programs are translated into the machine language
 And how the hardware executes them

 The hardware/software interface

 What determines program performance
 And how it can be improved

 How hardware designers improve performance

 What is parallel processing

EC Chapter 1.17 Dept. of Comp. Arch., UMA

Is it useful?
 Marcos Fajardo wrote from scratch a
rendering system, Arnold, that runs natively
on x86 CPUs, where it tries to take
advantage of all available threads and
SIMD lanes for optimal parallelism.

 It is used in
 Thor
 Captain America
 X-Men: First Class
 The Avengers
 Gravity
 Guardians of the Galaxy
 Star Wars: The Force Awakens
 Game of Thrones
 …

EC Chapter 1.18 Dept. of Comp. Arch., UMA

Understanding Performance
 Algorithm
 Determines number of operations executed

 Programming language, compiler, architecture

 Determine number of machine instructions executed
per operation

 Processor and memory system

 Determine how fast instructions are executed

 I/O system (including OS)

 Determines how fast I/O operations are executed

EC Chapter 1.19 Dept. of Comp. Arch., UMA

Below the Program
Applications software
Written in high-level Systems software
language
Hardware

 System software
 Operating system – supervising program that interfaces the
user’s program with the hardware (e.g., Linux, MacOS,
Windows)
- Handles basic input and output operations
- Allocates storage and memory
- Schedules tasks & Provides for protected sharing among multiple
applications
 Compiler – translate programs written in a high-level language
(e.g., C, Java) into instructions that the hardware can execute
EC Chapter 1.20 Dept. of Comp. Arch., UMA
Below the Program, Con’t
 High-level language program (in C)
swap (int v[], int k)
(int temp;
temp = v[k];
v[k] = v[k+1]; one-to-many
v[k+1] = temp;
) C compiler

 Assembly language program (for MIPS)

swap: sll $2, $5, 2
add $2, $4, $2
lw $15, 0($2)
lw $16, 4($2) one-to-one
sw $16, 0($2)
sw $15, 4($2)
jr $31 assembler

 Machine (object, binary) code (for MIPS)

000000 00000 00101 0001000010000000
000000 00100 00010 0001000000100000
. . .
EC Chapter 1.21 Dept. of Comp. Arch., UMA
Advantages of Higher-Level Languages ?
 Higher-level languages
 Allow the programmer to think in a more natural language and
for their intended use (Fortran/C for scientific computation,
Cobol for business programming, Lisp for symbol
manipulation, Java for web programming, …)
 Improve programmer productivity – more understandable
code that is easier to debug and validate
 Improve program maintainability
 Allow programs to be independent of the computer on which
they are developed (compilers and assemblers can translate
high-level language programs to the binary instructions of any
machine)
 Emergence of optimizing compilers that produce very efficient
assembly code optimized for the target machine

 As a result, very little programming is done today at

the assembler level

EC Chapter 1.22 Dept. of Comp. Arch., UMA

Disadvantages of scripting languages
 Modern scripting languages are interpreted (Python,
JavaScript, PHP, Perl, Ruby …)
 dynamically-typed and encourage reuse
 Efficient for programmers but not for execution

Matrix Multiply:
relative
speedup to a
Python version
(18 core CPU)

EC Chapter 1.23 Dept. of Comp. Arch., UMA

Under the Covers
 Same components for all kinds of computer
 Five classic components of a computer – input, output, memory,
datapath, and control
 datapath + control =
processor (CPU)
 Input/output includes
User-interface devices
Display, keyboard,
mouse
Storage devices
Hard disk, CD/DVD,
flash
Network adapters
For communicating
with other
computers

EC Chapter 1.26 Dept. of Comp. Arch., UMA

Evolution of processors for desktops and laptops

Intel 80286 (1982) Intel80386 (1985) Intel Pentium (1993)

134×103 transistors 275×103 transistors 3.1×106 transistors
12 Mhz, 68,7 mm2 33 Mhz, 104 mm2 66 Mhz, 264 mm2

Intel Pentium II (1997) Intel Pentium III (1999) Intel Pentium 4 (2000)
7.5×106 transistors 28×106 transistors 42×106 transistors
300 Mhz, 209 mm2 733 Mhz, 140 mm2 1.5 Ghz, 224 mm2
Remarkable improvement in performance, without
considering power consumption or die size

EC Chapter 1.27 Dept. of Comp. Arch., UMA

12th Generation Intel Core Processors

 Alder Lake
 8P + 8E cores  1GHz-5.5GHz / 770 Mhz-4.0GHz
 GPU with 32 EU  TDP: 9 W-241 W
 4-30MB L3  10 nm
EC Chapter 1.28 Dept. of Comp. Arch., UMA
13th Generation Intel Core Proc.

 Raptor Lake
 8(16)P + 16E cores  6 GHz P-cores / 4,3 GHZ E-cores
 GPU up to 96 EU  TDP: 35 W-253 W
 36 MB L3  7 nm
EC Chapter 1.29 Dept. of Comp. Arch., UMA
Processors for desktops and laptops
 Desktops (35 – 150W) and laptops (9 – 55 W) or mobiles (9 - 25W):
Intel Raptor Lake U (mobile) AMD Ryzen U 7000

 Raptor Lake (7 nm)  Ryzen (5 nm)

 2P- 4(8)E cores, 1,1-1,3 GHz  4-16 cores, 1.9-2.2 GHz
 TDP: 15W  TDP: 9-28 W
 8-12 MB L3 Cache  16 MB L3 Cache
 GPU Xe with 48-96 EU  GPU RDNA with 12 CU
EC Chapter 1.31 Dept. of Comp. Arch., UMA
Processors for smartphones and IoT
 M2 Apple (5 nm)
8-10 cores ARM
 20 Billions TRTs
 CPU 4 cores Firestorm
3.5 GHz + 4 cores
Icestorm (2.8 GHz)
 1.3-13.8W
 GPU with 10 EU
 NE with 16 cores

EC Chapter 1.32 Dept. of Comp. Arch., UMA

Evolution of processors for embedded systems: ARM

800x more energy efficient

500x smaller

EC Chapter 1.33 Dept. of Comp. Arch., UMA

ARM's primary CPU core families

EC Chapter 1.34 Dept. of Comp. Arch., UMA

Processors for smartphones and IoT
 IoT (0.5-4W):
ARM Cortex Qualcomm QCS410/QCS610

 Cortex- A7 (28 nm)  QCS410 (11 nm)

 4 cores, 1.3 GHz  4 cores (2Gold/2 Silver,
 1 MB L2 Cache (2,2-1,8GHZ)
 GPU with 2 EU  2 MB L2 cache
EC Chapter 1.35  GPU + DSPDept.
+AI engine
of Comp. Arch., UMA
Technology Trends

 Electronics
technology continues
to evolve
 Increased capacity
and performance
 Reduced cost

Year Technology Relative performance/cost

1951 Vacuum tube 1

1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2005 Ultra large scale IC 6,200,000,000

EC Chapter 1.39 Dept. of Comp. Arch., UMA

Moore’s Law
 Intel’s Gordon Moore proposed the Moore’s law in 1965

Courtesy, Intel ®
EC Chapter 1.42 Dept. of Comp. Arch., UMA
Technology Scaling Road Map (ITRS)
Year 2006 2008 2010 2012 2014 2020 2022

Feature size 65 45 32 22 14 5 4
(nm)

Intg. 0.3 0.7 1.1 1.4 1.9 > 16 > 90

Capacity
(BT)
Progression to a 2 nm transistor in 2024

 Fun facts about 45nm transistors

 30 million can fit on the head of a pin
 If car prices had fallen at the same rate as the price of a
single transistor has since 1968, a new car today would cost
about 1 cent
 The highest transistor count in a microprocessor is
134 billion transistors, in Apple's ARM-based M2 Ultra,
which is fabricated using 5 nm.
EC Chapter 1.43 Dept. of Comp. Arch., UMA
Moore’s Law Slowdown

EC Chapter 1.45 Dept. of Comp. Arch., UMA

Technology problems
 The number of manufactures decrease

EC Chapter 1.46 Dept. of Comp. Arch., UMA

Technology problems
 The number of manufactures decrease

EC Chapter 1.47 Dept. of Comp. Arch., UMA

Situation after 2016
More than Moore era

 The technology scaling may finish at the middle of the

next decade
 2 nm may be the final
- distance between Silice atoms 0.5 nm
- Size of the atom 0.24 nm
 Experts expect Moore's law will end by around 2025.
 Continuation of technological progress in a variety of other areas:
new chip architectures, quantum computing, and AI and machine
learning.

EC Chapter 1.48 Dept. of Comp. Arch., UMA

End of Growth of Processor Performance Speed?

Dark Silicon and the end

of Multicore Scaling

EC Chapter 1.49 Dept. of Comp. Arch., UMA

The Sea Change: the switch to multiprocessors
Uniprocessor Performance

Constrained by power, instruction-level parallelism,

memory latency

EC Chapter 1.50 Dept. of Comp. Arch., UMA

Technology & Power: Dennard Scaling
 Dennard Scaling: as transistors get smaller, their power
density stays constant, so that the power use stays in
proportion with area  Not true anymore

EC Chapter 1.51 Dept. of Comp. Arch., UMA

What Happened to Clock Rates and Why?

my plate: 2000W on 400cm2 = 5 W/cm2

Rocket
Watts/cm2 my micro: 120W on 4cm2 = 30 W/cm2 Sun's
Nozzle
1000 Surface

Nuclear Reactor

100
Pentium 4, 2000

Hot plate
Pentium III
Pentium II Intel Core Duo, 2006
10
Pentium Pro
Pentium
i386
i486
1
          nm

“New Microarchitecture Challenges in the Coming Generations of CMOS Process

Technologies” – Fred Pollack, Intel. Micro32 Conf. keynote. 1999.

EC Chapter 1.52 Dept. of Comp. Arch., UMA

But What Happened to Clock Rates and Why?

 Clock rates hit a

“power wall”

Power (Watts)

EC Chapter 1.53 Dept. of Comp. Arch., UMA

Power Trends

 In CMOS IC technology
Power  Capacitive load  Voltage  Frequency
2

×30 ×1000
 How could clock rates grow by a factor of 1000 while
power grew by only a factor of 30?
5V → 1V

EC Chapter 1.55 Dept. of Comp. Arch., UMA

Reducing Power

 Suppose a new CPU has

 85% of capacitive load of old CPU
 15% voltage and 15% frequency reduction

Pold Cold  V Fold2

1
 old
  1.92
Pnew Cold  0.85  (Vold  0.85) Fold  0.85 0.85
2 4

 The power wall

 We can’t reduce voltage further
 We can’t remove more heat
 How else can we improve performance?

EC Chapter 1.56 Dept. of Comp. Arch., UMA

Fallacy: Low Power at Idle
 X4 power benchmark
- At 100% load: 295W
- At 50% load: 246W (83%)
- At 10% load: 180W (61%)
 Although dynamic power is the primary source of power dissipation
in CMOS, static power dissipation occurs due to leakage current
that flows even when a transistor is off. Leakage is responsible of
40% of the power consumption (in 2008). Thus, increasing the
number of transistors increases power dissipation, even if the
transistors are always off.

 Google data center

- Mostly operates at 10% – 50% load
- At 100% load less than 1% of the time
 However even servers that are only 10% utilized burn about 2/3 of
their peak (100% load) power

 Consider designing processors to make power proportional

to load

EC Chapter 1.57 Dept. of Comp. Arch., UMA

The sea change: multiprocessors
 42 Years of Microprocessor Trend Data (Feb. 15, 2018)

42 Years of Microprocessor Trend Data (Feb.

15, 2018)

EC Chapter 1.59 Dept. of Comp. Arch., UMA

The Sea Change: DSAs
 The power challenge has forced a change in the design of
microprocessors
 As of 2006 all desktop and server companies are shipping
microprocessors with multiple processors – cores – per chip with
other computing devices
 2014: Domain Specific Architectures: Achieve higher efficiency by
tailoring the architecture to characteristics of the domain

Product AMD Intel Qualcomm Espressif

Ryzen 7000 Raptor Lake QCS2290 ESP32-C3
(RISC-V)
Cores per 4-16 + GPU 8+16 + GPU 4 + GPU 1/2
chip +DSP
Clock rate 2.0-5.4 GHz 1.1-5.1/0.8- 2.0 GHz 160 MHz
3.9 GHz
Power ~15-75 W ~15-253W 9W < 1 W (5 𝜇𝑊)

EC Chapter 1.60 Dept. of Comp. Arch., UMA

The sea change: DSAs
 Multicore microprocessors and DSAs
 More than one processor per chip and different programming
model

 Requires explicitly parallel programming

 Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer
 Hard to do
- Programming for performance for different devices
- Load balancing
- Optimizing communication and synchronization

EC Chapter 1.61 Dept. of Comp. Arch., UMA

Why DSAs Can Win (no magic)?
 Tailor the Architecture to the Domain
 More effective parallelism for a specific domain:
 SIMD vs. MIMD
 VLIW vs. Speculative, out-of-order

 More effective use of memory bandwidth

 User controlled versus caches

 Eliminate unneeded accuracy

 IEEE replaced by lower precision FP
 32-64 bit integers to 8-16 bit integers

 Domain specific programming language provides path for

software

EC Chapter 1.62 Dept. of Comp. Arch., UMA

DSA example: Tensor Processing Unit (TPU)

Perf/Watt TPU vs CPU & GPU

EC Chapter 1.63 Dept. of Comp. Arch., UMA

Performance

EC Chapter 1.73 Dept. of Comp. Arch., UMA

Performance Metrics
 Purchasing perspective
 given a collection of machines, which has the
- best performance ?
- least cost ?
- best cost/performance?
 Design perspective
 faced with design options, which has the
- best performance improvement ?
- least cost ?
- best cost/performance?
 Both require
 basis for comparison
 metric for evaluation
 Our goal is to understand what factors in the architecture
contribute to overall system performance and the relative
importance (and cost) of these factors
EC Chapter 1.74 Dept. of Comp. Arch., UMA
Throughput versus Response Time
 Response time (execution time) – the time between the
start and the completion of a task
 Important to individual users

 Throughput (bandwidth) – the total amount of work done

in a given unit time
 Important to data center managers
 Will need different performance metrics as well as a
different set of applications to benchmark embedded and
desktop computers, which are more focused on response
time, versus servers, which are more focused on
throughput
 How are response time and throughput affected by
● Replacing the processor with a faster version?
● Adding more processors?
EC Chapter 1.75
We’ll focus on response time for now…
Dept. of Comp. Arch., UMA
Defining (Speed) Performance
 To maximize performance, need to minimize execution
time

performanceX = 1 / execution_timeX

If X is n times faster than Y, then

performanceX execution_timeY
-------------------- = --------------------- = n
performanceY execution_timeX

 Decreasing response time almost always improves

throughput

EC Chapter 1.76 Dept. of Comp. Arch., UMA

Relative Performance Example
 If computer A runs a program in 10 seconds and
computer B runs the same program in 15 seconds, how
much faster is A than B?
We know that A is n times faster than B if

performanceA execution_timeB
-------------------- = --------------------- = n
performanceB execution_timeA

The performance ratio is 15

------ = 1.5
10

So A is 1.5 times faster than B

EC Chapter 1.77 Dept. of Comp. Arch., UMA

Measuring Execution Time
 Elapsed time
 Total response time= Wall clock Time = Elapsed Time
- it includes all aspects to complete a task
- Processing, I/O operations, OS overhead, idle time
 Determines system performance

 Productivity
 Throughput: the total amount of work done in a given unit time

EC Chapter 1.78 Dept. of Comp. Arch., UMA

Measuring Execution Time
 Elapsed time versus CPU time
Elapsed Time  P1.asm

IO+
OS P1.exe OS P2.exe OS P1.exe P2.exe P1.exe OS P1.exe IO sll $2, $5, 2
add $2, $4, $2
lw $15, 0($2)
Result lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
jr $31
sll $2, $5, 2
User CPU Time add $2, $4, $2
lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
jr $31
sll $2, $5, 2
add $2, $4, $2
System CPU Time
User CPU Time

CPU time = User CPU Time + System CPU time

EC Chapter 1.79 Dept. of Comp. Arch., UMA

Measuring Execution Time
 CPU time
 Time spent processing a given job
- Discounts I/O time, other jobs’ shares
 Comprises user CPU time and system CPU time
 Different programs are affected differently by CPU and system
performance
Elapsed
 Example: ‘time’ in Unix: time CPU utiliz.:
(90.7 + 12.9) /
90.7u 12.9s 2:39 65% (2*60 + 39)

I/O and other

user CPU system processes
time CPU time

 Our goal: user CPU time + system CPU time

EC Chapter 1.80 Dept. of Comp. Arch., UMA

Review: Machine Clock Rate
 Clock rate (clock cycles per second in MHz or GHz) is
inverse of clock cycle time (clock period)
CC = 1 / CR

one clock period

10 nsec clock cycle => 100 MHz clock rate

5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec (10-9) clock cycle => 1 GHz (109) clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate
EC Chapter 1.81 Dept. of Comp. Arch., UMA
Performance Factors
 CPU execution time (CPU time) – time the CPU spends
working on a task
 Does not include time waiting for I/O or running other programs

CPU execution time = # CPU clock cyclesx clock cycle time

for a program for a program
or
CPU execution time # CPU clock cycles for a program
= -------------------------------------------
for a program clock rate

 Can improve performance by increasing the the clock

rate or by reducing the number of clock cycles required
for a program
● Hardware designer must often trade off clock rate against cycle
count
EC Chapter 1.82 Dept. of Comp. Arch., UMA
Improving Performance Example
 A program runs on computer A with a 2 GHz clock in 10
seconds. What clock rate must computer B run at to run
this program in 6 seconds? Unfortunately, to accomplish
this, computer B will require 1.2 times as many clock
cycles as computer A to run the program.

EC Chapter 1.83 Dept. of Comp. Arch., UMA

Improving Performance Example
 A program runs on computer A with a 2 GHz clock in 10
seconds. What clock rate must computer B run at to run
this program in 6 seconds? Unfortunately, to accomplish
this, computer B will require 1.2 times as many clock
cycles as computer A to run the program.
CPU clock cyclesA
CPU timeA = -------------------------------
clock rateA
CPU clock cyclesA = 10 sec x 2 x 109 cycles/sec
= 20 x 109 cycles
CPU timeB 1.2 x 20 x 109 cycles
= -------------------------------
clock rateB
clock rateB 1.2 x 20 x 109 cycles
= ------------------------------- = 4 GHz
6 seconds
EC Chapter 1.84 Dept. of Comp. Arch., UMA
Clock Cycles per Instruction
 Not all instructions take the same amount of time to
execute
 One way to think about execution time is that it equals the
number of instructions executed multiplied by the average time
per instruction

# CPU clock cycles # Instructions Average clock cycles

for a program = for a program x per instruction

 Clock cycles per instruction (CPI) – the average number

of clock cycles each instruction takes to execute
 A way to compare two different implementations of the same ISA

CPI for this instruction class

A B C
CPI 1 2 3
EC Chapter 1.85 Dept. of Comp. Arch., UMA
Using the Performance Equation
 Computers A and B implement the same ISA. Computer
A has a clock cycle time of 250 ps and an effective CPI
of 2.0 for some program and computer B has a clock
cycle time of 500 ps and an effective CPI of 1.2 for the
same program. Which computer is faster and by how
much?

EC Chapter 1.86 Dept. of Comp. Arch., UMA

Using the Performance Equation
 Computers A and B implement the same ISA. Computer
A has a clock cycle time of 250 ps and an effective CPI of
2.0 for some program and computer B has a clock cycle
time of 500 ps and an effective CPI of 1.2 for the same
program. Which computer is faster and by how much?
Each computer executes the same number of instructions, I,
so
CPU timeA = I x 2.0 x 250 ps = 500 x I ps
CPU timeB = I x 1.2 x 500 ps = 600 x I ps

Clearly, A is faster … by the ratio of execution times

performanceA execution_timeB 600 x I ps

------------------- = --------------------- = ---------------- = 1.2
performanceB execution_timeA 500 x I ps

EC Chapter 1.87 Dept. of Comp. Arch., UMA

Effective (Average) CPI
 Computing the overall effective CPI is done by looking at
the different types of instructions and their individual
cycle counts and averaging
n
Overall effective CPI =  (CPIi x ICi)
i=1
 Where ICi is the count (percentage) of the number of instructions
of class i executed
 CPIi is the (average) number of clock cycles per instruction for
that instruction class
 n is the number of instruction classes

 The overall effective CPI varies by instruction mix – a

measure of the dynamic frequency of instructions across
one or many programs

EC Chapter 1.88 Dept. of Comp. Arch., UMA

THE Performance Equation
 Our basic performance equation is then
CPU time = Instruction_count x CPI x clock_cycle
or
Instruction_count x CPI
CPU time = -----------------------------------------------
clock_rate

 These equations separate the three key factors that

affect performance
 Can measure the CPU execution time by running the program
 The clock rate is usually given
 Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
 CPI varies by instruction type and ISA implementation for which
we must know the implementation details
EC Chapter 1.89 Dept. of Comp. Arch., UMA
Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle
Instruction CPI clock_cycle
_count

Algorithm X X
Programming
language X X

Compiler X X
ISA
X X X
Core
organization X X
Technology
X

EC Chapter 1.90 Dept. of Comp. Arch., UMA

Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle

EC Chapter 1.91 Dept. of Comp. Arch., UMA

A Simple Example
Op Freq CPIi Freq x CPIi
ALU 50% 1
Load 20% 5
Store 10% 3
Branch 20% 2

=
 How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?

 How does this compare with using branch prediction to shave

a cycle off the branch time?

 What if two ALU instructions could be executed at once?

EC Chapter 1.92 Dept. of Comp. Arch., UMA

A Simple Example
Op Freq CPIi Freq x CPIi
ALU 50% 1 .5 .5 .5 .25
Load 20% 5 1.0 .4 1.0 1.0
Store 10% 3 .3 .3 .3 .3
Branch 20% 2 .4 .4 .2 .4

= 2.2 1.6 2.0 1.95

 How much faster would the machine be if a better data cache

reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
 How does this compare with using branch prediction to shave
a cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
 What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
EC Chapter 1.93 Dept. of Comp. Arch., UMA
THE Performance Equation
 Our basic performance equation is then
CPU time = Instruction_count x CPI x clock_cycle
or
Instruction_count x CPI
CPU time = -----------------------------------------------
clock_rate
or

NI x CPI
Tcpu = -------------------- = NI x CPI x tck
F

EC Chapter 1.94 Dept. of Comp. Arch., UMA

Summary: Evaluating ISAs
 Design-time metrics:
 Can it be implemented, in how long, at what cost?
 Can it be programmed? Ease of compilation?

 Static Metrics:
 How many bytes does the program occupy in memory?

 Dynamic Metrics:
 How many instructions are executed? How many bytes does the
processor fetch to execute the program?
 How many clocks are required per instruction? CPI
 How "lean" a clock is practical?

Best Metric: Time to execute the program!

depends on the instructions set, the

processor organization, and compilation Inst. Count Cycle Time
techniques.

EC Chapter 1.95 Dept. of Comp. Arch., UMA

Pitfall: MIPS as a Performance Metric
 MIPS: Millions of Instructions Per Second
 Doesn’t account for
- Differences in ISAs between computers
- Differences in complexity between instructions

Instruction count
MIPS 
Execution time  10 6
Instruction count Clock rate
 
Instruction count  CPI CPI  10 6
 10 6

Clock rate

 CPI varies between programs on a given CPU

EC Chapter 1.96 Dept. of Comp. Arch., UMA

MFLOPS as a Performance Metric
 MFLOPS: Millions of Floating Point Operations Per
Second
 Doesn’t account for
- Differences in number of cycles for different operations
- It just models the Floating Point Unit (FPU)
- Only for FP applications

 Supercomputers metric:

EC Chapter 1.97 Dept. of Comp. Arch., UMA

Workloads and Benchmarks
 Benchmarks – a set of programs that form a “workload”
specifically chosen to measure performance
 SPEC (System Performance Evaluation Cooperative)
creates standard sets of benchmarks starting with
SPEC89. The latest is SPEC CPU2017 which consists of
10 integer benchmarks (SPECspeed2017_int) and 10
floating-point benchmarks (SPECspeed2017_fp) for
measuring speed. There are also two suites for measuring
throughput (SPECrate2017_int and SPECrate2017_fp).
www.spec.org
 There are also benchmark collections for power workloads
(SPECpower_ssj2008), for Cloud (SPEC Cloud IaaS
2018), for Graphics/Workstations (SPECviewperf 2020),
Virtualization (SPECvirt Datacenter 2021), Java
Client/Server …
EC Chapter 1.98 Dept. of Comp. Arch., UMA
SPEC CPU Benchmark
 Programs used to measure performance
 Supposedly typical of actual workload
 Standard Performance Evaluation Corp (SPEC)
 Develops benchmarks for CPU, I/O, Web, …
 SPEC CPU2017
 Elapsed time to execute a selection of programs
- Negligible I/O, so focuses on CPU performance
 Normalize relative to reference machine
 Summarize as geometric mean of performance ratios
- SPECspeed2017_int (integer)
– SPECspeed2017_int_base
– SPECspeed2017_int_peak
- SPECspeed2017_fp (floating-point)
– SPECspeed2017_fp_base
– SPECspeed2017_fp_peak
n
n
 Execution time ratio
i1
i
EC Chapter 1.99 Dept. of Comp. Arch., UMA
Comparing and Summarizing Performance
 How do we summarize the performance for benchmark
set with a single number?
 First the execution times are normalized giving the “SPEC ratio”
(bigger is faster, i.e., SPEC ratio is the inverse of execution time)
 The SPEC ratios are then “averaged” using the geometric mean
(GM)
n
GM = n  SPEC ratioi
i=1

 Guiding principle in reporting performance measurements

is reproducibility – list everything another experimenter
would need to duplicate the experiment (version of the
operating system, compiler settings, input set used,
specific computer configuration (clock rate, cache sizes
and speed, memory size and speed, etc.))
EC Chapter 1.100 Dept. of Comp. Arch., UMA
Workloads and Benchmarks SPEC CPU2017

EC Chapter 1.101 Dept. of Comp. Arch., UMA

Workloads and Benchmarks SPEC CPU2017

EC Chapter 1.102 Dept. of Comp. Arch., UMA

SPECINTC2017 on 2.20 GHz Intel Xeon E7-8890 v4

EC Chapter 1.103 Dept. of Comp. Arch., UMA

SPEC2017 Power measurements (new)
 You will need additional hardware:
 A power analyzer
 A temperature sensor
 An x86-based Linux or Windows "Controller System", where you
will run SPEC PTDaemon.

EC Chapter 1.104 Dept. of Comp. Arch., UMA

Amdahl’s Law
 Pitfall: Improving an aspect of a computer and
expecting a proportional improvement in overall
performance
TCPU _ org 1
S 
TCPU _ imp Fm  (1 Fm)
Sm
 Fm = Fraction of improvement
 Sm = Factor of improvement

EC Chapter 1.109 Dept. of Comp. Arch., UMA

Amdahl’s Law – How to compute it
S= =

tsubSM
tsubCM

TcpuSM
TcpuCM
tresto tresto

 S 1 𝐹𝑚 2 Sm 3

 𝑇𝑐𝑝𝑢𝐶𝑀 tresto tsub = (1 𝐹𝑚 𝑥 𝑇𝑐𝑝𝑢𝑆𝑀 + 𝑥 𝑇𝑐𝑝𝑢𝑆𝑀 4

 tresto 𝑇𝑐𝑝𝑢 𝑡𝑠𝑢𝑏 𝑇𝑐𝑝𝑢 𝐹𝑚 𝑥 𝑇𝑐𝑝𝑢𝑆𝑀 ...(2)

 tsub … 3 , 2
EC Chapter 1.110 Dept. of Comp. Arch., UMA
Amdahl’s Law
 Pitfall: Improving an aspect of a computer and
expecting a proportional improvement in overall
performance
TCPU _ org 1
S 
TCPU _ imp Fm  (1 Fm)
Sm
 Fm = Fraction of improvement
 Sm = Factor of improvement

 The opportunity for improvement is affected by how

much time the modified event consumes
 Corollary 1: make the common case fast

 The performance enhancement possible with a given

improvement is limited by the amount the improved
feature is used
 Corollary 2: the bottleneck will limit the improv.
EC Chapter 1.111 Dept. of Comp. Arch., UMA
Amdahl’s Law
 Example: multiply accounts for 80s/100s
 How much improvement in multiply performance to
get 5× overall?

 Study the limit cases in the Amdahl law

1 1
 Fm=0 A 1  Fm=1 A  Am
0  (1) 1 / Am  (1  1)

1 1
 Am=1 A  1 Am=inf A
Fm  (1  Fm) 0  (1  Fm)
EC Chapter 1.112 Dept. of Comp. Arch., UMA
Amdahl’s Law
 Example: multiply accounts for 80s/100s
 How much improvement in multiply performance to
get 5× overall?
80  Can’t be done!
20   20
n

 Study the limit cases in the Amdahl law

1 1
 Fm=0 S 1  Fm=1 S  Sm
0  (1) 1 / Sm (11)

1 1
 Sm=1 S  1  Sm=inf S
Fm (1 Fm) 0  (1 Fm)
EC Chapter 1.113 Dept. of Comp. Arch., UMA
Concluding Remarks
 Cost/performance is improving
 Due to underlying technology development

 Hierarchical layers of abstraction

 In both hardware and software

 Instruction set architecture

 The hardware/software interface

 Execution time: the best performance measure

 Power is a limiting factor
 Use parallelism to improve performance

EC Chapter 1.114 Dept. of Comp. Arch., UMA

EC 14-15 Computer Organization Fall 2014 Chapter 1: Measuring & Understanding Performance
No ratings yet
EC 14-15 Computer Organization Fall 2014 Chapter 1: Measuring & Understanding Performance
34 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
03 Why Parallel
No ratings yet
03 Why Parallel
34 pages
HPC Pipeline Execution Time Overview
No ratings yet
HPC Pipeline Execution Time Overview
124 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Computer Architecture Insights
No ratings yet
Computer Architecture Insights
29 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
50 pages
Computer Evolution 2 (Details)
No ratings yet
Computer Evolution 2 (Details)
23 pages
EEE415 Lect Intro
No ratings yet
EEE415 Lect Intro
61 pages
SKEE3223 - Module 02 - ARM Architecture Hardware
No ratings yet
SKEE3223 - Module 02 - ARM Architecture Hardware
20 pages
Defining Computer Architecture
No ratings yet
Defining Computer Architecture
6 pages
An Introduction To Computer Architecture: © 2019 Arm Limited
No ratings yet
An Introduction To Computer Architecture: © 2019 Arm Limited
46 pages
CSC 247 Class Lecture
No ratings yet
CSC 247 Class Lecture
33 pages
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
No ratings yet
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
52 pages
Trends in Computer Architecture
No ratings yet
Trends in Computer Architecture
30 pages
001 Intro
No ratings yet
001 Intro
55 pages
Advanced Computer Architecture Overview
No ratings yet
Advanced Computer Architecture Overview
20 pages
UNIT1
No ratings yet
UNIT1
11 pages
Lecture Slides-Week1
No ratings yet
Lecture Slides-Week1
59 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
Advances in Computer Architecture ECE 6373
No ratings yet
Advances in Computer Architecture ECE 6373
151 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Solihin Chapter 1 Slides
No ratings yet
Solihin Chapter 1 Slides
23 pages
CS 35101 Computer Architecture: Section 600
No ratings yet
CS 35101 Computer Architecture: Section 600
45 pages
Microprocessors 2week
No ratings yet
Microprocessors 2week
46 pages
Lecture01 IntroToComputerArchitecture
No ratings yet
Lecture01 IntroToComputerArchitecture
47 pages
Computer Abstractions Overview
No ratings yet
Computer Abstractions Overview
22 pages
Computer Evolution and Performance Insights
No ratings yet
Computer Evolution and Performance Insights
32 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
CSE 820 Graduate Computer Architecture: Dr. Enbody
No ratings yet
CSE 820 Graduate Computer Architecture: Dr. Enbody
25 pages
Chapter 1 Measuring Understanding Performance
No ratings yet
Chapter 1 Measuring Understanding Performance
63 pages
Embedded Systems Overview & Design
100% (1)
Embedded Systems Overview & Design
57 pages
Introduction To Computer Architecture and Performance Measurement
No ratings yet
Introduction To Computer Architecture and Performance Measurement
41 pages
SKEE3223 - Module 02
No ratings yet
SKEE3223 - Module 02
39 pages
Embedded System Design and History
No ratings yet
Embedded System Design and History
99 pages
Son-CA - Lec1 - 1 - Computer Abstraction and Technology
No ratings yet
Son-CA - Lec1 - 1 - Computer Abstraction and Technology
31 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Computers: - Computers Impact Our Lives in A Huge Number of Ways
No ratings yet
Computers: - Computers Impact Our Lives in A Huge Number of Ways
14 pages
FIT9134 Week11
No ratings yet
FIT9134 Week11
21 pages
Chapter 1
No ratings yet
Chapter 1
50 pages
Ru CS Caal 04
No ratings yet
Ru CS Caal 04
26 pages
Parallel Archit 1
No ratings yet
Parallel Archit 1
18 pages
Parallel Architecture Fundamental
No ratings yet
Parallel Architecture Fundamental
18 pages
Ch.2 Performance Issues: Computer Organization and Architecture
No ratings yet
Ch.2 Performance Issues: Computer Organization and Architecture
25 pages
Embedded Systems with ARM Focus
100% (1)
Embedded Systems with ARM Focus
111 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
52 pages
Computer Architecture Principles Overview
No ratings yet
Computer Architecture Principles Overview
274 pages
Lecture 1
No ratings yet
Lecture 1
30 pages
02 Computer-Evolution
No ratings yet
02 Computer-Evolution
52 pages
Communication Costs in Parallel Machines
No ratings yet
Communication Costs in Parallel Machines
80 pages
Week 4a - Computer Architecture Fundamentals - Part 1
No ratings yet
Week 4a - Computer Architecture Fundamentals - Part 1
45 pages
1 - Intro
No ratings yet
1 - Intro
12 pages
MP Introduction of Course
No ratings yet
MP Introduction of Course
31 pages
8051 Microcontroller Thesis Guide
100% (2)
8051 Microcontroller Thesis Guide
8 pages
ICT Revision Questions
No ratings yet
ICT Revision Questions
58 pages
Online RTO Report PDF
No ratings yet
Online RTO Report PDF
115 pages
Incoming Ii Puc Computer Science Chapter 1 Exception Handling in Python Study Material - 2025-26
No ratings yet
Incoming Ii Puc Computer Science Chapter 1 Exception Handling in Python Study Material - 2025-26
9 pages
Computer Reviewer
No ratings yet
Computer Reviewer
13 pages
Electrical Engineering and Computer Science (Course 6) : Programming & So Ware Engineering
No ratings yet
Electrical Engineering and Computer Science (Course 6) : Programming & So Ware Engineering
69 pages
XII Practical File 22-23
No ratings yet
XII Practical File 22-23
11 pages
Nursing Informatics Lecture2010
100% (5)
Nursing Informatics Lecture2010
354 pages
Python for Financial Modeling 2023-2024
No ratings yet
Python for Financial Modeling 2023-2024
9 pages
Optimal Code Compiling in C: Nitika Gupta Nistha Seth Prabhat Verma
No ratings yet
Optimal Code Compiling in C: Nitika Gupta Nistha Seth Prabhat Verma
8 pages
Mastering Digital Literacy 2024 Grades 3 To 5
No ratings yet
Mastering Digital Literacy 2024 Grades 3 To 5
114 pages
Aos M1
No ratings yet
Aos M1
6 pages
Additional Lecture Notes CP - 0
No ratings yet
Additional Lecture Notes CP - 0
240 pages
Ishaq Chapter One
No ratings yet
Ishaq Chapter One
10 pages
Chapter 13 Building Information Systems
100% (1)
Chapter 13 Building Information Systems
20 pages
Building Java Programs: A Back To Basics Approach, 5th Edition Stuart Reges Online Reading
No ratings yet
Building Java Programs: A Back To Basics Approach, 5th Edition Stuart Reges Online Reading
66 pages
Motorola Mototrbo XPR 2500 Basic Service Manual 100
No ratings yet
Motorola Mototrbo XPR 2500 Basic Service Manual 100
100 pages
Arithmetic Aptitude Data Interpretation
No ratings yet
Arithmetic Aptitude Data Interpretation
9 pages
CHAPTER ONE To FIVE WALTERmain
No ratings yet
CHAPTER ONE To FIVE WALTERmain
31 pages
Ec8681-Microprocessors and Microcontrollers Lab-1732123961-Cse IV MP Lab Manual Fin
No ratings yet
Ec8681-Microprocessors and Microcontrollers Lab-1732123961-Cse IV MP Lab Manual Fin
114 pages
A K 2 0 0 e R R o R M e S S A G e e X P L A N A T I o N S
No ratings yet
A K 2 0 0 e R R o R M e S S A G e e X P L A N A T I o N S
142 pages
Income Tax Management System Using
86% (7)
Income Tax Management System Using
46 pages
Object-Oriented Implementation of Chess Game in C++
100% (1)
Object-Oriented Implementation of Chess Game in C++
11 pages
Isa 401 - Auditing of Cis Environment
100% (1)
Isa 401 - Auditing of Cis Environment
6 pages
TASM Assembler Lab Guide
No ratings yet
TASM Assembler Lab Guide
10 pages
C
100% (11)
C
566 pages
24B15CS111 - SDF Lab - 1 - Assignment 14 And15
No ratings yet
24B15CS111 - SDF Lab - 1 - Assignment 14 And15
7 pages
RPG400 Tutorial
63% (8)
RPG400 Tutorial
22 pages
Cics Tutorial PDF
0% (1)
Cics Tutorial PDF
23 pages
Python Text Processing Techniques
No ratings yet
Python Text Processing Techniques
18 pages