Lecture 2: Performance
CMPS 221 – Computer Organization and Design
Slides by Mahmoud Bdeir and Izzat El Hajj
Clocks
• A computer is driven by a clock that determines
when events take place
• A clock cycle is a discrete time interval between two
pulses of an oscillator
• A clock period is the duration of a clock cycle
• The clock rate or frequency is the number of clock
cycles per second (inverse of the clock period)
• Example: the Intel Core i7-8700K has a clock rate of 3.7GHz.
What is its clock period?
Which has better performance?
• CPU1: 2.4 GHz
• CPU2: 3.8 GHz
Trick question!
Which has better performance?
It depends on the performance metric we care about
Which has better performance?
If we care about minimizing the time to transport one person
from one place to another (i.e., execution time)…
…the car
Which has better performance?
If we care about maximizing the number of people we can
transport in a certain amount of time (i.e., throughput)…
…the bus
Which has better performance?
If we care about minimizing the energy it takes
to transport people (i.e., energy efficiency)…
…bikes
Which has better execution time?
• CPU1: 2.4 GHz
• CPU2: 3.8 GHz
Still a trick question!
Components of Execution Time
𝑺𝒆𝒄𝒐𝒏𝒅𝒔 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔 𝑺𝒆𝒄𝒐𝒏𝒅𝒔
= ×
𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏
𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆𝒔 𝑺𝒆𝒄𝒐𝒏𝒅𝒔
= × ×
𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆
Instruction Count CPI Clock Rate -1
Which has better execution time?
𝑺𝒆𝒄𝒐𝒏𝒅𝒔 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔 𝑺𝒆𝒄𝒐𝒏𝒅𝒔
= ×
𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏
𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆𝒔 𝑺𝒆𝒄𝒐𝒏𝒅𝒔
= × ×
𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆
Instruction Count CPI Clock Rate -1
CPU1 ? ? 2.4 GHz
CPU2 ? ? 3.8 GHz
Cannot decide based on just the clock rate
Improving a CPU’s Execution Time
𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆𝒔 𝑺𝒆𝒄𝒐𝒏𝒅𝒔
× ×
𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆
Instruction Count CPI Clock Rate -1
Approaches to decreasing execution time involve
decreasing one of these three components
Improving a CPU’s Execution Time
𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆𝒔 𝑺𝒆𝒄𝒐𝒏𝒅𝒔
× ×
𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆
Instruction Count CPI Clock Rate -1
For a long time, improvements in circuits technology enabled
driving processors at higher clock rates, improving execution
time without the need for additional effort by software
developers and computer architects
This trend was called the “free lunch”
Moore’s “Law”
107
Transistors
106
(thousands)
105
104
103
102
101
100
1970 1980 1990 2000 2010 2020
Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).
Moore’s “Law” predicted that the number of transistors
per unit area would double every 18-24 months
No More Free Lunch
107
Transistors
106
(thousands)
105
104
Frequency
103
(MHz)
102
101
100
1970 1980 1990 2000 2010 2020
Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).
Processor frequency (clock rate) followed the same trend because
smaller transistors can be switched faster… until around 2005.
Power Wall
107
Transistors
106
(thousands)
105
104
Frequency
103
(MHz)
102
Power Wall
101
100
1970 1980 1990 2000 2010 2020
Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).
Around 2005, frequency stopped increasing due to the Power Wall
Power Breakdown
2
𝑃 ∝ 𝐶 𝑉 𝑓
(Power) (Capacitance) (Voltage) (Frequency)
Power Breakdown
2
𝑃 ∝ 𝐶 𝑉 𝑓
(Power) (Capacitance) (Voltage) (Frequency)
Increasing frequency increases power which dissipates more heat,
requiring more support for cooling the chip
Power Breakdown
2
𝑃 ∝ 𝐶 𝑉 𝑓
(Power) (Capacitance) (Voltage) (Frequency)
Historically, the increase in power was partially compensated for by a
decrease in voltage, enabled by the decrease in transistor size
(over 20yrs, there was a 1,000x increase in frequency but only
a 30x increase in power because voltage decreased by 5x)
Power Breakdown
2
𝑃 ∝ 𝐶 𝑉 𝑓
(Power) (Capacitance) (Voltage) (Frequency)
Today, voltage can no longer be decreased because it makes transistors unreliable
and power can no longer be increased because we have reached the limit of what we
can cool
therefore, frequency can no longer be increased.
Power Trend
107 But we still get more
Transistors
106 transistors!
(thousands)
105 What to do with
104
them?
Frequency
103
(MHz)
Typical Power
102
(Watts)
101
100
1970 1980 1990 2000 2010 2020
Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).
Stagnation in frequency is associated with a stagnation in power
Where to invest transistors?
• Increase number of cores (or threads per core)
• Improves throughput
• Make cores more advanced
• Improves execution time
• Tradeoff between execution time and throughput
Improving a CPU’s Execution Time
𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆𝒔 𝑺𝒆𝒄𝒐𝒏𝒅𝒔
× ×
𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆
Instruction Count CPI Clock Rate -1
Computer architects have developed a wide variety of
techniques for improving the number of instructions that
can be executed each clock cycle
Techniques for Reducing CPI
• Pipelining (Chapter 4)
• Caching (Chapter 5)
• Speculative Execution (covered briefly)
• Out-of-order Execution (covered briefly)
Improving a CPU’s Execution Time
𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆𝒔 𝑺𝒆𝒄𝒐𝒏𝒅𝒔
× ×
𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆
Instruction Count CPI Clock Rate -1
Software plays a primary role in reducing a program’s instruction count
(low complexity algorithms, powerful compiler optimizations)
Computer architecture also plays a role by providing special purpose
hardware for common operations (increasingly popular trend)
Pitfalls
𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆𝒔 𝑺𝒆𝒄𝒐𝒏𝒅𝒔
× ×
𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆
Instruction Count CPI Clock Rate -1
A CPU manufacturer increases the clock rate of their processor,
decreasing the clock cycle duration.
As a result, some instructions that used to take 1 cycle to
complete now require 2 cycles, increasing the overall CPI.
If CPI increase is disproportionate, execution time my increase.
Pitfalls
𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆𝒔 𝑺𝒆𝒄𝒐𝒏𝒅𝒔
× ×
𝑷𝒓𝒐𝒈𝒓𝒂𝒎 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝑪𝒍𝒐𝒄𝒌 𝑪𝒚𝒄𝒍𝒆
Instruction Count CPI Clock Rate -1
A CPU manufacturer creates a fused-multiply-add (FMA)
instruction which is a common operation in linear algebra.
Assuming an add instruction takes 4 cycles and a multiply
instruction takes 8 cycles, if the FMA instruction takes 12 cycles,
then execution time does not improve.
Textbook Sections
• Some of the content in these slides corresponds to:
• Textbook:
• Computer Organization and Design, 5th Edition by David
Patterson and John Hennessy, Morgan Kaufmann, 2014.
• Sections:
• 1.6, 1.7