1/24/14
Welcome To CSCE 4610/5610: Computer Architecture
Outcomes for CSCE 4610
1. Apply metrics to evaluate performance modern computer systems.
2. Design processor pipelines to meet specifications.
3. Design simple branch prediction for a pipelined processor.
4. Design an out-of-order instruction execution using reservation stations and
reorder buffers.
5. Apply simple compiler techniques to improve performance.
6. Gain knowledge about various cache design alternatives.
You will be asked at the end of the semester to see if we met these objectives
Review
What is computer architecture
Instruction Set Architecture
Computer Organization
Micro-architecture
System Architecture
CSCE 4610/5610 Jan. 16, 2014
Welcome To CSCE 4610/5610: Computer Architecture
Role of a computer architect
Support functionality
Depends on the type of applications or target market
Desktop, server, scientific, mobile/personal devices
Embedded systems (controllers, etc.)
Understand technology trends
Denser chips
Denser memories
Memory wall
Clock frequencies
Heat dissipation
Support functionality with best performance
Speed performance
Reliability, availability
Power/energy performance
hard or soft real-time requirements
CSCE 4610/5610 Jan. 16, 2014
1/24/14
CSCE 4610/5610: Computer Architecture
Issues related to Cost
Cost of Integrated Chip
Cost of the die (or chip)
Cost of testing
Cost of packaging
Cost of the die depends on the number of chips per wafer and
how many good dies per wafer (or yield)
Dies per wafer:
CSCE 4610/5610 Jan. 16, 2014
4610/5610: Computer Architecture
CSCE
Die yield=
(Wafer _ yield)*
1
[1+ (defects _ per _ unit _ area)*(die _ area)]N
N is known as process complexity and in 2010 its value ranged between 11.5 and 15.5
Example from page 31
Note: Some of the problems from Chapter 1 do not work with this formula
30cm water and we have two different dies, 1.5cm or 1.0 cm square
Dies per wafer:
With 1.5 cm dies or 2.25 cm2 we get 270 dies
With 1cm dies or 1.0 cm2, we get 640 dies
CSCE 4610/5610 Jan. 16, 2014
1/24/14
CSCE 4610/5610: Computer Architecture
Defects per square determine the yield
Given 0.031 defects for cm2 and N=13.5
If we use 1.5cm dies,
Die yield = 0.40. So we get 270*.4 = 108 good chips from 30cm wafer
If we use 1cm dies: die yield = 0.66 and we will get 640*0.66 = 422 good chips
CSCE 4610/5610 Jan. 16, 2014
CSCE 4610/5610: Computer Architecture
Another example: Problem 1.1 on page 62
CSCE 4610/5610 Jan. 16, 2014
1/24/14
CSCE 4610/5610: Computer Architecture
Here we simply apply the yield equation
Die yield=
(Wafer _ yield)*
1
[1+ (defects _ per _ unit _ area)*(die _ area)]N
We will assume wafer yield to be 100% and N= 13.5
If you the equation in the text book, we get Yield = 2.9*10-5
VERY BAD!
The previous edition used the following equation for yield
Die _ yield = wafer _ yield *[1+
(defects _ per _ unit _ area)*(die _ area)
]
Assume Wafer-Yield =100% and alpha = 4
Now, the yield for Power 4 turns out to be more reasonable = 0.36
CSCE 4610/5610 Jan. 16, 2014
CSCE 4610/5610: Computer Architecture
Why does power5 have a lower defect rate?
IBM technology is older (see larger scale in terms of manufacturing size in nm)
So it is mature and have fewer defects in manufacturing
Power consumed by a processor
Two types of power consumed
Static: even if a hardware component is not active
sometimes called leakage
dynamic: due to switching of transistors
Powerdynamic = (1/2)*(Capacitive load)*(threshold voltage)2*(operating frequency)
CSCE 4610/5610 Jan. 16, 2014
1/24/14
CSCE 4610/5610: Computer Architecture
Example: What happens if voltage is dropped by 15% and (proportional) change
in operating frequency?
No change in capacitive load
So we reduced the power consumption by 39%
Consider another example: problem 1.4
CSCE 4610/5610 Jan. 16, 2014
CSCE 4610/5610: Computer Architecture
CSCE 4610/5610 Jan. 16, 2014
10
1/24/14
CSCE 4610/5610: Computer Architecture
Note we are using Intel processor with 2 DRAM chips and 7200 rpm disk
total power = 66 (for processor) +2*(2.3) (for DRAM) + 7.9 (for disk)
= 78.5 W
However if a power supply only works at 80% efficiency and need to supply 78.5W we
need a power supply that is rated for 78.5/0.8 = 99W
b). Disk is 60% idle (or 40% busy)
7.8*40% + 4.0*60% = 5.56
c). 7200 disk can be idle longer, or seek time shorter
seek7600= 75% seek5400
Total-power7200=100%=seek7200+idle7600=75%*seek5400+idle7200
Total-power5400=seek5400+idle5400
We need to equate these two power equations and use the power consumed by the two disks
for seek and idle given in the table
Note seek_time = 1-(idle_time)
Solving, you will see that idle7600 is approximately 29%
CSCE 4610/5610 Jan. 16, 2014
11
CSCE 4610/5610: Computer Architecture
Note: More hardware means more power consumption
both static and dynamic
The capacitive load is proportional to the number of transistors
Dynamic voltage and frequency scaling
Changing voltage and clock speed) can degrade performance
A better measure may be time*energy product
If we change voltage and frequency in the middle of execution
you will lose some time since hardware components need to be resynchronized
Dropping (threshold) voltage reduces power consumption but may become more error prone
Lot of work done on changing frequencies as well as shutting off components to save power
Globally Asynchronous Locally Synchronous (GALS)
Different units (different stages of a pipeline) run at different clock rates
CSCE 4610/5610 Jan. 16, 2014
12
1/24/14
CSCE 4610/5610: Computer Architecture
Another criteria is the amount of silicon area needed
at least for embedded systems
Let us define the size needed for 1bit register as 1rbe.
To build one bit SRAM we need 0.6 rbe
To build one bit DRAM we need 0.1 rbe
To build one bit direct mapped cache we need ~ 0.8 rbe
So, need to decide if you want DRAM or DRAM or Cache or registers
Tradeoff Registers are faster than caches, caches faster than DRAM
Logic circuits (like control logic, arithmetic logic) consume more area than memory units
If we can reduce the amount of cache memory needed, we can potentially reduce the area
needed for cache and power consumption
We have explored some ideas -- keep the same performance but reduce area and
Power consumed using different cache organizations
CSCE 4610/5610 Jan. 16, 2014
13
CSCE 4610/5610: Computer Architecture
Cost of computers
The cost of CPU chip is only a small fraction of the overall cost a computer
CPU is 22% of total system cost
This fraction keeps changing on how the cost of other components change
The cost of system must be understood in relation to the selling price of the system
the actual cost of the system is only 25% of list price
rest for marketing, profits etc
So, if the cost of the CPU increases by $, the system cost increases by $4.5
The list price will increase by $18!
So, if we are considering adding new functionality, we need to worry about the impact of the
functionality on cost and price
And the increase should be justified by performance either speed or reliability/availability
or less power
CSCE 4610/5610 Jan. 16, 2014
14
1/24/14
CSCE 4610/5610: Computer Architecture
How do we define the performance of a processor?
Execution time for a program?
Wall clock or CPU time?
User CPU and System CPU Time
For now we will only use user CPU time =
(instr uction count)* (CPI)
(Clock Rate)
Note cycle time = 1/clock_rate. 1 Ghz clock means 1 ns per cycle
CPI: Average number of Cycles Per Instruction.
How do we find this?
Consider for example that we collected average frequencies for various instruction types.
ALU operations occur 43% of time and take 1 clock cycle to execute
Load instructions occur 21% of the time and need 2 cycles
Store instructions occur 12% of the time and need 2 cycles
Branch instructions occur 24% of the time and need 2 cycles
CSCE 4610/5610 Jan. 16, 2014
15
CSCE 4610/5610: Computer Architecture
How do we get CPI = average cycles per instruction?
instruction count
(Ci)
(instruction count)
The average number of cycles per instruction =
0.43*1+0.21*2+0.12*2+0.24*2 = 1.57 Cycles per instruction
Once we have the CPI, and clock speed, we can find the MIPS ratings of a processor
If we are using 1Ghz processor, the MIPS rating is given by
109 /(1.57) = 637 Million Instruction Per Second
Execution time = (instruction_count)*(1/637 mips)
= (instruction_count)*6.37*10-9 seconds
Remember, the clock speed (or frequency) is inversely related to clock period.
CSCE 4610/5610 Jan. 16, 2014
16
1/24/14
CSCE 4610/5610: Computer Architecture
Consider why MIPS rating can be misleading.
Suppose we have a compiler that can optimize the program
The optimized compiler can eliminate 50% of Arithmetic instructions.
Now let us consider how the equations change. What is the CPI?
Consider for example that we collect average frequencies for various instruction types.
ALU operations occur 21.5% of time and take 1 clock cycle to execute
Load instructions occur 21% of the time and need 2 cycles
Store instructions occur 12% of the time and need 2 cycles
Branch instructions occur 24% of the time and need 2 cycles
But we need to scale these fractions since the total is only 78.5%
So CPI = [(21.5%)*1 + (21%)*2+(12%)*2+(24%)*2]/(78.5%)
= 1.73 cycles per instruction larger CPI?
MIPS = (1 Ghz)/[1.73*106] = 578 MIPS
So the computer with an optimized compiler has lower MIPS rating!
CSCE 4610/5610 Jan. 16, 2014
17
CSCE 4610/5610: Computer Architecture
Another Example
Consider two processors with different ways of implementing conditional instructions
CPU-A: needs two instructions; A compare and a branch (eg., SLT R3, R1, R2; BNZ R3, loop)
CPU-B: A single instruction to compare and branch (eg., BLT R1, R2, Loop)
Branches take 2 cycles and all other instructions take 1 cycle
Frequency of branches = 20%
CPU-As clock is 25% faster simpler instructions
Time on CPU-A = (Instr_Count)*{0.80*1+0.20*(2+1)}(Cycle_Time)
= (Instr_Count)*1.4*(Cycle_Time)
Time on CPU-B = (inst_Count)*(0.8*1+0.2*2)*(1.25*Cycle_Time)
= (Instr_Count)*1.5*(Cycle_Time)
CPU-A is faster even if it needs more instructions!
CSCE 4610/5610 Jan. 16, 2014
18
1/24/14
CSCE 4610/5610: Computer Architecture
Many Complex Interactions During Execution.
Pipeline Bubbles or Stalls or lost cycles due to branch instructions
Consider for example, on the average 50% of all branches
are taken and cause 3 cycle stalls or lost cycles
What is the CPI for branch instructions?
If not taken, CPI = 1
If taken, CPI = 4
Effective CPI for branches = 0.5*1+0.5*4 =2.5
If branches are 20%, total CPI = 80%*1 + 20%*2.5 = 1.3
Cache Misses
Effect only load and store instructions
If no cache miss, say CPI =2
If cache miss, we may have a CPI of 50
5% miss rate leads to 0.95*2+0.05*50 = 4.4
Remember the instruction frequencies from a previous example
CSCE 4610/5610 Jan. 16, 2014
19
CSCE 4610/5610: Computer Architecture
ALU operations occur 43% of time and take 1 clock cycle to execute
Load instructions occur 21% of the time and need 2 cycles without cache misses
Store instructions occur 12% of the time and need 2 cycles without cache misses
Branch instructions occur 24% of the time and need 2 cycles
But if have 21% loads and 12% stores with 4.4 cycles with cache misses,
the new CPI is = 33%*4.4 + 43%*1+24%*2 = 2.32 CPI
compared to 1.57 CPI with no cache misses
How to report performance data?
Execution time for one program
Execution times for all programs
Average execution time across all programs
Weighted average etc.
Arithmetic Mean =
Assuming n programs
1 n
(Time)i
n i =1
CSCE 4610/5610 Jan. 16, 2014
20
10
1/24/14
CSCE 4610/5610: Computer Architecture
n
(Weight ) * (Time)
Weighted Arithmetic Mean =
n
Harmonic Mean =
i =1
(Time)
i =1
Let us look an example. Here we are comparing 3 different computers using 2 programs.
Computer A Computer B Computer C
Pgm P1
Pgm P2
Total
1
1000
1001
10
100
110
20
20
40
Let us find weighted arithmetic average execution times and we will use 3 different weights
W1: P1=50% P2=50%
W2: P1= 90.9%, P2= 9.1%
W3: P1= 99.9%; P2=0.1%
CSCE 4610/5610 Jan. 16, 2014
21
CSCE 4610/5610: Computer Architecture
Computer A Computer B Computer C
Pgm P1
Pgm P2
Total
Avg with W1
Avg with W2
Avg with W3
1
1000
1001
10
100
110
20
20
40
500.5
91.91
2
55
18.19
10.09
20
20
20
So which computer is best?
If we use W1, C is best, with W2, B is best and with W3 A is best
Can we think of a different way of computing averages?
Relative performance. For each program use a relative execution time, compared a
standard computer.
The relative execution times can be used to compute an arithmetic (or weighted) means.
CSCE 4610/5610 Jan. 16, 2014
22
11
1/24/14
CSCE 4610/5610: Computer Architecture
We can also compute Geometric Mean.
( Relative _ execution _ time)
n
i =1
Let us look our example using Geometric means. The relative performance of the 3 machines
remain the same
Pgm P1
Pgm P2
Arithmetic Mean
Geometric Mean
Pgm P1
Pgm P2
Arithmetic Mean
Geometric Mean
Pgm P1
Pgm P2
Arithmetic Mean
Geometric Mean
Normalized to A
Computer A Computer B Computer C
1
10
20
1
0.1
0.02
1
1
5.05
1
10.01
0.63
Normalized to B
Computer A Computer B Computer C
0.1
1
2
10
1
0.2
5.05
1
1
1
Now, C is always the best
1.1
0.63
Normalized to C
Computer A Computer B Computer C
0.05
0.5
1
50
5
1
25.03
1.58
2.75
1.58
1
1
23
CSCE 4610/5610: Computer Architecture
Another example. See the table on page 43
here we are looking at Geometric means for Opteron and Itanium
Sun Ultra Spark 5 is used as the reference computer
Opteron runs 30% slower
CSCE 4610/5610 Jan. 16, 2014
24
12
1/24/14
CSCE 4610/5610: Computer Architecture
What programs to use in evaluating performance?
The programs that will be run in the field
Benchmark programs
Real programs that are common in an application domain
e.g. SPEC benchmarks(Spec CPU, Integer, float)
SPECWeb, SPECvirt
bio-informatics
High-performance (SPEComp)
Program kernels:
eg. Embedded kernels (EEMBC)
NAS benchmarks, Livermore loops
Synthetic program mixes
CSCE 4610/5610 Jan. 16, 2014
25
CSCE 4610/5610: Computer Architecture
How to collect performance data using benchmarks?
Actual Measurements and Simulations
If the architecture already exist, run programs and collect data
Need to be careful in collecting data
Instrumentation may skew data
Performance Registers
Software profiling techniques
Or develop simulations.
Detailed simulations
Trace driven simulations
Monte Carlo simulations
CSCE 4610/5610 Jan. 16, 2014
26
13