Unit-2
Memory Hierarchy Design
By
Dr. Sarvesh Vishwakarma
Professor – CSE
TCS 704 - Advanced Computer Architecture
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Introduction
• Programmers want unlimited amounts of memory with low
latency
• Fast memory technology is more expensive per bit than
slower memory
• Solution: organize memory system into a hierarchy
– Entire addressable memory space available in largest, slowest memory
– Incrementally smaller and faster memories, each containing a subset
of the memory below it, proceed in steps up toward the processor
• Temporal and spatial locality insures that nearly all references
can be found in smaller memories
– Gives the allusion of a large, fast memory being presented to the
processor
Dr. Sarvesh Vishwakarma TCS-704 ACA
Principle of Locality
• Temporal Locality
• Spatial Locality
Dr. Sarvesh Vishwakarma TCS-704 ACA
Principle of Temporal Locality
❖ Program spends 90% of its execution time in only 10% of the code
❖ It’s a property of program
❖ This word (Data or Instruction) is likely to be needed in again in near
future.
Dr. Sarvesh Vishwakarma TCS-704 ACA
Principle of Spatial Locality
❖ Data or Instruction whose addresses are near one another tend to
be referenced close together in near future time.
❖ It’s a property of program
❖ There is high probability that the other data in the block will be
needed soon.
Dr. Sarvesh Vishwakarma TCS-704 ACA
Processor-Memory Performance
µProc
1000 CPU 60%/yr.
CPU-DRAM Gap“Moore’s Law” (2X/1.5yr)
Performance
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM 9%/yr.
1
1989
(2X/10 yrs)
1984
1986
1996
1999
1980
1981
1982
1983
1985
1987
1988
1990
1991
1992
1993
1994
1995
1997
1998
2000
Problem: Time
Improvements in access time are not enough to catch up
Solution:
Increase the bandwidth of main memory (improve throughput)
Dr. Sarvesh Vishwakarma TCS-704 ACA
Capacity Memory Hierarchy Upper Level
Access Time Staging
Transfer Unit faster
CPU Registers
1KB SIZE Registers
50,000-500,000 MB/sec
< 0.5 ns Instr. Operands Prog./compiler
Cache
16 MB Cache
5000-20,000 MB/sec
0.5-25 ns cache cntl
Blocks
Main Memory
2-8 G B Main Memory
2500-10,000 MB/sec
50-250 ns OS
Pages
Disk
G-T Bytes
50-500 MB/sec
Disk
50,000 ns
user/operator
Files
Tape Larger
infinite
sec-min Tape Lower Level
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Memory Hierarchy
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Memory Hierarchy Design
• Memory hierarchy design becomes more crucial with
recent multi-core processors:
– Aggregate peak bandwidth grows with # cores:
• Intel Core i7 can generate two references per core per clock
• Four cores and 3.2 GHz clock
– 25.6 billion 64-bit data references/second +
– 12.8 billion 128-bit instruction references
– = 409.6 GB/s!
• DRAM bandwidth is only 6% of this (25 GB/s)
• Requires:
– Multi-port, pipelined caches
– Two levels of cache per core
– Shared third-level cache on chip
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Performance and Power
• High-end microprocessors have >10 MB on-chip
cache
– Consumes large amount of area and power budget
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Memory Hierarchy Basics
• When a word is not found in the cache, a miss
occurs:
– Fetch word from lower level in hierarchy, requiring a
higher latency reference
– Lower level may be another cache or the main memory
– Also fetch the other words contained within the block
• Takes advantage of spatial locality
– Place block into cache in any location within its set,
determined by address
• block address MOD number of sets
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Memory Hierarchy Basics
• n sets => n-way set associative
– Direct-mapped cache => one block per set
– Fully associative => one set
• Writing to cache: two strategies
– Write-through
• Immediately update lower levels of hierarchy
– Write-back
• Only update lower levels of hierarchy when an updated block is
replaced
– Both strategies use write buffer to make writes
asynchronous
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Memory Hierarchy Basics
• Miss rate
– Fraction of cache access that result in a miss
• Causes of misses
– Compulsory
• First reference to a block
– Capacity
• Blocks discarded and later retrieved
– Conflict
• Program makes repeated references to multiple addresses from
different blocks that map to the same location in the cache
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Memory Hierarchy Basics
• Note that speculative and multithreaded processors
may execute other instructions during a miss
– Reduces performance impact of misses
Dr. Sarvesh Vishwakarma TCS-704 ACA
Main Memory Background
• Performance of Main Memory:
– Latency: affects cache miss penalty
• Access Time: time between request and word arrives
• Cycle Time: time between requests
– Bandwidth: primary concern for I/O & large block
• Main Memory is DRAM: Dynamic RAM
– Dynamic since needs to be refreshed periodically
– Addresses divided into 2 halves (Row/Column)
• Cache uses SRAM: Static RAM
– No refresh
• 6 transistors/bit vs. 1 transistor/bit, 10X area
– Address not divided: Full address
Dr. Sarvesh Vishwakarma TCS-704 ACA
Memory hierarchy
•Pyramid
•Technical specification of memory elements
•Concepts of cache memory
•Temporal locality
•Spatial locality
•Hit
•Miss
•Hit Rate
•Miss Rate
•Miss penalty
✓Hit rate + Miss Rate = 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
Categorize Cache Miss
1.Compulsory misses
2.Capacity misses
3.Conflict misses
3 C’s Model
Dr. Sarvesh Vishwakarma TCS-704 ACA
CASE STUDY-1
User program contain 320 instructions
Instruction Count: 320
Size of individual instruction: 1 byte
Program size: 320 bytes
Memory has 320 cells
Capacity of single cell: 1 byte or 1 instruction
Max carrying capacity of cache: 1 byte
Dr. Sarvesh Vishwakarma TCS-704 ACA
ALU Flags Capacity
Reg.[B] Reg.[C]
Reg.[D] Reg.[E]
PC SP
Cache
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
CASE STUDY-1
➢Fail to implement principle of locality
✓Temporal locality
✓Spatial locality
➢Increase in miss rate (especially capacity miss)
Dr. Sarvesh Vishwakarma TCS-704 ACA
CASE STUDY-2
➢Try to implement principle of locality
✓Temporal locality
▪ Cache size: 80 cell or 80 instructions
▪Block size of cache (line length): 1 cell
▪Use bigger cache
➢Decrease in miss rate (reduction in capacity miss)
➢Drawback: (with bigger cache memory)
•Longer hit time
•High Cost
•More Power Consumption
Dr. Sarvesh Vishwakarma TCS-704 ACA
Capable of holding both old and new 0 1 2 3 4 5 6 7 ALU Flags
Information (Data/Instruction)
B C
D E
Temporal Locality PC SP
Now it has many cells to hold
the multiple instructions along Processor
with their data for longer time Cache
But Still cache is fetching the Data
from memory as per the demand of
Processor
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Main Memory
Dr. Sarvesh Vishwakarma TCS-704 ACA
CASE STUDY-3
➢Also implement second principle of locality
✓Spatial locality
▪ Cache size: 80 cell or 80 instructions
▪Use Large block size
▪Block size of cache (line length): 10 cell
➢Decrease in miss rate (reduce in compulsory misses)
✓ Increase hit rate by increasing the amount of data
brought into the cache when a cache miss occur
➢Drawback of increasing the length of cache line
✓Line fetch time get longer
✓Spatial locality principle loss its strength
Dr. Sarvesh Vishwakarma TCS-704 ACA
Capable of holding both old and new 0 1 2 3 4 5 6 7 ALU Flags
Information (Data/Instruction)
B C
D E
Spatial Locality PC SP
Now it has many cells to hold
the multiple instructions along Processor
with their data for longer time Cache Obviously frequency
of transaction
Here cache is bringing the whole between Cache and
chunk of Data from memory Memory would get
reduce
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Main Memory
Dr. Sarvesh Vishwakarma TCS-704 ACA
Mapping Function
Associativity
How many locations within the cache may contain a given
memory address………….?
If cache is partitioned into N number of cache blocks or line lengths…
How many groups can we form within the cache by distributing N number of
cache blocks or line lengths……………………………….???
Dr. Sarvesh Vishwakarma TCS-704 ACA
Different way of creating clusters in
Cache area
If N = 8
1. 8 cache blocks can be clubbed together in single group/set. 1 cluster/set
2. 8 cache blocks can be clubbed in a group of four blocks. 2 clusters/sets
3. 8 cache blocks can be clubbed in a group of two blocks. 4 clusters/sets
4. 8 cache blocks can be clubbed in a group of one blocks. 8 clusters/sets
If N = 16
1. 16 cache blocks can be clubbed together in single group/set. 1 cluster/set
2. 16 cache blocks can be clubbed in a group of eight blocks. 2 clusters/sets
3. 16 cache blocks can be clubbed in a group of four blocks. 4 clusters/sets
4. 16 cache blocks can be clubbed in a group of two blocks. 8 clusters/sets
5. 16 cache blocks can be clubbed in a group of one blocks. 16 clusters/sets
Dr. Sarvesh Vishwakarma TCS-704 ACA
Associativity
Referenced
address
Store Ref. Address in
➢ any location or
➢only one location or
➢fixed number of location
?
Dr. Sarvesh Vishwakarma TCS-704 ACA
Fully Associative Caches
Associativity
Referenced Referenced Referenced Referenced Referenced Referenced Referenced Referenced
address address address address address address address address
Referenced
address
Store Ref. Address in
➢ any location or
➢only one location or
➢fixed number of location
?
Dr. Sarvesh Vishwakarma TCS-704 ACA
Direct-Mapped Caches
Associativity
Referenced
address
Referenced
address
Store Ref. Address in
➢ any location or
➢only one location or
➢fixed number of location
?
Dr. Sarvesh Vishwakarma TCS-704 ACA
Set -Associative Caches
Associativity
SET-1 SET-2
Referenced Referenced Referenced Referenced
address address address address
Referenced
address
Store Ref. Address in
➢ any location or
➢only one location or
➢fixed number of location
?
There are TCS-704
Dr. Sarvesh Vishwakarma locations
ACAwhere a given address may be stored.
No. of locations in each set is the associativity of the cache
0 1 2 3 4 5 6 7 No. of Cache block: 8
Where to place
memory block with
index field 26?
No. of Memory block: 32
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
(Block Address) MOD (Number of sets in cache)
0 MOD 8 = 0 16 MOD 8 = 0
0 1 2 3 4 5 6 7 There are eight groups/sets ;
1 MOD 8 = 1 17 MOD 8 = 1 Each group contain single
2 MOD 8 = 2 18 MOD 8 = 2
3 MOD 8 = 3 19 MOD 8 = 3 cache block
4 MOD 8 = 4 20 MOD 8 = 4
5 MOD 8 = 5 21 MOD 8 = 5
6 MOD 8 = 6 22 MOD 8 = 6
7 MOD 8 = 7 23 MOD 8 = 7 Where to place
8 MOD 8 = 0 24 MOD 8 = 0
9 MOD 8 = 1 25 MOD 8 = 1 memory block with
10 MOD 8 = 2
11 MOD 8 = 3
26 MOD 8 = 2 index field 26?
27 MOD 8 = 3
12 MOD 8 = 4 28 MOD 8 = 4
13 MOD 8 = 5 29 MOD 8 = 5 One-way set associative
14 MOD 8 = 6 30 MOD 8 = 6
15 MOD 8 = 7 31 MOD 8 = 7
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
(Block Address) MOD (Number of sets in cache)
0 1 2 3 4 5 6 7 One-way set associative
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
One-way set associative
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
Where to place
memory block with
One-way set associative index field 26?
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
(Block Address) MOD (Number of sets in cache)
0 MOD 4 = 0 16 MOD 4 = 0
0 1 2 3 All eight cache blocks are
1 MOD 4 = 1 17 MOD 4 = 1 distributed in a group of two
2 MOD 4 = 2 18 MOD 4 = 2
3 MOD 4 = 3 19 MOD 4 = 3 cache blocks.
4 MOD 4 = 0 20 MOD 4 = 0 It form four groups.
5 MOD 4 = 1 21 MOD 4 = 1 Four Sets
6 MOD 4 = 2 22 MOD 4 = 2
7 MOD 4 = 3 23 MOD 4 = 3 Two-way set associative
8 MOD 4 = 0 24 MOD 4 = 0
9 MOD 4 = 1 25 MOD 4 = 1
10 MOD 4 = 2 26 MOD 4 = 2
11 MOD 4 = 3 27 MOD 4 = 3
12 MOD 4 = 0 28 MOD 4 = 0
13 MOD 4 = 1 29 MOD 4 = 1
14 MOD 4 = 2 30 MOD 4 = 2 Memory block => 32
15 MOD 4 = 3 31 MOD 4 = 3
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
(Block Address) MOD (Number of sets in cache)
0 MOD 4 = 0 16 MOD 4 = 0
4 MOD 4 = 0 20 MOD 4 = 0
8 MOD 4 = 0 24 MOD 4 = 0 Two-way set associative
12 MOD 4 = 0 28 MOD 4 = 0
Two-way for set `0’ 0 1 2 3
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
(Block Address) MOD (Number of sets in cache)
1 MOD 4 = 1 17 MOD 4 = 1
5 MOD 4 = 1 21 MOD 4 = 1
9 MOD 4 = 1 25 MOD 4 = 1 Two-way set associative
13 MOD 4 = 1 29 MOD 4 = 1
Two-way for set ‘1’ 0 1 2 3
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
(Block Address) MOD (Number of sets in cache)
2 MOD 4 = 2 18 MOD 4 = 2
6 MOD 4 = 2 22 MOD 4 = 2
10 MOD 4 = 2 26 MOD 4 = 2 Two-way set associative
14 MOD 4 = 2 30 MOD 4 = 2
Two-way for set ‘2’ 0 1 2 3
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
(Block Address) MOD (Number of sets in cache)
3 MOD 4 = 3 19 MOD 4 = 3
7 MOD 4 = 3 23 MOD 4 = 3
11 MOD 4 = 3 27 MOD 4 = 3
Two-way set associative
15 MOD 4 = 3 31 MOD 4 = 3
Two-way for set ‘3’ 0 1 2 3
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
(Block Address) MOD (Number of sets in cache)
0 1 All eight cache blocks are
0 MOD 2 = 0 16 MOD 2 = 0
1 MOD 2 = 1 17 MOD 2 = 1 distributed in a group of 4
2 MOD 2 = 0 18 MOD 2 = 0 cache blocks.
3 MOD 2 = 1 19 MOD 2 = 1
4 MOD 2 = 0 20 MOD 2 = 0
5 MOD 2 = 1 21 MOD 2 = 1 It form two sets/groups in
6 MOD 2 = 0 22 MOD 2 = 0
7 MOD 2 = 1 23 MOD 2 = 1 cache.
8 MOD 2 = 0 24 MOD 2 = 0
9 MOD 2 = 1 25 MOD 2 = 1
10 MOD 2 = 0 26 MOD 2 = 0 Four-way set associativity
11 MOD 2 = 1 27 MOD 2 = 1
12 MOD 2 = 0 28 MOD 2 = 0
13 MOD 2 = 1 29 MOD 2 = 1
14 MOD 2 = 0 30 MOD 2 = 0 Memory block => 32
15 MOD 2 = 1 31 MOD 2 = 1
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
(Block Address) MOD (Number of sets in cache)
0 MOD 2 = 0 16 MOD 2 = 0
2 MOD 2 = 0 18 MOD 2 = 0
4 MOD 2 = 0 20 MOD 2 = 0
6 MOD 2 = 0 22 MOD 2 = 0
8 MOD 2 = 0 24 MOD 2 = 0
10 MOD 2 = 0 26 MOD 2 = 0
12 MOD 2 = 0 28 MOD 2 = 0
14 MOD 2 = 0 30 MOD 2 = 0
4-way set associativity
0 1
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
(Block Address) MOD (Number of sets in cache)
1 MOD 2 = 1 17 MOD 2 = 1
3 MOD 2 = 1 19 MOD 2 = 1
4-way set
5 MOD 2 = 1 21 MOD 2 = 1 associativity
7 MOD 2 = 1 23 MOD 2 = 1
9 MOD 2 = 1 25 MOD 2 = 1
11 MOD 2 = 1 27 MOD 2 = 1
13 MOD 2 = 1 29 MOD 2 = 1
15 MOD 2 = 1 31 MOD 2 = 1
0 1
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
(Block Address) MOD (Number of sets
in cache) 8-way set associativity
0 MOD 1 = 0 16 MOD 1 = 0
1 MOD 1 = 0 17 MOD 1 = 0
2 MOD 1 = 0 18 MOD 1 = 0
3 MOD 1 = 0 19 MOD 1 = 0 Fully Associative
4 MOD 1 = 0 20 MOD 1 = 0
5 MOD 1 = 0 21 MOD 1 = 0
6 MOD 1 = 0 22 MOD 1 = 0
7 MOD 1 = 0 23 MOD 1 = 0
8 MOD 1 = 0 24 MOD 1 = 0
9 MOD 1 = 0 25 MOD 1 = 0
10 MOD 1 = 0 26 MOD 1 = 0
11 MOD 1 = 0 27 MOD 1 = 0 0
12 MOD 1 = 0 28 MOD 1 = 0
13 MOD 1 = 0 29 MOD 1 = 0
14 MOD 1 = 0 30 MOD 1 = 0
15 MOD 1 = 0 31 MOD 1 = 0
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Dr. Sarvesh Vishwakarma TCS-704 ACA
Processor
Cache
Memory
storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
Processor
Reply for requested
Request for data
data
Cache
Memory
storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
Processor
Reply for requested
Request for data
data
Cache
Queries for Reply for enquired
missing data data
Memory
storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
Avg. Memory Access Time= Hit time L1:Cache +(Miss Rate L1: Cache x Miss Penalty L1: cache)
Miss Penalty L1: cache = Hit time Memory + Miss Rate Memory x Miss Penalty Memory
Miss Penalty Memory = 0
Dr. Sarvesh Vishwakarma TCS-704 ACA
Processor
Hit time for cache
Request for data
to cache
Cache
Queries for
Cache missing Reply for enquired
Miss data in data
cache
Request for Hit time for
Memory
missing data Main Memory
to Main Data found
Memory
storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
Processor
L1: Cache
L2: Cache
Memory
storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
Request for data to L1: cache Processor
t1 t2 L1: cache Hit
L1: Cache
L2: Cache
Memory
storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
Request for data to L1: cache Processor
t1 t4 L1: cache Hit
Request for L1: cache Miss L1: Cache
missing data
to L2: cache t2
t3 L2: cache Hit
L2: Cache
Memory
storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
Request for data to L1: cache Processor
t1 t6 L1: cache Hit
Request for L1: cache Miss L1: Cache
missing data
to L2: cache t2
t5 L2: cache Hit
Request for L2: cache Miss L2: Cache
missing data
to Memory
t3 Memory Hit
t4
Memory
storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
Request for data to L1: cache Processor
t1 t8 L1: cache Hit
Request for L1: cache Miss L1: Cache
missing data
to L2: cache t2
t7 L2: cache Hit
Request for L2: cache Miss L2: Cache
missing data
to Memory
t3 Memory Hit
t6
Request for Page fault Memory
missing data
to hard disk
Virtual pages loaded
t4 into Memory
t5
Storage DATA
FOUND
Dr. Sarvesh Vishwakarma TCS-704 ACA
Global Miss Rate for L2: cache
L 2 : CacheMiss
L1 : Cache Re quest
Local Miss Rate for L2: cache
L 2 : CacheMiss
L 2 : Cache Re quest
Dr. Sarvesh Vishwakarma TCS-704 ACA
Global Miss Rate for L1: cache
L1 : CacheMiss
L1 : Cache Re quest
Local Miss Rate for L1: cache
L1 : CacheMiss
L1 : Cache Re quest
Dr. Sarvesh Vishwakarma TCS-704 ACA
Request for data to L1: cache Processor
t1 t6 L1: cache Hit
Request for L1: L1: Cache
L2:CacheRequest
missing data cache
= L1:CacheMiss to L2: cache Miss
t2 t5 L2: cache Hit
L2: Cache
MemoryRequest Request for L2:
missing data cache
= L2:CacheMiss to Memory Miss
t3 t4 Memory Hit
Memory
storage
Dr. Sarvesh Vishwakarma TCS-704 ACA
L 2 : CacheMiss
Global Miss Rate for L2: cache =
L1 : Cache Re quest
L1 : CacheMiss
Local Miss Rate for L1: cache =
L1 : Cache Re quest
L 2 : CacheMiss
Local Miss Rate for L2: cache =
L1 : CacheMiss
Global Miss Rate for L2: cache = (Local Miss Rate for L1: cache)x (Local Miss Rate for L2: cache)
Dr. Sarvesh Vishwakarma TCS-704 ACA
Avg. Memory Access Time= Hit time L1:Cache +(Miss Rate L1: Cache x Miss Penalty L1: cache)
Miss Penalty L1: cache = Hit time L2:Cache + (Miss Rate L2:Cache x Miss Penalty L2:Cache)
Miss Penalty L2:Cache = Hit time Memory + (Miss Rate Memory x Miss Penalty Memory)
Miss Penalty Memory = 0
Avg. Memory Access Time= Hit time L1:Cache +(Miss Rate L1: Cache) x (Hit time
L2:Cache + (Miss Rate L2:Cache x Miss Penalty L2:Cache))
Avg. Memory Access Time= Hit time L1:Cache +Miss Rate L1: Cache x (Hit time L2:Cache +
Miss Rate L2:Cache x Miss Penalty L2:Cache)
T = Hit time L1 +Miss Rate L1 x (Hit time L2 +Miss Rate L2 x Miss Penalty L2)
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Memory Hierarchy
Dr. Sarvesh Vishwakarma TCS-704 ACA
Caches
❖Capacity : 32 KB, 64 KB, 2MB, 4 MB etc
❖Line length: 10 byte, 32 byte, 64 byte, Is it aligned or not
❖Associativity:
a) Direct mapped,
b) Fully associative,
c) Set associative
❖Replacement policy:
a) Least-recently Used
b) Random replacement
c) First in, first out
d) Not-most-recently used
❖How to handle storing the information in lower level of
memory hierarchy
a) Write Back
b) Write Through
Dr. Sarvesh Vishwakarma TCS-704 ACA
Replacement Algorithms
1. Least recently used or LRU LRU bit
2. Least frequently used or LFU counter
3. First In First Out or FIFO Circular
buffer
Dr. Sarvesh Vishwakarma TCS-704 ACA
Introduction
Memory Hierarchy Basics
• Six basic cache optimizations:
– Larger block size
• Reduces compulsory misses
• Increases capacity and conflict misses, increases miss penalty
– Larger total cache capacity to reduce miss rate
• Increases hit time, increases power consumption
– Higher associativity
• Reduces conflict misses
• Increases hit time, increases power consumption
– Higher number of cache levels
• Reduces overall memory access time
– Giving priority to read misses over writes
• Reduces miss penalty
– Avoiding address translation in cache indexing
• Reduces hit time
Dr. Sarvesh Vishwakarma TCS-704 ACA
Eleven Advanced cache Optimization
Techniques
1. Reducing the hit time:
a) small and simple caches,
b) way prediction, and
c) Trace caches
2. Increasing cache bandwidth:
a) pipelined caches,
b) multibanked caches, and
c) Nonblocking caches
3. Reducing the miss penalty:
a) critical word first and
b) merging write buffers
4. Reducing the miss rate: compiler optimizations
a) Code and Data Rearrangement
b) Loop interchange
c) Blocking
5. Reducing the miss penalty or miss rate via parallelism: hardware
prefetching and compiler prefetching
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Ten Advanced Optimizations
• Small and simple first level caches
– Critical timing path:
• addressing tag memory, then
• comparing tags, then
• selecting correct set
– Direct-mapped caches can overlap tag compare and
transmission of data
– Lower associativity reduces power because fewer cache
lines are accessed
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
L1 Size and Associativity
Access time vs. size and associativity
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
L1 Size and Associativity
Energy per read vs. size and associativity
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Way Prediction
• To improve hit time, predict the way to pre-set mux
– Mis-prediction gives longer hit time
– Prediction accuracy
• > 90% for two-way
• > 80% for four-way
• I-cache has better accuracy than D-cache
– First used on MIPS R10000 in mid-90s
– Used on ARM Cortex-A8
• Extend to predict block as well
– “Way selection”
– Increases mis-prediction penalty
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Pipelining Cache
• Pipeline cache access to improve bandwidth
– Examples:
• Pentium: 1 cycle
• Pentium Pro – Pentium III: 2 cycles
• Pentium 4 – Core i7: 4 cycles
• Increases branch mis-prediction penalty
• Makes it easier to increase associativity
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Nonblocking Caches
• Allow hits before
previous misses
complete
– “Hit under miss”
– “Hit under multiple
miss”
• L2 must support this
• In general, processors
can hide L1 miss
penalty but not L2
miss penalty
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Multibanked Caches
• Organize cache as independent banks to support
simultaneous access
– ARM Cortex-A8 supports 1-4 banks for L2
– Intel i7 supports 4 banks for L1 and 8 banks for L2
• Interleave banks according to block address
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Critical Word First, Early Restart
• Critical word first
– Request missed word from memory first
– Send it to the processor as soon as it arrives
• Early restart
– Request words in normal order
– Send missed work to the processor as soon as it arrives
• Effectiveness of these strategies depends on block
size and likelihood of another access to the
portion of the block that has not yet been fetched
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Merging Write Buffer
• When storing to a block that is already pending in the
write buffer, update write buffer
• Reduces stalls due to full write buffer
• Do not apply to I/O addresses
No write
buffering
Write buffering
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Compiler Optimizations
• Code and Data Rearrangement
Dr. Sarvesh Vishwakarma TCS-704 ACA
Compiler Optimizations Memory
• Traditional loop operations X[0][0] S1
– i-loop is nested inside j-loop to access memory in X[0][1] S5
nonsequential order X[0][2] S9
– for (j = 0; j < 4; j = j+1) X[0][3] S13
X[1][0] S2
for(i=0; i< 3; i=i+1)
X[1][1] S6
X[i][j] = 2*X[i][j];
X[1][2] S10
X[1][3] S14
j
X X[2][0] S3
0 1 2 3 X[2][1] S7
X[2][2] S11
0
X[2][3] S15
i X[3][0] S4
1
X[3][1] S8
2
X[3][2] S12
Dr. Sarvesh Vishwakarma TCS-704 ACA
X[3][3] S16
Compiler Optimizations Memory
• Loop Interchange X[0][0] S1
– Swap nested loops to access memory in X[0][1] S2
sequential order X[0][2] S3
– for (i = 0; i < 3; i = i+1) X[0][3] S4
for(j=0; j< 4; j=j+1) X[1][0] S5
X[i][j] = 2*X[i][j];
X[1][1] S6
– Reordering maximizes use of data in a cache
X[1][2] S7
block before they are discarded.
X[1][3] S8
j
X X[2][0] S9
0 1 2 3 X[2][1] S10
X[2][2] S11
0
X[2][3] S12
i X[3][0] S13
1
X[3][1] S14
2
X[3][2] S15
Dr. Sarvesh Vishwakarma TCS-704 ACA
X[3][3] S16
Compiler Optimizations
Advanced Optimizations
• Blocking
– Instead of accessing entire rows or columns, subdivide matrices
into blocks
– Requires more memory accesses but improves locality of accesses
Y: 1 2 3 1 2 2 Z: 1 1 0 1 2 1 X: 14
0 0 2 0 1 1 1 0 0 1 3 1
1 2 3 2 2 1 2 2 0 3 2 0
2 3 0 2 0 1 3 3 2 1 1 1
1 2 3 4 1 0 1 0 1 4 0 2
3 2 1 2 0 1 0 1 2 0 1 2
for (i=0;i<N; i = i+1)
1 1 + 2 1 + 3 2 + 1 3 + 2 1 + 2 0 for (j=0;j<N; j = j+1)
1+ 2 + 6 + 3 + 2 + 0 { r =0;
for (k=0;k<N; k = k+1)
14 r = r + Y[i][k]*Z[k][j];
X[i][j] = r;
Dr. Sarvesh Vishwakarma TCS-704 ACA
};
Compiler Optimizations
Advanced Optimizations
• Blocking
– Instead of accessing entire rows or columns, subdivide matrices
into blocks
– Requires more memory accesses but improves locality of accesses
Y: 1 2 3 1 2 2 Z: 1 1 0 1 2 1 X: 9
0 0 2 0 1 1 1 0 0 1 3 1
1 2 3 2 2 1 2 2 0 3 2 0
2 3 0 2 0 1 3 3 2 1 1 1
1 2 3 4 1 0 1 0 1 4 0 2
3 2 1 2 0 1 0 1 2 0 1 2
1 1 + 2 1 + 3 2 for (jj=0;jj<N; jj = jj+1)
for (kk=0;kk<N; kk = kk+B)
1+ 2 + 6 for (i=0;i<N; i = i+1)
9 for (j=jj;j<min(jj+B,N);j=j+1)
{ r =0;
Intermediate result for (k=kk;k<min(kk+B,N);k=k+1)
r = r + Y[i][k]*Z[k][j];
X[i][j] = X[i][j] +r;
Dr. Sarvesh Vishwakarma TCS-704 ACA
};
Advanced Optimizations
Hardware Prefetching
• Fetch two blocks on miss (include next sequential
block)
Pentium 4 Pre-fetching
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Compiler Prefetching
• Insert prefetch instructions before data is needed
• Non-faulting: prefetch doesn’t cause exceptions
• Register prefetch
– Loads data into register
• Cache prefetch
– Loads data into cache
• Combine with loop unrolling and software
pipelining
Dr. Sarvesh Vishwakarma TCS-704 ACA
Advanced Optimizations
Summary
Dr. Sarvesh Vishwakarma TCS-704 ACA
Virtual Memory
• Three memory problems
– Not enough RAM
– Holes in our address space
– Programs writing over each other
• What is virtual memory?
– Indirection
– How does it solve the problem
– Pages Tables and Transaction
• Implementing virtual memory
– Where do we store the page tables?
– Making translation fast
Dr. Sarvesh Vishwakarma TCS-704 ACA
Max. addressing capability of 8085 procesor :
65536
216 = 28 28 = 256 256 = 65536 or = 64KB
1024
System bus (16 address lines)
0000H
8085 ROM RAM 2000H Disk
CPU 8k 8k Drive
1FFFH 3FFFH
2000H Storage
(8 K)
2000H
3FFFH
Program
(16 KB)
4000H
5FFFH
(8 K)
5FFFH
Dr. Sarvesh Vishwakarma TCS-704 ACA
What if we don’t have enough memory?
• MIPS gives each program its own 32-bit address
space
• Programs can access any byte in their 32-bit address
space
– Indirection
– How does it solve the problem
– Pages Tables and Transaction
• Implementing virtual memory
– Where do we store the page tables?
– Making translation fast
Dr. Sarvesh Vishwakarma TCS-704 ACA
Overlay
• Customer billing software without overlay
Get data from customer
Calculate total (1 KB)
Print bill
(1 KB)
• Customer billing software with overlay
Get data from customer
Calculate total (1 KB) Print bill (1 KB)
Load printing routine
Dr. Sarvesh Vishwakarma TCS-704 ACA
Virtual Memory
If a program became too large for physical memory, it was the
programmer’s job to make it fit. Programmers divided programs
into pieces, then identified the pieces that were mutually exclusive,
and loaded or unloaded those overlays under user program control
during execution. The Programmer ensured that the program never
tried to access more physical main memory than was in the
computer, and that the proper overlay was loaded at the proper time.
This responsibility eroded programmer productivity.
Virtual memory was invented to relieve programmers of this
burden. It automatically manages the two levels of the memory
hierarchy represented by main memory and secondary storage and
allows the programmer to develop the software assuming there is
more memory available in the computer than in reality
Dr. Sarvesh Vishwakarma TCS-704 ACA
Virtual Memory
Virtual
address
0 A 0
4K B 4K C
8K C 8K
12K D 12K
Virtual memory 16K A
20K
Mapping of virtual memory
to physical memory for a 24K B
program with four pages A, 28K
B, C, and D Physical main memory
Dr. Sarvesh Vishwakarma TCS-704 ACA
Disk
Making VM work: translation
• How does a program access memory?
1. Program executes a load with a virtual address (VA)
2. Computer translates the address to the physical
address (PA) in memory
3. If the physical address (PA) is not in memory, the
operating system loads it in from disk
4. The computer then reads the RAM using the physical
address (PA) and returns the data to the program
Dr. Sarvesh Vishwakarma TCS-704 ACA
Making VM work: translation
Dr. Sarvesh Vishwakarma TCS-704 ACA
Making VM work: translation
Dr. Sarvesh Vishwakarma TCS-704 ACA
What happens if a page is not in RAM?
• Page Table Entry says the page is on disk
• CPU generates a page fault exception
• The hardware jumps to the OS page fault handler to clean
up
– The OS chooses a page to evict from RAM and write to disk
– If the page is dirty, it needs to be written back to disk first
– The OS then reads the page from disk and puts it in RAM
– The OS then changes the Page Table to map the new page
• The OS jumps back to the instruction that caused the page
fault.
– This time it won’t cause a page fault since the page has been
loaded Dr. Sarvesh Vishwakarma TCS-704 ACA
How does the virtual memory system prevent
programs from accessing each other’s data?
❖Each program has its own virtual address space.
❖The virtual memory system is responsible for ensuring that different
programs virtual addresses translate to different physical addresses. So no
two programs have virtual addresses that map onto the same physical
address at the same time.
❖This mean that memory references from one program can not target the
physical addresses containing another program’s data, preventing programs
from accessing each other’s data.
Dr. Sarvesh Vishwakarma TCS-704 ACA
Caches Vs Virtual Memory
Parameter First-Level cache Virtual Memory
Block (page) size 16-128 bytes 4096-65536 bytes
Hit time 1-3 clock cycles 50-150 clock cycles
Miss penalty 8-150 clock cycles 1000,000-10,000,000 clock cycles
Access time 6-130 clock cycles 800,000-8,000,000 clock cycles
Transfer time 2-20 clock cycles 200,000-2,000,000 clock cycles
Miss rate 0.1-10% 0.00001-0.001%
Address mapping 25-45 bit physical 32-64 bit virtual address to 25-45 bit
address to 14-20 bit physical address
cache address
Dr. Sarvesh Vishwakarma TCS-704 ACA
Caches Vs Virtual Memory
1. Replacement on cache misses is primarily controlled by
hardware, while virtual memory replacement is primarily
controlled by the operating system.
2. The size of the processor address determines the size of
virtual memory, but the cache size is independent of the
processor address size.
3. In addition to acting as the lower-level backing store for main
memory in the hierarchy, secondary storage is also used for
the file system. In fact, the file system occupies most of
secondary storage. It is not normally in the address space.
Dr. Sarvesh Vishwakarma TCS-704 ACA
Mapping of a virtual address to a physical
address via a page table
Virtual address
Virtual page number Page offset
Main memory
Page table
Physical address
Dr. Sarvesh Vishwakarma TCS-704 ACA
Techniques for Fast Address Translation
Address space
number <8>
Virtual page Page offset
number <35> <13>
ASN Prot V Tag Physical address
<8> <4> <1> <35> <31>
1 2
44 bits physical address
Low-order 13 bits of address
3
128:1 MUX 4
High-order 31 bits of address
Dr. Sarvesh Vishwakarma TCS-704 ACA
Techniques for Fast Address Translation
1 Step 1 & 2 : Translation begins by sending the virtual address to all tags. The
tag must be marked valid to allow a match.
2 Step 2: At the same time, the type of memory access is checked for a violation
against protection information in the TLB.
To reduce TLB misses due to context switches, each entry has an 8-bit address
space number (ASN), which plays the same role as the PID. If the context
switching returns to the process with the same ASN, it can still match the TLB.
The process ASN and the page table entry (PTE) ASN must also match for a
valid tag.
3 Step 3: The matching tag sends corresponding physical address through
effectively a 128:1 multiplexor.
4 Step 4: The page offset is then combined with the physical page frame to form
a full physical address.
Dr. Sarvesh Vishwakarma TCS-704 ACA
Dr. Sarvesh Vishwakarma TCS-704 ACA