0% found this document useful (0 votes)

77 views23 pages

Memory Hierarchy and Cache Optimization

The document summarizes the memory hierarchy and caching techniques used in computer systems. It describes how caches bridge the speed gap between fast processor and slower main memory. The key points are: 1) A memory hierarchy with multiple levels of memory of increasing size but decreasing speed is used to obtain fast average memory access time. Caches located between the processor and main memory exploit locality to reduce memory access latency. 2) Caches use techniques like direct mapping, tags, and blocks to determine if requested data is in the cache (hit) or needs to be fetched from lower levels (miss). Larger block sizes and write buffers further improve cache performance. 3) Studies of programs on early systems showed reducing miss rates

Uploaded by

HarshitaSharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views23 pages

Memory Hierarchy and Cache Optimization

Uploaded by

HarshitaSharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 23

The Memory Hierarchy

Ideally one would desire an indefinitely large memory capacity such

that any particular word would be immediately available
We are forced to recognize the possibility of constructing a
hierarchy of memories, each of which has a greater capacity than
the preceding but which is less quickly accessible.
A.W. Burks, H.H. Goldstine, and J. von Neumann, 1946.

The same is true today. Access to main memory is too

slow for modern microprocessors. On a 100Mhz (pretty
slow) microprocessor an addition computation takes 10ns.
A memory reference takes 60-110ns (using DRAM ) or
25ns using EDO DRAM. On a 500Mhz microprocessor an
add takes 2ns, a memory reference still takes at least 25ns.
In order to bridge this gap we use a memory hierarchy
whose main component is the cache () . 1

The Principle of Locality

The principle of locality ( ) states that a program
accesses a small part of its address space at any instant of time.
There are 2 types of locality:
Temporal locality (locality in time) - If an item is accessed it will be
accessed again soon.
Spatial Locality (locality in space) - If an item was accessed, items close by
will be accessed.

We use the principal of locality by implementing a memory

hierarchy composed of multiple
levels of memory with different
sizes and speeds. SRAMs which
are faster(5-20ns) and more
expensive(100-250$ per Mbyte)
are used closer to the CPU while
DRAMs (25-100ns, 3-8$) are
used as main memory.
Speed

CPU

Size

Cost ($/bit)

Fastest

Memory

Smallest

Highest

Biggest

Lowest

Memory

Slowest

Memory

Hits and Misses

In the memory hierarchy an upper level (closer to the CPU) holds
a subset (- )of any lower level (farther away than the
Processor
CPU).
Although the memory hierarchy can have
multiple levels data is transferred between
Data are transferred
two adjacent ( )levels. The minimum
unit of data that can be present or not
present is called a block or line.
If data requested by the CPU is in a block of the upper level, this
is called a hit, if it isn't and the lower level has to be accessed it is
called a miss.
The hit rate or hit ratio is the fraction of accesses found in the
upper level. The miss rate (1 - hit rate) is the fraction of accesses
not found in the upper level.
3

Hit Time and Miss Penalty

The hit time is the time needed to access the upper level, decide if
the data is there and get it to the CPU.
The miss penalty is the time it takes to replace a block in the upper
level with the block we need from the lower level and get the data
to the CPU.
The hit time for upper levels is much faster than the hit time for
lower levels. Thus if we have a high
CPU
hit ratio at upper levels, we
have a access time equal
Increasing distance
Level 1
from the CPU in
to that of the highest
access time
Level 2
Levels in the
(and fastest) level with memory
hierarchy
the size of the
lowest (and slowest) level.
Level n

Size of the memory at each level

The Cache ()
Cache: a safe place for hiding or string things. Webster's
Dictionary
The level between main memory and CPU was called the cache
in a 1968 paper describing the IBM 360/85 (1st commercial
machine with cache). Nowadays all levels between main memory
and CPU are called caches.
X4
X4
Lets look at a simple cache in which
X1
X1
the processor accesses a word at a
Xn 2
Xn 2
time and the block size is 1 word.
Xn 1
Xn 1
The processor requested the word Xn.
X2
X2
It isn't in the cache which results in
Xn
a cache miss. The word Xn is brought
X3
X3
from memory into the cache.
a. Before the reference to Xn

b. After the reference to Xn

Direct Mapped Caches

000
001
010
011
100
101
110
111

How do we map memory location into cache locations?

The simplest way is to use the address. When each memory
Cache
location can be mapped to
only one cache location we call the
cache a direct mapped cache.
Almost all direct mapped caches
use the mapping:
(Block address)%
(# of blocks in $)
In the case where the #
of blocks is a power of 2
01001
00001
00101
01101
10001
10101
11001
then the lower log2
Memory
bits of the address are used
to map the memory location into the cache.
6

11101

Tags
If each cache location is mapped
to several memory locations, how Hit
do we know whether the data in
the cache corresponds to the data
requested from memory?
We add a tag to each block. The
tag contains the upper bits of the
address not used to index the
cache.
We also need a way to determine
if the cache has valid information.
At start up the cache will hold
garbage. In order not to use it by
mistake a valid bit is added to
each block. If the bit isn't set there
can't be a match.

Address (showing bit positions)

31 30

13 12 11

210
Byte
offset

Data

Tag
Index

Index Valid Tag

Data

0
1
2

1021
1022
1023
20

Cache Size
Assuming a 32 bit address, a direct mapped cache of size 2n words
with 1 word (4-byte) blocks will need a tag field of size 32 - (n+2)
because 2 bits are the byte offset and n bits are used for the index.
The total number of bits in a direct mapped cache is
2n*(32 + (32-n-2)+1) = 2n*(63-n).
How many total bits are needed for a direct mapped cache with
64KB data in 1 word blocks?
64KB = 16K words = 214 words = 214 blocks. Each block has 32
bits of data, 32-14-2 bits of tag, and a valid bit. Thus the total
cache size is: 214*(32+(32-14-2)+1) = 214*49 = ~98KB. The size
of the cache is 50% more than the size of the data it holds (for this
configuration).

The DECStation 3100

The DECStation 300 was a work
station that used the MIPS R2000
processor and had 2 64KB caches,
one for data and one for
instructions.
A read is simple:
Send the address to the I-cache or Dcache. The address comes either
from the PC (instruction) or from the
ALU (data).
If the cache signals a hit the data is
ready on the cache lines. If the cache
signals a miss, the address is sent to
main memory. When the data is
brought from memory it is written
into the cache.

Address (showing bit positions)

31 30

17 16 15

543210

Hit

Data
16 bits
Valid Tag

32 bits
Data

16K
entries

Byte
offset

Writes in a Cache
Write works differently. On a store instruction we write the data
into the D-cache, now main memory has a different value from the
cache. The cache and memory are inconsistent () .
The simple way to keep consistency is to write the data both to
memory and to the cache. This is called write-through.
On a write miss there is no reason to read the block from memory.
We can just overwrite the data in the cache block and change the
tag. In fact we can do this for a write hit as well. Thus in the
DECStation 300 a write works like this:
Index the cache using bits 15-2 of the address.
Write bits 31-16 of the address into the tag, write the data value and set the
valid bit.
Write the word to main memory

The problem with write-through is that the writes to memory slow

down the processor. The solution is to use a write-buffer.
10

Write-Back Caches
The data in the write-buffer is written to memory in parallel to the
CPU continuing computation. This cuts down the penalty of the
write-through scheme.
If the write buffer is full the CPU is stalled ( )until data in
the buffer is written to memory. In the DECStation 3000 the write
buffer can hold 4 blocks.
The alternative to write-through is write-back. When a write
occurs it is written into the cache only. Only when the block is
replaced is the data written into main memory. Write-back caches
are better when the CPU generates writes faster than the main
memory can handle them.
The miss rates for 2 popular programs on the DECStation are:
Program I miss rate
D miss rate Combined miss rate
gcc
6.1%
2.1%
5.4%
spice
1.2%
1.3%
1.2%
11

Taking Advantage of Spatial Locality

In order to take advantage of spatial locality we need a block size
that is larger than 1 word. When a miss occurs we will fetch
multiple adjacent ( )words that we will probably use shortly.
The mapping of memory address to cache entry is the same, we
just use less bits for the index and more for the offset.
Read misses are processed the same as for single word blocks. The
writes are different. We can't just write the new data into the
cache. Assume there are 2 addresses X and Y which map into
cache block C. C contains Y. If the CPU writes into X it will
overwrite the tag and write the value into C. Now C holds 1 word
of X and 3 words of Y with the tag of X. So on a write miss we
must read the block from memory.
Spatial locality improves the hit rate. Lets say the the byte
addresses 16,24,20 are requested by the program. Reading the
block that contains 16 will cause the addresses 16-31 to reside in
12
the cache. Accesses to 24 and 20 are now hits instead of misses.

4 Word Block Diagram

Address (showing bit positions)
31

16 15
16

Hit

4 32 1 0
12

2 Byte
offset

Tag

Data

Index
V

Block offset

16 bits

128 bits

Tag

Data

4K
entries

Mux
32

Miss Rate for 4 Word Block

The miss rate for gcc and spice is:

Program Block Size
I-cache
gcc
1
6.1%
gcc
4
2.0%
spice
1
1.2%
spice
4
0.3%

D-cache
2.1%
1.7%
1.3%
0.6%

Combined
5.4%
1.9%
1.2%
0.4%

40%
35%

Miss rate

30%
25%
20%
15%
10%
5%
0%

64
Block size (bytes)

256
1 KB
8 KB

The miss rate rises if the block size is too large as not all
the data in a block is used so space in the cache is wasted.

16 KB
64 KB
256 KB

Memory System Support for Caches

Cache misses read data from main memory which is constructed
from DRAMs. Although it is hard to reduce the latency ()
to the first word it is possible to increase the bandwidth (
)from memory to cache.
Let's define access time for main memory:
1 clock cycle to send the address
15 clock cycles to read a word from DRAM
1 clock cycle to send a word of data

For a cache block of 4 word and 1 word wide memory bank the
miss penalty is: 1 + 4*15 + 4*1=65.
If we widen the memory and busses between memory to cachewe
can reduce the miss penalty. For a 2 word wide memory the miss
penalty is 1 + 2*15 + 2*1=33. For a 4 word wide memory the miss
penalty is 1 + 1*15 + 1 = 17. But we pay the cost of a wide
memory and wide busses.
15

Interleaved ( )Memory
The third option interleaves memory into multiple banks.
Sequential addresses are in sequential banks. We can read 4 words
at the same time, but send the data to the cache one word at a time.
When using interleaved memory the miss penalty is 1 + 1*15 +
4*1 = 20. Using banks is also helpful when using write-through
caches, the write bandwidth is quadrupled (4 ) .
CPU

CPU

Multiplexor
Cache

Cache
Cache
Bus

Memory

a. One-word -wide
memory organization

Bus

b. Wide memory organization

Memory
bank 0

Memory
bank 1

Memory
bank 2

Memory
bank 3

c. Interleaved memory organization

Computing the Average Memory Access Time

The average memory access time is:
(hit rate*hit time) + (miss rate*miss penalty)
Assuming the hit time is 1 cycle and the miss penalty is 17 cycles
what is the average access time given a 98% hit rate:
0.98*2 + 0.02*17 = 2.3
Even for a high hit ratio the average access time is relatively high
compared to R-type instructions. The solution is to introduce
another level of cache between main memory and the CPU.
All modern microprocessors have an on-chip cache which is
called the L1 cache and another larger off chip cache called the L2
cache. Assume a L1 hit time of 1, a L2 hit time of 5 and a L2 miss
penalty of 17. Given a L1 hit rate of 98% and a L2 hit rate of 98%
the average access time is:
0.98*1 + 0.02(0.98*5 + 0.02*17) = 1.08 cycles
17

Cache Associativity
We have seen one mapping scheme: direct mapped, each memory
location can be in only 1 cache location.
On the other extreme is a fully associative cache. Each memory
location can be in any cache block. To find a block we must
compare in parallel the tags of all blocks with the memory address.
The middle range is called set associative. A block can be mapped
to a fixed number of locations. The tags in the set are compared in
parallel to the memory address.
Direct mapped

Block #

0 1 2 3 4 5 6 7

Data

Tag
Search

Set associative

Set #

Data

1
2

Tag
Search

Fully associative

Data

1
2

Tag
Search

Set Associative Caches

We can look at all the mapping
schemes as variations of set
associative mapping.
One-way set associative
(direct mapped)
Block Tag Data
Direct mapping is 1-way set
0
associative and fully
Two-way set associative
1
Set
Tag Data Tag Data
2
associative mapping is m-way
0
3
set associative (m the number
1
4
2
5
of blocks in the cache).
3
6
7
A cache with sets of 4 blocks
to a set is called 4-way set
Four-way set associative
associative.
Set
Tag Data Tag Data Tag Data Tag Data
0
The advantage of set
1
associative caches is reduced
Eight-way set associative (fully associative)
miss rate, the disadvantage
is
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag
increased hit time
19

Data Tag

Set Associative Diagram

The number of index bits is log2(n) where n is the number of sets
in the cache, not blocks as
in a direct mapped cache.
Address
31 30

12 11 10 9 8
8

Index
0
1
2

Tag

Data

3 2 1 0

Tag

Data

Tag

Data

Tag

Data

253
254
255
22

4-to-1 multiplexor

Hit

Data

Replacing a Cache Block

I n a direct mapped cache there isn't a problem, there is only one
block to replace. In a set associative cache we have to decide
which block to replace.
The most commonly used scheme is least recently used (LRU),
the block that has been unused the longest time is replaced. LRU
is hard to implement for a set associativity higher than 2.
The replacement scheme used is an approximation ( )of LRU
called pseudo LRU.
Some caches use a random replacement scheme which isn't much
worse than true LRU.
The miss rates for gcc when using set associative caches are (for spice the
results are the same for all 3 sizes):
Associativity I-cache D-cache Combined
1
2.0%
1.7%
1.9%
2
1.6%
1.4%
1.5%
2
1.6%
1.4%
1.5%
21

Miss Rates for Set Associative Caches

The block size is 32 bytes. The applications are the SPEC92
benchmarks.
15%

12%

Miss rate

0%
One-way

Two-way

Four-way
Associativity

Eight-way
1 KB

16 KB

2 KB

32 KB

4 KB

64 KB

8 KB

128 KB

The 3 Cs
Cache misses can be divided into 3 categories:
Compulsory misses: Misses that are caused because the block
was never in the cache. These are also called cold-start misses.
Increasing the block size reduces compulsory misses. Too large
a block size can cause capacity misses and increases the miss
penalty.
Capacity misses: Misses caused when the cache can't contain
all the blocks needed during a program's execution. The block
was in the cache, was replaced and now is needed again.
Enlarging the cache reduces capacity misses. Enlarging to
much raises the access time.
Conflict misses: Occur in direct mapped and set associative
caches. Multiple blocks compete for the same set. Increasing
associativity reduces conflict misses. But a too high
associativity increases access time.
23

CMP3010L08 Memory
No ratings yet
CMP3010L08 Memory
45 pages
Memory Hierarchy and Cache Mapping Techniques
No ratings yet
Memory Hierarchy and Cache Mapping Techniques
58 pages
Computer Organization and Architecture (AT70.01) : Comp. Sc. and Inf. MGMT
No ratings yet
Computer Organization and Architecture (AT70.01) : Comp. Sc. and Inf. MGMT
49 pages
Chapter 5 - Memory
No ratings yet
Chapter 5 - Memory
44 pages
Chapter 7
No ratings yet
Chapter 7
23 pages
Unit V
No ratings yet
Unit V
44 pages
Memory Hierarchy Essentials
No ratings yet
Memory Hierarchy Essentials
60 pages
Cache Memory CAD
No ratings yet
Cache Memory CAD
16 pages
Computer Organization and Architecture (AT70.01)
No ratings yet
Computer Organization and Architecture (AT70.01)
49 pages
ch5 1
No ratings yet
ch5 1
44 pages
Chapter 2z
No ratings yet
Chapter 2z
54 pages
Chap 6
No ratings yet
Chap 6
48 pages
Lecture 04 IS064
No ratings yet
Lecture 04 IS064
41 pages
BiD 05
No ratings yet
BiD 05
97 pages
Cache Mapping
100% (1)
Cache Mapping
44 pages
Cache Memory
No ratings yet
Cache Memory
51 pages
Lecture 03
No ratings yet
Lecture 03
37 pages
Cache1 2
No ratings yet
Cache1 2
30 pages
13 - Large and Fast Exploiting Memory Hierarchy Final
No ratings yet
13 - Large and Fast Exploiting Memory Hierarchy Final
118 pages
Principles of Cache Memory Explained
No ratings yet
Principles of Cache Memory Explained
47 pages
Lecture11 Cda3101
No ratings yet
Lecture11 Cda3101
73 pages
Chap 4 Cache Memory
No ratings yet
Chap 4 Cache Memory
55 pages
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
No ratings yet
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
53 pages
Understanding Computer Memory Types
No ratings yet
Understanding Computer Memory Types
15 pages
Chapter 2z
No ratings yet
Chapter 2z
54 pages
Cache Memory
No ratings yet
Cache Memory
57 pages
04 - Cache Memory
No ratings yet
04 - Cache Memory
47 pages
Cache Memory Characteristics Overview
No ratings yet
Cache Memory Characteristics Overview
57 pages
Module 4: Memory System Organization & Architecture
No ratings yet
Module 4: Memory System Organization & Architecture
97 pages
11 Cache Memory
No ratings yet
11 Cache Memory
40 pages
Computer Memory Essentials
No ratings yet
Computer Memory Essentials
58 pages
Memory Organization AndCache Mapping Study 13
100% (1)
Memory Organization AndCache Mapping Study 13
55 pages
Computer Architecture Essentials
No ratings yet
Computer Architecture Essentials
34 pages
Lecture 13 16 Post
No ratings yet
Lecture 13 16 Post
24 pages
CH 4.ppt Type I
No ratings yet
CH 4.ppt Type I
60 pages
Cache Memory
No ratings yet
Cache Memory
89 pages
EC 5001 - Memory 1
No ratings yet
EC 5001 - Memory 1
56 pages
William Stallings Computer Organization and Architecture 6th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 6th Edition Cache Memory
54 pages
Cache Structure and Miss Rate Analysis
No ratings yet
Cache Structure and Miss Rate Analysis
63 pages
Address Field Breakdown for Cache System
No ratings yet
Address Field Breakdown for Cache System
55 pages
CH 4A - Cache Memory
No ratings yet
CH 4A - Cache Memory
38 pages
Computer Architecture: Memory Organization
No ratings yet
Computer Architecture: Memory Organization
65 pages
Cache Memory Characteristics
No ratings yet
Cache Memory Characteristics
67 pages
Cache Memory: A Safe Place For Hiding or Storing Things
100% (1)
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Cache Memory Characteristics Explained
No ratings yet
Cache Memory Characteristics Explained
57 pages
Computer Memory Systems Guide
No ratings yet
Computer Memory Systems Guide
16 pages
Memory
No ratings yet
Memory
57 pages
Computer Architecture: Cache Memory
No ratings yet
Computer Architecture: Cache Memory
57 pages
Week 12 - Lecture 12 - Memory
No ratings yet
Week 12 - Lecture 12 - Memory
27 pages
Cache Memory
67% (3)
Cache Memory
72 pages
Chapter 5
No ratings yet
Chapter 5
131 pages
Max Flow Algorithm
No ratings yet
Max Flow Algorithm
6 pages
Real-Time Software Design
No ratings yet
Real-Time Software Design
47 pages
Ncdor Security Policy Manual
100% (1)
Ncdor Security Policy Manual
228 pages
Whitepaper SAP Bestpractice Vmware
No ratings yet
Whitepaper SAP Bestpractice Vmware
35 pages
Baseworksheet PDF
No ratings yet
Baseworksheet PDF
4 pages
Jump List Forensics
No ratings yet
Jump List Forensics
21 pages
Test Case Template & Review Guide
No ratings yet
Test Case Template & Review Guide
2 pages
iSetup Data Migration Guide
100% (4)
iSetup Data Migration Guide
11 pages
Cisco ASA Route Map Configuration Guide
No ratings yet
Cisco ASA Route Map Configuration Guide
6 pages
AEM Project Management Checklist
No ratings yet
AEM Project Management Checklist
16 pages
Overview of Multilayer Perceptrons
No ratings yet
Overview of Multilayer Perceptrons
24 pages
Sap T Codes
No ratings yet
Sap T Codes
627 pages
W97M Downloader Serves Vawtrak Malware - McAfee
No ratings yet
W97M Downloader Serves Vawtrak Malware - McAfee
11 pages
Optimizing SQL Query Performance
No ratings yet
Optimizing SQL Query Performance
2 pages
Types and Functions of System Software
No ratings yet
Types and Functions of System Software
8 pages
Template Srs Konvensional
No ratings yet
Template Srs Konvensional
7 pages
API Hooking Error Documentation
No ratings yet
API Hooking Error Documentation
9 pages
Report Design in Visual FoxPro 3
No ratings yet
Report Design in Visual FoxPro 3
4 pages
Java Proxy Integration in SAP PI/XI
No ratings yet
Java Proxy Integration in SAP PI/XI
10 pages
Windows 2008R2 Server Hardening Guide
No ratings yet
Windows 2008R2 Server Hardening Guide
6 pages
Faulty RSA Encryption Vulnerabilities
No ratings yet
Faulty RSA Encryption Vulnerabilities
4 pages
Viewing and Searching Commands: Filename2 Filename and Display The Result
No ratings yet
Viewing and Searching Commands: Filename2 Filename and Display The Result
2 pages
Fibonacci Heaps for CS Students
No ratings yet
Fibonacci Heaps for CS Students
77 pages
LPC2148 USB Bootloader Guide
100% (1)
LPC2148 USB Bootloader Guide
15 pages
MPS 3 Advanced Concepts
No ratings yet
MPS 3 Advanced Concepts
346 pages
Advance Computer Architecture
88% (8)
Advance Computer Architecture
166 pages
CMT400 Applications of Computer in Chemistry: Sabrina M Yahaya Faculty of Applied Sciences Uitm
No ratings yet
CMT400 Applications of Computer in Chemistry: Sabrina M Yahaya Faculty of Applied Sciences Uitm
46 pages
Project Report On Shootout Enemy
No ratings yet
Project Report On Shootout Enemy
63 pages
Database Normalization Guide
No ratings yet
Database Normalization Guide
16 pages

Memory Hierarchy and Cache Optimization

Uploaded by

Memory Hierarchy and Cache Optimization

Uploaded by

The Memory Hierarchy

Ideally one would desire an indefinitely large memory capacity such

The same is true today. Access to main memory is too

The Principle of Locality

We use the principal of locality by implementing a memory

Hits and Misses

Hit Time and Miss Penalty

Size of the memory at each level

b. After the reference to Xn

Direct Mapped Caches

How do we map memory location into cache locations?

Address (showing bit positions)

Index Valid Tag

The DECStation 3100

Address (showing bit positions)

The problem with write-through is that the writes to memory slow

Taking Advantage of Spatial Locality

4 Word Block Diagram

Miss Rate for 4 Word Block

The miss rate for gcc and spice is:

Memory System Support for Caches

b. Wide memory organization

c. Interleaved memory organization

Computing the Average Memory Access Time

Set Associative Caches

Set Associative Diagram

Replacing a Cache Block

Miss Rates for Set Associative Caches

You might also like