CSE 410
Computer Systems
Hal Perkins
Spring 2010
L t
Lecture
13 Cache
C h W
Writes
it and
dP
Performance
f
Reading
Computer
p
Organization
g
and Design
g
Section 5.1 Introduction
Section 5.2 Basics of Caches
Section 5.3 Measuring and Improving Cache
Performance
Cache Writing & Performance
Whats left?
Writing to caches: keeping memory consistent & writeallocation.
allocation
Well also investigate some main memory
organizations
g
that can help
p increase memory
y system
y
performance.
Later, well talk about Virtual Memory, where memory is
treated like a cache of the disk
disk.
3
Four important questions
1 When we copy a block of data from main memory
1.
to the cache, where exactly should we put it?
2. How can we tell if a word is already in the cache, or
if it has to be fetched from main memory first?
3. Eventually, the small cache memory might fill up.
To load a new block from main RAM, wed have to
replace one of the existing blocks in the cache...
which one?
4. How can write operations be handled by the
y system?
y
memory
Weve
W answeredd the
h first
fi 3.
3 Now,
N
we consider
id the
h 4th.
4h
4
Writing to a cache
Writing to a cache raises several additional issues.
First, lets assume that the address we want to write to is already
loaded in the cache. Well assume a simple direct-mapped cache.
Index
Tag
g
Data
...
110
Address
Data
...
1
11010
42803
...
1101 0110
42803
...
If we write a new value to that address, we can store the new data in
the cache, and avoid an expensive main memory access.
Mem[214] = 21763
Index
Tag
Data
...
Data
...
...
110
Address
11010
21763
1101 0110
...
42803
5
Inconsistent memory
But now the cache and memoryy contain different,,
inconsistent data!
How can we ensure that subsequent loads will return
th right
the
i ht value?
l ?
This is also problematic if other devices are sharing
the main memory, as in a multiprocessor system.
Index
Tag
Data
...
110
...
Address
Data
...
1
11010
21763
1101 0110
42803
...
Write-through
Write
through caches
A write-through cache solves the inconsistency problem by
f i allll writes
forcing
it tto update
d t b
both
th th
the cache
h and
d the
th main
i
memory.
Mem[214] = 21763
Index
Tag
Data
...
110
...
Address
Data
...
1
11010
21763
1101 0110
21763
...
Thi
This is
i simple
i l tto iimplement
l
t and
d kkeeps th
the cache
h and
d
memory consistent.
Why is this not so good?
Write-back
Write
back caches
In a write-back cache, the memory is not updated until the cache block
needs to be replaced (e.g., when loading data into a full cache set).
For example, we might write some data to the cache at first, leaving it
inconsistent with the main memory as shown before.
The cache block is marked dirty to indicate this inconsistency
Mem[214] = 21763
Index
V Dirty
Tag
Data
...
110
...
11010
21763
Address
Data
1000 1110
1225
1101 0110
42803
...
Subsequent reads to the same memory address will be serviced by the
p
data.
cache, which contains the correct, updated
Finishing the write back
We dont need to store the new value back to main memory unless the cache
bl k gets
block
t replaced.
l
d
For example, on a read from Mem[142], which maps to the same cache block,
the modified cache contents will first be written to main memory.
Index
V Dirty
Tag
Data
...
110
11010
21763
Data
1000 1110
1225
1101 0110
21763
...
...
Address
Only then can the cache block be replaced with data from address 142.
Index
V Dirty
Tag
Data
...
110
...
10001
1225
Address
Data
1000 1110
1225
1101 0110
21763
...
9
Write-back
Write
back cache discussion
The advantage
g of write-back caches is that not all
write operations need to access main memory, as
with write-through caches.
If a single
i l address
dd
iis ffrequently
tl written
itt tto, th
then it
doesnt pay to keep writing that data through to
main memory.
If several bytes within the same cache block are
modified, they will only force one memory write
operation at write
write-back
back time.
time
10
Write-back
Write
back cache discussion
Each block in a write-back cache needs a dirtyy bit to
indicate whether or not it must be saved to main
memory before being replacedotherwise we might
perform unnecessary
p
y writebacks.
Notice the penalty for the main memory access will
not be applied until the execution of some
subsequent instruction following the write
write.
In our example, the write to Mem[214] affected
only the cache.
But
B t the
th load
l d ffrom Mem[142]
M [142] resulted
lt d in
i two
t
memory accesses: one to save data to address
214, and one to load data from address 142.
11
Write misses
A second scenario is if we try to write to an address that is
nott already
l d contained
t i d iin th
the cache;
h thi
this iis called
ll d a write
it
miss.
Lets say we want to store 21763 into Mem[1101 0110] but
we find that address is not currently in the cache
cache.
Index
Tag
Data
...
110
...
Address
Data
...
1
00010
123456
1101 0110
6378
...
When we update Mem[1101 0110], should we also load it
into the cache?
12
Write around caches (a.k.a. write-no-allocate)
write no allocate)
With a write around policy, the write operation goes directly to
main memory without affecting the cache
cache.
Mem[214] = 21763
Index
Tag
Data
...
Data
...
...
110
Address
00010
123456
1101 0110
21763
...
This is good when data is written but not immediately used
again,
i iin which
hi h case th
theres
no point
i t tto lload
d it iinto
t th
the cache
h yet.
t
for (int i = 0; i < SIZE; i++)
a[i] = i;
13
Allocate on write
An allocate on write strategy would instead load the newly
written data into the cache.
Mem[214]
[
] = 21763
Index
Tag
Data
...
110
...
Address
Data
...
1
11010
21763
1101 0110
21763
...
If that data is needed again soon, it will be available in the
cache.
14
Basic main memory design
There are some ways the main memory can be organized to reduce
miss penalties and help with caching
caching.
For some concrete examples, lets assume the following
three steps are taken when a cache needs to load data
CPU
from the main memory.
1 It ttakes
1.
k 1 cycle
l tto send
d an address
dd
tto th
the RAM
RAM.
2. There is a 15-cycle latency for each RAM access.
Cache
3. It takes 1 cycle to return data from the RAM.
In the setup shown here
here, the buses from the CPU to the
cache and from the cache to RAM are all one word wide.
If the cache has one-word blocks, then filling a block
from RAM (i.e., the miss penalty) would take 17 cycles.
1 + 15 + 1 = 17 clock cycles
Main
Memory
The cache controller has to send the desired address to
the RAM, wait and receive the data.
15
Miss penalties for larger cache blocks
If the cache has four-word blocks, then loading a single block would
need four individual main memory accesses, and a miss penalty of 68
cycles!
4 x (1 + 15 + 1) = 68 clock cycles
CPU
Cache
Main
Memory
16
A wider memory
A simple way to decrease the
miss p
penalty
y is to widen the
memory and its interface to
the cache, so we can read
multiple words from RAM in
one shot
shot.
If we could read four words
from the memory at once, a
four-word cache load would
need just 17
1 cycles.
1 + 15 + 1 = 17 cycles
CPU
Cache
Main
Memory
The disadvantage is the cost
of the wider buseseach
additional bit of memory width
requires another connection to
the cache.
17
An interleaved memory
Another approach is to interleave
the memory, or split it into banks
banks
that can be accessed individually.
The main benefit is overlapping
the latencies of accessing each
word
word.
For example, if our main memory
has four banks, each one byte
wide, then we could load four
bytes into a cache block in just 20
cycles.
CPU
Cache
Main Memory
1 + 15 + (4 x 1) = 20 cycles
Our buses are still one byte wide
here, so four cycles are needed to
transfer data to the caches.
This is cheaper
p than implementing
p
g
a four-byte bus, but not too much
slower.
Bank 0
Bank 1
Bank 2
Bank 3
18
Interleaved memory accesses
Clock cycles
Load word 1
Load word 2
Load word 3
Load word 4
15 cycles
Here is a diagram to show how the memory accesses can be
interleaved.
The magenta cycles represent sending an address to a memory
bank.
bank
Each memory bank has a 15-cycle latency, and it takes another
cycle (shown in blue) to return data from the memory.
This is the same basic idea as pipelining!
As soon as we request data from one memory bank, we can go
ahead and request data from another bank as well.
Each individual load takes 17 clock cycles, but four overlapped
loads require just 20 cycles.
19
Summary
Writing to a cache poses a couple of interesting issues.
Write-through and write-back policies keep the cache consistent with main
memory in different ways for write hits.
Write-around and allocate-on-write are two strategies to handle write misses,
differing in whether updated data is loaded into the cache.
Memory system performance depends upon the cache hit time,
time miss rate and
miss penalty, as well as the actual program being executed.
We can use these numbers to find the average memory access time.
We can also revise our CPU time formula to include stall cycles.
AMAT = Hit time + (Miss rate x Miss penalty)
Memory stall cycles = Memory accesses x miss rate x miss penalty
CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time
The organization of a memory system affects its performance.
The cache size, block size, and associativity affect the miss rate.
We can organize the main memory to help reduce miss penalties. For
example interleaved memory supports pipelined data accesses.
example,
accesses
20