0% found this document useful (0 votes)
18 views151 pages

ECE 511 - Fall 2019 - Lecture 5 - Cache Concept

The document discusses the cache concept in computer architecture, focusing on memory hierarchy, cache basics, and the effective memory access time (EMAT). It explains the differences between SRAM and DRAM technologies, their respective roles in cache and main memory, and the importance of locality principles in optimizing cache performance. Additionally, it covers cache organization types, hit/miss rates, and the impact of cache design on overall system performance.

Uploaded by

pp9621132
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views151 pages

ECE 511 - Fall 2019 - Lecture 5 - Cache Concept

The document discusses the cache concept in computer architecture, focusing on memory hierarchy, cache basics, and the effective memory access time (EMAT). It explains the differences between SRAM and DRAM technologies, their respective roles in cache and main memory, and the importance of locality principles in optimizing cache performance. Additionally, it covers cache organization types, hit/miss rates, and the impact of cache design on overall system performance.

Uploaded by

pp9621132
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 151

ECE 511

Cache Concept

Khaled N. Khasawneh
George Mason University
Fall 2019
Cache Concept
3
Topics Covered
n The Memory System
q Memory Hierarchy

q Cache, Main memory, and Secondary memory

n Cache Basics
q Block size

q Hit and Miss; Hit rate and Miss rate

q Effective memory access time (EMAT)

n How the cache works.


n Memory organization
n Cache organization
n Cache and the Pipeline
n Direct-mapped cache
n Fully-associative cache
n Set-associative cache
4

THE MEMORY SYSTEM


Memory Hierarchy
Principle of Locality
Cache, Main memory and Secondary memory
5
Memory Speed
n Thus far, it has been assumed that memory operates at
the same speed as the processor.

IF IF/ID ID/R ID/EX EX EX/MEM MEM MEM/WB WB


PC
Reg Adder
File
data
Register

Register

Register

Register
I-mem
D-mem
Reg
Decode File
Logic ALU
Adder

Instructions fetched from instruction memory.


Data read from and written to data memory.

Remember, memory is not part of the processor datapath.


Idealism

Pipeline
Instruction (Instruction Data
Supply Supply
execution)

- Zero-cycle latency - No pipeline stalls - Zero-cycle latency

- Infinite capacity -Perfect data flow - Infinite capacity


(reg/memory dependencies)
- Zero cost - Infinite bandwidth
- Zero-cycle interconnect
- Perfect control flow (operand communication) - Zero cost

- Enough functional units

6
Memory and Processor Performance Gap!
n Memory technology evolves slower!

7
8
Memory Speed
n In fact, memory is orders of magnitude slower (than the CPU).
q Processor speeds are on the order of Gigahertz (GHz).

q Processor clock period is ~ 0.3ns-1ns.

q Main memory access time is ~ 50ns-100ns.

q Main memory access requires many processor clock cycles.


9
Memory Speed
n Must bridge the gap between the processor speed and the memory
speed for pipelining to be effective.
q Otherwise, pipeline must be stalled for every memory access.

n When fetching instructions from instruction memory.


n When reading data from or writing data to data memory.
1
0
The Memory Hierarchy
n The goal of the design of a memory system is to reduce the time
required to access the memory.
q Known as the Effective Memory Access Time (EMAT).

n While providing sufficient storage for the computing system, and


maintaining a reasonable cost.

n An hierarchical memory system is


designed to meet these objectives.

n It is designed using a variety of


memory technologies.

Why?
1
1
Register File
n The register file is a fast memory.
q It operates at the speed of the processor.
q It can be read/written in a portion of one
clock cycle.
q Pipeline does not need to be stalled when
accessing the register file.
q (It is part of the processor datapath).

n The speed of the register file can be attributed to


q The SRAM technology used to realize it.
q And its small size.

Why not realize main memory using the same technology?


1
2
Main Memory
n Consider the use of SRAM technology for main memory.

SRAM
q High power consumption. SRAM Bit-Cell
q Requires large area on die (chip).

q Cost per 1 GB (as of 2018) = $700-$1000

q Access time (as of 2018) = 0.3ns-0.5ns

EMAT = TSRAM = 0.5 nsec

n SRAM is fast, but expensive.

n Not feasible, financially, to realize large memories using SRAM.

n Use SRAM to realize small, high-speed cache memories.


1
3
Main Memory
n Instead, consider the use of DRAM technology.

DRAM
q Low power consumption. DRAM Bit-Cell
q Small size.

q Ideal for large memories.

q Cost per 1 GB (as of 2018) = $7-$10

q Access time (as of 2018) = 50ns-70ns

EMAT = TDRAM = 70 nsec

n DRAM is dense (small size per bit) and inexpensive, but slow.

n While it does not meet the speed requirements of the processor, it


is the technology of choice for the main memory.
1
4
Cache and Main Memory
n Design the memory system using both SRAM and DRAM.
Cache
SRAM
q Use a small amount of SRAM.

q Minimize cost.

q Achieve access time required by CPU.

DRAM Main Memory


q Use a large amount of DRAM.

q Provide the needed storage capacity.

EMAT = TSRAM + ( mSRAM * TDRAM )

Access time of SRAM Access time of DRAM


Miss rate of SRAM
Analogy

15
Analogy
L1 Cache

Main Memory

L1 Cache
L2 Cache

L3 Cache
L1 Cache
L2 Cache

16
1
7
Miss Rate and EMAT
n Miss rate
q The probability that the data or instruction needed by the processor is
not found in the high-speed cache (SRAM).

TSRAM = 0.5 nsec


EMAT = TSRAM + ( mSRAM * TDRAM )
TDRAM = 70 nsec

mSRAM EMAT n The effective memory access time


1 70.5 nsec (EMAT) approaches the access time of
the SRAM as the miss rate of the SRAM
0.75 53.0 nsec decreases.
0.50 35.5 nsec
0.10 7.5 nsec
n The hierarchical memory system can
achieve the speed of the processor only
0.02 1.9 nsec when the SRAM miss rate is small.
0 0.5 nsec
Cache Miss
n 3Cs:
q Compulsory
n First access to a block is
always a miss! Also refer
to as cold-start-miss
q Capacity
n Misses in the result of
finite capacity of the
cache!
q Conflict
n Misses in the result of
multiple blocks map to the
same cache location. Also
referred to as collision-
misses
18
1
9
The Principle of Locality
n The memory system can achieve the speed of the cache only when the
cache miss rate is small.
n This is possible because of the principle of locality.
n Spatial Locality
q If a program accesses memory location k, there is a high probability
that it will access the locations that follow it.
q Copy into the high-speed cache the data (or instruction) at location
k, as well as the data (or instructions) at locations k+1, k+2, etc. for
later use.
n Time Locality
q If a program is accessing a memory location now, there is a high
probability that it will access the same location again in the near
future.
q Copy into the high-speed cache the data (or instruction) at the
specified location for use now and again in the near future.
2
0
Take Advantage of Locality
n Design an hierarchical memory system.

1 Store all programs and data in secondary memory.


3

2 Transfer instructions and data of


executing program from secondary
memory to smaller, faster main memory. 2

3 Transfer instructions being executed,


and associated data, from main memory
to even smaller, faster cache memory.
1
2
1
Cache and Main Memory
n Cache is hidden memory.
q It is not visible to the programmer.
faster expensive
n Main Memory (MM) is the physical
memory that is visible to the
programmer.
q MM addresses are specified in the
program.
q MM addresses are issued by the
processor.

n Instructions and data that are in the


cache are also in the main memory.
q Data and instructions are copied from
main memory into the cache. slower cheap

q Cache contains a subset of the data


and instructions that are in main
memory.
2
2

CACHE BASICS
Block and Block size
Cache hit and miss
Effective Memory Access Time (EMAT)
2
3
Memory (and Cache) Block
n Block (aka. Line)
q The smallest unit of memory that can be
transferred between the main memory and Processor
the cache. Registers
q Can be a single word or many bytes.
Cache

block
Byte addressed
by processor.

Main Memory

Entire block is
transferred to the
cache.
Main Memory
2
4
Memory (and Cache) Block
n The word size and block size can be different.

n A word is a unit of memory access.


q The size of the data typically handled by the processor.

q For example: MIPS word = 4 bytes.

n A block is a unit of memory transfer.


q The size of the data transferred between main memory and the cache.

q The block size is specified by the memory system.

n The block size is often larger than the word size.


q It is typically a multiple of the word size.

word
byte byte byte byte byte byte byte byte byte byte byte byte byte byte byte byte

block
2
5
Cache Lookup
n The processor specifies a main memory address when
q Fetching an instruction from program memory.
q Reading data from or writing data to data memory.
n The cache is searched, using the MM address, for the instruction/data
needed.
n Hit: a successful search of the cache.
lw $s0, 0x1040($zero)
n Hit rate (h): the probability of
finding the specified MM address in
MM Address Data Block To CPU the cache.
n Miss: an unsuccessful search of the
cache.
n Miss rate (m): the probability of not
finding the specified MM address in
0x1040 the cache.
q M=1–h
n Miss penalty: the time penalty
incurred when MM must be accessed
for the required instruction or data.
Cache
2
6
Effective Memory Access Time
n EMAT is the effective (or average) time experienced by
the processor when accessing the memory system.

EMAT = TC + ( mC * TMM ) Two-level Memory System

n TC = cache access time


q Time required to access cache
q Cache hit time
n mC = cache miss rate
n TMM = main memory access time
q Time required to access main memory
q Miss penalty
2
7
Effective Memory Access Time
n EMAT is defined recursively for a multi-level memory system.

EMATi = Ti + ( mi * EMATi+1 ) Multi-level Memory System

Level 0
n Ti = access time of level i
n mi = miss rate of level i
Level 1
n EMATi+1 = effective
memory access time of
level i+1 Level 2

Level 3
2
8
Multi-level Cache
n The goal of the design of an hierarchical memory system is to
reduce the effective memory access time.

Optimized for speed (small).

n Modern processors use multiple levels of cache.


n Why not use a single large cache?
n Trying to meet two objectives at the same time:

EMAT = TC + ( mC * TMM )

Reduce cache access Decrease cache miss


time by using a smaller rate by using a larger
cache. cache.

Optimized for low miss rate (larger).


2
9

CACHE BASICS
How does the cache work?
3
0
Reading Data from Memory
n How does the CPU read data (and instructions) from memory?
n Processor issues a MM address.
n Cache is searched using the MM
address. Processor
q Address is also sent to the main Registers
memory.
n If the MM address is found (cache hit),
processor reads data from cache. Cache
q One clock cycle to read from cache.
n If the MM address is not found (miss),
data is transferred from MM to the
cache and the processor.
q Miss penalty = 100’s of clock cycles. Main Memory
q Data is copied into an unused cache
block.
q Or must overwrite a used cache block.
3
1
Reading Data from Memory
n How does the CPU read data (and instructions) from memory?

n Processor issues a main memory


address.
n Cache is searched using the MM address Processor
address. Registers
q Address is also sent to the main
memory.
n If the MM address is found (cache hit), Cache
processor reads data from cache.
q One clock cycle to read from cache.
n If the MM address is not found (miss),
data is transferred from MM to the
cache and the processor. Main Memory
q Miss penalty = 100’s of clock cycles.
q Data is copied into an unused cache
block.
q Or must overwrite a used cache block.
3
2
Reading Data from Memory
n How does the CPU read data (and instructions) from memory?

n Processor issues a main memory


address.
n Cache is searched using the MM address Processor
address. Registers
q Address is also sent to the main
memory.
n If the MM address is found (cache hit), Cache
processor reads data from cache.
q One clock cycle to read from cache.
n If the MM address is not found (miss),
data is transferred from MM to the
cache and the processor. Main Memory
q Miss penalty = 100’s of clock cycles.
q Data is copied into an unused cache
block.
q Or must overwrite a used cache block.
3
3
Reading Data from Memory
n How does the CPU read data (and instructions) from memory?

n Processor issues a main memory


address.
n Cache is searched using the MM address Processor
address. Registers
q Address is also sent to the main
memory.
n If the MM address is found (cache hit), Cache
processor reads data from cache.
q One clock cycle to read from cache.
n If the MM address is not found (miss), data
data is transferred from MM to the
cache and the processor. Main Memory
q Miss penalty = 100’s of clock cycles.
q Data is copied into an unused cache
block.
q Or must overwrite a used cache block.
3
4
Reading Data from Memory
n How does the CPU read data (and instructions) from memory?

n Processor issues a main memory


address.
n Cache is searched using the MM address Processor
address. Registers
q Address is also sent to the main
memory.
n If the MM address is found (cache hit), Cache
processor reads data from cache.
q One clock cycle to read from cache.
n If the MM address is not found (miss),
data is transferred from MM to the
cache and the processor. Main Memory
q Miss penalty = 100’s of clock cycles.
q Data is copied into an unused cache
block.
q Or must overwrite a used cache block. data
3
5

MEMORY ORGANIZATION
Memory is organized into blocks.
Calculating the main memory block number.
3
6
Main Memory Block Number
n Main memory and cache are organized into blocks.
q The block size is the same for both memories.
n The processor issues an a-bit main memory address.
n To determine the block number, the address can be
partitioned into two fields:
a bits

Block number Byte number

The upper a – b bits The lower b bits identify


identify the block the byte number.
number.

a = log2(memory size in bytes) Block size = number of bytes/block = 2b.


b = log2(block size in bytes) Number of blocks in main memory = 2a-b.
3
7
MM Block Number: Example
n Consider a memory system with the given specifications:
q Size of main memory = 64 Kbytes
q Block size = 8 bytes
n Determine the block number for the MM address 0x3A17.

a = width of address = log2(64 Kbytes) = 16 bits


b = block offset = log2(8 bytes) = 3 bits

MM Address = 0x3A17: 0011 1010 0001 0111

a – b = 13 bits b = 3 bits
Block 1858 Byte 7

213 = 8192 blocks in MM


3
8
MM Block Number: Example
n Consider a memory system with the given specifications:
q Size of main memory = 64 Kbytes
q Block size = 32 bytes
n Determine the block number for the MM address 0x3A17.

a = width of address = log2(64 Kbytes) = 16 bits


b = block offset = log2(32 bytes) = 5 bits

MM Address = 0x3A17: 0011 1010 0001 0111

a – b = 11 bits b = 5 bits
Block 464 Byte 23

211 = 2048 blocks in MM


3
9

CACHE ORGANIZATION
4
0
Cache Organization
n Each main memory block is mapped to a block in the cache.
n The cache can be organized in a variety of ways.

n Three cache organizations will be considered:

1. Fully-associative
q Each main memory block can be mapped to any cache block.
2. Direct-mapped
q Each main memory block is mapped to exactly one cache block.
3. Set-associative
q Each main memory block is mapped to a set of cache blocks.
4
1

CACHE AND THE PIPELINE


Instruction and Data caches.
What happens on a cache hit and a cache miss?
Memory stalls
How does the cache affect processor performance?
4
2
Instruction and Data Caches
IF IF/ID ID/RR ID/EX EX EX/MEM MEM MEM/WB WB
PC
Reg Adder
File
data
Register

Register

Register

Register
I-Cache D-
Cache
Reg
Decode File
Logic ALU
Adder

Separate instruction and data caches

n Processor accesses separate instruction and data caches.


q Fetches an instruction from the I-cache in the IF stage.

q Reads data from and writes data to the D-cache in the MEM stage.

n Modified Harvard Architecture


q Separate instruction and data caches, rather than separate main
memories.
4
3
Effect on the Pipeline: Cache HIT
IF IF/ID ID/RR ID/EX EX EX/MEM MEM MEM/WB WB
PC
Reg Adder
File
data
Register

Register

Register

Register
I-Cache D-
Cache
Reg
Decode File
Logic ALU
Adder

n The instruction and data caches make it possible for the IF and MEM
stages of the pipeline to have comparable clock cycle times to the
other stages of the pipeline.
q This assumes cache hits.

q If cache miss, pipeline must be stalled.

Access time of the caches is comparable to the critical path delay of


the hardware in each of the other stages of the pipeline
4
4
Effect on Pipeline Performance
IF IF/ID ID/RR ID/EX EX EX/MEM MEM MEM/WB WB
PC
Reg Adder
File
data
Register

Register

Register

Register
I-Cache D-
Cache
Reg
Decode File
Logic ALU
Adder

n Miss on an instruction fetch (I-cache miss).


q Occurs in the IF stage of the pipeline.
q IF stage inserts hardware NOPs into pipeline until required instruction
retrieved from main memory (and copied into I-cache).
q Main memory access time = 10s to 100s of processor clock cycles.
4
5
Effect on Pipeline Performance
IF IF/ID ID/R ID/EX EX EX/MEM MEM MEM/WB WB
PC
R
Reg Adder
File
data
Register

Register

Register

Register
I-Cache D-
Cache
Reg
Decode File
Logic ALU
Adder

n Miss on an data read (D-cache miss).


q Occurs in the MEM stage of the pipeline.
q MEM stage inserts hardware NOPs into pipeline until required data is
retrieved from main memory (and copied into D-cache).
q Freezes (or stalls) all preceding stages of the pipeline.
q Relevant to load and store instructions only.
4
6
Memory Stall
n A memory stall occurs when the processor must wait for a memory
operation to complete.
q As with hazards, when the processor is stalled, clock cycles are
wasted.

n There are two types of memory stalls:


q Read stall

n Instruction fetch
n Load word instruction (read from memory)
q Write stall
n Store word instruction (write to memory)
4
7
Effect on Processor Performance
n Processor performance is measured using Execution Time (ET).
n Assuming a perfect cache (i.e. miss rate = 0 %)

ET = IC * CPIAVG * clock period

n Assuming a non-ideal cache.

ET = IC * (CPIAVG + Mem StallsAVG) * clock period

CPIEFFECTIVE = CPIAVG + Mem StallsAVG

ET = IC * CPIEFFECIVE * clock period

ET: Execution Time


CPI: Clock Per Instruction
IC = Instruction Count
4
8
Cache Performance: Example #1
n Given the following system specifications:
q CPIAVG = 1.8 (does not include memory stalls)

q Hit rate for the Instruction cache = 95%

q Hit rate for the Data cache = 98%

q Memory access instructions account for 30% of the instructions


executed.
n 80% are read instructions (load)
n 20% are write instructions (store)
q Read miss penalty = 20 clock cycles.
q Write miss penalty = 5 clock cycles.

n Calculate the effective CPI (CPIEFFECTIVE) of the processor.


4
9
Cache Performance: Example #1
n Instruction miss penalty = instruction miss rate * read miss penalty
= (1 – 0.95) * 20 clock cycles
= 1 clock cycle per instruction (average)

n Data read miss penalty = % of memory access instructions in program *


% of mem access instructions that are reads *
data miss rate * read miss penalty
= 0.3 * 0.8 * (1 – 0.98) * 20 clock cycles
= 0.096 clock cycles per instruction (average)
5
0
Cache Performance: Example #1
n Data write miss penalty = % of memory access instructions in program *
% of mem access instructions that are writes *
data miss rate * write miss penalty
= 0.3 * 0.2 * (1 – 0.98) * 5 clock cycles
= 0.006 clock cycles per instruction (average)

n CPIEFFECTIVE = base CPI + Effect of I-cache misses + Effect of D-cache misses


= 1.8 + 1 + (0.096 + 0.006) clock cycles
= 2.902 clock cycles

n Thus, due to the non-ideal cache, the time required to execute the program
will increase by a factor of 2.902 / 1.8 = 1.61.
5
1
Cache Performance: Example #2
n Given the following system specifications:
q CPIAVG = 2.0 (does not include memory stalls)

q Miss rate for the Instruction cache = 2%

q Miss rate for the Data cache = 4%

q Memory access instructions account for 36% of the instructions


executed.
q Miss penalty = 100 clock cycles.

n Calculate the effective CPI (CPIEFFECTIVE) of the processor.


5
2
Cache Performance: Example #2
n Instruction miss penalty= instruction miss rate * miss penalty
= (0.02) * 100 clock cycles
= 2 clock cycles per instruction (average)

n Data miss penalty = % of memory access instructions in program *


data miss rate * miss penalty
= 0.36 * (0.04) * 100 clock cycles
= 1.44 clock cycles per instruction (average)
5
3
Cache Performance: Example #2
n CPIEFFECTIVE = base CPI + Effect of I-cache misses + Effect of D-cache
misses
= 2.0 + 2 + 1.44 clock cycles
= 5.44 clock cycles

n Thus, a processor with a perfect cache is 5.44 / 2.0 = 2.72 times faster than
a processor with the non-ideal cache described above.
5
4
Improving Cache Performance
n Improve cache performance to improve processor performance.
n Processor performance can be improved by reducing pipeline stalls,
including those due to memory access (memory stalls).

EMAT = TC + ( mC * TMM )

n Reduce the number of memory stalls by decreasing


q The cache miss rate (mc)

q The main memory access time (TMM)

n Exploit spatial locality to reduce cache miss rate.


q Transfer an entire block rather than just the byte (or word)
specified.
q Increase the block size to decrease the cache miss rate.
55

CACHE ORGANIZATIONS
The (hierarchical) memory system.
Cache is “hidden” memory.
Fully-associative, Direct-mapped, Set-associative
Main memory (and cache) is organized into blocks.
56
The (Hierarchical) Memory System
n The memory system is organized hierarchically:
57
Cache and Main Memory
n Cache is “hidden” memory.
Processor
faster expensive q Not visible to the programmer.
Registers q Not directly accessible to the
processor.

Cache n Main memory is the


physical memory that is
visible to the programmer.
q Main memory addresses are
specified in the program
code.
Main Memory q Main memory addresses are
issued by the processor.
slower cheap
58
Cache Organization
n Each main memory block is mapped to a block in the cache.
n Information in the cache can be organized in a variety of ways.

n Three cache organizations will be considered:

1. Fully-associative
q Each main memory block can be mapped to any cache block.
2. Direct-mapped
q Each main memory block is mapped to exactly one cache block.
3. Set-associative
q Each main memory block is mapped to a set of cache blocks.
59
Main Memory Block Number
n Main memory and cache are organized into blocks.
q The block size is the same for both memories.

n The processor issues an a-bit main memory address.


n To determine the block number, the address can be partitioned into
two fields:

a bits

Block number Byte number

The upper a – b bits identify The lower b bits identify the


the block number. byte number.

a = log2(memory size in bytes)


Block size = number of bytes/block = 2b.
b = log2(block size in bytes) Number of blocks in main memory = 2a-b.
60

DIRECT-MAPPED CACHE
Basics
Organization of a direct-mapped cache
Example
61
Direct-mapped Cache
n The cache is organized into blocks.
n Each main memory block maps to exactly one cache block.
q A cache block is identified by its cache index.

n Many main memory blocks map to the same cache block.


q Only one MM block can reside in a cache block at a time.

cache index Index Tag Data

Tag field of block copied from MM

8 cache blocks
Block (data or
instructions) copied from
MM
62
Direct-mapped Cache
n The processor issues an a-bit main memory address.
n When accessing a direct-mapped cache, the address is partitioned
into three fields:
q Tag field: distinguishes between the different main memory
blocks that can map to the same cache block in a direct-mapped
cache.
q Cache index: identifies the cache block to which the main
memory block is mapped in a direct-mapped cache.
q Block offset: identifies a byte within the block.

n The main memory address is used to search the cache.

a bits b = log2(block size in bytes)


Tag field Index Block n = log2(cache size in blocks)
offset t = a – (n + b)
t bits n bits b bits
63
Direct-mapped Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM address. Processor
q Index field is used to lookup MM address in Registers
cache.
q Address is also sent to the main memory.
n If the MM address is found (cache hit),
processor reads data from cache. Cache
q One clock cycle to read from cache.
n If the MM address is not found (miss), data
is transferred from MM to the cache and the
processor.
q Miss penalty = 100’s of clock cycles.
Main Memory
q Data is copied into an unused cache block.
q Or must overwrite a used cache block.
64
Direct-mapped Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM address. address Processor
q Index field is used to lookup MM address in Registers
cache.
q Address is also sent to the main memory.
n If the MM address is found (cache hit),
processor reads data from cache. Cache
q One clock cycle to read from cache.
n If the MM address is not found (miss), data
is transferred from MM to the cache and the
processor.
q Miss penalty = 100’s of clock cycles.
Main Memory
q Data is copied into an unused cache block.
q Or must overwrite a used cache block.
65
Direct-mapped Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM address. address Processor
q Index field is used to lookup MM address in Registers
cache.
q Address is also sent to the main memory.
n If the MM address is found (cache hit),
processor reads data from cache. Cache
q One clock cycle to read from cache.
n If the MM address is not found (miss),
data is transferred from MM to the cache
and the processor.
q Miss penalty = 100’s of clock cycles.
Main Memory
q Data is copied into an unused cache block.
q Or must overwrite a used cache block.
66
Direct-mapped Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM address. address Processor
q Index field is used to lookup MM address in Registers
cache.
q Address is also sent to the main memory.
n If the MM address is found (cache hit),
processor reads data from cache. Cache
q One clock cycle to read from cache. data
n If the MM address is not found (miss), data
is transferred from MM to the cache and the
processor.
q Miss penalty = 100’s of clock cycles.
Main Memory
q Data is copied into an unused cache block.
q Or must overwrite a used cache block.
67
Direct-mapped Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM address. address Processor
q Index field is used to lookup MM address in Registers
cache.
q Address is also sent to the main memory.
n If the MM address is found (cache hit),
processor reads data from cache. Cache
q One clock cycle to read from cache.
n If the MM address is not found (miss),
data is transferred from MM to the cache
and the processor.
q Miss penalty = 100’s of clock cycles.
Main Memory
q Data is copied into an unused cache block. data
q Or must overwrite a used cache block.
68
Direct-mapped Cache
Main memory address
n Index field
q specifies the cache block to be
searched for the requested
instruction/data.
n Tag field
q compared against the tag bits
stored in the cache to determine if
there is a cache hit or a cache miss.
n Block offset
q selects one of the bytes within the
block.
69
Direct-mapped Cache
Cache
n Each cache location contains
q Data
q Metadata (tag bits + valid bit)
n Data (or instruction)
q Block copied from main memory.
n Tag bits
q Identify the MM block that is
stored in the cache block.
q Upper t bits of the MM address.
q Disambiguates MM blocks.
n Valid bit
q Indicates validity of block.
n Index
q Specifies location of the block
within the cache.
70
Valid Bit
n Blocks are copied from main memory to the cache (as needed).
n Is the data (instruction) in the cache block valid?

n Cache is initially “empty”.


q In reality, cache initially
contains garbage (i.e. invalid
data).
n Must store an additional bit in
each cache block to indicate
the validity of the data (or
instruction).
n The valid bit indicates whether
the data is valid or invalid.
q Valid bit = 0 : invalid
q Valid bit = 1 : valid
71
Direct-mapped Cache
Cache Search (hit or miss)
n CPU issues a main memory address.
q Index field selects cache block.
q Tag field compared against tag
stored in selected cache block.
q Valid bit checked.
n If cache hit
q Processor reads data from cache.
q Processor does not need to be
stalled.
n If cache miss
q Processor reads data from MM.
q MM block copied into appropriate
cache block.
q Processor must be stalled.
Hardware (i.e. digital logic) to
generate the cache hit control signal.
72
Direct-mapped Cache: Example
Memory Specifications: Processor
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

Cache
H/M
Index V Tag Data
lbu $s0, 0x32($zero)
0 0
lbu $s0, 0xC9($zero) 1 0

lbu $s0, 0xF4($zero) 2 0


3 0
lbu $s0, 0x37($zero)
lbu $s0, 0x5D($zero) Main Memory
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23
Block # 24 25 26 27 28 29 30 31

MM Address: 0000 0000 Cache size = 4 blocks (n = 2)


Tag index Byte #
73
Direct-mapped Cache: Example
Memory Specifications: Processor
0x32
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

Cache
H/M
Index V Tag Data
lbu $s0, 0x32($zero) M 0 0
lbu $s0, 0xC9($zero) 1 0

lbu $s0, 0xF4($zero) 2 1 1 D06


3 0
lbu $s0, 0x37($zero)
lbu $s0, 0x5D($zero) Main Memory
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23
D06
Block 6 24 25 26 27 28 29 30 31

MM Address: 0011 0010 Processor is stalled 100’s of clock cycles.

1 2 Byte 2
74
Direct-mapped Cache: Example
Memory Specifications: Processor
0xC9
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

Cache
H/M
Index V Tag Data
lbu $s0, 0x32($zero) M 0 0
lbu $s0, 0xC9($zero) M 1 1 6 D25

lbu $s0, 0xF4($zero) 2 1 1 D06


3 0
lbu $s0, 0x37($zero)
lbu $s0, 0x5D($zero) Main Memory
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23
D25
Block 25 24 25 26 27 28 29 30 31

MM Address: 1100 1001 Processor is stalled 100’s of clock cycles.


6 1 Byte 1
75
Direct-mapped Cache: Example
Memory Specifications: Processor
0xF4
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

Cache
H/M
Index V Tag Data
lbu $s0, 0x32($zero) M 0 0
lbu $s0, 0xC9($zero) M 1 1 6 D25
2 1 7 D31
lbu $s0, 0xF4($zero) M
3 0
lbu $s0, 0x37($zero)
lbu $s0, 0x5D($zero) Main Memory
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23
D31
Block 31 11110 = 30
24 25 26 27 28 29 30 31

MM Address: 1111 0100 Processor is stalled 100’s of clock cycles.


7 2 Byte 4
76
Direct-mapped Cache: Example
Memory Specifications: Processor
0x37
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

Cache
H/M
Index V Tag Data
lbu $s0, 0x32($zero) M 0 0
lbu $s0, 0xC9($zero) M 1 1 6 D25
2 1 1 D06
lbu $s0, 0xF4($zero) M
3 0
lbu $s0, 0x37($zero) M
lbu $s0, 0x5D($zero) Main Memory
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23
D06
Block 6 24 25 26 27 28 29 30 31

MM Address: 0011 0111


Processor is stalled 100’s of clock cycles.
1 2 Byte 7
77
Direct-mapped Cache: Example
Memory Specifications: Processor
0x5D
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

Cache
H/M
Index V Tag Data
lbu $s0, 0x32($zero) M 0 0
lbu $s0, 0xC9($zero) M 1 1 6 D25
2 1 1 D06
lbu $s0, 0xF4($zero) M
3 1 2 D11
lbu $s0, 0x37($zero) M
lbu $s0, 0x5D($zero) M Main Memory
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23
D11
Block 11 24 25 26 27 28 29 30 31

MM Address: 0101 1101 Processor is stalled 100’s of clock cycles.

2 3 Byte 5
78
Direct-mapped Cache: Example
Memory Specifications: Processor
0x58
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

Cache
H/M
Index V Tag Data
lbu $s0, 0x32($zero) M 0 0
lbu $s0, 0xC9($zero) M 1 1 6 D25
2 1 1 D06
lbu $s0, 0xF4($zero) M D11
3 1 2 D11
lbu $s0, 0x37($zero) M
lbu $s0, 0x5D($zero) M Main Memory
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero) H 8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23
Block 11 24 25 26 27 28 29 30 31

MM Address: 0101 1000 Processor is not stalled.


2 3 Byte 0
79
Direct-mapped Cache: Example
Memory Specifications: Processor
0x58
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

Cache
H/M
Index V Tag Data
lbu $s0, 0x32($zero) M 0 0
lbu $s0, 0xC9($zero) M 1 1 6 D25
2 1 1 D06
lbu $s0, 0xF4($zero) M
3 1 0 D03
lbu $s0, 0x37($zero) M
lbu $s0, 0x5D($zero) M Main Memory
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero) H 8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) M 16 17 18 19 20 21 22 23
D03
Block 3 24 25 26 27 28 29 30 31

MM Address: 0001 1110 Processor is stalled 100’s of clock cycles.


0 3 Byte 6
80

REPLACEMENT POLICY
Direct-mapped cache
81
Replacement Policy
n Empty block
q Use an “empty” cache block, if one is available.

q An “empty” or invalid cache block is indicated by V = 0.

n Overwrite contents of cache block.


q Each MM block maps to just one cache block in a direct-mapped cache.

q Therefore, cache block must be overwritten with new MM block.

q Only one MM block can reside in a cache block at a time.


82

CACHE SIZE
Direct-mapped cache: Example
83
Cache Size
n Memory system specifications: Direct-mapped Cache
q Main memory size = 1M byte Index V Tag Data
q Cache size = 128 bytes 0 0
q Block size = 8 bytes 1 0
q Direct-mapped cache 2 0
(One-way Set-associative cache). 3 0

(illustrated here:
4 sets with one block each)
n Calculate the total size of the cache (in bits).
n Determine the number of blocks in main memory.
q # blocks in MM = Main memory size / Block size
q # blocks in MM = 1M byte / 8 bytes = 220 / 23 = 217 = 131072
blocks.
n Determine the number of blocks in the cache.
q # blocks in cache = Cache size / Block size
q # blocks in cache = 128 bytes / 8 bytes = 16 blocks.
84
Cache Size
n Determine the number of address bits for the tag field.
q a = address bits = log2(memory size in bytes) = log2(1M bytes) = 20 bits.
q n = index = log2(cache size in blocks) = log2(16) = 4 bits.
q b = block offset = log2(block size in bytes) = log2(8 bytes) = 3 bits.
q t = tag = a – (b + n) = 20 – (3 + 4) = 13 bits.

n Determine the number of bits in each cache entry (i.e. block).


q Data bits = block size = 8 bytes * 8 bits/byte = 64 bits.
q Metadata bits = Valid bit + Tag bits = 1 + 13 = 14 bits.
q Total bits in each cache block = data bits + metadata bits = 78 bits.

n Determine total size of cache (in bits).


q Total size of cache = # cache blocks * total bits in each cache block
q Total size of cache = 16 blocks * 78 bits/block = 1248 bits.
85
Cache Size
n Determine the cache overhead.
q Overhead = the percentage of the cache bits needed for the metadata.

q Total data bits = 16 blocks * 8 bytes/block * 8 bits/block = 1024 bits.

q Total metadata bits = 16 blocks * 14 bits/block = 224 bits.

q Total bits in the cache = 1024 + 224 = 1248 bits.

q Overhead = metadata / total = 224 / 1248 = 17.95 %

n This implies that just 82.05% of the cache is used to store data.
n There may be a more efficient design for the cache.
q One in which a greater percentage of the cache is used to store data.
86
More Efficient Cache Design
n How can the size of the cache be reduced?

n The cache is composed of data and metadata.


q It is not useful to reduce the amount of data stored in the cache.

q However, it may be possible to reduce the amount of metadata.

n Reduce the metadata by increasing the block size.


q The same metadata is now included for larger cache blocks.

q This leads to a reduction in the cache overhead and total cache size.

n Overhead = metadata / (data + metadata)


n Total cache size = data + metadata

n Increasing the block size takes advantage of spatial locality.


q This, also, leads to a reduction in the cache miss rate.
87
Cache Size
n Memory system specifications: Direct-mapped Cache
q Main memory size = 1M byte
Index V Tag Data
q Cache size = 128 bytes
0 0
q Block size = 32 bytes 1 0
q Direct-mapped cache 2 0
(One-way Set-associative cache). 3 0

(illustrated here:
n Calculate the total size of the cache (in bits). 4 sets with one block each)
n Determine the number of blocks in main memory.
q # blocks in MM = Main memory size / Block size

q # blocks in MM = 1M byte / 32 bytes = 2


20 / 25 = 215 = 32768
blocks.

n Determine the number of blocks in the cache.


q # blocks in cache = Cache size / Block size

q # blocks in cache = 128 bytes / 32 bytes = 4 blocks.


88
Cache Size
n Determine the number of address bits for the tag field.
q a = address bits = log2(memory size in bytes) = log2(1M bytes) = 20 bits.
q n = index = log2(cache size in blocks) = log2(4) = 2 bits.
q b = block offset = log2(block size in bytes) = log2(32 bytes) = 5 bits.
q t = tag = a – (b + n) = 20 – (2 + 5) = 13 bits.

n Determine the number of bits in each cache entry (i.e. block).


q Data bits = block size = 32 bytes * 8 bits/byte = 256 bits.
q Metadata bits = Valid bit + Tag bits = 1 + 13 = 14 bits.
q Total bits in each cache block = data bits + metadata bits = 270 bits.

n Determine total size of cache (in bits).


q Total size of cache = # cache blocks * total bits in each cache block
q Total size of cache = 4 blocks * 270 bits/block = 1080 bits.
89
Cache Size
n Determine the cache overhead.
q Overhead = the percentage of the cache bits needed for the metadata.

q Total data bits = 4 blocks * 32 bytes/block * 8 bits/block = 1024 bits.

q Total metadata bits = 4 blocks * 14 bits/block = 56 bits.

q Total bits in the cache = 1024 + 56 = 1080 bits.

q Overhead = metadata / total = 56 / 1080 = 5.19 %

n By increasing the block size from 8 bytes to 32 bytes, the cache overhead was
decreased from 17.95 % to 5.19 % .
q This implies that nearly 95 % of the cache is now used for data.

q The total cache size was also decreased from 1248 bits to 1080 bits, which

corresponds to a decrease in size by 13.5 %.


90
Impact of Increasing Block Size
n For a given cache size, we
expect the miss rate to
decrease with increasing block
size.
q Exploits spatial locality.
n However, in a fixed-size cache,
larger blocks lead to fewer
blocks.
q More competition for blocks.
q Results in an increase in the
miss rate.
91

DIRECT-MAPPED CACHE
The problem with a direct-mapped cache: Example.
92
Direct-mapped Cache: Example
n The code segment given below is executed on
a processor with a 4-block direct-mapped Cache specifications:
cache. • Cache size = 4 blocks
• Block size = 4 bytes

L1: lw $s0, 0x1000($zero) Index Tag Data


lw $s1, 0x1010($zero)
00
lw $s2, 0x1020($zero)
lw $s3, 0x1030($zero) 01
… 10

11

Cache hit rate = 0% Direct-mapped Cache

n Problem: the direct-mapped cache design allows each main memory block
to be placed in just one cache block (specified by the cache index).

n Solution: design a cache that has a more flexible placement algorithm.


93

FULLY-ASSOCIATIVE CACHE
Basics
Organization of fully-associative cache
Example
94
Fully-associative Cache
n The cache is organized into blocks.
n Each cache block contains storage space for the
q Tag field: contains the tag field of the MM block copied into the cache.
q Data: contains a copy of the data in the specified main memory block.

n Each main memory block can be mapped to any cache block.

Tag field of block copied from MM Block (data or instructions) copied from MM

Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

8 cache blocks
95
Fully-associative Cache
n The processor issues an a-bit main memory address.
n When accessing a fully-associative cache, the address is partitioned into
two fields:
q Tag field: identifies the main memory block (i.e. the block number).

q Block offset: identifies a byte within the block.

n The main memory address is used to search the cache.

a bits

Tag field Block offset


t bits b bits

a = log2(memory size in bytes) Block size = number of bytes/block = 2b.


Number of blocks in main memory = 2t.
b = log2(block size in bytes)
t=a–b
t = a – (n+b) but n=0
96
Fully-associative Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM
address. Processor
q Address is also sent to the main
memory. Registers
n If the MM address is found (cache hit),
processor reads data from cache.
q One clock cycle to read from cache.
n If the MM address is not found (miss), Cache
data is transferred from MM to the
cache and the processor.
q Miss penalty = 100’s of clock cycles.
q Data is copied into an unused cache
block.
q Or must overwrite a used cache block.
Main Memory
97
Fully-associative Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM
address. address Processor
q Address is also sent to the main Registers
memory.
n If the MM address is found (cache hit),
processor reads data from cache.
q One clock cycle to read from cache.
n If the MM address is not found (miss), Cache
data is transferred from MM to the
cache and the processor.
q Miss penalty = 100’s of clock cycles.
q Data is copied into an unused cache
block.
q Or must overwrite a used cache block. Main Memory
98
Fully-associative Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM
address. address Processor
q Address is also sent to the main Registers
memory.
n If the MM address is found (cache hit),
processor reads data from cache.
q One clock cycle to read from cache.
n If the MM address is not found (miss), Cache
data is transferred from MM to the
cache and the processor.
q Miss penalty = 100’s of clock cycles.
q Data is copied into an unused cache
block.
q Or must overwrite a used cache block. Main Memory
99
Fully-associative Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM
address. address Processor
q Address is also sent to the main Registers
memory.
n If the MM address is found (cache hit),
processor reads data from cache.
q One clock cycle to read from cache.
n If the MM address is not found (miss), Cache
data is transferred from MM to the data
cache and the processor.
q Miss penalty = 100’s of clock cycles.
q Data is copied into an unused cache
block.
q Or must overwrite a used cache block. Main Memory
100
Fully-associative Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM
address. address Processor
q Address is also sent to the main Registers
memory.
n If the MM address is found (cache hit),
processor reads data from cache.
q One clock cycle to read from cache.
n If the MM address is not found (miss), Cache
data is transferred from MM to the
cache and the processor.
q Miss penalty = 100’s of clock cycles.
q Data is copied into an unused cache
block.
q Or must overwrite a used cache block.
Main Memory
data
101
Fully-associative Cache
Main memory address
n Tag field
30 Block size = 4 bytes q compared against
b = 2 bits
the tag bits stored
in each of the
V Tag Data V Tag Data V Tag Data
cache blocks to
determine if there
is a cache hit or a
30 32 30 32
cache miss.
30 32
n Block offset
q selects one of the
bytes within the
addressed block.

multiplexer
102
Fully-associative Cache
Cache Blocks
n Each block contains
q Data (or instructions)
30 q Metadata (V bit, tag)

n Valid bit
q Indicates validity of
V Tag Data V Tag Data V Tag Data corresponding block.
n Tag bits
30 32 30 32 q Identify the MM block
30 32 that was copied into
the cache block.
q Upper t bits of MM
addr.
n Data
q Block copied from MM.
multiplexer q Data or instructions.
103
Valid bit
n Blocks are copied from main memory to the cache (as needed).

n Is the data (or instruction) in the cache block valid?

n Cache is initially “empty”.


q In reality, cache initially
contains invalid data.

n Must store an additional bit in


each cache block to indicate
the validity of the data.

n The valid bit indicates whether


the data is valid or invalid.
q Valid bit = 0 : invalid
q Valid bit = 1 : valid

n The valid bit is updated when


data is copied into the cache.
104
Fully-associative Cache
Cache Search (hit or miss)
n CPU issues a MM addr.
30
q Tag field compared against
Parallel comparison tag bits in each cache
block.
q Valid bit checked
V Tag Data V Tag Data V Tag Data
n If cache hit
q Processor reads data from
30 32 30 32 cache.
q Processor does not need to
30 32
be stalled.
n If cache miss
q CPU reads data from MM.
q MM block copied into a
cache block.
q CPU must be stalled.
multiplexer

Hardware to generate the cache hit control signal.


105
Fully-associative Cache
Hardware to compare tag field of MM
address with tag bits in every cache block is
30
expensive.

V Tag Data V Tag Data V Tag Data


Is there another way to
30 32 30 32 organize the cache?
30 32

multiplexer
Large OR gate and multiplexer is
difficult and costly to realize.
106
Fully-associative Cache: Example
Memory Specifications:
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Processor
Registers
H/M
lbu $s0, 0x32($zero)
lbu $s0, 0x37($zero)
lbu $s0, 0x5B($zero)
Cache
lbu $s0, 0xC9($zero)
lbu $s0, 0x58($zero) 0 1 2 3 4 5 6 7
lbu $s0, 0xA4($zero) 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
lbu $s0, 0x3D($zero)
24 25 26 27 28 29 30 31

MM Address: 0000 0000 Main Memory

Block # Byte #
(Tag field)
107
Fully-associative Cache: Example
Memory Specifications:
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Processor
0x32
Registers
H/M
lbu $s0, 0x32($zero) M
lbu $s0, 0x37($zero)
06 D06
lbu $s0, 0x5B($zero)
Cache
lbu $s0, 0xC9($zero)
lbu $s0, 0x58($zero) 0 1 2 3 4 5 6 7
lbu $s0, 0xA4($zero) 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
lbu $s0, 0x3D($zero) D06
24 25 26 27 28 29 30 31

MM Address: 0011 0010 Main Memory

Block 6 Byte 2
Processor is stalled 100’s of clock cycles.
(Tag field)
108
Fully-associative Cache: Example
Memory Specifications:
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Processor
0x37
D06
Registers
H/M
lbu $s0, 0x32($zero) M
lbu $s0, 0x37($zero) H
06 D06
lbu $s0, 0x5B($zero)
Cache
lbu $s0, 0xC9($zero)
lbu $s0, 0x58($zero) 0 1 2 3 4 5 6 7
lbu $s0, 0xA4($zero) 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
lbu $s0, 0x3D($zero)
24 25 26 27 28 29 30 31

MM Address: 0011 0111 Main Memory

Block 6 Byte 7 Processor is not stalled.


(Tag field)
109
Fully-associative Cache: Example
Memory Specifications:
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Processor
0x5B
Registers
H/M
lbu $s0, 0x37($zero) M
lbu $s0, 0x32($zero) H
06 D06 11 D11
lbu $s0, 0x5B($zero) M
Cache
lbu $s0, 0xC9($zero)
lbu $s0, 0x58($zero) 0 1 2 3 4 5 6 7
lbu $s0, 0xA4($zero) 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
lbu $s0, 0x3D($zero) D11
24 25 26 27 28 29 30 31

MM Address: 0101 1011 Main Memory

Block 11 Byte 3
Processor is stalled 100’s of clock cycles.
(Tag field)
110
Fully-associative Cache: Example
Memory Specifications:
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Processor
0xC9
Registers
H/M
lbu $s0, 0x37($zero) M
lbu $s0, 0x32($zero) H
06 D06 11 D11 25 D25
lbu $s0, 0x5B($zero) M
Cache
lbu $s0, 0xC9($zero) M
lbu $s0, 0x58($zero) 0 1 2 3 4 5 6 7
lbu $s0, 0xA4($zero) 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
lbu $s0, 0x3D($zero) D25
24 25 26 27 28 29 30 31

MM Address: 1100 1001 Main Memory

Block 25 Byte 1
Processor is stalled 100’s of clock cycles.
(Tag field)
111
Fully-associative Cache: Example
Memory Specifications:
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Processor
0x58
D11
Registers
H/M
lbu $s0, 0x37($zero) M
lbu $s0, 0x32($zero) H
06 D06 11 D11 25 D25
lbu $s0, 0x5B($zero) M
Cache
lbu $s0, 0xC9($zero) M
lbu $s0, 0x58($zero) H 0 1 2 3 4 5 6 7
lbu $s0, 0xA4($zero) 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
lbu $s0, 0x3D($zero)
24 25 26 27 28 29 30 31

MM Address: 0101 1000 Main Memory

Block 11 Byte 0
Processor is not stalled.
(Tag field)
112
Fully-associative Cache: Example
Memory Specifications:
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Processor
0xA4
Registers
H/M
lbu $s0, 0x37($zero) M
lbu $s0, 0x32($zero) H
06 D06 11 D11 25 D25 20 D20
lbu $s0, 0x5B($zero) M
Cache
lbu $s0, 0xC9($zero) M
lbu $s0, 0x58($zero) H 0 1 2 3 4 5 6 7
lbu $s0, 0xA4($zero) M 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
lbu $s0, 0x3D($zero) D20
24 25 26 27 28 29 30 31

MM Address: 1010 0100 Main Memory

Block 20 Byte 4
Processor is stalled 100’s of clock cycles.
(Tag field)
113
Fully-associative Cache: Example
Memory Specifications:
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Processor
0x3D
Registers
H/M
lbu $s0, 0x37($zero) M
To where should Block 7 be copied?
lbu $s0, 0x32($zero) H
06 D06 11 D11 25 D25 20 D20
lbu $s0, 0x5B($zero) M
Cache
lbu $s0, 0xC9($zero) M
lbu $s0, 0x58($zero) H 0 1 2 3 4 5 6 7
lbu $s0, 0xA4($zero) M 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
lbu $s0, 0x3D($zero) M D07
24 25 26 27 28 29 30 31

MM Address: 0011 1101 Main Memory

Block 7 Byte 5
Processor is stalled 100’s of clock cycles.
(Tag field)
114

REPLACEMENT POLICY
Fully-associative cache
115
Replacement Policy
n Empty block
q Use an “empty” cache block, if one is available.

q An “empty” or invalid cache block is indicated by V = 0.

n Least Recently Used (LRU)


q Choose the cache block that has not been used for the longest time.

q Simple for 2-way Set-associative caches.

q Manageable for 4-way Set-associative caches.

q Too difficult for Fully-associative caches.

n Random
q Gives approximately the same performance as LRU for high
associativity.
q Used for Fully-associative caches.
116

CACHE SIZE
Fully-associative cache: Example
117
Cache Size
n Memory system specifications:
q Main memory size = 1M byte
q Cache size = 128 bytes Fully-associative Cache
q Block size = 8 bytes V Tag Data

q Fully-associative cache. (illustrated here: 1 set with four blocks)

n Calculate the total size of the cache (in bits).


n Determine the number of blocks in main memory.
q # blocks in MM = Main memory size / Block size
q # blocks in MM = 1M byte / 8 bytes = 220 / 23 = 217 = 131072
blocks.
n Determine the number of blocks in the cache.
q # blocks in cache = Cache size / Block size
q # blocks in cache = 128 bytes / 8 bytes = 16 blocks.
118
Cache Size
n Determine the number of address bits for the tag field.
q a = address bits = log2(memory size in bytes) = log2(1MB) = 20 bits.

q b = block offset = log2(block size in bytes) = log2(8 bytes) = 3 bits.

q t = tag = a – (b) = 20 – 3 = 17 bits.

n Determine the number of bits in each cache block.


q Data bits = block size = 8 bytes * 8 bits/byte = 64 bits.

q Metadata bits = Valid bit + Tag bits = 1 + 17 = 18 bits.

q Total bits in each cache block = data bits + metadata bits = 82 bits.

n Determine total size of cache (in bits).


q Total size of cache = # cache blocks * total bits in each cache block

q Total size of cache = 16 blocks * 82 bits/block = 1312 bits.


119
Cache Size
n Determine the cache overhead.
q Overhead = the percentage of the cache bits needed for the
metadata.
q Total data bits = 16 blocks * 8 bytes/block * 8 bits/block = 1024bits.

q Total metadata bits = 16 blocks * 18 bits/block = 288 bits.

q Total bits in the cache = 1024 + 288 = 1312 bits.

q Overhead = metadata / total = 288 / 1312 = 21.95 %

n This implies that just 78.05% of the cache is used to store data.

n There may be a more efficient design for the cache.


q One in which a greater percentage of the cache is used to store data.

n How does the block size affect the cache overhead?


Problem with Fully Associative Cache
n What is the problem with fully associative Cache?

120
121

SET-ASSOCIATIVE CACHE
Basics
Organization of a set-associative cache
Example
122
Set-associative Cache
m-way Set-associative cache
n The cache is organized into sets.
q Each set contains m blocks for data or instructions.

q A cache set is identified by its cache index.

n Each main memory block maps to a cache set.


q The MM block can be copied into any of the blocks in the cache set.

n Many main memory blocks map to the same cache set.

2 blocks

Block 0 Block 1
cache index Index Tag Data Tag Data

Tag field of block 4 cache sets


copied from MM
Block (data or instructions)
copied from MM
123
Set-associative Cache
n The processor issues an a-bit main memory address.
n When accessing a set-associative cache, the address is partitioned into
three fields:
q Tag field: distinguishes between the different main memory blocks
that can map to the same cache set in a set-associative cache.
q Cache index: identifies the cache set to which the main memory
block is mapped in a set-associative cache.
q Block offset: identifies a byte within the block.

n The main memory address is used to search the cache.

a bits b = log2(block size in bytes)


Tag field Index Block offset n = log2(# of sets in cache)

t bits n bits b bits t = a – (n + b)


124
Set-associative Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM address. Processor
q Index field is used to lookup MM address in Registers
cache.
q Address is also sent to the main memory.
n If the MM address is found (cache hit),
processor reads data from cache. Cache
q One clock cycle to read from cache.
n If the MM address is not found (miss),
data is transferred from MM to the cache
and the processor.
q Miss penalty = 100’s of clock cycles.
Main Memory
q Data is copied into an unused cache block.
q Or must overwrite a used cache block.
125
Set-associative Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM address. address Processor
q Index field is used to lookup MM address in Registers
cache.
q Address is also sent to the main memory.
n If the MM address is found (cache hit),
processor reads data from cache. Cache
q One clock cycle to read from cache.
n If the MM address is not found (miss),
data is transferred from MM to the cache
and the processor.
q Miss penalty = 100’s of clock cycles.
Main Memory
q Data is copied into an unused cache block.
q Or must overwrite a used cache block.
126
Set-associative Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM address. address Processor
q Index field is used to lookup MM address in Registers
cache.
q Address is also sent to the main memory.
n If the MM address is found (cache hit),
processor reads data from cache. Cache
q One clock cycle to read from cache.
n If the MM address is not found (miss),
data is transferred from MM to the cache
and the processor.
q Miss penalty = 100’s of clock cycles.
Main Memory
q Data is copied into an unused cache block.
q Or must overwrite a used cache block.
127
Set-associative Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM address. address Processor
q Index field is used to lookup MM address in Registers
cache.
q Address is also sent to the main memory.
n If the MM address is found (cache hit),
processor reads data from cache. Cache
q One clock cycle to read from cache. data
n If the MM address is not found (miss),
data is transferred from MM to the cache
and the processor.
q Miss penalty = 100’s of clock cycles.
Main Memory
q Data is copied into an unused cache block.
q Or must overwrite a used cache block.
128
Set-associative Cache
n How does the CPU read data (and instructions) from memory?

n Processor issues a MM address.


n Cache is searched using the MM address. address Processor
q Index field is used to lookup MM address Registers
in cache.
q Address is also sent to the main memory.
n If the MM address is found (cache hit),
processor reads data from cache. Cache
q One clock cycle to read from cache.
n If the MM address is not found (miss),
data is transferred from MM to the cache
and the processor.
q Miss penalty = 100’s of clock cycles.
Main Memory
q Data is copied into an unused cache block. data
q Or must overwrite a used cache block.
129
Set-associative Cache
Main memory address
n Index field
q specifies the cache set
to be searched for the
requested data/instr.
n Tag field
q compared against the
tag bits stored in all m
cache blocks to
determine if there is a
cache hit or a cache
miss.
n Block offset
q selects one of the bytes
within the block.

m=4
130
Set-associative Cache
Cache
n Each cache set contains
q m blocks of data
q Metadata for each block
n Data (or instruction)
q Block copied from MM.
n Tag bits
q Identify the MM block
that is stored in the
cache block.
q Upper t bits of MM addr.
n Valid bit
q Indicates validity of
corresponding block.
n Index
q specifies location of the
m=4 set within the cache
131
Set-associative Cache
Cache Search (hit or miss)
n CPU issues a MM addr.
q Index field selects set.
Parallel comparison
q Tag field compared
against tag stored in all m
blocks of selected set.
q Valid bit checked
n If cache hit
q Processor reads data from
cache.
q Processor does not need
to be stalled.
n If cache miss
q CPU reads data from MM.
q MM block copied into
appropriate cache set.
m=4 q CPU must be stalled.

Hardware to generate the cache hit control signal.


132
Set-associative Cache: Example
Memory Specifications: Processor
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

H/M
lbu $s0, 0x32($zero) Block 0 Block 1
Index V Tag Data V Tag Data
lbu $s0, 0xC9($zero) 0 0 0
lbu $s0, 0xF4($zero) 1 0 0

lbu $s0, 0x37($zero) Cache


lbu $s0, 0x5D($zero)
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23

Block # 24 25 26 27 28 29 30 31

Main Memory
MM Address: 0000 0000
Cache specs: 2-way (w = 2), 2 sets (n = 1)
Tag index Byte #
133
Set-associative Cache: Example
Memory Specifications: Processor
0x32
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

H/M
Block 0 Block 1
lbu $s0, 0x32($zero) M
Index V Tag Data V Tag Data
lbu $s0, 0xC9($zero) 0 1 3 D06 0
lbu $s0, 0xF4($zero) 1 0 0

lbu $s0, 0x37($zero) Cache


lbu $s0, 0x5D($zero)
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23
D06
Block 6 24 25 26 27 28 29 30 31

Main Memory
MM Address: 0011 0010
Processor is stalled 100’s of clock cycles.
3 0 2
134
Set-associative Cache: Example
Memory Specifications: Processor
0xC9
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

H/M
Block 0 Block 1
lbu $s0, 0x32($zero) M
Index V Tag Data V Tag Data
lbu $s0, 0xC9($zero) M 0 1 3 D06 0
lbu $s0, 0xF4($zero) 1 1 12 D25 0

lbu $s0, 0x37($zero) Cache


lbu $s0, 0x5D($zero)
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23
D25
Block 25 24 25 26 27 28 29 30 31

Main Memory
MM Address: 1100 1001
Processor is stalled 100’s of clock cycles.
12 1 1
135
Set-associative Cache: Example
Memory Specifications: Processor
0xF4
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

H/M
Block 0 Block 1
lbu $s0, 0x32($zero) M
Index V Tag Data V Tag Data
lbu $s0, 0xC9($zero) M 0 1 3 D06 1 15 D30
lbu $s0, 0xF4($zero) M 1 1 12 D25 0

lbu $s0, 0x37($zero) Cache


lbu $s0, 0x5D($zero)
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23
D30
Block 30 24 25 26 27 28 29 30 31

Main Memory
MM Address: 1111 0100
Processor is stalled 100’s of clock cycles.
15 0 4
136
Set-associative Cache: Example
Memory Specifications: Processor
0x37 D06
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

H/M
Block 0 Block 1
lbu $s0, 0x32($zero) M
Index V Tag Data V Tag Data
lbu $s0, 0xC9($zero) M 0 1 3 D06 1 15 D30
lbu $s0, 0xF4($zero) M 1 1 12 D25 0

lbu $s0, 0x37($zero) H Cache


lbu $s0, 0x5D($zero)
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23

Block 6 24 25 26 27 28 29 30 31

Main Memory
MM Address: 0011 0010
3 0 2 Processor is not stalled.
137
Set-associative Cache: Example
Memory Specifications: Processor
0x5D
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

H/M
Block 0 Block 1
lbu $s0, 0x32($zero) M
Index V Tag Data V Tag Data
lbu $s0, 0xC9($zero) M 0 1 3 D06 1 15 D30
lbu $s0, 0xF4($zero) M 1 1 12 D25 1 5 D11

lbu $s0, 0x37($zero) H Cache


lbu $s0, 0x5D($zero) M
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero)
8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23
D11
Block 11 24 25 26 27 28 29 30 31

Main Memory
MM Address: 0101 1101
Processor is stalled 100’s of clock cycles.
5 1 5
138
Set-associative Cache: Example
Memory Specifications: Processor
0x58 D11
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

H/M
Block 0 Block 1
lbu $s0, 0x32($zero) M
Index V Tag Data V Tag Data
lbu $s0, 0xC9($zero) M 0 1 3 D06 1 15 D30
lbu $s0, 0xF4($zero) M 1 1 12 D25 1 5 D11

lbu $s0, 0x37($zero) H Cache


lbu $s0, 0x5D($zero) M
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero) H 8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) 16 17 18 19 20 21 22 23

Block 11 24 25 26 27 28 29 30 31

Main Memory
MM Address: 0101 1000
Processor is not stalled.
5 1 0
139
Set-associative Cache: Example
Memory Specifications: Processor
0x58
Main Memory size = 256 bytes (a = 8)
Block size = 8 bytes (b = 3) Registers

H/M Replace which block in set #1?


Block 0 Block 1
lbu $s0, 0x32($zero) M
Index V Tag Data V Tag Data
lbu $s0, 0xC9($zero) M 0 1 3 D06 1 15 D30
lbu $s0, 0xF4($zero) M 1 1 12 D25 1 5 D11

lbu $s0, 0x37($zero) H Cache


lbu $s0, 0x5D($zero) M
0 1 2 3 4 5 6 7
lbu $s0, 0x58($zero) H 8 9 10 11 12 13 14 15
lbu $s0, 0x1E($zero) M 16 17 18 19 20 21 22 23
D03
Block 3 24 25 26 27 28 29 30 31

Main Memory
MM Address: 0001 1110
Processor is stalled 100’s of clock cycles.
1 1 6
140

REPLACEMENT POLICY
Set-associative cache
141
Replacement Policy
n Empty block
q Use an “empty” cache block, if one is available.

q An “empty” or invalid cache block is indicated by V = 0.

n Least Recently Used (LRU)


q Choose the cache block that has not been used for the longest time.

q Simple for 2-way Set-associative caches.

q Manageable for 4-way Set-associative caches.

q Too difficult for Fully-associative caches.

n Random
q Gives approximately the same performance as LRU for high
associativity.
q Used for Fully-associative caches.
142

SPECTRUM OF ASSOCIATIVITY
Set-associative cache
143
Spectrum of Associativity
n Consider a cache with 8 blocks for data storage.

m=2
m=1

m=4

m=8
144
Spectrum of Associativity
n Memory system specifications:
q Main memory size = 64K bytes -> a = log2(64K) = 16 bits
q Cache size = 128 bytes (cache size is fixed).
q Block size = 16 bytes -> b = log2(16) = 4 bits

Cache Type Cache Entries Ways (m) Tag bits Index bits Block Offset
(b)
Direct Mapped 8 1 9 3 4
Two-way
4 2 10 2 4
Set Associative
Four-way
2 4 11 1 4
Set Associative
Fully Associative 1 8 12 0 4
145
How much Associativity?
n Increased associativity decreases miss rate.
q But with diminishing returns.

n System: 64 KB D-cache, 128-byte blocks

n Simulation using SPEC2000 Benchmark:


q 1-way: miss rate = 10.3%
q 2-way: miss rate = 8.6%
q 4-way: miss rate = 8.3%
q 8-way: miss rate = 8.1%
146

CACHE SIZE
Set-associative cache: Example
147
Cache Size Set-associative Cache
Block 0 Block 1

n Memory system specifications: Index V Tag Data V Tag Data


0 0 0
q Main memory size = 1M byte
1 0 0
q Cache size = 128 bytes
(illustrated here: 2 sets with two blocks each)
q Block size = 8 bytes

q Two-way Set-associative cache (w = 2).

n Calculate the total size of the cache (in bits).


n Determine the number of blocks in main memory.
q # blocks in MM = Main memory size / Block size

q # blocks in MM = 1M byte / 8 bytes = 220 / 23 = 217 = 131072


blocks.
n Determine the number of blocks in the cache.
q # blocks in cache = Cache size / Block size

q # blocks in cache = 128 bytes / 8 bytes = 16 blocks.


148
Cache Size
n Determine the number of sets in the cache.
q # sets in cache = # blocks in cache / # blocks in each set

q # blocks in cache = 16 blocks

q # blocks in each set = 2 (w = 2)

q # sets in cache = 16 / 2 = 8 sets.

n Determine the number of address bits for the tag field.


q a = address bits = log2(memory size in bytes) = log2(1MB) = 20 bits.

q n = index = log2(# of sets in cache) = log2(8) = 3 bits.

q b = block offset = log2(block size in bytes) = log2(8 bytes) = 3 bits.

q t = tag = a – (b + n) = 20 – (3 + 3) = 14 bits.
149
Cache Size
n Determine the number of bits in each cache block.
q Data bits = block size = 8 bytes * 8 bits/byte = 64 bits.
q Metadata bits = Valid bit + Tag bits = 1 + 14 = 15 bits.
q Total bits in each cache block = data bits + metadata bits = 79 bits.

n Determine the number of bits in each cache set.


q Total bits in each cache set = # blocks in each set * total bits in
cache block
q Total bits in each cache set = 79 bits/block * 2 blocks/set
q Total bits in each cache set = 158 bits

n Determine total size of cache (in bits).


q Total size of cache = # cache sets * total bits in each cache set
q Total size of cache = 8 sets * 158 bits/set = 1264 bits.
150
Cache Size
n Determine the cache overhead.
q Overhead = the percentage of the cache bits needed for the metadata.

q Total data bits = 16 blocks * 8 bytes/block * 8 bits/block = 1024 bits.

q Total metadata bits = 16 blocks * 15 bits/block = 240 bits.

q Total bits in the cache = 1024 + 224 = 1264 bits.

q Overhead = metadata / total = 240 / 1264 = 18.99 %

n This implies that just 81.01% of the cache is used to store data.

n As previously discussed, increasing the block size will reduce the cache
overhead and total cache size.
q It may also lead to a reduction in the cache miss rate.
151

Questions?

You might also like