0% found this document useful (0 votes)
24 views29 pages

Lecture 8 Cont. Cache Memory

Uploaded by

syed.12682
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views29 pages

Lecture 8 Cont. Cache Memory

Uploaded by

syed.12682
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

William Stallings

Computer Organization
and Architecture
8th Edition
Chapter 4
[Link] Memory

Book by : Computer, Architecture and Organizations, 8th Edition ,William Stalling


Original Slides by : Adrian J Pullin
Cont. Cache Memory
Lecture Outcomes
Understanding of:
• Replacement Algorithm
• Write Policy
• Cache Performance
• Locality of Reference
• Pentium 4 Cache Organization
• ARM Cache Organization
Replacement Algorithms (1) Direct mapping

• No choice
• Each block only maps to one line
• Replace that line
Replacement Algorithms (2) Associative & Set Associative
• Hardware implemented algorithm (speed)
• Least Recently used (LRU)
• e.g. in 2 way set associative
– Which of the 2 block is lru?
• First in first out (FIFO)
– replace block that has been in cache longest
• Least frequently used
– replace block which has had fewest hits
• Random
Write Policy
• Must not overwrite a cache block unless main memory is
up to date
• Multiple CPUs may have individual caches
• I/O may address main memory directly
Write through
• All writes go to main memory as well as cache
• Multiple CPUs can monitor main memory traffic to keep
local (to CPU) cache up to date
• Lots of traffic
• Slows down writes
• Remember bogus write through caches!
Write back
• Updates initially made in cache only
• Update bit for cache slot is set when update occurs
• If block is to be replaced, write to main memory only if
update bit is set
• Other caches get out of sync
• I/O must access main memory through cache
• N.B. 15% of memory references are writes
Multilevel Caches
• High logic density enables caches on chip
– Faster than bus access
– Frees bus for other transfers
• Common to use both on and off chip cache
– L1 on chip, L2 off chip in static RAM
– L2 access much faster than DRAM or ROM
– L2 often uses separate data path
– L2 may now be on chip
– Resulting in L3 cache
• Bus access or now on chip…
Measuring Cache Performance
• No cache: Often about 10 cycles per memory access
• Simple cache:
– tave = hC + (1-h)M
– C is often 1 clock cycle
– Assume M is 17 cycles (to load an entire cache line)
– Assume h is about 90%
– tave = .9 (1) + (.1)17 = 2.6 cycles/access
– What happens when h is 95%?

10
Multi-level cache performance
• tave = h1C1 + (1-h1) h2C2 + (1-h1) (1-h2) M
– h1 = hit rate in primary cache
– h2 = hit rate in secondary cache
– C1 = time to access primary cache
– C2 = time to access secondary cache
– M = miss penalty (time to load an entire cache line
from main memory)
Processor Performance Without Cache

• 5GHz processor, cycle time = 0.2ns


• Memory access time = 100ns = 500 cycles
• Ignoring memory access, Clocks Per Instruction (CPI) =
1
• Assuming no memory data access:
CPI = 1 + # stall cycles
= 1 + 500 = 501

12
Performance with Level 1 Cache

• Assume hit rate, h1 = 0.95


• 5GHz processor, cycle time = 0.2ns
• Memory access time = 100ns = 500 cycles
• L1 access time = 0.2ns/processor cycle time (0.2ns) = 1 cycle
• CPI = 1 + # stall cycles
= 1 + 0.05 x 500
= 26
• Processor speed increase due to cache
= 501/26 = 19.3%

13
Performance with L1 and L2 Caches

• Assume:
– L1 hit rate, h1 = 0.95
– L2 hit rate, h2 = 0.90 (this is very optimistic!)
– L2 access time = 5ns = 25 cycles
• CPI = 1 + # stall cycles
= 1 + 0.05 (25 + 0.10 x 500)
= 1 + 3.75 = 4.75
• Processor speed increase due to both caches
= 501/4.75 = 105.5
• Speed increase due to L2 cache
= 26/4.75 = 5.47

14
15
16
17
18
19
Example

20
Hit Ratio (L1 & L2)
For 8 kbytes and 16 kbyte L1
Unified v Split Caches
• One cache for data and instructions or two, one for data and one for
instructions
• Advantages of unified cache
– Higher hit rate
• Balances load of instruction and data fetch
• Only one cache to design & implement
• Advantages of split cache
– Eliminates cache contention between instruction fetch/decode
unit and execution unit
• Important in pipelining
Pentium 4 Cache
• 80386 – no on chip cache
• 80486 – 8k using 16 byte lines and four way set associative organization
• Pentium (all versions) – two on chip L1 caches
– Data & instructions
• Pentium III – L3 cache added off chip
• Pentium 4
– L1 caches
• 8k bytes
• 64 byte lines
• four way set associative
– L2 cache
• Feeding both L1 caches
• 256k
• 128 byte lines
• 8 way set associative
– L3 cache on chip
Pentium 4 Design Reasoning
• Decodes instructions into RISC like micro-ops before L1 cache
• Micro-ops fixed length
– Superscalar pipelining and scheduling
• Pentium instructions long & complex
• Performance improved by separating decoding from scheduling & pipelining
– (More later – ch14)
• Data cache is write back
– Can be configured to write through
• L1 cache controlled by 2 bits in register
– CD = cache disable
– NW = not write through
– 2 instructions to invalidate (flush) cache and write back then invalidate
• L2 and L3 8-way set-associative
– Line size 128 bytes
ARM Cache Features

Core Cache Cache Size (kB) Cache Line Size Associativity Location Write Buffer Size
Type (words) (words)

ARM720T Unified 8 4 4-way Logical 8

ARM920T Split 16/16 D/I 8 64-way Logical 16


ARM926EJ-S Split 4-128/4-128 D/I 8 4-way Logical 16

ARM1022E Split 16/16 D/I 8 64-way Logical 16


ARM1026EJ-S Split 4-128/4-128 D/I 8 4-way Logical 8

Intel StrongARM Split 16/16 D/I 4 32-way Logical 32

Intel Xscale Split 32/32 D/I 8 32-way Logical 32


ARM1136-JF-S Split 4-64/4-64 D/I 8 4-way Physical 32
ARM Cache Organization
• Small FIFO write buffer
– Enhances memory write performance
– Between cache and main memory
– Small c.f. cache
– Data put in write buffer at processor clock speed
– Processor continues execution
– External write in parallel until empty
– If buffer full, processor stalls
– Data in write buffer not available until written
• So keep buffer small
ARM Cache and Write Buffer
Organization
Review Questions

❑What are the differences among sequential access, direct access, and random
access?
❑What is the general relationship among access time, memory cost, and capacity?
❑How does the principle of locality relate to the use of multiple memory levels?
❑What is the distinction between spatial locality and temporal locality?
❑In general, what are the strategies for exploiting spatial locality and temporal
locality?
Thank you

You might also like