Lecture4-Ch2-Memory Hierarchy Design
Lecture4-Ch2-Memory Hierarchy Design
Computer Architecture
A Quantitative Approach, Fifth Edition
Computer Architecture
Chapter 2
Memory Hierarchy Design
(Pages 78 – 125) + Appendix B
Introduction
Introduction
Programmers want unlimited amounts of memory with
low latency
Fast memory technology is more expensive per bit than
slower memory
Solution: organize memory system into a hierarchy
Entire addressable memory space available in largest, slowest
memory
Incrementally smaller and faster memories, each containing a
subset of the memory below it, proceed in steps up toward the
processor
Temporal and spatial locality insures that nearly all
references can be found in smaller memories
Gives the allusion of a large, fast memory being presented to the
processor
Introduction
Memory Hierarchy
Introduction
Memory Hierarchy
Introduction
Memory Performance Gap
muumuu
high bandwith
memory
Introduction
GBS
Four cores and 3.2 GHz clock
409.6
25.6 billion 64-bit data references/second +
= 409.6 GB/s!
Introduction
Performance and Power
High-end microprocessors have >10 MB on-chip
cache
Consumes large amount of area and power budget
Thus, more designs must consider both performance
and power trade-offs
Introduction
Introduction
Memory Hierarchy Basics
n blocks per set => n-way set associative
Direct-mapped cache => one block per set (one-way)
Fully associative => one set
Place block into cache in any location within its set, determined by
address
block address MOD number of sets
4
200
Hit: data appears in some block in the upper level (example: Block 44
X)
1 miss Rate
Hit Rate: the fraction of memory access found in the upper level
Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
Miss: data needs to be retrieved from a block in the lower level
(Block Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
Cache Lower
Processor Level
Memory
Introduction
12
Introduction
Memory Hierarchy Basics
Six basic cache optimizations:(See Appendix B
for quantitative examples)
Larger block size
Reduces compulsory misses
Increases capacity and conflict misses, increases miss penalty
Larger total cache capacity to reduce miss rate
Increases hit time, increases power consumption
Higher associativity
Reduces conflict misses
Increases hit time, increases power consumption
Higher number of cache levels
Reduces overall memory access time
Giving priority to read misses over writes
Reduces miss penalty (read will check write buffer and not wait for writes)
Avoiding address translation in cache indexing
Reduces hit time
13
Advanced Optimizations
14
Advanced Optimizations
L1 Size and Associativity
15
Advanced Optimizations
POWER
Advanced Optimizations
2- Pipelining Cache
Pipeline cache access to improve bandwidth (divide
cache access stage in Inst. pipeline into multiple stages)
The effective latency of a first-level cache hit can be
multiple clock cycles.
Giving fast clock cycle time and high bandwidth but slow
hits.
Example: Accessing instructions from I-Cache
Pentium: 1 cycle
Pentium Pro – Pentium III: 2 cycles
Pentium 4 – Core i7: 4 cycles
Drawback: Increasing the number of pipeline stages
leads to greater penalty of mispredicted branches.
17
Advanced Optimizations
3- Nonblocking Caches
In out of order execution processors, allows continued cache hits during
misses to increase cache bandwidth
“Hit under miss”
“Hit under multiple miss”
Require Multibank memories
Reduce miss penalty by being helpful during a miss instead of ignoring the
requests of the processor, but increase cache complexity
18
Advanced Optimizations
4- Multibanked Caches
Organize cache as independent banks to support
simultaneous access
ARM Cortex-A8 supports 1-4 banks for L2
Intel i7 supports 4 banks for L1 and 8 banks for L2
Banking supports simultaneous accesses only when the
addresses are spread across multiple banks.
The mapping of addresses to banks affects the behavior
of the memory system.
Interleave banks according to block address
19
Advanced Optimizations
5- Hardware Prefetching
20
Advanced Optimizations
Hardware Prefetching
Hardware prefetching reduces miss rate or miss penalty
21
Advanced Optimizations
Summary
22
Example
In an L2 cache, a cache hit takes 0.8ns and a
cache miss takes 5.1ns on average. The
cache hit ratio is 95% while the cache miss
ratio is 5%. Assuming a cycle time is 0.5ns,
compute average memory access time.
A cache hit takes 0.8/0.5 = 2 cycles, and a
cache miss takes 5.1/0.5 = 11 cycles
Average memory access cycles =
0.95*2+0.05*11 = 2.45 cycles
Average memory access time = 2.45*0.5 =
1.225ns
23
http://www.bit-tech.net/hardware/memory/2007/11/15/the_secrets_of_pc_memory_part_1/3
24
Memory Technology
Memory Technology
Performance metrics
Latency is concern of cache
Bandwidth is concern of Main memory for multiprocessors
and I/O
External approach (e.g., multi-bank memory)
Internal approach (e.g., SDRAM, DDR)
Memory latency
Access time (AT): time between read request and when
25
Memory Technology
Memory Technology
SRAM
Requires low power to retain bit
Requires 6 transistors/bit
DRAM
Must be re-written after being read
Must also be periodically refreshed
Every ~ 8 ms
Each row can be refreshed simultaneously
One transistor/bit
Address lines are multiplexed:
Upper half of address: row access strobe (RAS)
Lower half of address: column access strobe (CAS)
26
Address line T3 T4
T5 C1 C2 T6
Transistor
Storage
capacitor
T1 T2
(a) Dynamic RAM (DRAM) cell (b) Static RAM (SRAM) cell
27
A DRAM Example
RAS CAS WE OE
Refresh
Counter MUX
Data Input
A10 Column Buffer D1
Address D2
Refresh circuitry D3
Buffer Data Output D4
Buffer
Column Decoder
28
w.ms
Memory Technology
Memory Technology
Amdahl:
Memory capacity should grow linearly with processor speed
Unfortunately, memory capacity and speed has not kept
pace with processors
Some optimizations:
Multiple column accesses to same row (Asynchronous
interface - Overhead problem)
Synchronous DRAM
Added clock to DRAM interface and enables pipelining
Burst mode (block transfer) with critical word first
Wider interfaces(4-bits , 8-bits , 16-bits)
Double data rate (DDR, read on rising and falling edge)
Multiple banks on each DRAM device
29
Memory Technology
Memory Optimizations
30
Memory Technology
Memory Optimizations
DRAMs are commonly sold on small boards called dual
inline memory modules (DIMMs) that contain 4–16 DRAM
chips.
Gs
31
http://en.wikipedia.org/wiki/DIMM
32
Memory Technology
Memory Optimizations
DDR:
DDR2
Lower power (2.5 V -> 1.8 V)
Higher clock rates (266 MHz, 333 MHz, 400 MHz)
DDR3
1.5 V
800 MHz
DDR4
1-1.2 V
1600 MHz
Graphic DRAM is a special class of DRAMs based on
SDRAM designs but tailored for handling the higher
bandwidth demands of graphics processing units.
GDDR5 is graphics memory based on DDR3
33
Memory Technology
Memory Optimizations
Graphics memory:
Achieve 2-5 X bandwidth per DRAM vs. DDR3
Wider interfaces (32 vs. 16 bit)
Higher clock rate
Possible because they are attached via soldering instead of
socketted DIMM modules
34
Memory Technology
Memory Power Consumption
35
Memory Technology
Flash Memory
Used as a secondary storage in PMDs.
Type of EEPROM, Flash uses a very different architecture
and has different properties than standard DRAM.
Reads to Flash are sequential and read an entire page
Must be erased (in blocks) before being overwritten
Non volatile
Limited number of write cycles (at least 100,000)
Cheaper than SDRAM, more expensive than disk ($2/GiB
for Flash, $20 to $40/GiB for SDRAM, and $0.09/GiB for
magnetic disks)
Slower than SDRAM, faster than disk
36
37
Comparison
38
Virtual Memory
The Limits of Physical Addressing
A0-A31 A0-A31
CPU Memory
D0-D31 D0-D31
Data
oAll programs share one address space: The physical
address space
oMachine language programs must be aware of the machine
organization
oNo way to prevent a program from accessing any machine
resource
39
Virtual Memory
Solution: Add a Layer of Indirection
Data
• User programs run in a standardized virtual address space
• Address Translation hardware, managed by the operating
system (OS), maps virtual address to physical memory
•Hardware supports “modern” OS features: Protection,
Translation, Sharing
40
41
Virtual Memory
two mode
Multiprogramming, where several programs running
concurrently would share a computer, led to demands
for protection and sharing among programs
Protection via virtual memory
Keeps processes in their own memory space
Role of architecture:
Provide user mode and supervisor mode
Protect certain aspects of process state (read/write privileges)
I
Provide mechanisms for switching between user mode and
supervisor mode
Provide mechanisms to limit memory accesses
Provide TLB (Translation Look aside Buffer) to translate
addresses
Some bits in each TLB or page entry for page protection.
42
43
Virtual Machines
Supports isolation and security
Sharing a computer among many unrelated users
Enabled by raw speed of processors, making the
overhead more acceptable
Two types: System VM (like IBM VM/370) and
Application VM (like Java VM and .NET Framework)
System VM allows different ISAs and operating
systems to be presented to user programs
SVM software is called “virtual machine monitor” or
“hypervisor”
Individual virtual machines run under the monitor are called
“guest VMs”
44
If
of
1/9/2023 45
Cache
Review
so I 11
got km'S
66 to
Cache
00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C
Memory
Cache
00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C
Memory
Cache
00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C
Memory
Address
• Disadvantage: 253
254
255
– More tag bits 22 32
– More hardware
– Higher access time
4-to-1 multiplexor
Hit Data
Exercise
• Given the following requirements for cache design for a 32-bit-address
computer (word addressable): (1) cache contains 16KB of data, and (2)
each cache block contains 16 words. (3) Placement policy is 4-way set-
associative.
– What are the lengths (in bits) of the block offset field and the index
field in the address?
16 KB 5.7% 5.2%
64 KB 2.0% 1.9%
256 KB 1.17% 1.15%
Using the principle of locality. The larger the block, the greater the chance parts
of it will be used again.
20% 1K
4K
15%
Miss
16K
Rate
10%
64K
5% 256K
0%
16
32
64
128