11 VM
11 VM
Virtual Memory
Looking Ahead: Multicore Cache
Hierarchies
• Shared vs Private L2
• Shared: Data distributed
across all core's L2
(capacity)
• Private: Data local to L1
(locality + no interference)
• Uniform vs Non-uniform • Maintain Coherence?
• For shared, does any hit (all copies of one cache
take same amount of time? line are the same)
• Centralized vs Distributed? • Maintain Consistency?
• Centralized physical (semantics of loads and
structure, or on-chip stores with shared
network? memory)
17
Today: Virtual Memory
App App App • The operating system (OS)
System software • A super-application
• Hardware support for an OS
Mem CPU I/O
• Virtual memory
• Page tables and address translation
• TLBs and memory hierarchy issues
18
A Computer System: Hardware
• CPUs and memories
• Connected by memory bus
• I/O peripherals: storage, input, display, network, …
• With separate or built-in DMA
• Connected by system bus (which is connected to memory bus)
19
A Computer System: + App Software
• Application software: computer must do something
• Pain to just use hardware directly – how to share?
Application software
Memory bus System (I/O) bus
bridge
CPU/$ CPU/$ DMA DMA I/O ctrl
Memory
20
A Computer System: + OS
• Operating System (OS): virtualizes hardware for apps
• Abstraction: provides services (e.g., threads, files, etc.)
+ Simplifies app programming model, raw hardware is nasty
• Isolation: gives each app illusion of private CPU, memory, I/O
+ Simplifies app programming model
+ Increases hardware resource utilization
Application Application Application Application
OS
kbd、 display
·
Disk NIC
口
21
Operating System (OS) and User Apps
• Sane system development requires a split
• Hardware itself facilitates/enforces this split
22
System Calls
• Controlled transfers to/from OS
23
Interrupts – OS wants your attention
• Exceptions: synchronous, generated by running app
• E.g., illegal insn, divide by zero, etc.
• Interrupts: asynchronous events generated externally
• E.g., timer, I/O request/reply, etc.
• E.g. Timer: programmable on-chip interrupt
• Initialize with some number of micro-seconds
• Timer counts down and interrupts when reaches zero
• “Interrupt” handling: same mechanism for both
• “Interrupts” are on-chip signals/bits
• Either internal (e.g., timer, exceptions) or from I/O devices
• Processor continuously monitors interrupt status, when one is high …
• Hardware jumps to some preset address in OS code (interrupt vector)
• Like an asynchronous, non-programmatic SYSCALL
24
Virtualizing Processors
• How do multiple apps (and OS) share the processors?
• Goal: applications think there are an infinite # of processors
25
Virtualizing Main Memory
• How do multiple apps (and the OS) share main memory?
• Goal: each application thinks it has infinite memory
• Biggest memory we have: Disk (so slooooow)
• Faster memory DRAM (so expensive!)
• One app may want more memory than is in main memory
• App’s insn/data footprint > DRAM
• Requires main memory to act like a cache
• With disk as next level in memory hierarchy (slow)
• Write-back, write-allocate, large blocks or “pages”
• Solution:
• Part #1: treat memory as a “cache”
• Store the overflowed blocks in “swap” space on disk
• Part #2: add a level of indirection (address translation)
27
Virtual Memory
Program • Programs use virtual addresses (VA)
code heap stack • 0…2N–1
• VA size also referred to as machine size
…
• E.g., Pentium4 is 32-bit, SPARC is 64-bit
Disk
• 4th level: disk (swap space)
• SSD/HDD/etc. -- SLOW!
30
Low %miss At All Costs
• For a memory component: thit vs. %miss tradeoff
31
Memory Organization Parameters
Parameter I$/D$ L2 Main Memory
thit <1ns 2-10ns 100ns
tmiss 2-10ns 100ns 10ms (10M ns)
Capacity 8–64KB 256KB–20MB Up to 1TB
Block size 16–64B 64–256B 4+KB
Associativity 1–8 4–16 Full
Replacement Policy LRU/NMRU LRU/NMRU working set
Write-through? Either No No
Write-allocate? Either Yes Yes
Victim buffer? Yes Maybe No
Prefetching? Either Yes Software
32
Uses of Virtual Memory
• Virtual memory is quite a useful feature
• Automatic, transparent memory management just one use
• “Functionality problems are solved by adding levels of indirection”
• Example: program isolation and multiprogramming
• Each process thinks it has 2N bytes of address space
• Each thinks its stack starts at address 0xFFFFFFFF
• System maps VPs from different processes to different PPs
+ Prevents processes from reading/writing each other’s memory
Program1 … … Program2
33
More Uses of Virtual Memory
• Protection
• Piggy-back mechanism to implement page-level protection
• Map virtual page to physical page
… and to Read/Write/Execute protection bits in page table
• Attempt to illegal access, to execute data, to write read-only data?
• Exception → OS terminates program
• Inter-process communication
• Map virtual pages in different processes to same physical page
• Share files via the UNIX mmap() call
OS App1 App2
… … …
• Example above
• 64KB pages → 16-bit POFS
• 32-bit machine → 32-bit VA → 16-bit VPN
• Maximum 256MB memory → 28-bit PA → 12-bit PPN
35
Mechanics of Address Translation
• How are addresses translated?
• In software (now) but with hardware acceleration (a little later)
• Each process allocated a page table (PT)
• Managed by the operating system Page Table (PT)
vpn
• Translation is table lookup
struct {
union { int ppn, disk block; }
bool is valid, is dirty;
} PTE;
struct PTE pt[NUM VIRTUAL PAGES];
36
Page Table Size
• How big is a page table on the following machine?
• 4B page table entries (PTEs)
• 32-bit machine
• 4KB pages
“a”
“b”
“c”
“d”
39
Multi-Level Page Table
• 20-bit VPN VPN[19:10] VPN[9:0] 2nd-level
PTEs
• Upper 10 bits index 1st-level table
1st-level
• Lower 10 bits index 2nd-level table pt “root” “pointers”
struct {
union { int ppn, disk block; }
bool is valid, is dirty;
} PTE;
struct {
struct PTE ptes[1024];
} L2PT;
struct L2PT *pt[1024];
41
Translation in X86
Virtual Address VA
APP APP
OS CR3
Hardware Physical
Address PA
Courtesy Jayneel Gandhi (Agile
Paging)
At most mem accesses = 4 42
Alternative: Inverted/Hashed Page Tables
Inverted
Page Table
Base of Table
Table
PID Offset PA of IPTE
hash +
一 VPN PID PTE
VPN
43
Address Translation Mechanics
• Three remaining questions
• Who performs translation?
• When do you translate?
• Where does page table reside?
• Conceptual view:
• Who: OS
• Disallow program from modifying its own page table entries
• When: Translate virtual address before every cache access
• Walk the page table for every load/store/instruction-fetch
• Where: Memory
• Actual approach:
• Cache translations in a “translation cache” to avoid repeated lookup
44
Translation Lookaside Buffer
CPU • Functionality problem? add indirection
• Performance problem? add cache
VA!
TLB
!
TLB
VA
• Address translation too slow?
• Cache translations in translation lookaside
PA ! buffer (TLB)
I$ D$ • Small cache: 16–512 entries
!PAPA! • Small TLBs often fully associative (<64)
L2 + Exploits temporal locality in page table (PT)
• What if an entry isn’t found in the TLB?
PA
• Invoke TLB miss handler
Main
Memory ! “tag” “data”
VPN PPN
VPN PPN
VPN PPN
↓
45
TLB Misses and Miss Handling
• TLB miss: requested PTE not in TLB, search page table
• Software routine, e.g., Alpha, SPARC, MIPS
• Special instructions for accessing TLB directly
• Latency: one or two memory accesses + trap
• Hardware finite state machine (FSM), e.g., x86
• “Page table walker”
• Store page table root in hardware register
• Page table root and table pointers are physical addresses
+ Latency: saves cost of OS call
• In both cases, reads use the the standard cache hierarchy
+ Allows caches to help speed up search of the page table
47
Page Faults
• Page fault: PTE not in page table
• Page is simply not in memory
• Starts out as a TLB miss, detected by OS handler/hardware FSM
• OS routine
• Choose a physical page to replace
• “Working set”: more refined software version of LRU
• Tries to see which pages are actively being used
• Balances needs of all current running applications
• If dirty, write to disk
• Read missing page from disk
• Takes so long (~10ms), OS schedules another task
• Treat like a normal TLB miss from here
48
Physical (Address) Caches
• Memory hierarchy so far: physical caches
CPU
• Indexed and tagged by Pas
VA!
TLB
!
TLB
VA
• Physically Indexed (PI)
• Physically Tagged (PT)
• Translate to PA at the outset
PA ! + Cached inter-process communication works
I$ D$ • Single copy indexed by PA
!PA
PA!
– Slow: adds at least one cycle to thit
L2
PA
Main
Memory
49
Virtual Address Caches (VI/VT)
CPU • Alternative: virtual caches
• Indexed and tagged by VAs (VI and VT)
VA! !VA • Translate to PAs only to access L2
I$ D$ + Fast: avoids translation latency in common case
– Problem: VAs from differentprocesses are
VA ! distinct physical locations (with different values)
TLB (homonyms: same address different meaning)
PA ! • What to do on process switches?
L2 • Flush caches? Slow
• Add process IDs to cache tags
PA • Does inter-process communication work?
Main • Synonyms: multiple VAs map to same PA
Memory • Can’t allow same PA in the cache twice
• Can be handled, but very complicated (bonus)
50
Parallel TLB/Cache Access (VI/PT)
51
Parallel Cache/TLB Access
TLB ==
data
52
Cache Size And Page Size
53
TLB Organization
• Like caches: TLBs also have ABCs
• Capacity
• Associativity (At least 4-way associative, fully-associative common)
• What does it mean for a TLB to have a block size of two?
• Two consecutive VPs share a single tag
54
What addresses in the pipeline?
• What type of address for LSQ?
• (synonyms within a process -> same physical)
load queue SQ
(LQ)
address
head == == head
== ==
==
== ==
== age ==
tail == == tail
==
== ==
== ==
D$/TLB
55
Intel Core i7 Memory System
Processor package
Core x4
Page Tbl Instruction MMU
Core
Walker fetch (addr translation)
Main memory
56
End-to-end Core i7 Address Translation
32/64
CPU L2, L3, and
Result
main memory
36 12
Virtual address (VA)
VPN VPO L1 L1
miss
32 ! 4
hit
TLBT TLBI
L1 d-cache
! ↓ ! TLB (64 sets, 8 lines/set)
hit
TLB
miss
...
...
L1 TLB (16 sets, 4
entries/set)
9 9 9 9
40 12 40 6 6
VPN1 VPN2 VPN3 VPN4
PPN PPO CT CI CO
Physical
CR address
3
PTE PTE PTE PTE (PA)
Page tables 57
Alternative to VM: base/bound registers
• Each process is given a non-overlapping, contiguous
physical memory region
•
When a process is swapped in, OS sets base to the start of
the process’s memory region and bound to the end of the
region
• On memory references, HW translation & protection check
• PA = EA + base
• provided (PA < bound), Base
active process’s
• else violations Bound region
58
Also Segmented Address Space
• segment == a base and bound pair
• segmented addressing gives each process multiple segments
• initially, separate code and data segments
- 2sets of base-and-bound reg’sfor instand datafetch
-allowed sharing code segments
• became more and more elaborate: code,data, stack,etc.
• also (ab)used as a way for an ISA with a small EA space to address
a larger physical memory space
SEG # EA
segment tables
must be 1. PA
privileged data segment +,< &
structures and 2. base
table
private/unique to & okay?
each process bound
60
Memory Protection and Isolation
• Important role of virtual memory today
61
Stack Smashing via Buffer Overflow
int i = 0;
char buf[128];
while ((buf[i++] = getc()) != ’\n’) ; ra
return;
buf[128]
• Stack smashing via buffer overflow
• Oldest trick in the virus book
ra
• Exploits stack frame layout and …
• Sloppy code: length-unchecked copy to stack buffer
• “Attack string”: code (128B) + &buf[0] (4B)
• Caller return address replaced with pointer to attack code &buf[0]
• Caller return …
• …executes attack code at caller’s privilege level attack
code
ra
62
Page-Level Protection
struct {
union { int ppn, disk block; }
int is valid, is dirty, permissions;
} PTE;
• Page-level protection
• Piggy-backs on translation infrastructure
• Each PTE associated with permission bits: Read, Write, eXecute
• Read/execute (RX): for code
• Read (R): read-only data
• Read/write (RW): read-write data
• TLB access traps on illegal operations (e.g., write to RX page)
• To defeat stack-smashing? Set stack permissions to RW
• Will trap if you try to execute &buf[0]
64
(not really VM, but lets talk about it)
65
66
Background: μArch Dependence Graphs
[Fields, ISCA 2001]
• Represent execution nodes.
• Edges between Inorder Dependencies.
Fetch/Dispatch/Commit
F 0 F Fetch0 F 0 F
↓
Fetch-Dispatch !↓ ↓ ↓
Latency
0 0
D D Dispatch D 0 D
↓ ↓↓ ↓ ↓
Dispatch Before
E Execute E E
Execute E
↓ ↓ ↓ ↓
Execution
P Latency P Complete P P
↓↓ ↓
↓
Complete Before ↓
C 0
Commit C Commit
0 C 0 C
Graphs
D ta
Branch Critical Pat Executio esource
Misprediction Time Dependence
F F 0 F F
1 ↓ ↓ ↓ ↓
D D D D
! ↓ ↓ ↓
1
E 6 E E 1 E
1
! ↓ ↓↓ ↓↓
0 1
P P P P
↓ ↓ ↓
1
C C C C
bgz r1 r2 = ld[r3] r4 = r2 * 2 r5 = r6 *
68
• Reduced complexity model with only
• D: Dispatch
• E: Execute
• C: Commit
69
Claim: Only critical L2 Hits matter
70
Intuition: Latency matters much more for L1 cache! 71
Intuition: L2 cache least sensitive to non-critical access latency 72
Criticality Aware Tiered Cache Hierarchy
(CATCH)
73
Criticality detection in hardware
74
TACT: Data Prefetchers
75
76
Backups
78
Itanium Prevalidated tags
VA
TLB I$
79
RAM
0/1 ? 0/1 ?
• This one is 4x2
wordline1 o • Bits in word connected by wordline
0/1 ? 0/1 ? • Bits in position connected by bitline
wordline2 o • Operation
0/1 • Address decodes into M wordlines
0/1 ? ?
• High wordline → word on bitlines
wordline3 o
• Bit/bitline connection → read/write
bitline1
bitline0
• Access latency
data • #ports * √#bits
80
SRAM
? ? • “Static” means
• Inverters connected to pwr/gnd
+ Bits naturally/continuously “refreshed”
? ?
data
81
DRAM
• DRAM: dynamic RAM
• Bits as capacitors
+ Single transistors as ports
address
• “Dynamic” means
• Capacitors not connected to pwr/gnd
– Stored charge decays over time
– Must be explicitly refreshed
data
82
Moore’s Law
Year Capacity $/MB Access time
1980 64Kb $1500 250ns
1988 4Mb $50 120ns
1996 64Mb $10 60ns
2004 1Gb $0.5 35ns
83
DRAM Operation I
• Read: similar to cache read
• Phase I: pre-charge bitlines to 0.5V
address
r-I
sa sa
• Write: two steps
• Step IA: read selected word into row buffer
r ↓ ↓ • Step IB: write data into row buffer
r/w-I DL • Step II: write row buffer back to selected word
DL
r/w-II
data
85
DRAM Refresh
• DRAM periodically refreshes all contents
• Loops through all words
• Reads word into row buffer
address
sa sa
↓ ↓
DL DL
data
86
DRAM Parameters
• DRAM parameters
• Large capacity: e.g., 64–256Mb
address • Arranged as square
+ Minimizes wire length
+ Maximizes refresh efficiency
DRAM
bit array • Narrow data interface: 1–16 bit
• Cheap packages → few bus pins
data
87
Two-Level Addressing
address • Two-level addressing
[23:12] [11:2] • Row decoder/column muxes share
↓ ↓ address lines
• Two strobes (RAS, CAS) signal which
part of address currently on bus
RAS
12to4K decoder
• Asynchronous access
4K x 4K
• Level 1: RAS high
bits
• Upper address bits on address bus
• Read row into row buffer
! ! ! • Level 2: CAS high
row buffer • Lower address bits on address bus
4 1Kto1 muxes
• Mux row buffer onto data bus
CAS data
88
Access Latency and Cycle Time
• DRAM access much slower than SRAM
• More bits → longer wires
• Buffered access with two-level addressing
• SRAM access latency: <1ns
• DRAM access latency: 30–50ns
89
DRAM Latency and Power Derivations
• Same basic form as SRAM
• Most of the equations are geometrically derived
• Same structure for decoders, wordlines, muxes
• Some differences
• Somewhat different pre-charge/sensing scheme
• Array access represents smaller part of total access
• Arrays not multi-ported
90
Building a Memory System
CPU • How to build an efficient main memory
out of standard DRAM chips?
! !
I$ D$
• How many DRAM chips?
! • What width/speed (data) bus to use?
L2 • Assume separate address bus
Disk(swap)
91
An Example Memory System
• Parameters
• 32-bit machine
• L2 with 32B blocks
• 4Mx16b DRAMs, 20ns access time, 40ns cycle time
• 100MHz (10ns period) data bus
• 100MHz, 32-bit address bus
92
First Memory System Design
T (ns) DRAM Data Bus
2B
10 [31:30]
4M 20 [31:30]
x 30 refresh [31:30]
2B
40 refresh
50 [29:28]
93
Second Memory System Design
T (ns) DRAM Bus
4b
10 [31:30]
4M 20 [31:30]
x 30 refresh [31H]
2B
40 refresh [31L]
50 [29:28] [30H]
94
Third Memory System Design
T (ns) DRAM0 DRAM1 DRAM15 Bus
32B
10 [31:30] [29:28] [1:0]
4M 4M 4M 4M 20 [31:30] [29:28] [1:0]
x x x … x 30 refresh refresh refresh [31:0]
2B 2B 2B 2B
40 refresh refresh refresh
0 1 2 15
95
Latency and Bandwidth
• In general, given bus parameters …
• Find smallest number of chips that minimizes cycle time
• Approach: match bandwidths
96
Fourth Memory System Design
T (ns) DRAM0 DRAM1 DRAM2 DRAM3 Bus
2B
10 [31:30] [29:28] [27:26] [25:24]
4M 4M 4M 4M 20 [31:30] [29:28] [27:26] [25:24]
x x x x 30 refresh refresh refresh refresh [31:30]
2B 2B 2B 2B
40 refresh refresh refresh refresh [29:28]
0 1 2 3
50 [23:22] [21:20] [19:18] [17:16] [27:26]
60 [23:22] [21:20] [19:18] [17:16] [25:24]
• 2B bus … … … … … …
110 refresh refresh refresh refresh [15:14]
• Bus b/w: 2B/10ns
120 refresh refresh refresh refresh [13:12]
• DRAM b/w: 2B/40ns 130 [7:6] [5:4] [3:2] [1:0] [11:10]
• 4 DRAM chips 140 [7:6] [5:4] [3:2] [1:0] [9:8]
97
More Bandwidth From One DRAM
• EDO: extended data out
• Multiple row buffer reads/writes
• Send only column addresses
98
Memory Access and Clock Frequency
• Nominal clock frequency applies to CPU and caches
• Memory bus has its own clock, typically much slower
• DRAM has no clock (SDRAM operates on bus clock)
99
Memory/Clock Frequency Example
• Parameters
• 1GHz CPU, base CPI = 1
• I$: 1% miss rate, 32B blocks (ignore D$, L2)
• Data bus: 100MHz, 8B (ignore address bus)
• DRAM: 10ns access, 20ns cycle, #chips to match bus bandwidth
100
Memory/Clock Frequency Example
• What are CPI and MIPS if clock speed is doubled?
• Memory parameters same: 50ns access, 40ns cycle
• 2GHz clock → 50ns = 100 cycles
• CPI+memory = 1 + (0.01*100) = 1 + 1 = 2
• MIPS+memory = 2GHz / 2 CPI = 2000MHz / 2 CPI = 1000
101
Digital Rights Management
• Digital rights management
• Question: how to enforce digital copyright?
• Electronically, not legally
• “Trying to make bits un-copiable is like trying to make water un-wet”
102
Aside: Public-Key Cryptography
• Public-key cryptography
• Asymmetric: pair of keys
• Kpub : used for encryption, published
• Kpriv : used for decryption, secret
• acrypt(acrypt(M, Kpub), Kpriv) = acrypt(acrypt(M, Kpriv), Kpub) = M
• Well-known example: RSA
• Two uses
• Encryption
• Someone sends you encrypted message M: C = acrypt(M, Kpub)
• You are the only one that can decrypt it
• Authentication/Digital Signature
• You send someone a chosen plaintext M
• They “sign” it by sending back DS = acrypt(M, Kpriv)
• If acrypt(DS, Kpub) = M, then they are who Kpub says they are
103
Research: XOM
you vendor
K priv
decrypt ©
© ©
CPU L2 K pub encrypt
encrypt K pub
K pub
$
• eXecute Only Memory (XOM)
• Stanford research project [Lie+, ASPLOS’00]
• Two registers: Kpriv, Kpub different for every chip (Flash program)
• Software can get at Kpub, but Kpriv is hardware’s secret
• Hardware encryption/decryption engine on L2 fill/spill path
• Vendor sells you acrypt(©, Kpub)
+ Even if someone copies it, they won’t have Kpriv to decrypt it
• Plaintext © only exists on-chip
+ Even OS can never see plaintext ©
104
XOM: Not Quite
• Performance consideration
• Asymmetric en-/de-cryption is slow, symmetric (one key) faster
• E.g., DES, AES (Rijndael)
– Problem: can’t publish encryption key without also...
• XOM Take II
• Vendor chooses random symmetric key Ksym
• Sells you scrypt(©, Ksym) + acrypt(Ksym, Kpub)
• Two-stage decryption
• Decrypt Ksym using Kpriv : slow (but for one piece of data)
• Decrypt © using Ksym : fast
• Note: SSL does the same thing
• Uses asymmetric cryptography to choose symmetric session key
105
Error Detection: Parity
• Parity: simplest scheme
• f(data N–1…0) = XOR(data N–1, …, data1, data0)
+ Single-error detect: detects a single bit flip (common case)
• Will miss two simultaneous bit flips …
• But what are the odds of that happening?
– Zero-error correct: no way to tell which bit flipped
106
Error Correction: Hamming Codes
• Hamming Code
• H(A,B) = number of 1’s in A^B (number of bits that differ)
• Called “Hamming distance”
• Use D data bits + C check bits construct a set of “codewords”
• Check bits are parities on different subsets of data bits
• codewords A,B H(A,B) ≥ α
• No combination of α–1 transforms one codeword into another
• For simple parity: α = 2
• Errors of δ bits (or fewer) can be detected if α = δ + 1
• Errors of β bits or fewer can be corrected if α = 2β + 1
• Errors of δ bits can be detected and errors of β bits can be
corrected if α = β + δ + 1
107
SEC Hamming Code
• SEC: single-error correct
• C = log2D + 1
+ Relative overhead decreases as D grows
• Example: D = 4 → C = 3
• d1 d2 d3 d4 c1 c2 c3 → c1 c2 d1 c3 d2 d3 d4
• c1 = d1 ^ d2 ^ d4 , c2 = d1 ^ d3 ^ d4 , c3 = d2 ^ d3 ^ d4
• Syndrome: ci ^ c’i = 0 ? no error : points to flipped-bit
• Working example
• Original data = 0110 → c1 = 1, c2 = 1, c3 = 0
• Flip d2 = 0010 → c’1 = 0, c’2 = 1, c’3 = 1
• Syndrome = 101 (binary 5) → 5th bit? D2
• Flip c2 → c’1 = 1, c’2 = 0, c’3 = 0
• Syndrome = 010 (binary 2) → 2nd bit? c2
108
SECDED Hamming Code
• SECDED: single error correct, double error detect
• C = log2D + 2
• Additional parity bit to detect additional error
• Example: D = 4 → C = 4
• d1 d2 d3 d4 c1 c2 c3 → c1 c2 d1 c3 d2 d3 d4 c4
• c4 = c1 ^ c2 ^ d1 ^ c3 ^ d2 ^ d3 ^ d4
• Syndrome == 0 and c’4 == c4 → no error
• Syndrome != 0 and c’4 != c4 → 1-bit error
• Syndrome != 0 and c’4 == c4 → 2-bit error
• Syndrome == 0 and c’4 != c4 → c4 error
• Many machines today use 64-bit SECDED code
• C = 8 (one additional byte, 12% overhead)
• ChipKill - correct any aligned 4-bit error
• If an entire DRAM chips dies, the system still works!
109
Bonus
110
111
This Unit: Main Memory
Application
• Memory hierarchy review
OS
Comp iler
• Virtual memory
Firmware
• Address translation and page tables
CPU I/O • Virtual memory’s impact on caches
Me mory • Page-based protection
Digital Circuits
• Organizing a memory system
• Bandwidth matching
Gates & Transistors
• Error correction
112
Static Random Access Memory
row select - Read
Sequence 1. address decode
2. drive row select
_bitline
bitline
m 2m diff pairs
sense amp and mux
1
113
Dynamic Random Access Memory
row enable
- Bits stored as charges on node
capacitance (non-restorative)
_bitline
“the one single development that put computers on their feet was the
invention of a reliable form of memory, namely the core memory … It’s
cost was reasonable, it was reliable, and because it was reliable it
could in due course be made large”
Maurice Wilkes
Memoirs of a Computer Programmer, 1985
115
DRAM Basics [Jacob and Wang]
• Precharge and Row Access
116
DRAM Basics, cont.
• Column Access
117
DRAM Basics, cont.
• Data Transfer
118
Open v. Closed Pages
• Open Page
• Row stays active until another row needs to be accessed
• Acts as memory-level cache to reduce latency
• Variable access latency complicates memory controller
• Higher power dissipation (sense amps remain active)
• Closed Page
• Immediately deactivate row after access
• All accesses become Activate Row, Read/Write, Precharge
119
DRAM Bandwidth
• Use multiple DRAM chips to increase bandwidth
• Recall, access are the same size as second-level cache
• Example, 16 2-byte wide chips for 32B access
120
DRAM Evolution
• Survey by Cuppu et al.
1. Early Asynchronous Interface
2. Fast Page Mode/Nibble Mode/Static Column (skip)
3. Extended Data Out
4. Synchronous DRAM & Double Data Rate
5. Rambus & Direct Rambus
6. FB-DIMM
121
Old 64MbitDRAM Example from Micron
Clock Recovery
COLUMNADDRESS
ROWADDRESS 122
Extended Data Out (EDO)
RAS’
CAS’
RAS’ \
CAS’
Column
add
Row add
125
FB-DIMM
126
Wide v. Narrow Interfaces
• High frequency short wavelength data skew issues
• Balance wire lengths
127
Rambus RDRAM
• High-frequency, narrow channel
• Time multiplexed “bus” dynamic point-to-point channels
• ~40 pins 1.6GB/s
• Proprietary solution
• Never gained industry-wide acceptance (cost and power)
• Used in some game consoles (e.g., PS2)
128
DRAM Reliability
• One last thing about DRAM technology … errors
• DRAM bits can flip from 0 1 or 1 0
• Small charge stored per bit
• Energetic α-particle strikes disrupt stored charge
• Many more bits
• Modern DRAM systems: built-in error detection/correction
• Today all servers; most new desktop and laptops
• Key idea: checksum-style redundancy
• Main DRAM chips store data, additional chips store f(data)
• |f(data)| < |data|
• On read: re-compute f(data), compare with stored f(data)
• Different ? Error …
• Option I (detect): kill program
• Option II (correct): enough information to fix error? fix and go on
129
DRAM Error Detection and Correction
address data error
4M 4M 4M 4M 4M
x x x x x
2B 2B 2B 2B 2B
0 1 2 3 f
130
Interleaved Main Memory
• Divide memory into M banks and “interleave” addresses
across them, so word A is
• in bank (A mod M)
• at word (A div M)
Bank 0 Bank 1 Bank 2 Bank n
word 0 word 1 word 2 word n-1
word n word n+1 word n+2 word 2n-1
word 2n word 2n+1 word 2n+2 word 3n-1
PA
Doubleword in bank Bank Word in doubleword
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
MC MC MC MC
CPU
B B+64 B+128 B+192
Data bus
132
Research: Processing in Memory
address • Processing in memory
• Embed some ALUs in DRAM
• Picture is logical, not physical
DRAM • Do computation in DRAM rather than …
bit array • Move data to from DRAM to CPU
• Compute on CPU
• Memory hierarchy
• Upper components: small, fast, expensive
• Lower components: big, slow, cheap
• tavg of hierarchy is close to thit of upper (fastest) component
• 10/90 rule: 90% of stuff found in fastest component
• Temporal/spatial locality: automatic up-down data movement
134
Software Managed Memory
• Isn’t full associativity difficult to implement?
• Yes … in hardware
• Implement fully associative memory in software
135