0% found this document useful (0 votes)
63 views118 pages

11 VM

The document discusses advanced computer architecture concepts, focusing on virtual memory and its management in computer systems. It covers topics such as the structure of operating systems, memory organization, address translation, and the use of page tables. Additionally, it explores the implications of virtual memory for application isolation, protection, and inter-process communication.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views118 pages

11 VM

The document discusses advanced computer architecture concepts, focusing on virtual memory and its management in computer systems. It covers topics such as the structure of operating systems, memory organization, address translation, and the use of page tables. Additionally, it explores the implications of virtual memory for application isolation, protection, and inter-process communication.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Advanced Computer Architecture I

Virtual Memory
Looking Ahead: Multicore Cache
Hierarchies

• Shared vs Private L2
• Shared: Data distributed
across all core's L2
(capacity)
• Private: Data local to L1
(locality + no interference)
• Uniform vs Non-uniform • Maintain Coherence?
• For shared, does any hit (all copies of one cache
take same amount of time? line are the same)
• Centralized vs Distributed? • Maintain Consistency?
• Centralized physical (semantics of loads and
structure, or on-chip stores with shared
network? memory)
17
Today: Virtual Memory
App App App • The operating system (OS)
System software • A super-application
• Hardware support for an OS
Mem CPU I/O
• Virtual memory
• Page tables and address translation
• TLBs and memory hierarchy issues

18
A Computer System: Hardware
• CPUs and memories
• Connected by memory bus
• I/O peripherals: storage, input, display, network, …
• With separate or built-in DMA
• Connected by system bus (which is connected to memory bus)

Memory bus System (I/O) bus


bridge
CPU/$ CPU/$ DMA DMA I/O ctrl
Memory

kbd display NIC


Disk

19
A Computer System: + App Software
• Application software: computer must do something
• Pain to just use hardware directly – how to share?

Application software
Memory bus System (I/O) bus
bridge
CPU/$ CPU/$ DMA DMA I/O ctrl
Memory

kbd display NIC


Disk 口

20
A Computer System: + OS
• Operating System (OS): virtualizes hardware for apps
• Abstraction: provides services (e.g., threads, files, etc.)
+ Simplifies app programming model, raw hardware is nasty
• Isolation: gives each app illusion of private CPU, memory, I/O
+ Simplifies app programming model
+ Increases hardware resource utilization
Application Application Application Application
OS

Memory bus System (I/O) bus


bridge
CPU/$ CPU/$ DMA DMA I/O ctrl
Memory

kbd、 display
·
Disk NIC

21
Operating System (OS) and User Apps
• Sane system development requires a split
• Hardware itself facilitates/enforces this split

• Operating System (OS): a super-privileged process


• Manages hardware resource allocation/revocation for all processes
• Has direct access to resource allocation features
• Aware of many low level hardware details
• Aware of other processes
• Talks directly to input/output devices (device driver software)

• User-level apps: ignorance is bliss


• Unaware of most nasty hardware details
• Unaware of other apps (and OS)
• Explicitly denied access to resource allocation features

22
System Calls
• Controlled transfers to/from OS

• System Call: a user-level app “function call” to OS


• Leave description of what you want done in registers
• SYSCALL instruction (also called TRAP or INT)
• Entry point into the OS for privileged operations
• Basic operation
• Processor jumps to requested handler
• Sets privileged mode
• OS code performs operation
• OS does a “return from system call”
• Unsets privileged mode

23
Interrupts – OS wants your attention
• Exceptions: synchronous, generated by running app
• E.g., illegal insn, divide by zero, etc.
• Interrupts: asynchronous events generated externally
• E.g., timer, I/O request/reply, etc.
• E.g. Timer: programmable on-chip interrupt
• Initialize with some number of micro-seconds
• Timer counts down and interrupts when reaches zero
• “Interrupt” handling: same mechanism for both
• “Interrupts” are on-chip signals/bits
• Either internal (e.g., timer, exceptions) or from I/O devices
• Processor continuously monitors interrupt status, when one is high …
• Hardware jumps to some preset address in OS code (interrupt vector)
• Like an asynchronous, non-programmatic SYSCALL

24
Virtualizing Processors
• How do multiple apps (and OS) share the processors?
• Goal: applications think there are an infinite # of processors

• Solution: time-share the resource


• Trigger a context switch at a regular interval (~5ms)
• Pre-emptive: app doesn’t yield CPU, OS forcibly takes it
+ Stops greedy apps from starving others
• Architected state: PC, registers
• Save and restore them on context switches
• Memory state?
• Non-architected state: caches, predictor tables, etc.
• Ignore or flush
• OS responsible to handle context switching
• Hardware support is just a timer interrupt

25
Virtualizing Main Memory
• How do multiple apps (and the OS) share main memory?
• Goal: each application thinks it has infinite memory
• Biggest memory we have: Disk (so slooooow)
• Faster memory DRAM (so expensive!)
• One app may want more memory than is in main memory
• App’s insn/data footprint > DRAM
• Requires main memory to act like a cache
• With disk as next level in memory hierarchy (slow)
• Write-back, write-allocate, large blocks or “pages”
• Solution:
• Part #1: treat memory as a “cache”
• Store the overflowed blocks in “swap” space on disk
• Part #2: add a level of indirection (address translation)

27
Virtual Memory
Program • Programs use virtual addresses (VA)
code heap stack • 0…2N–1
• VA size also referred to as machine size

• E.g., Pentium4 is 32-bit, SPARC is 64-bit

• Memory uses physical addresses (PA)


• 0…2M–1 (typically M<N, especially if N=64)

• 2M is most physical memory machine supports
Main
Memory • VA→ PA at page granularity (VP→ PP)
• Small granularity too expensive!
• By “system”
• Mapping need not preserve contiguity
Disk
• VP need not be mapped to any PP
• Unmapped VPs live on disk (swap)
28
How to manage our memory as a cache?
• 1st/2nd levels: SRAM caches (I$, D$, L2)
CPU
• Set Associative/Hardware Managed
! !
I$ D$ • 3rd level: main memory (DRAM)
! • Made of DRAM
L2 • Managed in software (how?)
• Block size? 4KB+
• Associativity? Fully associative
Main • Replacement policy? In software
Memory • Write-back vs. write-through? Write-back
• Write-allocate vs. write-non-allocate?
Write allocate

Disk
• 4th level: disk (swap space)
• SSD/HDD/etc. -- SLOW!
30
Low %miss At All Costs
• For a memory component: thit vs. %miss tradeoff

• Upper components (I$, D$) emphasize low thit


• Frequent access → minimal thit important
• t miss is not bad → minimal %miss less important
• Low capacity/associativity, write-back or write-thru

• Moving down (L2) emphasis turns to %miss


• Infrequent access → minimal thit less important
• t miss is bad → minimal %miss important
• High capacity/associativity, write-back

• For memory, emphasis entirely on %miss


• t miss is disk access time (measured in ms, not ns)

31
Memory Organization Parameters
Parameter I$/D$ L2 Main Memory
thit <1ns 2-10ns 100ns
tmiss 2-10ns 100ns 10ms (10M ns)
Capacity 8–64KB 256KB–20MB Up to 1TB
Block size 16–64B 64–256B 4+KB
Associativity 1–8 4–16 Full
Replacement Policy LRU/NMRU LRU/NMRU working set
Write-through? Either No No
Write-allocate? Either Yes Yes
Victim buffer? Yes Maybe No
Prefetching? Either Yes Software

32
Uses of Virtual Memory
• Virtual memory is quite a useful feature
• Automatic, transparent memory management just one use
• “Functionality problems are solved by adding levels of indirection”
• Example: program isolation and multiprogramming
• Each process thinks it has 2N bytes of address space
• Each thinks its stack starts at address 0xFFFFFFFF
• System maps VPs from different processes to different PPs
+ Prevents processes from reading/writing each other’s memory

Program1 … … Program2

33
More Uses of Virtual Memory
• Protection
• Piggy-back mechanism to implement page-level protection
• Map virtual page to physical page
… and to Read/Write/Execute protection bits in page table
• Attempt to illegal access, to execute data, to write read-only data?
• Exception → OS terminates program
• Inter-process communication
• Map virtual pages in different processes to same physical page
• Share files via the UNIX mmap() call
OS App1 App2
… … …

• Don’t duplicate shared libraries!


34
Address Translation
virtual address[31:0] VPN[31:16] POFS[15:0]
!
translate ! don’t touch
physical address[25:0] PPN[27:16] POFS[15:0]

• VA→ PA mapping called address translation


• Split VA into virtual page number (VPN) and page offset (POFS)
• Translate VPN into physical page number (PPN)
• POFS is not translated
• VA→ PA = [VPN, POFS] → [PPN, POFS]

• Example above
• 64KB pages → 16-bit POFS
• 32-bit machine → 32-bit VA → 16-bit VPN
• Maximum 256MB memory → 28-bit PA → 12-bit PPN

35
Mechanics of Address Translation
• How are addresses translated?
• In software (now) but with hardware acceleration (a little later)
• Each process allocated a page table (PT)
• Managed by the operating system Page Table (PT)

• Maps VPs to PPs or to disk (swap) addresses


• VP entries empty if page never referenced

vpn
• Translation is table lookup
struct {
union { int ppn, disk block; }
bool is valid, is dirty;
} PTE;
struct PTE pt[NUM VIRTUAL PAGES];

int translate( int vpn) {


if (pt[vpn] .is valid) Disk(swap)
return pt[vpn] .ppn;
}

36
Page Table Size
• How big is a page table on the following machine?
• 4B page table entries (PTEs)
• 32-bit machine
• 4KB pages

• 32-bit machine → 32-bit VA → 4GB virtual memory


• 4GB virtual memory / 4KB page size → 1M VPs
• 1M VPs * 4B PTE → 4MB (per process)

• How big would the page table be with 64KB pages?


• 256KB
• How big would it be for a 64-bit machine?
• 18,014,398,509,481,984KB (oops)
• Page tables can get big
• There are ways of making them smaller
• PA = f(VA) many different data structures possible
• Want a compact representation that has fast lookups!
37
Here’s an interesting datastructure!
“d” “a”

• What is a “trie” data structure “a”


“b”
• Also called a “prefix tree” “root” “c”
“a” A “d”
“b”
“c”
“d”
• What is it used for?
“a”
“b”
“c”
• What properties does it have? “d”

• How is it different from a binary tree?


• How is it different than a hash table?

“a”
“b”
“c”
“d”

Better worst case # accesses compared to hash table!


38
Multi-Level Page Table
• One way: multi-level page tables
• Tree of page tables
• Lowest-level tables hold PTEs
• Upper-level tables hold pointers to lower-level tables
• Different parts of VPN used to index different levels

• Example: two-level page table


• Compute number of pages needed for lowest-level (PTEs)
• 4KB pages / 4B PTEs → 1K PTEs/page
• 1M PTEs / (1K PTEs/page) → 1K pages
• Compute number of pages needed for upper-level (pointers)
• 1K lowest-level pages → 1K pointers
• 1K pointers * 32-bit VA → 4KB → 1 upper level page

39
Multi-Level Page Table
• 20-bit VPN VPN[19:10] VPN[9:0] 2nd-level
PTEs
• Upper 10 bits index 1st-level table
1st-level
• Lower 10 bits index 2nd-level table pt “root” “pointers”

struct {
union { int ppn, disk block; }
bool is valid, is dirty;
} PTE;

struct {
struct PTE ptes[1024];
} L2PT;
struct L2PT *pt[1024];

int translate( int vpn) {


struct L2PT *l2pt = pt[vpn>>10];
if (l2pt && l2pt->ptes[vpn&1023].is_valid)
return l2pt->ptes[vpn&1023]. ppn;
}
40
Multi-Level Page Table (PT)
• Have we saved any space?
• Isn’t total size of 2nd level tables same as single-level
table (i.e., 4MB)?
• Yes, but …

• Large virtual address regions unused


• Corresponding 2nd-level tables need not exist
• Corresponding 1st-level pointers are null

• Example: 2MB code, 64KB stack, 16MB heap


• Each 2nd-level table maps 4MB of virtual addresses
• 1 for code, 1 for stack, 4 for heap, (+1 1st-level)
• 7 total pages = 28KB (much less than 4MB)

41
Translation in X86

Virtual Address VA

APP APP

OS CR3

Hardware Physical
Address PA
Courtesy Jayneel Gandhi (Agile
Paging)
At most mem accesses = 4 42
Alternative: Inverted/Hashed Page Tables
Inverted
Page Table
Base of Table

Table
PID Offset PA of IPTE
hash +
一 VPN PID PTE
VPN

Size of Inverted Page table only needs to be


proportional to the size of the physical memory

Each VPN can only be mapped to a small set


of entries according to a hash function

To translate a VPN, check all allowed table


entries for matching VPN and PID Physical
Memory
How many memory lookups per translation?

43
Address Translation Mechanics
• Three remaining questions
• Who performs translation?
• When do you translate?
• Where does page table reside?

• Conceptual view:
• Who: OS
• Disallow program from modifying its own page table entries
• When: Translate virtual address before every cache access
• Walk the page table for every load/store/instruction-fetch
• Where: Memory

• Actual approach:
• Cache translations in a “translation cache” to avoid repeated lookup

44
Translation Lookaside Buffer
CPU • Functionality problem? add indirection
• Performance problem? add cache
VA!
TLB
!
TLB
VA
• Address translation too slow?
• Cache translations in translation lookaside
PA ! buffer (TLB)
I$ D$ • Small cache: 16–512 entries
!PAPA! • Small TLBs often fully associative (<64)
L2 + Exploits temporal locality in page table (PT)
• What if an entry isn’t found in the TLB?
PA
• Invoke TLB miss handler
Main
Memory ! “tag” “data”
VPN PPN
VPN PPN
VPN PPN

45
TLB Misses and Miss Handling
• TLB miss: requested PTE not in TLB, search page table
• Software routine, e.g., Alpha, SPARC, MIPS
• Special instructions for accessing TLB directly
• Latency: one or two memory accesses + trap
• Hardware finite state machine (FSM), e.g., x86
• “Page table walker”
• Store page table root in hardware register
• Page table root and table pointers are physical addresses
+ Latency: saves cost of OS call
• In both cases, reads use the the standard cache hierarchy
+ Allows caches to help speed up search of the page table

• Nested TLB miss: miss handler itself misses in the TLB


• Solution #1: Allow recursive TLB misses (very tricky)
• Solution #2: Lock TLB entries for page table into TLB
• Solution#3:Avoid problem using physical address in pagetable
46
TLB Performance
• TLB Reach = # TLB entries * Page size
= 64 * 4KB = 256KB << L2 cache size
Solution #1: Big pages (e.g., 4MB)
TLB Reach = 256MB, but internal fragmentation
How to support both big and small pages?
Solution #2: Two-level TLB
L1: 64-128 entries, L2: 512-2048 entries
Solution #3: Software TLB (aka TSB)
in memory TLB: 32K entries (or more)
low-associativity (e.g., 2-way), longer hit time
Much faster than page table access

47
Page Faults
• Page fault: PTE not in page table
• Page is simply not in memory
• Starts out as a TLB miss, detected by OS handler/hardware FSM

• OS routine
• Choose a physical page to replace
• “Working set”: more refined software version of LRU
• Tries to see which pages are actively being used
• Balances needs of all current running applications
• If dirty, write to disk
• Read missing page from disk
• Takes so long (~10ms), OS schedules another task
• Treat like a normal TLB miss from here

48
Physical (Address) Caches
• Memory hierarchy so far: physical caches
CPU
• Indexed and tagged by Pas
VA!
TLB
!
TLB
VA
• Physically Indexed (PI)
• Physically Tagged (PT)
• Translate to PA at the outset
PA ! + Cached inter-process communication works
I$ D$ • Single copy indexed by PA
!PA
PA!
– Slow: adds at least one cycle to thit
L2

PA
Main
Memory

49
Virtual Address Caches (VI/VT)
CPU • Alternative: virtual caches
• Indexed and tagged by VAs (VI and VT)
VA! !VA • Translate to PAs only to access L2
I$ D$ + Fast: avoids translation latency in common case
– Problem: VAs from differentprocesses are
VA ! distinct physical locations (with different values)
TLB (homonyms: same address different meaning)
PA ! • What to do on process switches?
L2 • Flush caches? Slow
• Add process IDs to cache tags
PA • Does inter-process communication work?
Main • Synonyms: multiple VAs map to same PA
Memory • Can’t allow same PA in the cache twice
• Can be handled, but very complicated (bonus)

50
Parallel TLB/Cache Access (VI/PT)

• Compromise: access TLB in parallel


CPU • In small caches, indexof VAandPA the same
VA
TLB
!I$ !VA
D$ TLB
• VI == PI
• Use the VA to index the cache
PA! • Tagged by PAs
• Cache access and address translation in parallel
L2
+ No context-switching/aliasing problems
PA + Fast: no additional thit cycles
Main
Memory • Common organization in processors today

Cache View tag [31:12] index [11:5] off[4:0]


Virtual View VPN [31:16] page offset [15:0]
Physical View PPN [31:16] page offset [15:0]

51
Parallel Cache/TLB Access

Fully associative TLB


==
• Parallel cache/TLB … ==
• If address translation
doesn’t change index ==
• VPN/index don’t overlap

TLB ==

TLB hit/miss— — cache


cache hit/miss
virtual tag [31:12] index [11:5] [4:0]
address VPN [31:16] page offset [15:0]

data
52
Cache Size And Page Size

[31:12] ? index [11:5] [4:0]


VPN [31:16] page offset [15:0]

• Relationship between page size and L1 cache size


• Forced by non-overlap between VPN and IDX portions of VA
• Which is required for TLB access
• Rule: (cache size) / (associativity) ≤ page size
• Result: associativity increases allowable cache sizes
• Example: Pentium 4, 4KB pages, 8KB, 2-way SA L1 data cache
• If cache is too big, same issues as virtually-indexed caches

53
TLB Organization
• Like caches: TLBs also have ABCs
• Capacity
• Associativity (At least 4-way associative, fully-associative common)
• What does it mean for a TLB to have a block size of two?
• Two consecutive VPs share a single tag

• Like caches: there can be L2 TLBs


• Why? Think about this …

• Rule of thumb: TLB should “cover” L2 contents (maybe LLC)


• In other words: (#PTEs in TLB) * page size ≥ L2 size
• Why? Think about relative miss latency in each …

54
What addresses in the pipeline?
• What type of address for LSQ?
• (synonyms within a process -> same physical)

store position flush?

load queue SQ
(LQ)

address
head == == head
== ==
==
== ==
== age ==
tail == == tail
==
== ==
== ==

D$/TLB
55
Intel Core i7 Memory System
Processor package
Core x4
Page Tbl Instruction MMU
Core
Walker fetch (addr translation)

L1 d-cache L1 i-cache L1 d-TLB L1 i-TLB


32 KB, 8-way 32 KB, 8-way 64 entries, 4-way 128 entries, 4-way
t t
Private L2 unified cache Private L2 unified TLB
256 KB, 8-way 512 entries, 4-way
To other
QuickPath interconnect
4 links @ 25.6 GB/s cores
each To I/O
bridge

Shared L3 unified cache DDR3 Memory controller


8 MB, 16-way 3 x 64 bit @ 10.66 GB/s
(shared by all cores) 32 GB/s total (shared by all cores)

Main memory
56
End-to-end Core i7 Address Translation
32/64
CPU L2, L3, and
Result
main memory
36 12
Virtual address (VA)
VPN VPO L1 L1
miss
32 ! 4
hit
TLBT TLBI
L1 d-cache
! ↓ ! TLB (64 sets, 8 lines/set)
hit
TLB
miss
...

...
L1 TLB (16 sets, 4
entries/set)
9 9 9 9
40 12 40 6 6
VPN1 VPN2 VPN3 VPN4
PPN PPO CT CI CO
Physical
CR address
3
PTE PTE PTE PTE (PA)

Page tables 57
Alternative to VM: base/bound registers
• Each process is given a non-overlapping, contiguous
physical memory region

When a process is swapped in, OS sets base to the start of
the process’s memory region and bound to the end of the
region
• On memory references, HW translation & protection check
• PA = EA + base
• provided (PA < bound), Base
active process’s
• else violations Bound region

privileged control another process’s


registers region
physical mem.

58
Also Segmented Address Space
• segment == a base and bound pair
• segmented addressing gives each process multiple segments
• initially, separate code and data segments
- 2sets of base-and-bound reg’sfor instand datafetch
-allowed sharing code segments
• became more and more elaborate: code,data, stack,etc.
• also (ab)used as a way for an ISA with a small EA space to address
a larger physical memory space
SEG # EA

segment tables
must be 1. PA
privileged data segment +,< &
structures and 2. base
table
private/unique to & okay?
each process bound

Cheaper translation, but fragmentation?


59
Virtual Memory
• Virtual memory ubiquitous today
• Certainly in general-purpose (in a computer) processors
• But even many embedded (in non-computer) processors support it

• Several forms of virtual memory


• Paging (aka flat memory): equal sized translation blocks
• Most systems do this
• Segmentation: variable sized (overlapping?) translation blocks
• x86 used this rather than 32-bits to break 16-bit (64KB) limit
• Memory allocation not fun … (fragmentation?)
• Paged segments: too many indirections …
• Today: X86-64 does not use segmentation in 64-bit mode
• forces all segment registers (CS,SS,DS,ES to 0-2^64)

60
Memory Protection and Isolation
• Important role of virtual memory today

• Virtual memory protects applications from one another


• OS uses indirection to isolate applications
• One buggy program should not corrupt the OS or other programs
+ Comes “for free” with translation
– However, the protection is limited

– What about protection from …


• Viruses and worms?
• Stack smashing
• Malicious/buggy services?
• Other applications with which you want to communicate

61
Stack Smashing via Buffer Overflow
int i = 0;
char buf[128];
while ((buf[i++] = getc()) != ’\n’) ; ra
return;

buf[128]
• Stack smashing via buffer overflow
• Oldest trick in the virus book
ra
• Exploits stack frame layout and …
• Sloppy code: length-unchecked copy to stack buffer
• “Attack string”: code (128B) + &buf[0] (4B)
• Caller return address replaced with pointer to attack code &buf[0]
• Caller return …
• …executes attack code at caller’s privilege level attack
code

ra

62
Page-Level Protection
struct {
union { int ppn, disk block; }
int is valid, is dirty, permissions;
} PTE;

• Page-level protection
• Piggy-backs on translation infrastructure
• Each PTE associated with permission bits: Read, Write, eXecute
• Read/execute (RX): for code
• Read (R): read-only data
• Read/write (RW): read-write data
• TLB access traps on illegal operations (e.g., write to RX page)
• To defeat stack-smashing? Set stack permissions to RW
• Will trap if you try to execute &buf[0]

– Unfortunately, hackers have many other tricks (return oriented


programming)
63
Bonus

64
(not really VM, but lets talk about it)

• Criticality Aware Cache Hierarchies


• Problem?
• Insight?
• Approach?
• Benefits?

65
66
Background: μArch Dependence Graphs
[Fields, ISCA 2001]
• Represent execution nodes.
• Edges between Inorder Dependencies.
Fetch/Dispatch/Commit

F 0 F Fetch0 F 0 F

Fetch-Dispatch !↓ ↓ ↓
Latency
0 0
D D Dispatch D 0 D
↓ ↓↓ ↓ ↓
Dispatch Before
E Execute E E
Execute E
↓ ↓ ↓ ↓
Execution
P Latency P Complete P P
↓↓ ↓

Complete Before ↓
C 0
Commit C Commit
0 C 0 C

Inst. 1 Inst. 2 Inst. 3 Inst. 4


67
Background: μArch Dependence
8

Graphs
D ta
Branch Critical Pat Executio esource
Misprediction Time Dependence

F F 0 F F
1 ↓ ↓ ↓ ↓
D D D D
! ↓ ↓ ↓
1
E 6 E E 1 E
1
! ↓ ↓↓ ↓↓
0 1
P P P P
↓ ↓ ↓
1
C C C C

bgz r1 r2 = ld[r3] r4 = r2 * 2 r5 = r6 *
68
• Reduced complexity model with only
• D: Dispatch
• E: Execute
• C: Commit

69
Claim: Only critical L2 Hits matter

70
Intuition: Latency matters much more for L1 cache! 71
Intuition: L2 cache least sensitive to non-critical access latency 72
Criticality Aware Tiered Cache Hierarchy
(CATCH)

• Track critical load PCs


• Served from non-L1 on-die
caches

• Prefetch critical loads into


L1
• Accelerate the critical path

73
Criticality detection in hardware

(graph is backwards now …

74
TACT: Data Prefetchers

75
76
Backups

Slide History/Attribution Diagram:


UW Madison UPenn UW Madison
UCLA
Hill, Sohi, Amir Roth, Hill, Sohi, Wood,
Nowatzki
Smith, Wood Milo Martin Various Universities Sankaralingam, Sinclair
Asanovic, Falsafi, Hoe, Lipasti,
Shen, Smith, Vijaykumar
Safe and Efficient Services
• Scenario: module (application) A wants service B provides
• A doesn’t “trust” B and vice versa (e.g., B is kernel)
• How is service provided?
• Option I: conventional call in same address space
+ Can easily pass data back and forth (pass pointers)
– Untrusted module can corrupt your data
• Option II: trap or cross address space call
– Copy data across address spaces: slow, hard if data uses pointers
+ Data is not vulnerable
• Page-level protection helps somewhat, but …
• Page-level protection can be too coarse grained
• If modules share address space, both can change protections

78
Itanium Prevalidated tags
VA

TLB I$

• I$ tag is bit vector, not address tag


• match TLB location for hit
• TLB miss I$ miss
• TLB size tag size (32 entries/32 bits in Itanium 2)

79
RAM

0/1 ? 0/1 • RAM: large storage arrays


?
wordline0 o • Basic structure
• MxN array of bits (M N-bit words)
address

0/1 ? 0/1 ?
• This one is 4x2
wordline1 o • Bits in word connected by wordline
0/1 ? 0/1 ? • Bits in position connected by bitline
wordline2 o • Operation
0/1 • Address decodes into M wordlines
0/1 ? ?
• High wordline → word on bitlines
wordline3 o
• Bit/bitline connection → read/write
bitline1

bitline0

• Access latency
data • #ports * √#bits

80
SRAM

? ? • SRAM: static RAM


• Bits as cross-coupled inverters (CCI)
– Four transistors per bit
address

? ? – More transistors for ports

? ? • “Static” means
• Inverters connected to pwr/gnd
+ Bits naturally/continuously “refreshed”
? ?

• Designed for speed

data

81
DRAM
• DRAM: dynamic RAM
• Bits as capacitors
+ Single transistors as ports
address

+ One transistor per bit/port

• “Dynamic” means
• Capacitors not connected to pwr/gnd
– Stored charge decays over time
– Must be explicitly refreshed

• Designed for density

data

82
Moore’s Law
Year Capacity $/MB Access time
1980 64Kb $1500 250ns
1988 4Mb $50 120ns
1996 64Mb $10 60ns
2004 1Gb $0.5 35ns

• Commodity DRAM parameters


• 16X every 8 years is 2X every 2 years
• Not quite 2X every 18 months but still close

83
DRAM Operation I
• Read: similar to cache read
• Phase I: pre-charge bitlines to 0.5V
address

• Phase II: decode address, enable wordline


• Capacitor swings bitline voltage up(down)
• Sense-amplifier interprets swing as 1(0)
– Destructive read: word bits now discharged

write • Write: similar to cache write


sa sa
• Phase I: decode address, enable wordline
• Phase II: enable bitlines
• High bitlines charge corresponding capacitors

data – What about leakage over time?


84
DRAM Operation II
• Solution: add set of D-latches (row buffer)

• Read: two steps


address

• Step I: read selected word into row buffer


• Step IIA: read row buffer out to pins
• Step IIB: write row buffer back to selected word
+ Solves “destructive read” problem

r-I
sa sa
• Write: two steps
• Step IA: read selected word into row buffer
r ↓ ↓ • Step IB: write data into row buffer
r/w-I DL • Step II: write row buffer back to selected word
DL

r/w-II
data

85
DRAM Refresh
• DRAM periodically refreshes all contents
• Loops through all words
• Reads word into row buffer
address

• Writes row buffer back into DRAM array

• 1–2% of DRAM time occupied by refresh

sa sa

↓ ↓
DL DL

data

86
DRAM Parameters
• DRAM parameters
• Large capacity: e.g., 64–256Mb
address • Arranged as square
+ Minimizes wire length
+ Maximizes refresh efficiency

DRAM
bit array • Narrow data interface: 1–16 bit
• Cheap packages → few bus pins

! ! ! • Narrow address interface: N/2 bits


row buffer • 16Mb DRAM has a 12-bit address bus
• How does that work?

data

87
Two-Level Addressing
address • Two-level addressing
[23:12] [11:2] • Row decoder/column muxes share
↓ ↓ address lines
• Two strobes (RAS, CAS) signal which
part of address currently on bus
RAS
12to4K decoder

• Asynchronous access
4K x 4K
• Level 1: RAS high
bits
• Upper address bits on address bus
• Read row into row buffer
! ! ! • Level 2: CAS high
row buffer • Lower address bits on address bus
4 1Kto1 muxes
• Mux row buffer onto data bus

CAS data

88
Access Latency and Cycle Time
• DRAM access much slower than SRAM
• More bits → longer wires
• Buffered access with two-level addressing
• SRAM access latency: <1ns
• DRAM access latency: 30–50ns

• DRAM cycle time also longer than access time


• Cycle time: time between start of consecutive accesses
• SRAM: cycle time = access time
• Begin second access as soon as first access finishes
• DRAM: cycle time = 2 * access time
• Why? Can’t begin new access while DRAM is refreshing row

89
DRAM Latency and Power Derivations
• Same basic form as SRAM
• Most of the equations are geometrically derived
• Same structure for decoders, wordlines, muxes
• Some differences
• Somewhat different pre-charge/sensing scheme
• Array access represents smaller part of total access
• Arrays not multi-ported

90
Building a Memory System
CPU • How to build an efficient main memory
out of standard DRAM chips?
! !
I$ D$
• How many DRAM chips?
! • What width/speed (data) bus to use?
L2 • Assume separate address bus

Main • Main memory interface: L2 miss blocks


Memory • What do you want t miss-L2 to be?

Disk(swap)

91
An Example Memory System
• Parameters
• 32-bit machine
• L2 with 32B blocks
• 4Mx16b DRAMs, 20ns access time, 40ns cycle time
• 100MHz (10ns period) data bus
• 100MHz, 32-bit address bus

• How many DRAM chips?


• How wide to make the data bus?

92
First Memory System Design
T (ns) DRAM Data Bus
2B
10 [31:30]
4M 20 [31:30]
x 30 refresh [31:30]
2B
40 refresh
50 [29:28]

• 1 DRAM + 16b bus 60 [29:28]


70 refresh [29:28]
• Access time: 630ns
80 refresh
• Not including address
… … …
• Cycle time: 640ns 600 refresh
• DRAM ready to handle another miss 610 [1:0]
620 [1:0]
630 refresh [1:0]
640 refresh

93
Second Memory System Design
T (ns) DRAM Bus
4b
10 [31:30]
4M 20 [31:30]
x 30 refresh [31H]
2B
40 refresh [31L]
50 [29:28] [30H]

• 1 DRAM + 4b bus 60 [29:28] [30L]


70 refresh [29H]
• One DRAM chip, don’t need 16b bus
80 refresh [29L]
• Balanced system → match bandwidths
… … …
• DRAM: 2B / 40ns → 4b / 10ns 600 [1:0] [2H]
610 [1:0] [2L]

• Access time: 660ns (30ns longer, 4%) 620 refresh [1H]


640 refresh [1L]
• Cycle time: 640ns (same)
650 [0H]
+ Much cheaper
660 [0L]

94
Third Memory System Design
T (ns) DRAM0 DRAM1 DRAM15 Bus
32B
10 [31:30] [29:28] [1:0]
4M 4M 4M 4M 20 [31:30] [29:28] [1:0]
x x x … x 30 refresh refresh refresh [31:0]
2B 2B 2B 2B
40 refresh refresh refresh
0 1 2 15

• How fast can we go?


• 16 DRAM chips + 32B bus
• Stripe data across chips
• Byte M in chip (M/2)%16
• Access time: 30ns
• Cycle time: 40ns
– 32B bus is very expensive
– 128MB of memory isn’t, but you may not want that much

95
Latency and Bandwidth
• In general, given bus parameters …
• Find smallest number of chips that minimizes cycle time
• Approach: match bandwidths

96
Fourth Memory System Design
T (ns) DRAM0 DRAM1 DRAM2 DRAM3 Bus
2B
10 [31:30] [29:28] [27:26] [25:24]
4M 4M 4M 4M 20 [31:30] [29:28] [27:26] [25:24]
x x x x 30 refresh refresh refresh refresh [31:30]
2B 2B 2B 2B
40 refresh refresh refresh refresh [29:28]
0 1 2 3
50 [23:22] [21:20] [19:18] [17:16] [27:26]
60 [23:22] [21:20] [19:18] [17:16] [25:24]

• 2B bus … … … … … …
110 refresh refresh refresh refresh [15:14]
• Bus b/w: 2B/10ns
120 refresh refresh refresh refresh [13:12]
• DRAM b/w: 2B/40ns 130 [7:6] [5:4] [3:2] [1:0] [11:10]
• 4 DRAM chips 140 [7:6] [5:4] [3:2] [1:0] [9:8]

• Access time: 180ns 150 refresh refresh refresh refresh [7:6]

• Cycle time: 160ns 160 refresh refresh refresh refresh [5:4]


170 [3:2]
180 [1:0]

97
More Bandwidth From One DRAM
• EDO: extended data out
• Multiple row buffer reads/writes
• Send only column addresses

• SDRAM: synchronous DRAM


• Read/write row buffer chunks on clock edge
• No need to send column addresses at all
• DDR SDRAM: double-data rate SDRAM
• Read/write on both clock edges
• Popular these days

• RDRAM: aka RAMBUS


• Multiple row buffers, “split” transactions, other complex behaviors
• Very expensive, high end systems only

98
Memory Access and Clock Frequency
• Nominal clock frequency applies to CPU and caches
• Memory bus has its own clock, typically much slower
• DRAM has no clock (SDRAM operates on bus clock)

• Careful when doing calculations


• Clock frequency increases don’t reduce memory or bus latency
• May make misses come out faster
• At some point memory bandwidth may become a bottleneck
• Further increases in clock speed won’t help at all

99
Memory/Clock Frequency Example
• Parameters
• 1GHz CPU, base CPI = 1
• I$: 1% miss rate, 32B blocks (ignore D$, L2)
• Data bus: 100MHz, 8B (ignore address bus)
• DRAM: 10ns access, 20ns cycle, #chips to match bus bandwidth

• What are CPI and MIPS including memory latency?


• Bus: frequency = 100MHz → latency = 10ns (for 8B)
• Memory system cycle time = bus latency to transfer 32B = 40ns
• Memory system access time = 50ns (10ns DRAM access + bus)
• 1GHz clock → 50ns = 50 cycles
• CPI+memory = 1 + (0.01*50) = 1 + 0.5 = 1.5
• MIPS+memory = 1GHz / 1.5 CPI = 1000MHz / 1.5 CPI = 667

100
Memory/Clock Frequency Example
• What are CPI and MIPS if clock speed is doubled?
• Memory parameters same: 50ns access, 40ns cycle
• 2GHz clock → 50ns = 100 cycles
• CPI+memory = 1 + (0.01*100) = 1 + 1 = 2
• MIPS+memory = 2GHz / 2 CPI = 2000MHz / 2 CPI = 1000

• What is the peak MIPS if we can only change clock?


• Available bandwidth: 32B/40ns = 0.8B/ns
• Needed bandwidth: 0.01*32B/cycle = 0.32B/cycle * X cycle/ns
• Memory is a bottleneck at 0.8/0.32 cycle/ns = 2.5GHz
• No sustained speedup possible after that point
• 2.5GHz clock → 50ns = 125 cycles
• CPI+memory = 1 + (0.01*125) = 1 + 1.25 = 2.25
• MIPS+memory = 2.5GHz / 2.25 CPI = 2500MHz / 2.5 CPI = 1111

101
Digital Rights Management
• Digital rights management
• Question: how to enforce digital copyright?
• Electronically, not legally
• “Trying to make bits un-copiable is like trying to make water un-wet”

• Suppose you have some piece of copyrighted material ©…


– You can easily make a copy of ©
• But, what if © is encrypted?
• In order to use ©, you must also have the decryptor
– Can hack decryptor to spit out unencrypted ©
– Or hack OS to look at decryptor’s physical memory

102
Aside: Public-Key Cryptography
• Public-key cryptography
• Asymmetric: pair of keys
• Kpub : used for encryption, published
• Kpriv : used for decryption, secret
• acrypt(acrypt(M, Kpub), Kpriv) = acrypt(acrypt(M, Kpriv), Kpub) = M
• Well-known example: RSA
• Two uses
• Encryption
• Someone sends you encrypted message M: C = acrypt(M, Kpub)
• You are the only one that can decrypt it
• Authentication/Digital Signature
• You send someone a chosen plaintext M
• They “sign” it by sending back DS = acrypt(M, Kpriv)
• If acrypt(DS, Kpub) = M, then they are who Kpub says they are

103
Research: XOM
you vendor
K priv
decrypt ©
© ©
CPU L2 K pub encrypt
encrypt K pub
K pub
$
• eXecute Only Memory (XOM)
• Stanford research project [Lie+, ASPLOS’00]
• Two registers: Kpriv, Kpub different for every chip (Flash program)
• Software can get at Kpub, but Kpriv is hardware’s secret
• Hardware encryption/decryption engine on L2 fill/spill path
• Vendor sells you acrypt(©, Kpub)
+ Even if someone copies it, they won’t have Kpriv to decrypt it
• Plaintext © only exists on-chip
+ Even OS can never see plaintext ©

104
XOM: Not Quite
• Performance consideration
• Asymmetric en-/de-cryption is slow, symmetric (one key) faster
• E.g., DES, AES (Rijndael)
– Problem: can’t publish encryption key without also...
• XOM Take II
• Vendor chooses random symmetric key Ksym
• Sells you scrypt(©, Ksym) + acrypt(Ksym, Kpub)
• Two-stage decryption
• Decrypt Ksym using Kpriv : slow (but for one piece of data)
• Decrypt © using Ksym : fast
• Note: SSL does the same thing
• Uses asymmetric cryptography to choose symmetric session key

105
Error Detection: Parity
• Parity: simplest scheme
• f(data N–1…0) = XOR(data N–1, …, data1, data0)
+ Single-error detect: detects a single bit flip (common case)
• Will miss two simultaneous bit flips …
• But what are the odds of that happening?
– Zero-error correct: no way to tell which bit flipped

106
Error Correction: Hamming Codes
• Hamming Code
• H(A,B) = number of 1’s in A^B (number of bits that differ)
• Called “Hamming distance”
• Use D data bits + C check bits construct a set of “codewords”
• Check bits are parities on different subsets of data bits
• codewords A,B H(A,B) ≥ α
• No combination of α–1 transforms one codeword into another
• For simple parity: α = 2
• Errors of δ bits (or fewer) can be detected if α = δ + 1
• Errors of β bits or fewer can be corrected if α = 2β + 1
• Errors of δ bits can be detected and errors of β bits can be
corrected if α = β + δ + 1

107
SEC Hamming Code
• SEC: single-error correct
• C = log2D + 1
+ Relative overhead decreases as D grows

• Example: D = 4 → C = 3
• d1 d2 d3 d4 c1 c2 c3 → c1 c2 d1 c3 d2 d3 d4
• c1 = d1 ^ d2 ^ d4 , c2 = d1 ^ d3 ^ d4 , c3 = d2 ^ d3 ^ d4
• Syndrome: ci ^ c’i = 0 ? no error : points to flipped-bit
• Working example
• Original data = 0110 → c1 = 1, c2 = 1, c3 = 0
• Flip d2 = 0010 → c’1 = 0, c’2 = 1, c’3 = 1
• Syndrome = 101 (binary 5) → 5th bit? D2
• Flip c2 → c’1 = 1, c’2 = 0, c’3 = 0
• Syndrome = 010 (binary 2) → 2nd bit? c2

108
SECDED Hamming Code
• SECDED: single error correct, double error detect
• C = log2D + 2
• Additional parity bit to detect additional error
• Example: D = 4 → C = 4
• d1 d2 d3 d4 c1 c2 c3 → c1 c2 d1 c3 d2 d3 d4 c4
• c4 = c1 ^ c2 ^ d1 ^ c3 ^ d2 ^ d3 ^ d4
• Syndrome == 0 and c’4 == c4 → no error
• Syndrome != 0 and c’4 != c4 → 1-bit error
• Syndrome != 0 and c’4 == c4 → 2-bit error
• Syndrome == 0 and c’4 != c4 → c4 error
• Many machines today use 64-bit SECDED code
• C = 8 (one additional byte, 12% overhead)
• ChipKill - correct any aligned 4-bit error
• If an entire DRAM chips dies, the system still works!

109
Bonus

110
111
This Unit: Main Memory
Application
• Memory hierarchy review
OS
Comp iler
• Virtual memory
Firmware
• Address translation and page tables
CPU I/O • Virtual memory’s impact on caches
Me mory • Page-based protection

Digital Circuits
• Organizing a memory system
• Bandwidth matching
Gates & Transistors
• Error correction

112
Static Random Access Memory
row select - Read
Sequence 1. address decode
2. drive row select

_bitline
bitline

3. selected bit-cells drive bitlines


4 . diff. sensing and col. select
5. precharge all bitlines
- Access latency dominated by steps 2 and 3
- Cycling time dominated by steps 2, 3 and 5
- step 2 proportional to 2 m
bit-cell array - step 3 and 5 proportional to 2 n

2n - usually encapsulated by synchronous


n 2n row x 2m-col (sometime pipelined) interface logic
n+m
(n≈m to minimize
overall latency)

m 2m diff pairs
sense amp and mux

1
113
Dynamic Random Access Memory
row enable
- Bits stored as charges on node
capacitance (non-restorative)
_bitline

- bit cell loses charge when read


- bit cell loses charge over time
- Read Sequence
1~3 same as SRAM
RAS 4. a “flip-flopping” sense amp amplifies
bit-cell array and regenerates the bitline, data bit is
2n mux’ed out
n 2n row x 2m-col
5. precharge all bitlines
(n≈m to minmize - A DRAM controller must periodically,
overall latency) either distributed or in a burst, read all
rows within the allowed refresh time
m 2m (10s of ms) synchronous interfaces
sense amp and mux
1 - various hacks to allow faster repeated
A DRAM die comprises reads to the same row
CAS of multiple such arrays
114
Brief History of DRAM
• DRAM (memory): a major force behind computer industry
• Modern DRAM came with introduction of IC (1970)
• Preceded by magnetic “core” memory (1950s)
• Each cell was a small magnetic “donut”
• And by mercury delay lines before that (ENIAC)
• Re-circulating vibrations in mercury tubes

“the one single development that put computers on their feet was the
invention of a reliable form of memory, namely the core memory … It’s
cost was reasonable, it was reliable, and because it was reliable it
could in due course be made large”
Maurice Wilkes
Memoirs of a Computer Programmer, 1985

115
DRAM Basics [Jacob and Wang]
• Precharge and Row Access

116
DRAM Basics, cont.
• Column Access

117
DRAM Basics, cont.
• Data Transfer

118
Open v. Closed Pages
• Open Page
• Row stays active until another row needs to be accessed
• Acts as memory-level cache to reduce latency
• Variable access latency complicates memory controller
• Higher power dissipation (sense amps remain active)
• Closed Page
• Immediately deactivate row after access
• All accesses become Activate Row, Read/Write, Precharge

• Complex power v. performance trade off

119
DRAM Bandwidth
• Use multiple DRAM chips to increase bandwidth
• Recall, access are the same size as second-level cache
• Example, 16 2-byte wide chips for 32B access

• DRAM density increasing faster than demand


• Result: number of memory chips per system decreasing

• Need to increase the bandwidth per chip


• Especially important in game consoles
• SDRAM DDR DDR2 FBDIMM ( DDR3)
• Rambus - high-bandwidth memory
• Used by several game consoles

120
DRAM Evolution
• Survey by Cuppu et al.
1. Early Asynchronous Interface
2. Fast Page Mode/Nibble Mode/Static Column (skip)
3. Extended Data Out
4. Synchronous DRAM & Double Data Rate
5. Rambus & Direct Rambus
6. FB-DIMM

121
Old 64MbitDRAM Example from Micron
Clock Recovery

COLUMNADDRESS

ROWADDRESS 122
Extended Data Out (EDO)
RAS’

CAS’

Row add Column Column Column


add add add

Data Data Data

• Similar to Fast Page Mode


• But overlapped Column Address assert with Data Out
123
Synchronous DRAM (SDRAM)

RAS’ \
CAS’

Column
add
Row add

Data Data Data

• Add Clock and Wider data!


• Also multiple transfers per RAS/CAS
124
Enhanced SDRAM & DDR
• Evolutionary Enhancements on SDRAM:
1. ESDRAM (Enhanced): Overlap row buffer access with
refresh

2. DDR (Double Data Rate): Transfer on both clock edges


3. DDR2’s small improvements
lower voltage, on-chip termination, driver calibration
prefetching, conflict buffering
4. DDR3, more small improvements
lower voltage, 2X speed, 2X prefetching,
2X banks, “fly-by topology”, automatic calibration

125
FB-DIMM

126
Wide v. Narrow Interfaces
• High frequency short wavelength data skew issues
• Balance wire lengths

DDR-2 serpentine board routing FB-DIMM board


routing

127
Rambus RDRAM
• High-frequency, narrow channel
• Time multiplexed “bus” dynamic point-to-point channels
• ~40 pins 1.6GB/s
• Proprietary solution
• Never gained industry-wide acceptance (cost and power)
• Used in some game consoles (e.g., PS2)

RDRAM RDRAM RDRAM


CPU
from_clock
or
Memory to_clock
Controller
Data bus 16 bits @ 800 Mhz

128
DRAM Reliability
• One last thing about DRAM technology … errors
• DRAM bits can flip from 0 1 or 1 0
• Small charge stored per bit
• Energetic α-particle strikes disrupt stored charge
• Many more bits
• Modern DRAM systems: built-in error detection/correction
• Today all servers; most new desktop and laptops
• Key idea: checksum-style redundancy
• Main DRAM chips store data, additional chips store f(data)
• |f(data)| < |data|
• On read: re-compute f(data), compare with stored f(data)
• Different ? Error …
• Option I (detect): kill program
• Option II (correct): enough information to fix error? fix and go on

129
DRAM Error Detection and Correction
address data error

4M 4M 4M 4M 4M
x x x x x
2B 2B 2B 2B 2B
0 1 2 3 f

• Performed by memory controller (not the DRAM chip)


• Error detection/correction schemes distinguished by …
• How many (simultaneous) errors they can detect
• How many (simultaneous) errors they can correct

130
Interleaved Main Memory
• Divide memory into M banks and “interleave” addresses
across them, so word A is
• in bank (A mod M)
• at word (A div M)
Bank 0 Bank 1 Bank 2 Bank n
word 0 word 1 word 2 word n-1
word n word n+1 word n+2 word 2n-1
word 2n word 2n+1 word 2n+2 word 3n-1

PA
Doubleword in bank Bank Word in doubleword

Interleaved memory increases memory BW without wider bus


• Use parallelism in memory banksto hide memory latency
Copyright © 2002 Falsafi, from Hill,
Smith, Sohi, Vijaykumar, and Wood 131
Block interleaved memory systems
• Cache blocks map to separate memory controllers
• Interleave across DRAMs w/i a MC
• Interleave across intra-DRAM banks w/i a DRAM

DRAM
DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM
MC MC MC MC
CPU
B B+64 B+128 B+192

Data bus

132
Research: Processing in Memory
address • Processing in memory
• Embed some ALUs in DRAM
• Picture is logical, not physical
DRAM • Do computation in DRAM rather than …
bit array • Move data to from DRAM to CPU
• Compute on CPU

! ! ! • Move data from CPU to DRAM


row buffer • Will come back to this in “vectors” unit
row buffer
• E.g.,: IRAM: intelligent RAM
• Berkeley research project
• [Patterson+,ISCA’97]
! data
133
Memory Hierarchy Review
• Storage: registers, memory, disk
• Memory is the fundamental element

• Memory component performance


• tavg = thit + %miss * t miss
• Can’t get both low thit and %miss in a single structure

• Memory hierarchy
• Upper components: small, fast, expensive
• Lower components: big, slow, cheap
• tavg of hierarchy is close to thit of upper (fastest) component
• 10/90 rule: 90% of stuff found in fastest component
• Temporal/spatial locality: automatic up-down data movement

134
Software Managed Memory
• Isn’t full associativity difficult to implement?
• Yes … in hardware
• Implement fully associative memory in software

• Let’s take a step back…

135

You might also like