Memory Management
Fred Kuhns
Washington
WASHINGTON UNIVERSITY IN ST LOUIS
UNIX Memory Management
UNIX uses a demand paged (anticipatory paging) virtual memory architecture Memory is managed at the page level
page frame == physical page virtual page
Basic memory allocation responsibility of the page-level allocator
Fred Kuhns (1/18/2012)
CS523 Operating Systems
Page-level Allocation
Kernel maintains a list of free physical pages. Since kernel and user space programs use virtual memory address, the physical location of a page is not important Pages are allocated from the free list Two principal clients:
the paging system the kernel memory allocator
Fred Kuhns (1/18/2012)
CS523 Operating Systems
Memory allocation
physical page
Page-level allocator
Kernel memory Allocator
Paging system
Network buffers
Data structures
temp storage
process
Buffer cache
4
Fred Kuhns (1/18/2012)
CS523 Operating Systems
Kernel Memory management
Memory organization:
permanently mapped to same memory range of all processes separate address space
up to 1/3 of kernel time may be spent copying data between user and kernel user space can not write/read directly to kernel space KMA uses a preallocated pool of physical pages
Fred Kuhns (1/18/2012)
CS523 Operating Systems
Kernel Memory Allocation
Typical request is for less than 1 page Originally, kernel used statically allocated, fixed size tables - whats limiting about this? Kernel requires a general purpose allocator for both large and small chunks of memory. handles memory requests from kernel modules, not user level applications
pathname translation routine, STREAMS or I/O buffers, zombie structures, table table entries (proc structure etc)
Fred Kuhns (1/18/2012)
CS523 Operating Systems
Requirements of the KMA
Minimize Waste
utilization factor = requested/required memory
Useful metric that factors in fragmentation. 50% considered good
KMA must be fast since extensively used Simple API similar to malloc and free.
desirable to free portions of allocated space, this is different from typical user space malloc and free interface
Properly aligned allocations: for example 4 byte alignment Support cyclical and bursty usage patterns Interaction with paging system able to borrow pages from paging system if running low
Fred Kuhns (1/18/2012)
CS523 Operating Systems
Example KMAs
Resource Map Allocator Simple Power-of-Two Free Lists The McKusick-Karels Allocator The Buddy System SVR4 Lazy Buddy Allocator Mach-OSF/1 Zone Allocator Solaris Slab Allocator
Fred Kuhns (1/18/2012)
CS523 Operating Systems
Resource Map Allocator
Resource map is a set of <base,size> pairs that monitor areas of free memory Initially, pool described by a single map entry = <pool_starting_address, pool_size> Allocations result in pool fragmenting with one map entry for each contiguous free region Entries sorted in order of increasing base address Requests satisfied using one of three policies:
First fit Allocates from first free region with sufficient space. UNIX, fasted, fragmentation is concern Best fit Allocates from smallest that satisfies request. May leave several regions that are too small to be useful Worst fit - Allocates from largest region unless perfect fit is found. Goal is to leave behind larger regions after allocation
Fred Kuhns (1/18/2012) CS523 Operating Systems 9
Resource Map Allocator Example
supports operations: offset_t rmalloc(size), void rmfree(base, size)
<0,1024>
after: rmalloc(256), rmalloc(320), rmfree(256,128)
<256,128> <576,448>
after: rmfree(128,128)
<128,256> <576,448>
<128,32>
<288,64>
<544,128>
<832,32>
Fred Kuhns (1/18/2012)
CS523 Operating Systems
10
Resource Map -Good/Bad
Advantages:
simple, easy to implement not restricted to memory allocation, any collection of objects that are sequentially ordered and require allocation and freeing in contiguous chunks. Can allocate exact size within any alignment restrictions. Thus no internal fragmentation. Client may release portion of allocated memory. adjacent free regions are coalesced
Fred Kuhns (1/18/2012)
CS523 Operating Systems
11
Resource Map -Good/Bad
Disadvantages:
Map may become highly fragmented resulting in low utilization. Poor for performing large requests. Resource map size increases with fragmentation static table will overflow dynamic table needs its own allocator Map must be sort for free region coalescing. Sorting operations are expensive. Requires linear search of map to find free region that matches allocation request. Difficult to return borrowed pages to paging system.
Fred Kuhns (1/18/2012)
CS523 Operating Systems
12
Resource map final word
Poor performance is why resource maps is considered unsuitable as a general-purpose kernel memory allocator.
<X3,Y> <X0,Y0> <X1,Y1> <X2,Y> <X4,Y> <X5,Y5> <X6,Y6>
Fred Kuhns (1/18/2012)
CS523 Operating Systems
13
Simple Power of Twos
has been used to implement malloc() and free() in the user-level C library (libc). Uses a set of free lists with each list storing a particular size of buffer. Buffer sizes are a power of two. Each buffer has a one word header
when free, header stores pointer to next free list element when allocated, header stores pointer to associated free list (where it is returned to when freed). Alternatively, header may contain size of buffer
Fred Kuhns (1/18/2012)
CS523 Operating Systems
14
Simple Power of Two Free List
Example free list One word header per buffer (pointer)
malloc(X): size = roundup(X + sizeof(header)) roundup(Y) = 2^n, where 2^(n-1) < Y <= 2^n
free(buf) must free entire buffer.
freelistarr[] 32 64 128 256 512 1024
Fred Kuhns (1/18/2012) CS523 Operating Systems 15
Simple Power of Twos - Advantages
Simple and reasonably fast eliminates linear searches and fragmentation.
Bounded time for allocations when buffers are available
familiar API simple to share buffers between kernel modules since freeing a buffer does not require knowing its size
Fred Kuhns (1/18/2012)
CS523 Operating Systems
16
Simple Power of Two - Disadvantages
Rounding requests to power of 2 results in wasted memory and poor utilization.
aggravated by requiring buffer headers since it is not unusual for memory requests to already be a power-of-two.
no provision for coalescing free buffers since buffer sizes are generally fixed. no provision for borrowing pages from paging system although some implementations do this. no provision for returning unused buffers to page allocator
Fred Kuhns (1/18/2012)
CS523 Operating Systems
17
Simple Power of Two
void *malloc (size) { int ndx = 0; /* free list index */ int bufsize = 1 << MINPOWER /* size of smallest buffer */ size += 4; /* Add for header */ assert (size <= MAXBUFSIZE); while (bufsize < size) { ndx++; bufsize <<= 1; } /* ndx is the index on the freelist array from which a buffer * will be allocated */ }
Fred Kuhns (1/18/2012) CS523 Operating Systems 18
McKusick-Karels Allocator
Improved power of twos implementation Managed Memory must be contiguous pages All buffers within a page must be of equal size Adds page usage array, kmemsizes[], to manage pages Does not require buffer headers to indicate page size. When freeing memory, free(buff) simply masks of the lower order bit to get the page address (actually the page offset = pg) which is used as an index into the kmemsizes array.
Fred Kuhns (1/18/2012) CS523 Operating Systems 19
McKusick-Karels Allocator
Used in several UNIX variants Add an allocated page usage array (kmemsizes[]) where a page (pg) is in one of three states:
free: kmemsizes[pg] contains pointer to next free page Divided into buffers: kmemsizes[pg] == buffer size Part of a buffer Spanning multiple pages: kmemsizes[first_pg] = buffer size, where first_page is the first page of the buffer
Fred Kuhns (1/18/2012)
CS523 Operating Systems
20
McKusick-Karels Allocator
Disadvantages:
similar drawbacks to simple power-of-twos allocator vulnerable to bursty usage patterns since no provision for moving buffers between lists
Advantages:
eliminates space wastage in common case where allocation request is a power-of-two optimizes round-up computation and eliminates it if size is known at compile time
Fred Kuhns (1/18/2012)
CS523 Operating Systems
21
McKusick-Karels
freelistarr[] 32 128 64 512 F 32 F ... 32 64 128 256 512 1024 Power of Twos
kmemsizes[], map of managed pages Size of blocks allocated from pages, pointer to free pages
y No buffer header y vulnerable to bursty usage y memory not returned to paging system
22
Fred Kuhns (1/18/2012)
CS523 Operating Systems
McKusick-Karels ExampleMacros
#define NDX(size) \ (size) > 128 ? (size) > 256 ? 4 : 3 \ : (size) > 64 ? 2 : (size) > 32 ? 1 : 0 #define MALLOC(space, cast, size, flags) \ {\ register struct freelisthdr *flh; \ if (size <= 512 && (flh = freelistarr [NDX(size)]) != NULL) { \ space = (cast)flh->next; \ flh->next = *(caddr_t *)space; \ } else \ space = (cast)malloc (size, flags); \ }
Fred Kuhns (1/18/2012) CS523 Operating Systems 23
Buddy System
Allocation scheme combining Power-of-Two allocator with free buffer coalescing
binary buddy system: simplest and most popular form. Other variants may be used by splitting buffers into four, eight or more pieces.
Approach: create small buffers by repeatedly halving a large buffer (buddy pairs) and coalescing adjacent free buffers when possible. Requests rounded up to a power of two
Fred Kuhns (1/18/2012)
CS523 Operating Systems
24
Buddy System, Example
Minimum allocation size = 32 Bytes Initial free memory size is 1024 use a bitmap to monitor 32 Byte chunks
bit set if chunk is used bit clear if chunk is free
maintain freelist for each possible buffer size
power of 2 buffer sizes from 32 to 512 sizes = {32, 64, 128, 256, 512}
Initial one block = entire buffer
Fred Kuhns (1/18/2012)
CS523 Operating Systems
25
Buddy System in Action
Free list 0 B 32 64 128 256 512
256 384 448 512 640 768
1023 E
D D
1 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
Bitmap (32 B chunks) In-use Free
allocate(256), allocate(128), allocate(64), allocate(128), release(C, 128), release (D, 64)
Fred Kuhns1 (1/18/2012) CS523 Operating Systems 26
Buddy System
Advantages:
good job of coalesces adjacent free buffers easy exchange of memory with paging system
can allocate new page and split as necessary when coalescing results in a complete page, it may be returned to the paging system
Disadvantage:
performance
recursive coalescing is expensive with poor worst case performance back to back allocate and release result in alternate between splitting and coalescing the same memory
poor programming interface:
release needs both buffer and size. entire buffer must be released
Fred Kuhns (1/18/2012)
CS523 Operating Systems
27
SVR4 Lazy Buddy Algorithm
Addresses the main problem with the buddy system: poor performance due to repetitive coalescing and splitting of buffers. Under steady state conditions, the number of in-use buffers or each size remains relatively constant. Under these condition, coalescing offers no advantage. Coalescing is necessary only to deal with bursty conditions where there are large fluctuation in memory demand.
Fred Kuhns (1/18/2012)
CS523 Operating Systems
28
SVR4 Lazy Buddy Algorithm
Coalescing delay time taken to either coalesce a single buffer with its buddy or determine its buddy is not free Coalescing is recursive and doesnt stop until a buffer is found which can not be combined with its buddy. Each release operation results in at least one coalescing delay Solution:
defer coalescing until it is necessary results in poor worst-case performance lazy coalescing intermediate solution
Fred Kuhns (1/18/2012)
CS523 Operating Systems
29
Lazy Coalescing
Release operation has two setps
place buffer on free list making it locally free coalesce with buddy making it globally free
Buffers divided into classes.
Assume N buffers in a given class. N = A + L + G, where
A = number of active buffers, L = number of locally free buffers and G = number of globally free buffers
Buffer class states defined by slack = N 2L - G
lazy buffer use in steady state, coalescing not necessary. slack >= 2 reclaiming borderline consumption, coalescing needed. slack == 1 acelerated non-steady state consumption, must coalesce faster. slack==0
Fred Kuhns (1/18/2012)
CS523 Operating Systems
30
Lazy Coalescing
Improvement over basic buddy system steady state all lists in lazy state and no time is wasted splitting and coalescing worst case algorithm limits coalescing to no more than two buffers (two coalescing delays) shown to have an average latency 10% to 32% better than the simple buddy system greater variance and poorer worst case performance for the release routine
Fred Kuhns (1/18/2012) CS523 Operating Systems 31
MACH Zone Allocator
Used in Mach and DEC UNIX fast memory allocation and garbage collection in the background Each class of object (proc structure, message headers etc) assigned its own pool of free objects (aka a zone) set of power-of-two zones for anonymous objects memory allocated from page-allocator page can only hold objects in the same zone struct zone heads free list of objects for each zone zone structs are allocated from a zone of zones
Fred Kuhns (1/18/2012) CS523 Operating Systems 32
MACH Zone Allocator
zone initialization: zinit( size, max, alloc, name) size = object size in bytes max = max size of zone in bytes alloc = amount of memory to add when free list empty name = string describing objects in zone a zone struct is allocated where paramteres are stored active zone structures are kept on an linked list referenced by first_zone and last_zone. First element on list is the zone of zones. allocations are very fast since objexts sizes are fixed.
Fred Kuhns (1/18/2012)
CS523 Operating Systems
33
MACH Zone Allocator
Garbage Collection happens in the background to free up unused memory wihin zones. frees no-longer used pages expensive, linear search, locks keeps a zone page map refereing to all pages allocated to zones. Each map contains in_free_list = number of objects on free list. Calculated during garbage collection. alloc_count = total number o objects in page
Fred Kuhns (1/18/2012)
CS523 Operating Systems
34
Zone Garbage Collection
Zone garbage collector: zone_gc() is invoked by swapper each time it runs. It walks through zone list and for each zone it makes two passes through the free first pass increments in_free_list, if in_free_list equals alloc_count then page can be freed. second pass all objects in freed pages are removed from the zones free list. after pass two it calls kmem_free() to free the pages
Fred Kuhns (1/18/2012)
CS523 Operating Systems
35
MACH Zone Allocator
Advantages
fast and efficient with a simple API
obj = (void *)zalloc (struct zone* z); void zfree (struct zone *z, void *obj);
objects are exact size gc provides for memory reuse
Disadvantages
no provision for releasing part of an object gc efficiency is an issue:
slow and must complete before other system run complex and inefficient algorithm
since zone allocator is used by the memory manager
zone of zones statically allocated
Fred Kuhns (1/18/2012)
CS523 Operating Systems
36
Mach Zones
zone of zones first zone last zone
struct zone
struct zone
struct zone
struct zone
Object X zone
struct Obj X struct Obj Y
struct Obj X
Object Y zone
struct Obj Y
struct Obj Y
Fred Kuhns (1/18/2012)
CS523 Operating Systems
37
Solaris Slab Allocator
Better performance and memory utulization than other approaches studied. Based on Machs Zone Allocator. Slab allocator is concerned with object state while the zone allocator is not Focus on
object reuse, HW cache utilization allocator footprint
Fred Kuhns (1/18/2012)
CS523 Operating Systems
38
Slab Allocator
Object Reuse (i.e. Object Caching!).
Allocator must perform 5 operations:
1) allocate memory, 2) initialize object, 3) use object, 4) destruct object, 5) free memory
HW Cache Utilization
Due to typical buffer address distributions cache contention may occur resulting in its poor utilization slab allocator addresses this
Footprint portion of the hardware cache and TLB that is overwritten by the allocation itself.
large footprint can displace data in cache and TLB small footprint desirable
Fred Kuhns (1/18/2012)
CS523 Operating Systems
39
Design of Slab Allocator
cachep = kmem_cache_create (name, size, align, ctor, dtor); page-level allocator back end vnode cache proc cache front end mbuf cache msgb cache
vnode vnode vnode vnode
proc proc proc
mbuf mbuf
msgb msgb msgb msgb msgb
Objects in use by the kernel
Fred Kuhns (1/18/2012) CS523 Operating Systems 40
Slab Allocator Design
Organized as a collection of caches mechanisms for adding or removing memory from a cache Create object cache with object ctor and dtor. objp = kmem_cache_alloc (cachep, flags); kmem_cache_free (cachep, objp);
Before releasing object kernel must initialize object before it is released.
Cache grows by allocating a slab, contiguous pages managed as a unit. kmem_cache_reap (cachep) - garbage collection.
Fred Kuhns (1/18/2012) CS523 Operating Systems
41
Slab Organization
Coloring area free active Unused space free active active free slab struct Slab linked list
free list Free list pointers
32 Byte kmem_slab
Coloring area - vary starting offsets, optimize HW cache and bus
Fred Kuhns (1/18/2012)
CS523 Operating Systems
42
Slab Allocator Implementation
Slab = address % slab size Cache stores slabs in a partly sorted list: fully active, partially active, free slabs.
Why?
For large objects (>page size)
management structures on separate memory pool
Fred Kuhns (1/18/2012)
CS523 Operating Systems
43
Summary
Limited space overhead (kmem_slab, free ptr) service requests quickly (remove a preinitialized object) coloring scheme for better HW utilization small footprint simple garbage collection (cost shared across multiple requests - in-use count) Increased management overhead compared to simpler methods (power of twos)
Fred Kuhns (1/18/2012) CS523 Operating Systems 44