Presentation of Chapter 4,
LINUX Kernel Internals
Zhihua (Scott) Jiang
Computer Science Department
University of Maryland, Baltimore County
Baltimore, MD 21250
<
[email protected]>
Guideline
• The Architecture-independent Memory
Model in LINUX
• The Virtual Address Space for a
Process
• Block Device Caching
• Paging Under LINUX
The architecture-independent
memory model
• Pages of Memory
• Virtual Address Space
• Converting the Linear Address
• The Page Directory
• The Page Middle Directory
• The Page Table
Pages of memory
• Defined by the PAGE_SIZE macro in the
asm/page.h
• For X86, the size is 4k bytes
• For Alpha uses 8K bytes
Virtual address space
• Given by reference to a segment selector and the offset
within the segment
• C pointers hold the offsets
• Defined in asm/segment.h
– KERNERL_DS (segment selector for kernel data)
– USER_DS (segment selector for user data)
• By carrying out a conversion on the segment selector register,
a system function can be given pointers to the kernel
segment.
– Used by UMSDOS file system to simulate a Unix file system
Continued
• MMU of an x86 processor converts the virtual address to a
linear address
• 4 Gbytes by width of the linear address
– 3 Gbytes for user segment
– 1 Gbyte for kernel segment
• Alpha does not support segmentation
– Offset addresses for the user segment not permitted to overlap
with the offset addresses for the kernel segment
Converting the linear address
Linear address
Linear address conversion in the architecture-independent memory model
The virtual address space for a
process
• The User Segment
• Virtual Memory Areas
• The System Call brk
• Mapping Functions
• The Kernel Segment
• Static Memory Allocation in the Kernel Segment
• Dynamic Memory Allocation in the Kernel
Segment
The user segment
• In user mode, access only in user segment
• Individual page tables for different processes
• system call fork
– child and parent processes have different page directories and page
tables
– however, in the kernel segment page tables are shared by all
processes
• system call clone
– old and new threads share the memory fully
Continued
• Some explanation for shared libraries in the user
segment
– Originally, linked into one binary, lead to efficiency
– Drawback is the growth of the length
– Stored in separate files and loaded at program start
– Linked to static addresses
– With ELF, allowed shared libraries to be loaded during
program execution
– No absolute address references in the compiled code
Virtual memory areas
• Process not use all functions at any time
• Process can share codes if they are run by the
same executable file
• Copy-on-write strategy used for memory
management
The system call brk
• The brk field points to the end of the BSS segment for non-
statically initialized data
• Used for allocating or releasing dynamic memory
• The system call brk can be used to find the current value of
the pointer or to set it to a new one under protection check
• Rejected if the mem required exceeds the estimated size
• function sys_brk() calls do_map() to map a private and
anonymous area between the old & new values of brk
Mapping functions
• C library provides 3 functions in sys/mman.h
– caddr_t mmap(caddr_t addr, size_t len, int prot, int flags,
int fd, off_t off);
– int munmap(caddr_t addr, size_t len);
– int mprotect(caddr_t addr, size_t len, int prot);
– int msync;
The kernel segment
• In x86 architecture, a system call is generally initiated by the
software interrupt 128 (0x80) being triggered.
• Any processes in system mode will encounter the same kernel
segment
• Kernel segment in alpha architecture cannot start at addr 0
• A PAGE_OFFSET is provided between physical & virtual addrs
Static memory allocation in the kernel
segment
• Initialization routine for character-oriented
devices is called as follows
memory_start = console_init(memory_start, memory_end);
• Reserves memory by returning a value higher
than the parameter memory_start
• The memory between the return value and
memory_start can be used as desired by the
initialized component
Dynamic memory allocation in the kernel
segment
• In LINUX kernel, kmalloc() and kfree() used for dynamic
memory allocation
– void * kmalloc(size_t size, int priority);
– void kfree(void *obj);
• To increase efficiency, the memory reserved is not initialized
• In LINUX kernel 1.2, __get_free_pages() only to reserve
contiguous areas of memory of 4, 8, 16, 32, 64, and 128
Kbytes in size
• kmalloc() can reserve far smaller areas of memory
Continued
• Sizes[] contains descriptors for different for
different sizes of memory area
– one manages memory suitable for DMA
– the other is responsible for ordinary memory
Continued
Structures for kmalloc
Continued
• Kmalloc() and kfree() restricted to the size of one page of
mem
• vmalloc() and vfree() improved to multiple of the size of
one page of mem
• The max of value of size is limited by the amount of physical
memory available
• Memory reserved by vmalloc() won’t be copied to external
storage
Continued
• Comparison of vmalloc() and kmalloc()
– the size of the area of memory requested can be better
adjusted to actual needs
– Limited only by the size of free physical memory and not
by its segmentation (as kmalloc() is)
– Does not return any physical address
– reserved memory can be non-consecutive pages
– not suitable for reserving memory for DMA
Block Device Caching
• Block Buffering
• The update and bdflush Processes
• List Structures for the Buffer Cache
• Using the Buffer Cache
Block Buffering
• Block size may be 512, 1024, 2048, or 4096 bytes
• Held in memory via a buffering system
• A special case applies for blocks taken from files
opened with the flag 0_SYNC
– Transferred to disk every time their contents are modified
• Data is organized as frequently requested data lie
every close together & can be kept in the processor
cache
The update and bdflush
Processes
• At periodic intervals, update process calls the system call
bdflush with an parameter
• All modified buffer blocks are written back to disk with all
superblock and inode information
• bdflush, writes back the number of blocks buffers marked
“dirty” given in the bdflush parameter
• Always activated when a block is released by means of
brelse()
• Also activated when new block buffers are requested or the
size of the buffer cache needs to be reduced
List structure for the buffer cache
• LINUX manages its block buffers via a number of different doubly
linked lists
• Block buffers in use are managed in a set of special LRU lists
LRU list(index) Description
BUF_CLEAN Block buffers not managed in other lists - content
matches relevant block on hard disk
BUF_UNSHARED Block buffers formerly (but no longer) managed in
BUF_SHARED
BUF_LOCKED Locked block buffers (b_lock != 0 )
BUF_LOCKED1 Locked block buffers for inodes and superblocks
BUF_DIRTY Block buffers with contents not matching the relevant
block on hard disk
BUF_SHARED Block buffers situated in a page of memory mapped to
the user segment of a process
The various LRU lists
Using the buffer cache
• Function bread() is called for block read
• Variance of bread(), breada(), reads not the block
requested into the buffer cache but a number of
following blocks
Paging under LINUX
• Page Cache and Management
• Finding a Free Page
• Page Errors and Reloading a Page
Page Cache and Management
• LINUX can save pages to extenral media in 2 ways
– a complete block device as the external medium, typically
a partition on a hard disk
– fixed-length files on a file system for its external storage
• Data that belong together are stored in a cache line
(16 bytes)
Finding a free page
• __get_free_pages() is called after physical pages of mem
reserved
– unsigned long __get_free_pages(int priority, unsigned long
order, int dma) ;
Priority Description
GFP_BUFFER Free page to be returned only if free pages are still available
in physical mem
GFP_ATOMIC The function __get_free_page must not interrupt the current
process, but a page should be returned if possible
GFP_USER The current process may be interrupted to swap pages
GFP_KERNEL This para is the same as GFP_USER
GFP_NOBUFFER The buffer cache won’t be reduced by an attempt to find a
free page in mem
GFP_NFS The difference between this & GFP_USER is that the # of
pages reserved for GFP_ATOMIC is reduced from
min_free_pages to five. Will speed up NFS operations
Priorities for the function __get_free_page()
Page errors and reloading a page
• do_page_fault() is called when there generates a
page fault interrupt
– void do_page_fault(struct pt_regs *regs, unsigned long
error_code);
• do_no_page() or do_wp_page() is called when the
address is in a virtual memory area, the legality of the
read or write operation is checked by reference to the
flags for the virtual mem