Advanced Operarting Systems
Advanced Operarting Systems
Operating Systems
Course Notes
Fall 2011
David R. Cheriton
School of Computer Science
University of Waterloo
Intro 1
Intro 2
1
Intro 3
Intro 4
• Some terminology:
kernel: The operating system kernel is the part of the operating system that
responds to system calls, interrupts and exceptions.
operating system: The operating system as a whole includes the kernel, and
may include other related programs that provide services for applications.
This may include things like:
– utility programs
– command interpreters
– programming libraries
2
Intro 5
User Programs
Operating System
Kernel
Resources
Intro 6
3
Intro 7
Course Outline
• Introduction
• Threads and Concurrency
• Synchronization
• Processes and the Kernel
• Virtual Memory
• Scheduling
• File Systems
• Interprocess Communication and Networking (time permitting)
4
Threads and Concurrency 1
• Registers
– program counter, stack pointer, . . .
• Memory
– program code
– program data
– program stack containing procedure activation records
• CPU
– fetches and executes instructions
5
Threads and Concurrency 3
What is a Thread?
Imagine that you would like to suspend the program execution, and
resume it again later. Think of the thread context as the information
you would need in order to restart program execution from where
it left off when it was suspended.
6
Threads and Concurrency 5
Thread Context
memory
thread context
CPU registers
Concurrent Threads
• more than one thread may exist simultaneously (why might this be a good
idea?)
• each thread has its own context, though they may share access to program
code and data
• on a uniprocessor (one CPU), at most one thread is actually executing at any
time. The others are paused, waiting to resume execution.
7
Threads and Concurrency 7
for(i=0;i<NumLoops;i++) {
/* for now, this mouse chooses a random bowl from
* which to eat, and it is not synchronized with
* other cats and mice
*/
/* legal bowl numbers range from 1 to NumBowls */
bowl = ((unsigned int)random() % NumBowls) + 1;
mouse_eat(bowl,1);
}
Implementing Threads
• the thread library stores threads’ contexts (or pointers to the threads’ contexts)
when they are not running
• the data structure used by the thread library to store a thread context is
sometimes called a thread control block
8
Threads and Concurrency 9
memory
thread library
/* see kern/include/thread.h */
struct thread {
9
Threads and Concurrency 11
memory
thread library
thread structures
• the act of pausing the execution of one thread and resuming the execution of
another is called a (thread) context switch
• what happens during a context switch?
1. save the context of the currently running thread
2. decide which thread will run next
3. restore the context of the thread that is to run next
• the act of saving the context of the current thread and installing the context of
the next thread to run is called dispatching (the next thread)
• sounds simple, but . . .
– architecture-specific implementation
– thread must save/restore its context carefully, since thread execution
continuously changes the context
– can be tricky to understand (at what point does a thread actually stop?
what is it executing when it resumes?)
CS350 Operating Systems Fall 2011
10
Threads and Concurrency 13
/* Get the new stack pointer from the new control block */
lw sp, 0(a1)
nop /* delay slot for load */
j ra /* and return. */
addi sp, sp, 44 /* in delay slot */
.end mips_switch
11
Threads and Concurrency 15
• Yield() causes the calling thread to stop and wait, and causes the thread library
to choose some other waiting thread to run in its place. In other words, Yield()
causes a context switch.
12
Threads and Concurrency 17
Scheduling
13
Threads and Concurrency 19
Preemption
Review: Interrupts
14
Threads and Concurrency 21
Round-Robin Scheduling
• as in FIFO scheduling, there is a ready queue and the thread at the front of the
ready queue runs
• unlike FIFO, a limit is placed on the amount of time that a thread can run
before it is preempted
• the amount of time that a thread is allocated is called the scheduling quantum
• when the running thread’s quantum expires, it is preempted and moved to the
back of the ready queue. The thread at the front of the ready queue is
dispatched and allowed to run.
• suppose that the system timer generates an interrupt every t time units, e.g.,
once every millisecond
• suppose that the thread library wants to use a scheduling quantum q = 500t,
i.e., it will preempt a thread after half a second of execution
• to implement this, the thread library can maintain a variable called
running time to track how long the current thread has been running:
– when a thread is intially dispatched, running time is set to zero
– when an interrupt occurs, the timer-specific part of the interrupt handler
can increment running time and then test its value
∗ if running time is less than q, the interrupt handler simply returns
and the running thread resumes its execution
∗ if running time is equal to q, then the interrupt handler invokes
Yield() to cause a context switch
15
Threads and Concurrency 23
application
stack frame(s)
stack growth
trap frame
interrupt handling
stack frame(s)
Yield()
stack frame
saved thread
context
application
stack frame(s)
stack growth
thread_yield()
stack frame
saved thread
context
16
Synchronization 1
Concurrency
Synchronization 2
Thread Synchronization
17
Synchronization 3
Synchronization 4
18
Synchronization 5
• mutual exclusion algorithms ensure that only one thread at a time executes the
code in a critical section
• several techniques for enforcing mutual exclusion
– exploit special hardware-specific machine instructions, e.g., test-and-set or
compare-and-swap, that are intended for this purpose
– use mutual exclusion algorithms, e.g., Peterson’s algorithm, that rely only
on atomic loads and stores
– control interrupts to ensure that threads are not preempted while they are
executing a critical section
Synchronization 6
Disabling Interrupts
19
Synchronization 7
• advantages:
– does not require any hardware-specific synchronization instructions
– works for any number of concurrent threads
• disadvantages:
– indiscriminate: prevents all preemption, not just preemption that would
threaten the critical section
– ignoring timer interrupts has side effects, e.g., kernel unaware of passage
of time. (Worse, OS/161’s splhigh() disables all interrupts, not just
timer interrupts.) Keep critical sections short to minimize these problems.
– will not enforce mutual exclusion on multiprocessors (why??)
Synchronization 8
/* shared variables */
/* note: one flag array and turn variable */
/* for each critical section */
boolean flag[2]; /* shared, initially false */
int turn; /* shared */
flag[i] = false;
Ensures mutual exclusion and avoids starvation, but works only for
two threads. (Why?)
20
Synchronization 9
Synchronization 10
21
Synchronization 11
Semaphores
Synchronization 12
OS/161 Semaphores
struct semaphore {
char *name;
volatile int count;
};
see
• kern/include/synch.h
• kern/thread/synch.c
22
Synchronization 13
Synchronization 14
Producer/Consumer Synchronization
• suppose we have threads that add items to a list (producers) and threads that
remove items from the list (consumers)
• suppose we want to ensure that consumers do not consume if the list is empty
- instead they must wait until the list has something in it
23
Synchronization 15
Producer’s Pseudo-code:
add item to the list (call list append())
V(s);
Consumer’s Pseudo-code:
P(s);
remove item from the list (call list remove front())
Synchronization 16
• suppose we add one more requirement: the number of items in the list should
not exceed N
• producers that try to add items when the list is full should be made to wait
until the list is no longer full
24
Synchronization 17
Producer’s Pseudo-code:
P(empty);
add item to the list (call list append())
V(full);
Consumer’s Pseudo-code:
P(full);
remove item from the list (call list remove front())
V(empty);
Synchronization 18
/*
* May not block in an interrupt handler.
* For robustness, always check, even if we can actually
* complete the P without blocking.
*/
assert(in interrupt==0);
spl = splhigh();
while (sem->count==0) {
thread sleep(sem);
}
assert(sem->count>0);
sem->count--;
splx(spl);
}
CS350 Operating Systems Fall 2011
25
Synchronization 19
Thread Blocking
• Sometimes a thread will need to wait for an event. One example is on the
previous slide: a thread that attempts a P() operation on a zero-valued
semaphore must wait until the semaphore’s value becomes positive.
Synchronization 20
26
Synchronization 21
Thread States
quantum expires
or thread_yield()
ready running
dispatch
• the states:
running: currently executing
ready: ready to execute
blocked: waiting for something, so not ready to execute.
Synchronization 22
void
V(struct semaphore *sem)
{
int spl;
assert(sem != NULL);
spl = splhigh();
sem->count++;
assert(sem->count>0);
thread wakeup(sem);
splx(spl);
}
27
Synchronization 23
OS/161 Locks
lock aquire(mylock);
critical section /* e.g., call to list remove front */
lock release(mylock);
• A lock is similar to a binary semaphore with an initial value of 1. However,
locks also enforce an additional constraint: the thread that releases a lock
must be the same thread that most recently acquired it.
• The system enforces this additional constraint to help ensure that locks are
used as intended.
Synchronization 24
Condition Variables
28
Synchronization 25
• Condition variables get their name because they allow threads to wait for
arbitrary conditions to become true inside of a critical section.
Synchronization 26
29
Synchronization 27
Produce(item) {
lock acquire(mutex);
while (count == N) {
cv wait(notfull, mutex);
}
add item to buffer (call list append())
count = count + 1;
cv signal(notempty, mutex);
lock release(mutex);
}
Synchronization 28
Consume() {
lock acquire(mutex);
while (count == 0) {
cv wait(notempty, mutex);
}
remove item from buffer (call list remove front())
count = count - 1;
cv signal(notfull, mutex);
lock release(mutex);
}
30
Synchronization 29
Monitors
• only one monitor method may be active at a time, i.e., the monitor methods
(together) form a critical section
– if two threads attempt to execute methods at the same time, one will be
blocked until the other finishes
• inside a monitor, condition variables can be declared and used
Synchronization 30
Monitors in OS/161
31
Synchronization 31
Deadlocks
• Suppose there are two threads and two locks, lockA and lockB, both
initially unlocked.
• Suppose the following sequence of events occurs
1. Thread 1 does lock acquire(lockA).
2. Thread 2 does lock acquire(lockB).
3. Thread 1 does lock acquire(lockB) and blocks, because lockB is
held by thread 2.
4. Thread 2 does lock acquire(lockA) and blocks, because lockA is
held by thread 1.
Synchronization 32
32
Synchronization 33
R1 R2 R3
T1 T2 T3
R4 R5
Synchronization 34
R1 R2 R3
T1 T2 T3
R4 R5
33
Synchronization 35
Deadlock Prevention
No Hold and Wait: prevent a thread from requesting resources if it currently has
resources allocated to it. A thread may hold several resources, but to do so it
must make a single request for all of them.
Preemption: take resources away from a thread and give them to another (usually
not possible). Thread is restarted when it can acquire all the resources it needs.
Resource Ordering: Order (e.g., number) the resource types, and require that
each thread acquire resources in increasing resource type order. That is, a
thread may make no requests for resources of type less than or equal to i if it
is holding resources of type i.
Synchronization 36
• main idea: the system maintains the resource allocation graph and tests it to
determine whether there is a deadlock. If there is, the system must recover
from the deadlock situation.
34
Synchronization 37
Synchronization 38
/* initialization */
R = U
for all i, fi = false
/* can each thread finish? */
while ∃ i ( ¬ fi ∧ (Di ≤ R) ) {
R = R + Ai
fi = true
}
/* if not, there is a deadlock */
if ∃ i ( ¬ fi ) then report deadlock
else report no deadlock
35
Synchronization 39
R1 R2 R3
• D1 = (0, 1, 0, 0, 0)
• D2 = (0, 0, 0, 0, 1)
• D3 = (0, 1, 0, 0, 0)
• A1 = (1, 0, 0, 0, 0) T1 T2 T3
• A3 = (0, 1, 1, 0, 1)
• U = (0, 0, 1, 1, 0) R4 R5
Synchronization 40
R1 R2 R3
• D1 = (0, 1, 0, 0, 0)
• D2 = (1, 0, 0, 0, 0)
• D3 = (0, 0, 0, 0, 0)
• A1 = (1, 0, 0, 1, 0) T1 T2 T3
• A2 = (0, 2, 1, 0, 0)
• A3 = (0, 1, 1, 0, 1)
• U = (0, 0, 0, 0, 0)
R4 R5
36
Processes and the Kernel 1
What is a Process?
Multiprogramming
37
Processes and the Kernel 3
The OS Kernel
• The kernel is a program. It has code and data like any other program.
• Usually kernel code runs in a privileged execution mode, while other
programs do not
application kernel
thread library
CPU registers
38
Processes and the Kernel 5
System Calls
39
Processes and the Kernel 7
• The hardware provides a mechanism that a running program can use to cause
a system call. Often, it is a special instruction, e.g., the MIPS syscall
instruction.
40
Processes and the Kernel 9
Process Kernel
time
system call
thread
execution
path
system call return
Synopsis:
#include <unistd.h>
int
close(int fd);
41
Processes and the Kernel 11
/* Program: SyscallExample */
#include <unistd.h>
#include <errno.h>
int
main()
{
int x;
x = close(999);
if (x < 0) {
return errno;
}
return x;
}
SyscallExample, Disassembled
00400100 <main>:
400100: 27bdffe8 addiu sp,sp,-24
400104: afbf0010 sw ra,16(sp)
400108: 0c100077 jal 4001dc <close>
40010c: 240403e7 li a0,999
400110: 04400005 bltz v0,400128 <main+0x28>
400114: 00401821 move v1,v0
400118: 8fbf0010 lw ra,16(sp)
40011c: 00601021 move v0,v1
400120: 03e00008 jr ra
400124: 27bd0018 addiu sp,sp,24
400128: 3c031000 lui v1,0x1000
40012c: 8c630000 lw v1,0(v1)
400130: 08100046 j 400118 <main+0x18>
400134: 00000000 nop
42
Processes and the Kernel 13
...
004001d4 <write>:
4001d4: 08100060 j 400180 <__syscall>
4001d8: 24020006 li v0,6
004001dc <close>:
4001dc: 08100060 j 400180 <__syscall>
4001e0: 24020007 li v0,7
004001e4 <reboot>:
4001e4: 08100060 j 400180 <__syscall>
4001e8: 24020008 li v0,8
...
43
Processes and the Kernel 15
...
#define SYS_read 5
#define SYS_write 6
#define SYS_close 7
#define SYS_reboot 8
#define SYS_sync 9
#define SYS_sbrk 10
...
00400180 <__syscall>:
400180: 0000000c syscall
400184: 10e00005 beqz a3,40019c <__syscall+0x1c>
400188: 00000000 nop
40018c: 3c011000 lui at,0x1000
400190: ac220000 sw v0,0(at)
400194: 2403ffff li v1,-1
400198: 2402ffff li v0,-1
40019c: 03e00008 jr ra
4001a0: 00000000 nop
The system call and return processing, from the standard C library.
Like the rest of the library, this is unprivileged, user-level code.
44
Processes and the Kernel 17
exception:
move k1, sp /* Save previous stack pointer in k1 */
mfc0 k0, c0_status /* Get status register */
andi k0, k0, CST_KUp /* Check the we-were-in-user-mode bit */
beq k0, $0, 1f /* If clear,from kernel,already have stack *
nop /* delay slot */
/* Coming from user mode - load kernel stack into sp */
la k0, curkstack /* get address of "curkstack" */
lw sp, 0(k0) /* get its value */
nop /* delay slot for the load */
1:
mfc0 k0, c0_cause /* Now, load the exception cause. */
j common_exception /* Skip to common code */
nop /* delay slot */
application kernel
thread library
CPU registers
Each OS/161 thread has two stacks, one that is used while the
thread is executing unprivileged application code, and another that
is used while the thread is executing privileged kernel code.
45
Processes and the Kernel 19
application kernel
thread library
trap frame with saved
application state
CPU registers
While the kernel handles the system call, the application’s CPU
state is saved in a trap frame on the thread’s kernel stack, and the
CPU registers are available to hold kernel execution state.
46
Processes and the Kernel 21
• On the MIPS, the same exception handler is invoked to handle system calls,
exceptions and interrupts
• The hardware sets a code to indicate the reason (system call, exception, or
interrupt) that the exception handler has been invoked
default:
kprintf("Unknown syscall %d\n", callno);
err = ENOSYS;
break;
}
47
Processes and the Kernel 23
if (err) {
tf->tf_v0 = err;
tf->tf_a3 = 1; /* signal an error */
} else {
/* Success. */
tf->tf_v0 = retval;
tf->tf_a3 = 0; /* signal no error */
}
mips syscall must ensure that the kernel adheres to the system
call return convention.
Exceptions
• Exceptions are another way that control is transferred from a process to the
kernel.
• Exceptions are conditions that occur during the execution of an instruction by
a process. For example, arithmetic overflows, illegal instructions, or page
faults (to be discussed later).
• Exceptions are detected by the hardware.
Exception handling is similar to, but not identical to, system call
handling. (What is different?)
48
Processes and the Kernel 25
MIPS Exceptions
EX_IRQ 0 /* Interrupt */
EX_MOD 1 /* TLB Modify (write to read-only page) */
EX_TLBL 2 /* TLB miss on load */
EX_TLBS 3 /* TLB miss on store */
EX_ADEL 4 /* Address error on load */
EX_ADES 5 /* Address error on store */
EX_IBE 6 /* Bus error on instruction fetch */
EX_DBE 7 /* Bus error on data load *or* store */
EX_SYS 8 /* Syscall */
EX_BP 9 /* Breakpoint */
EX_RI 10 /* Reserved (illegal) instruction */
EX_CPU 11 /* Coprocessor unusable */
EX_OVF 12 /* Arithmetic overflow */
Interrupts (Revisited)
49
Processes and the Kernel 27
• interrupts, exceptions and system calls are three mechanisms by which control
is transferred from an application program to the kernel
• when these events occur, the hardware switches the CPU into privileged mode
and transfers control to a predefined location, at which a kernel handler
should be located
• the handler saves the application thread context so that the kernel code can be
executed on the CPU, and restores the application thread context just before
control is returned to the application
Implementation of Processes
• The kernel maintains information about all of the processes in the system in a
data structure often called the process table.
50
Processes and the Kernel 29
Implementing Timesharing
• notice that these context switches always occur while a process’ thread is
executing kernel code
stack data code stack data code stack stack data code
51
Processes and the Kernel 31
B’s thread is
system call ready, not running
or exception
or interrupt
return
A’s thread is
ready, not running
context switch
system call
context switch or exception
or interrupt
52
Processes and the Kernel 33
Implementing Preemption
• the kernel uses interrupts from the system timer to measure the passage of
time and to determine whether the running process’s quantum has expired.
• a timer interrupt (like any other interrupt) transfers control from the running
program to the kernel.
• this gives the kernel the opportunity to preempt the running thread and
dispatch a new one.
timer interrupt
interrupt return
Key:
ready thread
running thread
context
switches
53
Processes and the Kernel 35
Linux OS/161
Creation fork,execv fork,execv
Destruction exit,kill exit
Synchronization wait,waitpid,pause,. . . waitpid
Attribute Mgmt getpid,getuid,nice,getrusage,. . . getpid
54
Virtual Memory 1
• Programs use virtual addresses. As a program runs, the hardware (with help
from the operating system) converts each virtual address to a physical address.
• The conversion of a virtual address to a physical address is called address
translation.
Virtual Memory 2
• at run-time, the contents of the relocation register are added to each virtual
address to determine the corresponding physical address
• the OS maintains a separate relocation register value for each process, and
ensures that relocation register is reset on each context switch
• Properties
– each virtual address space corresponds to a contiguous range of physical
addresses
– OS must allocate/deallocate variable-sized chunks of physical memory
– potential for external fragmentation of physical memory: wasted,
unallocated space
55
Virtual Memory 3
A
max1
A + max1
C
max2
Proc 2
virtual address space
C + max2
m
2 −1
Virtual Memory 4
m bits
relocation
register
56
Virtual Memory 5
• Each virtual address space is divided into fixed-size chunks called pages
• The physical address space is divided into frames. Frame size matches page
size.
• OS maintains a page table for each process. Page table specifies the frame in
which each of the process’s pages is located.
• At run time, MMU translates virtual addresses to physical using the page table
of the running process.
• Properties
– simple physical memory management
– potential for internal fragmentation of physical memory: wasted, allocated
space
– virtual address space need not be physically contiguous in physical space
after translation.
Virtual Memory 6
max1
max2
Proc 2
virtual address space
m
2 −1
57
Virtual Memory 7
Paging Mechanism
m bits
page table base
register
frame #
Virtual Memory 8
Memory Protection
• during address translation, the MMU checks to ensure that the process uses
only valid virtual addresses
– typically, each PTE contains a valid bit which indicates whether that PTE
contains a valid page mapping
– the MMU may also check that the virtual page number does not index a
PTE beyond the end of the page table
• the MMU may also enforce other protection rules
– typically, each PTE contains a read-only bit that indicates whether the
corresponding page may be modified by the process
The kernel controls which pages are valid and which are protected
by setting the contents of PTEs and/or MMU registers.
58
Virtual Memory 9
• Kernel:
– save/restore MMU state on context switches
– create and manage page tables
– manage (allocate/deallocate) physical memory
– handle exceptions raised by the MMU
• MMU (hardware):
– translate virtual addresses to physical addresses
– check for and raise exceptions when necessary
Virtual Memory 10
Remaining Issues
sparseness: Many programs will only need a small part of the available space for
their code and data.
the kernel: Each process has a virtual address space in which to run. What about
the kernel? In which address space does it run?
59
Virtual Memory 11
• Execution of each machine instruction may involve one, two or more memory
operations
– one to fetch instruction
– one or more for instruction operands
• Address translation through a page table adds one extra memory operation
(for page table entry lookup) for each memory operation performed during
instruction execution
– Simple address translation through a page table can cut instruction
execution rate in half.
– More complex translation schemes (e.g., multi-level paging) are even
more expensive.
• Solution: include a Translation Lookaside Buffer (TLB) in the MMU
– TLB is a fast, fully associative address translation cache
– TLB hit avoids page table lookup
CS350 Operating Systems Fall 2011
Virtual Memory 12
TLB
• Each entry in the TLB contains a (page number, frame number) pair.
• Otherwise, translate through the page table, and add the resulting translation
to the TLB, replacing an existing entry if necessary. In a hardware controlled
TLB, this is done by the MMU. In a software controlled TLB, it is done by the
kernel.
• If the MMU cannot distinguish TLB entries from different address spaces,
then the kernel must clear or invalidate the TLB. (Why?)
60
Virtual Memory 13
• Each TLB entry includes a virtual page number, a physical frame number, an
address space identifier (not used by OS/161), and several flags (valid,
read-only).
See kern/arch/mips/include/tlb.h
Virtual Memory 14
growth
0x00000000 0xffffffff
This diagram illustrates the layout of the virtual address space for
the OS/161 test application testbin/sort
61
Virtual Memory 15
growth
0x00000000 0xffffffff
• Consider the page table for testbin/sort, assuming a 4 Kbyte page size:
– need 219 page table entries (PTEs) to cover the bottom half of the virtual
address space.
– the text segment occupies 2 pages, the data segment occupies 289 pages,
and OS/161 sets the initial stack size to 12 pages
• The kernel will mark a PTE as invalid if its page is not mapped.
• In the page table for testbin/sort, only 303 of 219 PTEs will be valid.
Virtual Memory 16
Segmentation
• Often, programs (like sort) need several virtual address segments, e.g, for
code, data, and stack.
• One way to support this is to turn segments into first-class citizens, understood
by the application and directly supported by the OS and the MMU.
62
Virtual Memory 17
111111
000000 Proc 1 physical memory
000000
111111
0
000000
111111
0
segment 0
000000
111111
000000
111111 11111
00000
0000
1111 00000
11111
0000
11110 00000
11111
00000
11111
0000
1111
segment 1
0000
11110
000
111
segment 2
0000
1111
0000
1111 000
111
000
111
Proc 2
0
111
000
segment 0
000
111
000
111
m
2 −1
Virtual Memory 18
virtual address
v bits
seg # offset +
segment table
m bits
segment table base
register
length start
protection
63
Virtual Memory 19
111111
000000
Proc 1 physical memory
0
000000
111111 11111
00000
000000
111111
0
segment 0
000000
111111 00000
11111
000000
111111 00000
11111
0000
1111 00000
11111
segment 1 0000
1111
0
0000
1111
0000
1111
0
111
000
segment 2
0000
1111
0000
1111 000
111
000
111
Proc 2
11111
00000
0
segment 0 00000
11111
00000
11111
000
111
000
111
000
111
m
2 −1
Virtual Memory 20
m bits
segment table base
register
page table
length
protection
64
Virtual Memory 21
Virtual Memory 22
• the MIPS MMU tries to translate each virtual address using the entries in the
TLB
• If there is no valid entry for the page the MMU is trying to translate, the
MMU generates a TLB fault (called an address exception)
• On return from exception, the MIPS retries the instruction that caused the
page fault. This time, it may succeed.
65
Virtual Memory 23
• virtual memory sharing allows parts of two or more address spaces to overlap
Virtual Memory 24
max1
max2
Proc 2
virtual address space
m
2 −1
66
Virtual Memory 25
111111
000000
Proc 1 physical memory
0
000000
111111
000000
111111
0
segment 0
000000
111111 11111
00000
(shared)
000000
111111 00000
11111
0000
1111 00000
11111
0000
1111
0
00000
11111
00000
11111
0000
1111
segment 1
0000
1111
0
000
111
segment 2
0000
1111
0000
1111 000
111
000
111
Proc 2
0
111
000
segment 0
000
111
111111
000000 000
111
000000
111111
000000
111111
segment 1
000000
111111
(shared) m
2 −1
Virtual Memory 26
• Each process has its own address space. What about the kernel?
• Two possibilities
Kernel in physical space: disable address translation in privileged system
execution mode, enable it in unprivileged mode
Kernel in separate virtual address space: need a way to change address
translation (e.g., switch page tables) when moving between privileged and
unprivileged code
• OS/161, Linux, and other operating systems use a third approach: the kernel
is mapped into a portion of the virtual address space of every process
• memory protection mechanism is used to isolate the kernel from applications
• one advantage of this approach: application virtual addresses (e.g., system call
parameters) are easy for the kernel to use
67
Virtual Memory 27
Process 1 Process 2
Address Space Address Space
Virtual Memory 28
0xc0000000
TLB mapped 0xa0000000
In OS/161, user programs live in kuseg, kernel code and data struc-
tures live in kseg0, devices are accessed through kseg1, and kseg2
is not used.
68
Virtual Memory 29
• When the kernel creates a process to run a particular program, it must create
an address space for the process, and load the program’s code and data into
that address space
Virtual Memory 30
ELF Files
• ELF files contain address space segment descriptions, which are useful to the
kernel when it is loading a new address space
• the ELF file identifies the (virtual) address of the program’s first instruction
• the ELF file also contains lots of other information (e.g., section descriptors,
symbol tables) that is useful to compilers, linkers, debuggers, loaders and
other tools used to build programs
69
Virtual Memory 31
• Each ELF segment describes a contiguous region of the virtual address space.
• For each segment, the ELF file includes a segment image and a header, which
describes:
– the virtual address of the start of the segment
– the length of the segment in the virtual address space
– the location of the start of the image in the ELF file
– the length of the image in the ELF file
• the image is an exact copy of the binary data that should be loaded into the
specified portion of the virtual address space
• the image may be smaller than the address space segment, in which case the
rest of the address space segment is expected to be zero-filled
Virtual Memory 32
• the ELF file does not describe the stack (why not?)
• dumbvm creates a stack segment for each process. It is 12 pages long, ending
at virtual address 0x7fffffff
70
Virtual Memory 33
• In the ELF file, a program’s code and data are grouped together into sections,
based on their properties. Some sections:
.text: program code
.rodata: read-only global data
.data: initialized global data
.bss: uninitialized global data (Block Started by Symbol)
.sbss: small uninitialized global data
• space for local program variables is allocated on the stack when the program
runs
CS350 Operating Systems Fall 2011
Virtual Memory 34
#include <unistd.h>
#define N (200)
int x = 0xdeadbeef;
int y1;
int y2;
int y3;
int array[4096];
char const *str = "Hello World\n";
const int z = 0xabcddcba;
struct example {
int ypos;
int xpos;
};
CS350 Operating Systems Fall 2011
71
Virtual Memory 35
int
main()
{
int count = 0;
const int value = 1;
y1 = N;
y2 = 2;
count = x + y1;
y2 = z + y2 + value;
reboot(RB_POWEROFF);
return 0; /* avoid compiler warnings */
}
Virtual Memory 36
Section Headers:
[Nr] Name Type Addr Off Size ES Flg
[ 0] NULL 00000000 000000 000000 00
[ 1] .reginfo MIPS_REGINFO 00400094 000094 000018 18 A
[ 2] .text PROGBITS 004000b0 0000b0 000200 00 AX
[ 3] .rodata PROGBITS 004002b0 0002b0 000020 00 A
[ 4] .data PROGBITS 10000000 001000 000010 00 WA
[ 5] .sbss NOBITS 10000010 001010 000014 00 WAp
[ 6] .bss NOBITS 10000030 00101c 004000 00 WA
[ 7] .comment PROGBITS 00000000 00101c 000036 00
...
Flags: W (write), A (alloc), X (execute), p (processor specific)
72
Virtual Memory 37
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
REGINFO 0x000094 0x00400094 0x00400094 0x00018 0x00018 R 0x4
LOAD 0x000000 0x00400000 0x00400000 0x002d0 0x002d0 R E 0x1000
LOAD 0x001000 0x10000000 0x10000000 0x00010 0x04030 RW 0x1000
• segment info, like section info, can be inspected using the cs350-readelf
program
• the REGINFO section is not used
• the first LOAD segment includes the .text and .rodata sections
• the second LOAD segment includes .data, .sbss, and .bss
Virtual Memory 38
73
Virtual Memory 39
The .rodata section contains the “Hello World” string literal and the
constant integer variable z.
Virtual Memory 40
...
## Size = 0x10 bytes = 16 bytes
## int x = deadbeef (4 bytes)
## char const *str = "Hello World\n"; (4 bytes)
## value stored in str = 0x004002b0.
## NOTE: this is the address of the start
## of the string literal in the .rodata section
The .data section contains the initialized global variables str and x.
74
Virtual Memory 41
...
10000010 A __bss_start
10000010 A _edata
10000010 A _fbss
10000010 S y3 ## S indicates sbss section
10000014 S y2
10000018 S y1
1000001c S errno
10000020 S __argv
...
10000030 B array ## B indicates bss section
10004030 A _end
The y1, y2, and y3 variables are in the .sbss section. The
array variable is in the .bss section. There are no values
for these variables in the ELF file, as they are uninitialized.
The cs350-nm program can be used to inspect symbols de-
fined in ELF files: cs350-nm -n <filename>, in this case
cs350-nm -n segments.
Virtual Memory 42
75
Virtual Memory 43
Goals:
• Allow virtual address spaces that are larger than the physical address space.
• Allow greater multiprogramming levels by using less of the available
(primary) memory for each process.
Method:
• Allow pages (or segments) from the virtual address space to be stored in
secondary memory, as well as primary memory.
• Move pages (or segments) between secondary and primary memory so that
they are in primary memory when they are needed.
Virtual Memory 44
L1 Cache 10 4
L2 Cache 10 6
primary
10 8 memory 10 9
secondary
10 6 memory 10 12
(disk)
76
Virtual Memory 45
• Virtual memory allows for very large virtual address spaces, and very large
virtual address spaces require large page tables.
• example: 248 byte virtual address space, 8 Kbyte (213 byte) pages, 4 byte page
table entries means
248 2
2 = 237 bytes per page table
213
• page tables for large address spaces may be very large, and
– they must be in memory, and
– they must be physically contiguous
• some solutions:
– multi-level page tables - page the page tables
– inverted page tables
Virtual Memory 46
Two-Level Paging
m bits level 1
page table base page table
register
level 2
page tables
77
Virtual Memory 47
• A normal page table maps virtual pages to physical frames. An inverted page
table maps physical frames to virtual pages.
• Other key differences between normal and inverted page tables:
– there is only one inverted page table, not one table per process
– entries in an inverted page table must include a process identifier
• An inverted page table only specifies the location of virtual pages that are
located in memory. Some other mechanism (e.g., regular page tables) must be
used to locate pages that are not in memory.
Virtual Memory 48
Paging Policies
When to Page?:
Demand paging brings pages into memory when they are used. Alternatively,
the OS can attempt to guess which pages will be used, and prefetch them.
What to Replace?:
Unless there are unused frames, one page must be replaced for each page that
is loaded into memory. A replacement policy specifies how to determine
which page to replace.
78
Virtual Memory 49
• When the system’s page reference string is generated by more than one
process, should the replacement policy take this into account?
Global Policy: A global policy is applied to all in-memory pages, regardless
of the process to which each one “belongs”. A page requested by process
X may replace a page that belongs another process, Y.
Local Policy: Under a local policy, the available frames are allocated to
processes according to some memory allocation policy. A replacement
policy is then applied separately to each process’s allocated space. A page
requested by process X replaces another page that “belongs” to process X.
Virtual Memory 50
Paging Mechanism
• A valid bit (V ) in each page table entry is used to track which pages are in
(primary) memory, and which are not.
V = 1: valid entry which can be used for translation
V = 0: invalid entry. If the MMU encounters an invalid page table entry, it
raises a page fault exception.
• To handle a page fault exception, the operating system must:
– Determine which page table entry caused the exception. (In SYS/161, and
in real MIPS processors, MMU puts the offending virtual address into a
register on the CP0 co-processor (register 8/c0 vaddr/BadVaddr). The
kernel can read that register.
– Ensure that that page is brought into memory.
On return from the exception handler, the instruction that resulted in the page
fault will be retried.
• If (pure) segmentation is being used, there will be a valid bit in each segment
table entry to indicate whether the segment is in memory.
CS350 Operating Systems Fall 2011
79
Virtual Memory 51
• the FIFO policy: replace the page that has been in memory the longest
• a three-frame example:
Num 1 2 3 4 5 6 7 8 9 10 11 12
Refs a b c d a b e a b c d e
Frame 1 a a a d d d e e e e e e
Frame 2 b b b a a a a a c c c
Frame 3 c c c b b b b b d d
Fault ? x x x x x x x x x
Virtual Memory 52
Num 1 2 3 4 5 6 7 8 9 10 11 12
Refs a b c d a b e a b c d e
Frame 1 a a a a a a a a a c c c
Frame 2 b b b b b b b b b d d
Frame 3 c d d d e e e e e e
Fault ? x x x x x x x
80
Virtual Memory 53
Virtual Memory 54
Locality
• Temporal locality says that pages that have been used recently are likely to be
used again.
• Spatial locality says that pages “close” to those that have been used are likely
to be used next.
81
Virtual Memory 55
• Counting references to pages can be used as the basis for page replacement
decisions.
• Example: LFU (Least Frequently Used)
Replace the page with the smallest reference count.
Virtual Memory 56
• LRU is based on the principle of temporal locality: replace the page that has
not been used for the longest time
• To implement LRU, it is necessary to track each page’s recency of use. For
example: maintain a list of in-memory pages, and move a page to the front of
the list when it is used.
• Although LRU and variants have many applications, LRU is often considered
to be impractical for use as a replacement policy in virtual memory systems.
Why?
82
Virtual Memory 57
Num 1 2 3 4 5 6 7 8 9 10 11 12
Refs a b c d a b e a b c d e
Frame 1 a a a d d d e e e c c c
Frame 2 b b b a a a a a a d d
Frame 3 c c c b b b b b b e
Fault ? x x x x x x x x x x
Virtual Memory 58
• A use bit (or reference bit) is a bit found in each TLB entry that:
– is set by the MMU each time the page is used, i.e., each time the MMU
translates a virtual address on that page
– can be read and modified by the operating system
– operating system copies use information into page table
83
Virtual Memory 59
• the kernel can emulate the “use” bit, at the cost of extra exceptions
1. When a page is loaded into memory, mark it as invalid (even though it as
been loaded) and set its simulated “use” bit to false.
2. If a program attempts to access the page, an exception will occur.
3. In its exception handler, the OS sets the page’s simulated “use” bit to
“true” and marks the page valid so that further accesses do not cause
exceptions.
• This technique requires that the OS maintain extra bits of information for each
page:
1. the simulated “use” bit
2. an “in memory” bit to indicate whether the page is in memory
Virtual Memory 60
• The clock algorithm (also known as “second chance”) is one of the simplest
algorithms that exploits the use bit.
• Clock is identical to FIFO, except that a page is “skipped” if its use bit is set.
• The clock algorithm can be visualized as a victim pointer that cycles through
the page frames. The pointer moves whenever a replacement is necessary:
84
Virtual Memory 61
• A page is modified (sometimes called dirty) if it has been changed since it was
loaded into memory.
• Operating system clears the modified bit when it cleans the page
• The modified bit potentially has two roles:
– Indicates which pages need to be cleaned.
– Can be used to influence the replacement policy.
Virtual Memory 62
• This technique requires that the OS maintain two extra bits of information for
each page:
1. the simulated “modified” bit
2. a “writeable” bit to indicate whether the page is supposed to be writeable
85
Virtual Memory 63
Virtual Memory 64
Page Cleaning
86
Virtual Memory 65
Belady’s Anomaly
Virtual Memory 66
Stack Policies
• Let B(m, t) represent the set of pages in a memory of size m at time t under
some given replacement policy, for some given reference string.
• A replacement policy is called a stack policy if, for all reference strings, all m
and all t:
B(m, t) ⊆ B(m + 1, t)
• Examples: LRU is a stack algorithm. FIFO and CLOCK are not stack
algorithms. (Why?)
87
Virtual Memory 67
Prefetching
• Prefetching means moving virtual pages into memory before they are needed,
i.e., before a page fault results.
• The goal of prefetching is latency hiding: do the work of bringing a page into
memory in advance, not while a process is waiting.
• To prefetch, the operating system must guess which pages will be needed.
• Hazards of prefetching:
– guessing wrong means the work that was done to prefetch the page was
wasted
– guessing wrong means that some other potentially useful page has been
replaced by a page that is not used
Virtual Memory 68
Page Size
• the virtual memory page size must be understood by both the kernel and the
MMU
88
Virtual Memory 69
• The resident set of a process is the set of pages that are located in memory.
Virtual Memory 70
89
Virtual Memory 71
• Define |W S(t, ∆)| to be the size of W S(t, ∆), i.e., the number of distinct
pages referenced by the process.
• If the operating system could track W S(t, ∆), it could:
– use |W S(t, ∆)| to determine the number of frames to allocate to the
process under a local page replacement policy
– use W S(t, ∆) directly to implement a working-set based page
replacement policy: any page that is no longer in the working set is a
candidate for replacement
Virtual Memory 72
• The working set model suggests that a page fault frequency plot should have a
sharp “knee”.
90
Virtual Memory 73
high
thresholds
low
few many
frames allocated to process
Virtual Memory 74
91
Virtual Memory 75
• Swapping a process out means removing all of its pages from memory, or
marking them so that they will be removed by the normal page replacement
process. Suspending a process ensures that it is not runnable while it is
swapped out.
• Which process(es) to suspend?
– low priority processes
– blocked processes
– large processes (lots of space freed) or small processes (easier to reload)
• There must also be a policy for making suspended processes ready when
system load has decreased.
92
Processor Scheduling 1
• A running thread can be modeled as alternating series of CPU bursts and I/O
bursts
– during a CPU burst, a thread is executing instructions
– during an I/O burst, a thread is waiting for an I/O operation to be
performed and is not executing instructions
Processor Scheduling 2
• A non-preemptive scheduler runs only when the running thread gives up the
processor through its own actions, e.g.,
– the thread terminates
– the thread blocks because of an I/O or synchronization operation
– the thread performs a Yield system call (if one is provided by the operating
system)
• A preemptive scheduler may, in addition, force a running thread to stop
running
– typically, a preemptive scheduler will be invoked periodically by a timer
interrupt handler, as well as in the circumstances listed above
– a running thread that is preempted is moved to the ready state
93
Processor Scheduling 3
Processor Scheduling 4
• non-preemptive
• ready threads are scheduled according to the length of their next CPU burst -
thread with the shortest burst goes first
where Bi is the predicted length of the ith CPU burst, and bi is its actual
length, and 0 ≤ α ≤ 1.
• Shortest Remaining Time First is a preemptive variant of SJF. Preemption
may occur when a new thread enters the ready queue.
CS350 Operating Systems Fall 2011
94
Processor Scheduling 5
Pa
Pb
Pc
Pd
time
0 4 8 12 16 20
Initial ready queue: Pa = 5 Pb = 8 Pc = 3
Thread Pd (=2) "arrives" at time 5
Processor Scheduling 6
Pa
Pb
Pc
Pd
time
0 4 8 12 16 20
Initial ready queue: Pa = 5 Pb = 8 Pc = 3
Thread Pd (=2) "arrives" at time 5 Quantum = 2
95
Processor Scheduling 7
SJF Example
Pa
Pb
Pc
Pd
time
0 4 8 12 16 20
Initial ready queue: Pa = 5 Pb = 8 Pc = 3
Thread Pd (=2) "arrives" at time 5
Processor Scheduling 8
SRTF Example
Pa
Pb
Pc
Pd
time
0 4 8 12 16 20
Initial ready queue: Pa = 5 Pb = 8 Pc = 3
Thread Pd (=2) "arrives" at time 5
96
Processor Scheduling 9
• non-preemptive
Processor Scheduling 10
HRRN Example
Pa
Pb
Pc
Pd
time
0 4 8 12 16 20
Initial ready queue: Pa = 5 Pb = 8 Pc = 3
Thread Pd (=4) "arrives" at time 5
97
Processor Scheduling 11
Prioritization
Processor Scheduling 12
• scheduler never chooses a thread in queue i if there are threads in any queue
j < i.
• threads in queue i use quantum qi , and qi < qj if i < j
98
Processor Scheduling 13
blocked
unblock
block
ready(0) run(0)
dispatch
preempt block
ready(1) run(1)
dispatch
preempt block
ready(2)
dispatch run(2)
preempt
Processor Scheduling 14
Suspending Processes
99
Processor Scheduling 15
suspended/
suspend ready
resume suspend
quantum expires
ready running
dispatch
suspend
blocked
suspended/
blocked
resume
100
I/O 1
• network interface
• graphics adapter
• co-processors
• ...
I/O 2
CPU Cache
Bridge Memory
PCI bus
SCSI USB
Bridge Graphics
controller controller
ISA bus
101
I/O 3
Key
CPU M K K K M: memory
K: device controller
I/O 4
• LAMEbus controller
• timer/clock - current time, timer, beep
102
I/O 5
Device Interactions
• device registers
– command, status, and data registers
– CPU accesses register via:
∗ special I/O instructions
∗ memory mapping
• interrupts
– used by device for asynchronous notification (e.g., of request completion)
– handled by interrupt handlers in the operating system
I/O 6
103
I/O 7
I/O 8
64 KB device "slot"
0x1fe00000 0x1fffffff
104
I/O 9
I/O 10
105
I/O 11
/* FROM: kern/arch/mips/mips/lamebus_mips.c */
/* Read 32-bit register from a LAMEbus device. */
u_int32_t
lamebus_read_register(struct lamebus_softc *bus,
int slot, u_int32_t offset)
{
u_int32_t *ptr = lamebus_map_area(bus, slot, offset);
return *ptr;
}
I/O 12
• Devices may have data buffers for such data - but how to get the data between
the device and memory?
• If the data buffer is memory-mapped, the kernel can move the data iteratively,
one word at a time. This is called program-controlled I/O.
• Program controlled I/O is simple, but it means that the CPU is busy executing
kernel code while the data is being transferred.
• The alternative is called Direct Memory Access (DMA). During a DMA data
transfer, the CPU is not busy and is free to do something else, e.g., run an
application.
106
I/O 13
• DMA is used for block data transfers between devices (e.g., a disk controller)
and memory
• Under DMA, the CPU initiates the data transfer and is notified when the
transfer is finished. However, the device (not the CPU) controls the transfer
itself.
3
1 2
K
CPU M K K
(disk)
I/O 14
107
I/O 15
• storage is non-volatile, i.e., data persists even when the device is without
power
I/O 16
Track
Sector
108
I/O 17
Shaft
Track
Sector
Cylinder
I/O 18
• request service time is the sum of seek time, rotational latency, and transfer
time
tservice = tseek + trot + ttransf er
• note that there are other overheads but they are typically small relative to these
three
109
I/O 19
I/O 20
Seek Time
• seek time depends on the speed of the arm on which the read/write heads are
mounted.
• a simple linear seek time model:
– tmaxseek is the time required to move the read/write heads from the
innermost cylinder to the outermost cylinder
– C is the total number of cylinders
110
I/O 21
• larger transfers to/from a disk device are more efficient than smaller ones.
That is, the cost (time) per byte is smaller for larger transfers. (Why?)
I/O 22
• goal: reduce seek times by controlling the order in which requests are serviced
111
I/O 23
head
I/O 24
head
112
I/O 25
• Under LOOK, the disk head moves in one direction until there are no more
requests in front of it, then reverses direction.
• seek time reduction without starvation
• SCAN is like LOOK, except the read/write heads always move all the way to
the edge of the disk in each direction.
I/O 26
SCAN Example
head
113
I/O 27
I/O 28
C-LOOK Example
head
114
File Systems 1
File Systems 2
File Interface
• open, close
– open returns a file identifier (or handle or descriptor), which is used in
subsequent operations to identify the file. (Why is this done?)
• read, write
– must specify which file to read, which part of the file to read, and where to
put the data that has been read (similar for write).
– often, file position is implicit (why?)
• seek
115
File Systems 3
File Read
fileoffset (implicit)
vaddr
111
000
000
111
11
00 000
111 length
00
11 000
111
length
00
11
00
11
virtual address
space
file
File Systems 4
File Position
• may be associated with the file, with a process, or with a file descriptor (Unix
style)
116
File Systems 5
char buf[512];
int i;
int f = open("myfile",O_RDONLY);
for(i=0; i<100; i++) {
read(f,(void *)buf,512);
}
close(f);
Read the first 100 ∗ 512 bytes of a file, 512 bytes at a time.
File Systems 6
char buf[512];
int i;
int f = open("myfile",O_RDONLY);
for(i=1; i<=100; i++) {
lseek(f,(100-i)*512,SEEK_SET);
read(f,(void *)buf,512);
}
close(f);
Read the first 100 ∗ 512 bytes of a file, 512 bytes at a time, in
reverse order.
117
File Systems 7
char buf[512];
int i;
int f = open("myfile",O_RDONLY);
for(i=0; i<100; i+=2) {
pread(f,(void *)buf,512,i*512);
}
close(f);
Read every second 512 byte chunk of a file, until 50 have been
read.
File Systems 8
Memory-Mapped Files
• generic interface:
vaddr ← mmap(file descriptor,fileoffset,length)
munmap(vaddr,length)
• mmap call returns the virtual address to which the file is mapped
• munmap call unmaps mapped files within the specified virtual address range
118
File Systems 9
fileoffset
vaddr
length
length
File Systems 10
• what should happen if the virtual memory to which a file has been mapped is
updated?
• some options:
– prohibit updates (read-only mapping)
– eager propagation of the update to the file (too slow!)
– lazy propagation of the update to the file
∗ user may be able to request propagation (e.g., Posix msync()
∗ propagation may be guaranteed by munmap()
– allow updates, but do not propagate them to the file
119
File Systems 11
File Systems 12
File Names
• namespace structure provides a way for users and applications to organize and
manage information
• in a structured namespace, objects may be identified by pathnames, which
describe a path from a root object to the object being identified, e.g.:
/home/kmsalem/courses/cs350/notes/filesys.ps
120
File Systems 13
x z
y
a b k l a c
b
f g Key
= directory
= file
File Systems 14
Hard Links
File systems ensure referential integrity for hard links. A hard link
refers to the object it was created for until the link is explicitly
destroyed. (What are the implications of this?)
121
File Systems 15
x z
y
m
a b k l a c
b
f g
link(/y/k/g, /z/m)
File Systems 16
Unlink Example
x z
y
a b k l a c
b
f
link(/y/k/g, /z/m)
unlink(/y/k/g)
Removing the last link to a file causes the file itself to be deleted.
Deleting a file that has a link would destroy the referential integrity
of the link.
122
File Systems 17
Symbolic Links
• a Symbolic link, or soft link, is an association between two names in the file
namespace. Think of it is a way of defining a synonym for a filename.
– symlink(oldpath,newpath) creates a symbolic link from
newpath to oldpath, i.e., newpath becomes a synonym for
oldpath.
• symbolic links relate filenames to filenames, while hard links relate filenames
to underlying file objects!
• referential integrity is not preserved for symbolic links, e.g., the system call
above can succeed even if there is no object named oldpath
File Systems 18
x z
y
m
/y/k/g
a b k l a c
b
f g
symlink(/y/k/g, /z/m)
/y/k/g still has only one hard link after the symlink call.
A new symlink object records the association between /z/m
and /y/k/g. open(/z/m) will now have the same effect as
open(/y/k/g).
123
File Systems 19
m
/y/k/g
a b k l a c
b
f
symlink(/y/k/g, /z/m)
unlink(/y/k/g)
File Systems 20
124
File Systems 21
% /bin/rm file1
% ls -li
685844 -rw------- 1 kmsalem kmsalem 15 2008-08-20 link1
685845 lrwxrwxrwx 1 kmsalem kmsalem 5 2008-08-20 sym1 -> file1
% cat link1
This is file1.
% cat sym1
cat: sym1: No such file or directory
% cat > file1
This is a brand new file1.
% ls -li
685846 -rw------- 1 kmsalem kmsalem 27 2008-08-20 file1
685844 -rw------- 1 kmsalem kmsalem 15 2008-08-20 link1
685845 lrwxrwxrwx 1 kmsalem kmsalem 5 2008-08-20 sym1 -> file1
% cat link1
This is file1.
% cat sym1
This is a brand new file1.
File Systems 22
125
File Systems 23
z a r
x x
y
q g
a b k l a c
b
x z
y
a b k l a c
b
a r
x
q g
File Systems 24
• a hard link associates a name in the file system namespace with a file in that
file system
• for example, after the mount operation illustrated on the previous slide:
– symlink(/x/a/x/g,/z/d) would succeed
– open(/z/d) would succeed, with the effect of opening /z/a/x/g.
• even if the symlink operation were to occur before the mount command, it
would succeed
126
File Systems 25
• space management
• file indexing (how to locate file data and meta-data)
• directories
• links
File Systems 26
fixed−size allocation
variable−size allocation
• layout matters! Try to lay a file out sequentially, or in large sequential extents
that can be read and written efficiently.
127
File Systems 27
File Indexing
• in general, a file will require more than one chunk of allocated space
File Systems 28
Chaining
128
File Systems 29
external chain
(file access table)
File Systems 30
Per-File Indexing
129
File Systems 31
• typically, a file system will assign a unique internal identifier to each file,
directory or other object
• given an identifer, the file system can directly locate a record containing key
information about the file, such as:
– the per-file index to the file data (if per-file indexing is used), or the
location of the file’s first data block (if chaining is used)
– file meta-data (or a reference to the meta-data), such as
∗ file owner
∗ file access permissions
∗ file acesss timestamps
∗ file type
File Systems 32
130
File Systems 33
i-node Diagram
attribute values
direct
direct
direct
single indirect
double indirect
triple indirect
indirect blocks
File Systems 34
Directories
• The file system should not allow directory files to be directly written by
application programs. Instead, the directory is updated by the file system as
files are created and destroyed
131
File Systems 35
• to implement this:
1. find out the internal file identifier for /y/k/g
2. create a new entry in directory /z
– file name in new entry is m
– file identifier (i-number) in the new entry is the one discovered in step 1
File Systems 36
132
File Systems 37
0
1
2
3 cached i−nodes
File Systems 38
• a single logical file system operation may require several disk I/O operations
• example: deleting a file
– remove entry from directory
– remove file index (i-node) from i-node table
– mark file’s data blocks free in free space index
• what if, because of a failure, some but not all of these changes are reflected on
the disk?
133
File Systems 39
Fault Tolerance
134
Interprocess Communication 1
• shared storage
– These mechanisms have already been covered. examples:
∗ shared virtual memory
∗ shared files
– processes must agree on a name (e.g., a file name, or a shared virtual
memory key) in order to establish communication
• message based
– signals
– sockets
– pipes
– ...
Interprocess Communication 2
Message Passing
operating system
sender receiver
send receive
operating system
sender receiver
send receive
135
Interprocess Communication 3
Directionality:
• simplex (one-way)
• duplex (two-way)
• half-duplex (two-way, but only one way at a time)
Message Boundaries:
datagram model: message boundaries
stream model: no boundaries
Interprocess Communication 4
Reliability:
• can messages get lost?
• can messages get reordered?
• can messages get damaged?
136
Interprocess Communication 5
Sockets
• if two processes are to communicate, each process must create its own socket
• two common types of sockets
stream sockets: support connection-oriented, reliable, duplex
communication under the stream model (no message boundaries)
datagram sockets: support connectionless, best-effort (unreliable), duplex
communication under the datagram model (message boundaries)
Interprocess Communication 6
s = socket(addressType, SOCK_DGRAM);
bind(s,address);
recvfrom(s,buf,bufLength,sourceAddress);
...
close(s);
137
Interprocess Communication 7
s = socket(addressType, SOCK_DGRAM);
sendto(s,buf,msgLength,targetAddress)
...
close(s);
Interprocess Communication 8
138
Interprocess Communication 9
Interprocess Communication 10
s = socket(addressType, SOCK_STREAM);
bind(s,address);
listen(s,backlog);
ns = accept(s,sourceAddress);
recv(ns,buf,bufLength);
send(ns,buf,bufLength);
...
close(ns); // close accepted connection
close(s); // don’t accept more connections
• listen specifies the number of connection requests for this socket that will
be queued by the kernel
• accept accepts a connection request and creates a new socket (ns)
139
Interprocess Communication 11
Interprocess Communication 12
s = socket(addressType, SOCK_STREAM);
connect(s,targetAddress);
send(s,buf,bufLength);
recv(s,buf,bufLength);
...
close(s);
• connect sends a connection request to the socket with the specified address
– connect blocks until the connection request has been accepted
• active process may (optionally) bind an address to the socket (using bind)
before connecting. This is the address that will be returned by the accept
call in the passive process
• if the active process does not choose an address, the system will choose one
140
Interprocess Communication 13
s s
s2
s3
socket
process 1 process 2
(active) (passive)
process 3
(active)
Interprocess Communication 14
Pipes
• a pipe() system call creates a pipe and returns two descriptors, one for each
end of the pipe
– for a simplex pipe, one descriptor is for reading, the other is for writing
– for a duplex pipe, both descriptors can be used for reading and writing
141
Interprocess Communication 15
int fd[2];
char m[] = "message for parent";
char y[100];
pipe(fd); // create pipe
pid = fork(); // create child process
if (pid == 0) {
// child executes this
close(fd[0]); // close read end of pipe
write(fd[1],m,19);
...
} else {
// parent executes this
close(fd[1]); // close write end of pipe
read(fd[0],y,100);
...
}
CS350 Operating Systems Fall 2011
Interprocess Communication 16
parent process
142
Interprocess Communication 17
Interprocess Communication 18
143
Interprocess Communication 19
named pipe:
• similar to pipes, but with an associated name (usually a file name)
• name allows arbitrary processes to communicate by opening the same
named pipe
• must be explicitly deleted, unlike an unnamed pipe
message queue:
• like a named pipe, except that there are message boundaries
• msgsend call sends a message into the queue, msgrecv call receives the
next message from the queue
Interprocess Communication 20
Signals
144
Interprocess Communication 21
Interprocess Communication 22
Signal Handling
• operating system determines default signal handling for each new process
• example default actions:
– ignore (do nothing)
– kill (terminate the process)
– stop (block the process)
• a running process can change the default for some types of signals
145
Interprocess Communication 23
Implementing IPC
• kernel descriptor tables (or other similar mechanism) are used to associate
descriptors with kernel data structures that implement IPC objects
• kernel provides bounded buffer space for data that has been sent using an IPC
mechanism, but that has not yet been received
– for IPC objects, like pipes, buffering is usually on a per object basis
– IPC end points, like sockets, buffering is associated with each endpoint
P1 P2
operating system
Interprocess Communication 24
• some sockets can be used to connect processes that are running on different
machines
• the kernel:
– controls access to network interfaces
– multiplexes socket connections across the network
P1 P2 P3 P1 P2 P3
operating operating
system system
network interface network interface
network
146
Interprocess Communication 25
• ISO/OSI Reference
Model
3 Network Layer
2 Data Link Layer
1 Physical Layer layer 1 protocol
Layer 1 Layer 1
• Internet Model
– layers 1-4 and 7
Interprocess Communication 26
• every machine has one (or more) IP address, in addition to its data link layer
address(es)
• In IPv4, addresses are 32 bits, and are commonly written using “dot” notation,
e.g.:
– cpu06.student.cs = 129.97.152.106
– www.google.ca = 216.239.37.99 or 216.239.51.104 or ...
147
Interprocess Communication 27
Interprocess Communication 28
148
Interprocess Communication 29
• since there can be many TCP or UDP communications end points (sockets) on
a single machine, there must be a way to distinguish among them
Interprocess Communication 30
Application Application
Process Process
Transport Transport
Instance Instance
network
network network
gateways
149
Interprocess Communication 31
application message
UDP payload
IP payload
Interprocess Communication 32
process
system calls
socket layer
socket queues
protocol layer
(TCP,UDP,IP,...)
interface
queues (IP) protocol queue
interface layer
(ethernet,PPP,loopback,...)
150
Additional Notes:
Additional Notes:
151
Additional Notes:
Additional Notes:
152