The Evolving Solaris Kernel
The Evolving Solaris Kernel Past, Present & Future
Jim Mauro Senior Staff Engineer - Performance & Availability Engineering Sun Microsystems, Inc. 400 Atrium Drive, Somerset, NJ 08812
[email protected] Richard McDougall Senior Staff Engineer - Performance & Availability Engineering Sun Microsystems, Inc.
[email protected]copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
The Evolving Solaris Kernel
Agenda
Introduction
Solaris Overview Distribution Releases System Overview & Kernel Features 64-bits Things added, things changed Tips and tidbits along the way... Solaris 7 Solaris 8 Solaris 9
The Evolution
Major Features Review
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
The Evolving Solaris Kernel
Introduction
What is Solaris?
A complete operating environment, built on a modular, dynamic kernel SunOS - the kernel (the 5.X thing) Windowing - desktop environment. CDE default, OpenWindows still included
GNOME 2 Beta Available GNOME is the strategic direction
The Solaris Operating Environment (SOE)
Open Network Computing (ONC+). NFS (V2 & V3), NIS/NIS+, RPC/XDR, LDAP
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
The Evolving Solaris Kernel
Solaris Distribution
Many CDs in the distribution
- WEB start CD (Installation) - OS bits, disks 1 and 2 - Software Supplement (more optional bits) - Flash PROM Update - Maintenance Update - Sun Management Center - Forte Workshop (try n buy)
Bonus Software
- Software Companion (GNU, etc) - StarOfce 6 - SunONE Advantage Software (2 CDs) - Oracle Enterprise Server
copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002
The Evolving Solaris Kernel
Releases
Base release, followed by quarterly update releases
Solaris 8 - released 2/00 Solaris 8, 6/00 (update 1) Solaris 8, 10/00 (update 2) Solaris 8, 1/01 (update 3) Solaris 8, 4/01 (update 4) Solaris 8, 7/01 (update 5) Solaris 8, 10/01 (update 6) Solaris 8, 2/02 (update 7)
Solaris 9 - base release, May, 2002
Provide predicatability for planning Provide a vehicle for getting new features, functionality and patches out in a regular and timely fashion
Nov 2002
The model is designed to
copyright (c) 2002 Jim Mauro and Richard McDougall
The Evolving Solaris Kernel
Releases (cont)
So, which release am I running?
sunsys> cat /etc/release Solaris 8 6/00 s28s_u1wos_08 SPARC Copyright 2000 Sun Microsystems, Inc. All Rights Reserved. Assembled 26 April 2000 sunsys>
Check out http://docs.sun.com, Whats New document for a specific release
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
The Evolving Solaris Kernel
Kernel Features
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
The Evolving Solaris Kernel
System Overview
System Call Interface TS/IA RT FX FSS Thread Scheduling and Process Management UFS NFS SPEC FS
Clocks & Timers Callouts
Virtual File System Framework
Kernel Services
Virtual Memory System Bus and Device Drivers
Networking
TCP IP Sockets
Hardware Address Translation (HAT)
SD
SSD
HARDWARE
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
The Evolving Solaris Kernel
Solaris Kernel Features
Dynamic Kernel
Small core unix modules Major subsystems implemented as dynamically loadable modules (file systems, scheduling classes, STREAMS modules, system calls). Dynamic resource sizing & allocation (processes, files, locks, memory, etc) Dynamic sizing based on system size
Goal is to minimize/elminate need to use /etc/system tuneable parameters
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
The Evolving Solaris Kernel
Solaris Kernel Features
Preemptive kernel
Does NOT require interrupt disable/blocking via PIL for synchronization Most kernel code paths are preemptable A few non-preemption points in critical code paths SCALABILITY & LOW LATENCY INTERRUPTS Module support, synchronization primitives, etc
Well-defined, layered interfaces
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
10
The Evolving Solaris Kernel
Solaris Kernel Features
Multithreaded kernel
Kernel threads perform core system services Fine grained locking for concurrency Threaded subsystems User level threads and synchronization primitives Solaris (UI) & POSIX threads Two-level (M x N) model, evolved to one-level model
Alternate thread library in Solaris 8 Default thread library Solaris 9
Multithreaded process model
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
11
The Evolving Solaris Kernel
Solaris Kernel Features
Table-driven dispatcher with multiple scheduling class support
Dynamically loadable/modifyable table values Relatively easy to add new scheduling classes
FSS and FX in Solaris 9
Realtime support with preemptive kernel
Additional kernel support for realtime applications (memory page locking, asynchronous I/O, processor sets, interrupt control, highres clock) Some things can be done on the fly
Kernel tuning via text file (/etc/system, driver.conf)
mdb(1)
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
12
The Evolving Solaris Kernel
Solaris Kernel Features
Tightly integrated virtual memory and file system support
Dynamic page cache memory implementation Object-like abstractions for files and file systems Facilitates new features/functionality
Kernel sockets via sockfs procfs (/proc) enhancements Doors (doorfs) fdfs, swapfs, tmpfs
Virtual File System (VFS) Implementation
(procfs), Doors (doorfs), fdfs, swapfs, tmpfs
Disk-based, distributed & pseudo file systems
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
13
The Evolving Solaris Kernel
Solaris Kernel Features
32-bit and 64-bit kernel
64-bit kernel required for UltraSPARC-III based systems (SunBlade, SunFire) 32-bit apps run just fine... Device driver interfaces Includes interfaces for dynamic attach/detach/pwr POSIX, UNIX International
Solaris DDI/DKI Implementation
Rich set of standards-compliant interfaces
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
14
The Evolving Solaris Kernel
Solaris Kernel Features
Integrated networking facilities
TCP/IP
IPv4, IPSec, IPv6
Name services - DNS, NIS, NIS+, LDAP NFS - defacto standard distributed file system, NFS-V2 & NFS-V3 Remote Procedure Call/External Data Representation (RPC/XDR) facilities Sockets, TLI, Federated Naming APIs
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
15
The Evolving Solaris Kernel
64-Bits
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
16
The Evolving Solaris Kernel
64-bit Solaris
Since Solaris 7, full 32-bit binary compatibility A simple directory namespace rule providing for the support and co-existence of 32-bit binaries on a 64-bit Solaris 8 system;
For every directory on the system that contains binary object files (executables, shared object libraries, etc), there is a sparcv9 subdirectory containing the 64-bit versions All kernel modules must be the of the same data model; ILP32 (32-bit data model) or LP64 (64-bit data model)
64-bit kernel required to run 64-bit apps
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
17
The Evolving Solaris Kernel
32 bit limits
Solaris 2.5
Heap is limited to 2GB, malloc will fail beyond 2GB Heap limited to 2GB by default Can go beyond 2GB with kernel patch 103640-08+ can raise limit to 3.75G by using ulimit or rlimit() if uid=root Do not need to be root with 103640-23+ Heap limited to 2GB by default can raise limit to 3.75G by using ulimit or rlimit() Limits are raised by default 32 bit program can malloc 3.99GB
Nov 2002
Solaris 2.5.1
Solaris 2.6
Solaris 7 & 8
copyright (c) 2002 Jim Mauro and Richard McDougall
18
The Evolving Solaris Kernel
Solaris/SPARC V8/V9 Data Model
Defines the width of integral data types
32-bit Solaris - ILP32 64-bit Solaris - LP64
C data type char short int long longlong pointer enum oat double quad 8 16 32 32 64 32 32 32 64 128 ILP32 8 16 32 64 64 64 32 32 64 128 LP64
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
19
The Evolving Solaris Kernel
64-bit Performance
64 Bit Virtual Address Space
(+) Free from the 3.9GB barrier (+) Memory map large files (+) 64 Bit Arithmetic, 64 Bit Registers (-) Pointers/Longs require moving 8 bytes
Typically ~5% delta Larger cache footprint
64 Bit data types
(-) Larger Stack
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
20
The Evolving Solaris Kernel
Which Data Model Is Booted?
Use isainfo(1)
sunsys> isainfo sparcv9 sparc sunsys> isainfo -b 64 sunsys> isainfo -v 64-bit sparcv9 applications 32-bit sparc applications
Or isalist(1)
sunsys> isalist sparcv9+vis sparcv9 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc sunsys>
man isaexec(3C)
Invoke isa-specific executable To create wrappers for shipping both 32-bit and 64-bit binaries, and automatically launching the correct one
21
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
The Evolving Solaris Kernel
Evolving Features & Technical Tidbits
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
22
The Evolving Solaris Kernel
The Evolution
1992 1993 1994 1995 1996 1998 2000 2002
Solaris 2.0 VFS/Vnode ISM UP only Solaris 2.1 4-way SMP
Solaris 2.2 sun4d SMP Large UFS Solaris 2.3 8-way SMP New DNLC
Solaris 2.5 Large pages (kernel) Doors NFS V3 sun4u Solaris 2.5.1 sun4u MP Solaris 2.4 20-way SMP New KMA Slab Allocator Cachefs CDE
Solaris 7 64-bit kernel 64-bit procs UFS logging Priority Paging Solaris 8 New KMA Cyclics T2 US-III SunFire StarCat Freeware UFS++
Solaris 9 SVM MPSS MPO Resource Pools FSS FX
Solaris 2.6 Large files Processor Sets Kernel Sockets lockstat UFS directio DR
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
23
The Evolving Solaris Kernel
General Priorities
Reliability, scalability, performance
on-going
Standards compliance SunOS 4.X binary compatibility Threads / SMP scalability Big systems performance
VM & I/O
Lessons learned on threads Resource management
Consolidation, ROI, TCO
Resource Pools, Service Containers, Resource Virtualization
copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002
24
The Evolving Solaris Kernel
Virtual Memory & The Dynamic Page Cache
Creating a dynamic page cache allows for all of physical memory to be used as disk buffer cache (read(2), write(2)) The evolution of systems hardware, RAID and general I/O tuning can create environments where the buffer cache throttles the VM system
The VM roller coaster (keeping the freelist sane)
Priority paging (2.6 & 7) provided a band-aid Using directio bypasses the page cache for UFS reads/writes Solaris 8 implements a new cyclic page cache
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
25
The Evolving Solaris Kernel
Global Memory Management
Demand Paged
Not recently used (NRU) algorithm Where has all my memory gone? Operates bottom up from physical pages Default mode treats all memory equally
Dynamic le system cache Page scanner
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
26
The Evolving Solaris Kernel
The Old Page Cache
kernel memory pages pushed out of segmap segmap reclaim process memory heap, data, stack
page scanner free list
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
27
The Evolving Solaris Kernel
The Cyclic Page Cache
kernel memory pages pushed out of segmap segmap reclaim process memory heap, data, stack
cache list
free list
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
28
The Evolving Solaris Kernel
Global Paging Dynamics
8192
(1GB Example)
fastscan
Scan Rate
100
slowscan
16MB
4MB
4MB
8MB
pages_before_pager
throttle- minfree desfree free
lotsfree
cachefree cachefree+ decit Free Memory
Nov 2002
32MB
copyright (c) 2002 Jim Mauro and Richard McDougall
29
The Evolving Solaris Kernel
Priority Paging
Solaris 7 FCS or Solaris 2.6 with T-105181-09
http://www.sun.com/sun-on-net/performance/priority_paging.html Set priority_paging=1 or cachefree in /etc/system ftp://playground.sun.com/pub/rmc/memstat New VM system, priority paging implemented at the core (make sure its disabled in Sol 8!) New vmstat flag, -p Multiple page size support (MPSS) Memory Placement Optimizations (MPO)
Solaris 7 Extended vmstat Solaris 8
Solaris 9
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
30
The Evolving Solaris Kernel
Memory Monitoring
Use vmstat or the memstat command on Solaris 7
# vmstat 3 procs r b w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 memory page disk swap free re mf pi po fr de sr f0 s0 s4 s6 269776 21160 0 0 0 0 0 0 0 0 0 0 2 269776 21152 0 0 0 0 0 0 0 0 0 0 2 269720 3896 5 17 80 0 109 0 59 0 0 0 2 269616 3792 0 0 160 0 160 0 76 0 0 0 2 269616 3792 0 0 192 0 192 0 105 0 0 0 2 269616 3800 1 90 234 5 232 0 99 0 0 0 2 269656 3832 0 0 106 0 106 0 51 0 0 0 2 faults in sy 154 200 155 203 221 773 279 242 294 225 323 964 237 212 cpu cs us sy id 92 0 0 100 113 0 0 99 134 0 2 98 130 0 1 99 138 0 1 99 305 5 3 92 121 0 1 99
ftp://playground.sun.com/pub/rmc/memstat
# memstat 3 (Solaris 7 Only) or # vmstat -p 3 (Solaris 8+) memory free 21160 21152 21152 11920 11888 11896 11904 11896 ---------- paging re mf pi po 0 22 0 5 0 0 0 0 0 18 34 2 0 0 277 106 0 0 256 69 0 0 213 106 0 0 245 66 0 0 245 64 ----------- - executable fr de sr epi epo epf 5 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 272 0 153 0 0 32 224 0 106 0 0 16 261 0 124 0 0 26 242 0 122 0 0 16 224 0 132 0 0 21 - anonymous api apo apf 0 0 0 0 0 0 0 0 0 0 98 149 0 69 178 0 106 232 0 64 221 0 64 189 -- filesys -- --- cpu --fpi fpo fpf us sy wt id 0 5 5 0 1 0 99 0 0 0 0 0 0 100 34 2 2 0 1 0 99 277 8 90 0 3 0 97 256 0 29 0 3 1 96 213 0 2 0 3 13 84 245 2 5 0 2 0 98 245 0 13 0 2 0 98
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
31
The Evolving Solaris Kernel
Simple Memory Rule:
Identifying a memory shortage without PP:
Scanner not scanning -> no memory shortage Scanner running, page ins and page outs, swap device activity -> potential memory shortage (use separate swap disk or 2.6 iostat -p to measure swap partition activity) api and apo should be zero in memstat, non zero is a clear sign of memory shortage scan rate != 0 freemem is real
Identifying a memory shortage with PP on Sol 7:
Identifying a memory shortage on Sol 8:
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
32
The Evolving Solaris Kernel
Memory Summary
Solaris 9
# mdb -k > ::memstat Page Summary Pages --------------------------Kernel 21146 Anon 16891 Exec and libs 8389 Page cache 8248 Free (cachelist) 2490 Free (freelist) 190309 Total 247473 MB ---------------165 131 65 64 19 1486 1933 %Tot ---9% 7% 3% 3% 1% 77%
Solaris 8 and earlier
# prtmem Total memory: Kernel Memory: Application: Executable & libs: File Cache: Free, file cache: Free, free: 1933 164 128 65 64 19 1491 Megabytes Megabytes Megabytes Megabytes Megabytes Megabytes Megabytes Nov 2002
copyright (c) 2002 Jim Mauro and Richard McDougall
33
The Evolving Solaris Kernel
The Threads Model
Original 2-level, MxN model design goals
Scalability Lightweight threads Pools of Virtual Processors (LWPs) Bound threads available User level thread scheduling is complex Signal delivery is, at times, a nightmare Kernel threads are not as expensive as they used to be Alternate thread library in Solaris 8 (/usr/lib/lwp/libthread.so) 1-level is the default in Solaris 9 (/usr/lib/libthread.so)
Lessons learned...
What we have today
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
34
The Evolving Solaris Kernel
2-Level MxN Model
proc 1 proc 2
Processes
proc 3
proc 4
User Threads
LWPs
User Layer Kernel Layer
Kernel Threads the dispatcher
An unattached kernel thread
Hardware Layer Processors (CPUs)
The 1 level model is effectively all bound threads (proc 4)
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
35
The Evolving Solaris Kernel
Resource Management
Effective management of hardware resources to applications
Large application performance Multiple apps per Solaris instance (consolidation) Provide boundaries on resource consumption by applications Processors (CPUs) Memory (physical memory) Disk IO bandwidth/latency/IOPS Network bandwidth/latency
Resource categories
This is an on-going effort, with significant improvements in subsequent Solaris 9 quarterly releases
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
36
The Evolving Solaris Kernel
Processor Control Commands
CPU related commands
psrinfo(1M) - provides information about the processors on the system. Use -v for verbose psradm(1M) - online/offline processors, interrupt enable/disable psrset(1M) - creation and management of processor sets pbind(1M) - original processor bind command. Does not provide exclusive binding processor_bind(2), processor_info(2), pset_bind(2), pset_info(2), pset_creat(2), p_online(2): system calls to do things programmatically
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
37
The Evolving Solaris Kernel
Solaris 9 Resource Management
Tasks, Projects & Extended Accounting
Task - A collection of processes Project - A collection of tasks
Projects
Task
Task
Task
proc
proc
proc
proc
proc
proc
proc
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
38
The Evolving Solaris Kernel
Solaris 9 Resource Management
Tasks & Projects provide abstractions for binding together related processes, for the purpose of
Resource management. Tasks and Projects can be bound to process sets, have scheduler changes applied to them, etc. Resource controls. Resource limits can be applied at the Project or Task level. Resource monitoring. Tools have been enhanced to monitor utilization at the Project or Task level.
prstat -J - Display statistics for processes and projects prstat -T - Display statistics for processes and tasks Extended accounting. The accounting facility had been enhanced to provide project and task level accounting data.
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
39
The Evolving Solaris Kernel
Solaris 9 Resource Controls
The following resource controls are available
project.cpu-shares: Number of CPU shares (FSS) available to this project task.max-cpu-time: Maximum CPU time available to the processes in this task (milliseconds) task.max-lwps: Maximum number of LWPs available to the processes in this task process.max-cpu-time: Max CPU time available to this process process.max-le-descriptor: Max number of open les for this process process.max-le-size: Max le size process.max-core-size: Max core le size process.max-data-size: Max size of the processs data segment process.max-stack-size: Max size of the processs stack
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
40
The Evolving Solaris Kernel
Solaris 9 Fair Share Scheduler
Share based (versus priority based) process scheduling Designed to provide a guaranteed minimum amount of CPU resources to a specific application (project/task)
Defining a maximum, or ceiling, not currently available Shares are allocated to projects Shares allocated are relative to shares allocated to other projects The total number of shares allocated also matters Finer grained management and control
Shares are not percentages
FSS can be used in conjunction with processor sets
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
41
The Evolving Solaris Kernel
FSS & Processor Sets
Project A 16.66% (1/6) Project B 33.33% (2/6) Project B 40% (2/5) Project C 100% (3/3) Project C 60% (3/5)
Project C 50% (3/6)
Processor Set 1 2 CPUs 25% of System
Processor Set 2 4 CPUs 50% of System
Processor Set 3 2 CPUs 25% of System
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
42
The Evolving Solaris Kernel
Resource Pools
Provides a facility for stateful (persistent) processor sets and project binding, as well as scheduling class assignment Resource pool management is done via pooladm(1M), poolbind(1M), and poolcfg(1M). /etc/pooladm.conf provides persistance across reboots (managed via poolcfg(1M)) poolbind(1M) provides for binding of projects or tasks to a resource pool /etc/projects can define a resource pool for a project or task
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
43
The Evolving Solaris Kernel
Solaris Release Features Summary
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
44
The Evolving Solaris Kernel
Solaris 7 - New Features
64-Bits
Kernel 64-bit binary support Full binary compatibility for 32-bit executables mount -o logging Logs to spare blocks in cylinder group No fsck Disable access time update to inodes Ends ps -ef | grep proc_name | awk { print $2 }
UFS logging
UFS noatime pgrep & pkill
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
45
The Evolving Solaris Kernel
Solaris 7 - New Features
traceroute bundled dumpadm(1M)
Configure a seperate raw partition for dumps Dump running systems
LDAP Client Library TCP with SACK
Selective Acknowledgement - RFC 2018 Device configuration information APIs User level function tracing. -u, -U
libdevinfo(3) truss(1) Enhanced
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
46
The Evolving Solaris Kernel
Solaris 8 - New Features
Cyclic Page Cache
Enhanced VM page management functionality Priority for page allocation given to process segments freemem is real! Numeric ID generated for syslog messages One tool for device configuration/management DR events managed through devfsadmd a = mmap( addr, len, prot, flags| MAP_ ANON,-1, off); CLOCK_HIGHRES via new Cyclics kernel subsystem
Nov 2002
System Message IDs devfsadm(1M)
mmap MAP_ANON POSIX High Resolution Timers
copyright (c) 2002 Jim Mauro and Richard McDougall
47
The Evolving Solaris Kernel
Solaris 8 - New Features
prstat(1)
Top-like curses based process monitor utility truss-like utility for tracing user-level library calls pstack(1), pcred(1), pfiles(1) System-wide core file management New kernel debugger - replace adb & crash Supports use of adb macros and crash utilities Evolved to manage user code debugging (Sol 9)
apptrace(1) /proc tools enhanced to work on core files coreadm(1M) mdb(1)
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
48
The Evolving Solaris Kernel
Solaris 8 - New Features
User Level Priority Inheritance
User defined mutex locks attribute umount -f /usr/lib/lwp/libthread.so - provides all bound threads Does not require re-compilation apache, bash, bzip2, tcsh, gcc, mkisofs, less, zsh, Glib, GTK+, etc, etc,...
Forced unmount Alternate threads library
Freeware CD
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
49
The Evolving Solaris Kernel
Solaris 9 - New Features
Many, many (but not all) Solaris 9 features have been backported to Solaris 8
Available in various Solaris 8 update releases Resource pools - configure boundaries on resources consumed by processes and tasks Processors today, memory coming Resource pools cross reboots (unlike processor sets and bindings) See prctl(1), pooladm(1M), poolcfg(1M), poolbind(1M), rctladm(1M), project(4)
Resource Management
Fixed-Priority Scheduling Class (FX)
TS class priority range, but priorities remain fixed Share-based (versus priority-based) CPU allocation
Nov 2002
Fair Share Scheduling Class (FSS)
copyright (c) 2002 Jim Mauro and Richard McDougall
50
The Evolving Solaris Kernel
Solaris 9 - New Features
Command line process facilties
pargs(1) - dump args and env associated with a live process, or core file preap(1) - remove zombies (Harry Cooper & Ben could have used this in 1968!) -h - provide human-readable output format. Lists sizes in Kbytes, Mbytes, Gbytes, etc... Support of pages larger than 8k for process stack, heap and mmapd anonymous memory Actual supported page sizes hardware dependent UltraSPARC-III supports 8k, 64k, 512k, 4MB...
du(1), df(1M) and ls(1) - New -h option
Multiple Page Size Support (MPSS)
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
51
The Evolving Solaris Kernel
Solaris 9 - New Features
MPSS (cont)
jurassic> pagesize -a 8192 65536 524288 4194304 jurassic>
New Threads Library/Model
1 Level threads model - all bound threads What was the alternate threads library in Solaris 8 is the default (in /usr/lib) in Solaris 9. Allows database to dynamically shrink/grow the shared segment Original ISM implementation was a big performance win (shared translation information, large pages), but was fixed in size DISM gives the best of both worlds
52
Dynamic Intimate Shared Memory (DISM)
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
The Evolving Solaris Kernel
Solaris 9 - New Features
Security
Internet Key Exchange (IKE) Protocol Secure Shell (ssh) - SSHv1 & SSHv2 Kerberos Key Distribution Center (KDC) & Admin Tools Secure LDAP 128-bit Encryption Role-Based Access Controls (RBAC) Enhanced tcp-wrappers 7.6 in freeware CD Xserver encrypted connections supported
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
53
The Evolving Solaris Kernel
Solaris 9 - New Features
iPlanet Director Server
LDAP Server bundled/integrated NIS+ - to - LDAP Migration Tool Based on WU-ftp server Includes PPPoE (Solaris 8 7/01)
LDAP Name Service Support FTP Server PPP 4.0 IP Network Multipathing (Solaris 8 10/00) Solaris Volume Manager
Formerly Solaris DiskSuite Soft partitions and Device ID support
Nov 2002
copyright (c) 2002 Jim Mauro and Richard McDougall
54
The Evolving Solaris Kernel
Summary
Steady, sustained progress on key areas - scalability, reliability, performance, features Going forward
Resource management - memory, service containers Observability - More & better tools Resilience - fault detection, isolation, containment Management - Zero downtime admin
patches, upgrades
Reliability, performance, always at the top
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
55
The Evolving Solaris Kernel
Supplemental Slides
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
56
The Evolving Solaris Kernel
Kernel Statistics
Solaris uses a central mechanism for kernel statistics
kstat Kernel providers
raw statistics (c structure) typed data classed statistics
Perl and C API kstat(1M) command
instance: 0 class: misc 90 86 87 1020713737 2999968 64.1117776 0 2999968 2 Nov 2002
# kstat -n system_misc module: unix name: system_misc avenrun_15min avenrun_1min avenrun_5min boot_time clk_intr crtime deficit lbolt ncpus
copyright (c) 2002 Jim Mauro and Richard McDougall
57
The Evolving Solaris Kernel
Memory Accounting
The ps command SZ = Virtual Size RSS = Resident Set Size (including shared)
# ps -ale USER root root root root root root root PID %CPU %MEM SZ RSS TT 22998 12.0 0.8 4584 1992 ? 23672 1.0 0.7 1736 1592 pts/16 3 0.4 0.0 0 0 ? 733 0.4 1.0 6352 2496 ? 345 0.3 0.7 2968 1736 ? 23100 0.2 0.5 3880 1104 ? 732 0.2 2.5 9920 6304 ? S START TIME COMMAND S 10:05:30 3:22 /usr/sbin/nsr/nsrc O 10:22:54 0:00 /usr/ucb/ps -aux S Sep 28 166:38 fsflush S Sep 28 174:29 /opt/SUNWsymon/jre S Sep 28 55:39 /usr/sbin/nsr/nsrd S Oct 15 0:25 rpc.rstatd S Sep 28 94:43 esd - init topolog
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
58
The Evolving Solaris Kernel
The pmap command
Verbose Process mappings
Solaris 8 private/shared Solaris 9 private=Anon, shared=RSS-Anon
Mode r-x-rwx-rw-srw--rw--R rw--rw--R r-x-rwx-r-x-r-x-rwx-r-x-rwx-rwx-Mapped File mmap mmap dev:0,2 ino:5304657 dev:0,2 ino:5304657 dev:0,2 ino:5304657 [ anon ] [ anon ] libc.so.1 libc.so.1 libc_psr.so.1 libdl.so.1 [ anon ] ld.so.1 ld.so.1 [ stack ]
# pmap -x 123 Address Kbytes RSS Anon Locked 00010000 8 8 00020000 8 8 8 01000000 1024 1024 02000000 1024 1024 512 03000000 1024 1024 512 04000000 1024 1024 1024 05000000 512 512 512 FF280000 680 680 FF33A000 32 32 32 FF380000 16 16 FF3A0000 8 8 FF3B0000 8 8 8 FF3C0000 152 152 FF3F6000 8 8 8 FFBFA000 24 24 24 -------- ------- ------- ------- ------total Kb 5552 5552 2640 -
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
59
The Evolving Solaris Kernel
SWAP Space ctd...
# swap -s total: 101456k bytes allocated + 12552k reserved = 114008k used, 597736k available should read: total: 101456k bytes unallocated + 12552k allocated = 114008k reserved, 597736k available
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
60
The Evolving Solaris Kernel
Swap:
# ./prtswap -l Swap Reservations: -------------------------------------------------------------------------Total Virtual Swap Configured: 767MB = RAM Swap Configured: 255MB Physical Swap Configured: + 512MB Total Virtual Swap Reserved Against: RAM Swap Reserved Against: Physical Swap Reserved Against: Total Virtual Swap Unresv. & Avail. for Reservation: Physical Swap Unresv. & Avail. for Reservations: RAM Swap Unresv. & Avail. for Reservations: 513MB = 1MB 512MB 253MB = 0MB 253MB
Swap Allocations: (Reserved and Phys pages allocated) -------------------------------------------------------------------------Total Virtual Swap Configured: 767MB Total Virtual Swap Allocated Against: 467MB Physical Swap Utilization: (pages swapped out) -------------------------------------------------------------------------Physical Swap Free (should not be zero!): 232MB = Physical Swap Configured: 512MB Physical Swap Used (pages swapped out): 279MB
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
61
The Evolving Solaris Kernel
The pmap command
Swap reservations (Solaris 9):
# pmap -S 123 Address Kbytes Swap 00010000 8 00020000 8 8 01000000 1024 02000000 1024 1024 03000000 1024 512 04000000 1024 1024 05000000 512 512 FF280000 680 FF33A000 32 32 FF380000 16 FF3A0000 8 FF3B0000 8 8 FF3C0000 152 FF3F6000 8 8 FFBFA000 24 24 -------- ------- ------total Kb 5552 3152 Mode r-x-rwx-rw-srw--rw--R rw--rw--R r-x-rwx-r-x-r-x-rwx-r-x-rwx-rwx-Mapped File mmap mmap dev:0,2 ino:5304657 dev:0,2 ino:5304657 dev:0,2 ino:5304657 [ anon ] [ anon ] libc.so.1 libc.so.1 libc_psr.so.1 libdl.so.1 [ anon ] ld.so.1 ld.so.1 [ stack ]
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
62
The Evolving Solaris Kernel
Shared Memory
System V Initimate Shared Memory (ISM)
Shared translation data structures 4MB TLB Page Size Locked pages Invoke with an additional ag to shmat () - SHARE_MMU Default shared memory mode for Oracle RDBMS Solaris 8 U3 Pageable variant of ISM Integrated with Oracle 9i (dynamic SGA) 8k TLB Page Size for Solaris 8 4MB TLB Page Size for Solaris 9 U1
System V Dynamic Intimate Shared Memory (DISM)
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
63
The Evolving Solaris Kernel
The pmap command
# pmap -x 15492 15492: ./maps Address Kbytes RSS Anon Locked 00010000 8 8 00020000 8 8 8 00022000 20344 16248 16248 03000000 1024 1024 04000000 1024 1024 512 05000000 1024 1024 512 06000000 1024 1024 1024 07000000 512 512 512 08000000 8192 8192 8192 09000000 8192 4096 0A000000 8192 8192 8192 0B000000 8192 8192 8192 FF280000 680 672 FF33A000 32 32 32 FF390000 8 8 FF3A0000 8 8 FF3B0000 8 8 8 FF3C0000 152 152 FF3F6000 8 8 8 FFBFA000 24 24 24 -------- ------- ------- ------- ------total Kb 50464 42264 18888 16384 Mode r-x-rwx-rwx-rw-srw--rw--R rw--rw--R rwxsrwxsrwxsR rwxsR r-x-rwx-r-x-r-x-rwx-r-x-rwx-rwx-Mapped File maps maps [ heap ] dev:0,2 ino:4628487 dev:0,2 ino:4628487 dev:0,2 ino:4628487 [ anon ] [ anon ] [ dism shmid=0x5 ] [ dism shmid=0x4 ] [ ism shmid=0x2 ] [ ism shmid=0x3 ] libc.so.1 libc.so.1 libc_psr.so.1 libdl.so.1 [ anon ] ld.so.1 ld.so.1 [ stack ]
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
64
The Evolving Solaris Kernel
Multiple Page Size Support
Solaris 8
Large (4MB) pages with ISM/DISM for shared memory "Multiple Page Size Support" Optional large pages for heap/stack A wrapper for unchanged programs (ppgsz) Programatically via memcntl(3C) Shared library for existing binaries (LD_PRELOAD) (/usr/lib/ libmpss.so) pmap enhancements to observe page sizes (pmap -sx) Tool to observe potential gains (trapstat -T)
Solaris 9
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
65
The Evolving Solaris Kernel
TLB Trap CPU Accounting
# trapstat -t 3 cpu | itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim | %tim -----+-------------------------------+-------------------------------+----0 k| 25 0.0 0 0.0 | 29558 0.5 6 0.0 | 0.6 0 u| 9728 0.1 1 0.0 | 17943 0.3 3 0.0 | 0.5 -----+-------------------------------+-------------------------------+----1 k| 0 0.0 0 0.0 | 19001 1.2 3 0.0 | 1.2 1 u| 7872 0.2 0 0.0 | 16300 0.5 0 0.0 | 0.8 =====+===============================+===============================+===== ttl | 17625 0.2 1 0.0 | 82802 1.3 12 0.0 | 1.5
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
66
The Evolving Solaris Kernel
The pmap command
# pmap -xs 15492 Address Kbytes RSS Anon Locked Pgsz Mode 00010000 8 8 8K r-x-00020000 8 8 8 8K rwx-00022000 3960 3960 3960 8K rwx-00400000 8192 8192 8192 4M rwx-00C00000 4096 - rwx-01000000 4096 4096 4096 4M rwx-03000000 1024 1024 8K rw-s08000000 8192 8192 8192 - rwxs09000000 4096 4096 8K rwxs0A000000 4096 - rwxs0B000000 8192 8192 8192 4M rwxsR FF280000 136 136 8K r-x-... FF390000 8 8 8K r-x-FF3A0000 8 8 8K r-x-FF3B0000 8 8 8 8K rwx-FF3C0000 152 152 8K r-x-FF3F6000 8 8 8 8K rwx-FFBFA000 24 24 24 8K rwx--------- ------- ------- ------- ------total Kb 50464 42264 18888 16384 Mapped File maps maps [ heap ] [ heap ] [ heap ] [ heap ] dev:0,2 ino:4628487 [ dism shmid=0x5 ] [ dism shmid=0x4 ] [ dism shmid=0x2 ] [ ism shmid=0x3 ] libc.so.1 libc_psr.so.1 libdl.so.1 [ anon ] ld.so.1 ld.so.1 [ stack ]
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
67
The Evolving Solaris Kernel
Memory Placement Optmization
Memory locality optimization for non-uniform memory architectures
Solaris 9 Update 1 Ex800 machines are slightly non-uniform F15k systems are slightly more non-uniform Unit is typically a system board (processors+memory) Lgroups are an artifact of the hardware architecture (not user configurable) Threads are assigned a home lgroup Program heap and stack is allocated on the same lgroup Shared memory allocated round robin across boards in the system or processor set. Different programatic policies also provided.
Nov 2002
Machine described as groups of latency (lgroups)
Memory allocated close to the threads accessing it
copyright (c) 2002 Jim Mauro and Richard McDougall
68
The Evolving Solaris Kernel
Lock Statistics - mpstat
# mpstat 1 CPU minf mjf xcal 8 0 0 6611 9 1 0 1294 10 0 0 3232 11 0 0 647 12 0 0 190 13 0 0 624 14 0 0 392 15 0 0 146 16 0 0 382 17 0 0 88 18 0 0 3571 19 0 0 3133 20 0 0 385 21 0 0 152 22 0 0 3964 23 0 2 555 24 0 0 811 25 0 0 105 26 0 0 163 27 0 1 718 28 0 0 868 29 0 0 931 30 0 0 2800 31 0 1 1778 intr ithr csw icsw migr smtx 456 300 1637 7 26 1110 250 100 2156 3 29 1659 308 100 2357 2 36 1893 385 100 1952 1 19 1418 225 100 307 0 1 589 373 100 1689 2 14 1175 312 100 1810 1 12 1302 341 100 2586 2 13 1676 355 100 1968 2 7 1628 283 100 689 0 4 474 152 104 568 0 7 2007 278 100 2043 2 24 1307 242 127 2127 2 22 1296 369 100 2259 0 10 1400 241 120 1754 3 25 1085 193 100 1827 2 23 1148 245 113 1327 2 23 1228 500 100 2369 0 11 1736 395 131 2383 2 16 1487 1278 1051 2073 4 23 1311 271 100 2287 4 27 1309 302 103 2480 3 29 1569 303 100 2146 2 13 1266 320 100 2368 2 24 1381 srw syscl 0 135 0 68 0 104 0 21 0 0 0 87 0 49 0 8 0 4 0 95 0 15 0 113 0 36 0 140 0 91 0 288 0 110 0 6 0 64 0 237 0 139 0 165 0 152 0 261 usr sys 33 45 9 63 2 66 4 83 0 98 7 80 2 80 0 82 0 88 1 94 0 93 7 69 0 73 2 84 11 62 7 64 3 76 0 88 2 79 9 67 9 55 9 66 11 70 11 56 wt idl 2 21 0 28 2 30 0 13 0 2 2 12 2 15 1 17 0 12 2 3 1 6 1 22 0 26 2 12 1 26 7 22 4 17 0 11 1 18 6 19 0 36 2 23 3 16 5 28
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
69
The Evolving Solaris Kernel
Lock Statistics - lockstat
# lockstat sleep 10 Adaptive mutex spin: 293311 events in 10.015 seconds (29288 events/sec) Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------218549 75% 75% 1.00 3337 0x71ca3f50 entersq+0x314 26297 9% 83% 1.00 2533 0x71ca3f50 putnext+0x104 19875 7% 90% 1.00 4074 0x71ca3f50 strlock+0x534 14112 5% 95% 1.00 3577 0x71ca3f50 qcallbwrapper+0x274 2696 1% 96% 1.00 3298 0x71ca51d4 putnext+0x50 1821 1% 97% 1.00 59 0x71c9dc40 putnext+0xa0 1693 1% 97% 1.00 2973 0x71ca3f50 qdrain_syncq+0x160 683 0% 97% 1.00 66 0x71c9dc00 putnext+0xa0 678 0% 98% 1.00 55 0x71c9dc80 putnext+0xa0 586 0% 98% 1.00 25 0x71c9ddc0 putnext+0xa0 513 0% 98% 1.00 42 0x71c9dd00 putnext+0xa0 507 0% 98% 1.00 28 0x71c9dd80 putnext+0xa0 407 0% 98% 1.00 42 0x71c9dd40 putnext+0xa0 349 0% 98% 1.00 4085 0x8bfd7e1c putnext+0x50 264 0% 99% 1.00 44 0x71c9dcc0 putnext+0xa0 187 0% 99% 1.00 12 0x908a3d90 putnext+0x454 183 0% 99% 1.00 2975 0x71ca3f50 putnext+0x45c 170 0% 99% 1.00 4571 0x8b77e504 strwsrv+0x10 168 0% 99% 1.00 4501 0x8dea766c strwsrv+0x10 154 0% 99% 1.00 3773 0x924df554 strwsrv+0x10
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
70
The Evolving Solaris Kernel
Lock Statistics - lockstat
Adaptive mutex block: 2818 events in 10.015 seconds (281 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------2134 76% 76% 1.00 1423591 0x71ca3f50 entersq+0x314 272 10% 85% 1.00 893097 0x71ca3f50 strlock+0x534 152 5% 91% 1.00 753279 0x71ca3f50 putnext+0x104 134 5% 96% 1.00 654330 0x71ca3f50 qcallbwrapper+0x274 65 2% 98% 1.00 872630 0x71ca51d4 putnext+0x50 9 0% 98% 1.00 260444 0x71ca3f50 qdrain_syncq+0x160 7 0% 98% 1.00 1390807 0x8dea766c strwsrv+0x10 6 0% 99% 1.00 906048 0x88876094 strwsrv+0x10 5 0% 99% 1.00 2266267 0x8bfd7e1c putnext+0x50 4 0% 99% 1.00 468550 0x924df554 strwsrv+0x10 3 0% 99% 1.00 834125 0x8dea766c cv_wait_sig+0x198 2 0% 99% 1.00 759290 0x71ca3f50 drain_syncq+0x380 2 0% 99% 1.00 1906397 0x8b77e504 cv_wait_sig+0x198 2 0% 99% 1.00 645358 0x71dd69e4 qdrain_syncq+0xa0
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
71
The Evolving Solaris Kernel
Lock Statistics - lockstat
Spin lock spin: 52335 events in 10.015 seconds (5226 events/sec) Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------23531 45% 45% 1.00 4352 turnstile_table+0x79c turnstile_lookup+0x48 1864 4% 49% 1.00 71 cpu[19]+0x40 disp+0x90 1420 3% 51% 1.00 74 cpu[18]+0x40 disp+0x90 1228 2% 54% 1.00 23 cpu[10]+0x40 disp+0x90 1159 2% 56% 1.00 60 cpu[16]+0x40 disp+0x90 1138 2% 58% 1.00 22 cpu[24]+0x40 disp+0x90 1108 2% 60% 1.00 57 cpu[17]+0x40 disp+0x90 1082 2% 62% 1.00 24 cpu[11]+0x40 disp+0x90 1039 2% 64% 1.00 25 cpu[29]+0x40 disp+0x90 1009 2% 66% 1.00 17 cpu[23]+0x40 disp+0x90 1007 2% 68% 1.00 21 cpu[31]+0x40 disp+0x90 882 2% 70% 1.00 29 cpu[13]+0x40 disp+0x90 846 2% 71% 1.00 25 cpu[28]+0x40 disp+0x90 833 2% 73% 1.00 27 cpu[30]+0x40 disp+0x90
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
72
The Evolving Solaris Kernel
Lock Statistics - lockstat
Thread lock spin: 1232 events in 10.015 seconds (123 events/sec) Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------468 38% 38% 1.00 1018 turnstile_table+0x79c ts_tick+0x8 251 20% 58% 1.00 683 turnstile_table+0x79c turnstile_block+0x1f4 180 15% 73% 1.00 152 sleepq_head+0x7f4 ts_tick+0x8 68 6% 78% 1.00 35 sleepq_head+0x7f4 turnstile_block+0x1f4 31 3% 81% 1.00 650 sleepq_head+0x7f4 ts_update_list+0x60 17 1% 82% 1.00 34 cpu[27]+0x64 cv_wait+0x18 7 1% 83% 1.00 64 cpu[13]+0x64 cv_wait+0x18 7 1% 84% 1.00 146 cpu[30]+0x64 ts_tick+0x8 6 0% 84% 1.00 56 cpu[29]+0x64 ts_tick+0x8 6 0% 84% 1.00 37 cpu[8]+0x64 turnstile_block+0x1f4 6 0% 85% 1.00 96 cpu[9]+0x64 ts_tick+0x8
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
73
The Evolving Solaris Kernel
Lock Statistics - lockstat
R/W writer blocked by writer: 1 events in 10.015 seconds (0 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------1 100% 100% 1.00 169634 0x9d42d620 segvn_pagelock+0x150 ------------------------------------------------------------------------------R/W reader blocked by writer: 3 events in 10.015 seconds (0 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------3 100% 100% 1.00 1841415 0x75b7abec mir_wsrv+0x18 -------------------------------------------------------------------------------
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
74
The Evolving Solaris Kernel
lockstat - kernel proling
# lockstat -kIi997 sleep 10 Profiling interrupt: 10596 events in 5.314 seconds (1994 events/sec) Count indv cuml rcnt nsec CPU+PIL Caller ------------------------------------------------------------------------------5122 48% 48% 1.00 1419 cpu[0] default_copyout 1292 12% 61% 1.00 1177 cpu[1] splx 1288 12% 73% 1.00 1118 cpu[1] idle 911 9% 81% 1.00 1169 cpu[1] disp_getwork 695 7% 88% 1.00 1170 cpu[1] i_ddi_splhigh 440 4% 92% 1.00 1163 cpu[1]+11 splx 414 4% 96% 1.00 1163 cpu[1]+11 i_ddi_splhigh 254 2% 98% 1.00 1176 cpu[1]+11 disp_getwork 27 0% 99% 1.00 1349 cpu[0] uiomove 27 0% 99% 1.00 1624 cpu[0] bzero 24 0% 99% 1.00 1205 cpu[0] mmrw 21 0% 99% 1.00 1870 cpu[0] (usermode) 9 0% 99% 1.00 1174 cpu[0] xcopyout 8 0% 99% 1.00 650 cpu[0] ktl0 6 0% 99% 1.00 1220 cpu[0] mutex_enter 5 0% 99% 1.00 1236 cpu[0] default_xcopyout 3 0% 100% 1.00 1383 cpu[0] write 3 0% 100% 1.00 1330 cpu[0] getminor 3 0% 100% 1.00 333 cpu[0] utl0 2 0% 100% 1.00 961 cpu[0] mmread 2 0% 100% 1.00 2000 cpu[0]+10 read_rtc
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
75
The Evolving Solaris Kernel
Kernel Process Model
Processes
All processes begin life as a program All processes begin life as a disk file (ELF object) All processes have state or context that defines their execution environment - hardware & software context The processor state, which is CPU architecture dependent. In general, the state of the hardware registers (general registers, privileged registers) Maintained in the LWP Address space, credentials, open files, resource limits, etc - stuff shared by all the threads in a process can be further divided into hardware context and software context
Nov 2002
Hardware context
Software context
copyright (c) 2002 Jim Mauro and Richard McDougall
76
The Evolving Solaris Kernel
Dispatcher Views
user thread user thread user thread user thread user thread user thread
unbound user threads are scheduled within the threads library, where the selected user thread is linked to an available LWP. This does not apply to bound threads
process
LWP LWP machine state
LWP machine state
process
software context: open les, credentials, address space, process group, session control,...
software context: open les, credentials, address space, process group, session control,...
LWP LWP machine state
LWP machine state
kernel dispatcher view. CPU
kthread
kthread
kthread
kthread
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
77
The Evolving Solaris Kernel
Dispatcher & Scheduling Classes
Solaris supports multiple scheduling classes
Allows for the co-existence of different priority schemes and scheduling algorithms (policies) within the kernel Each scheduling class provides a class-specific function to manage thread priorities, administration, creation, termination, etc. The class-specific functions are called using a MACRO scheme, similar to what is used at the VFS layer
... CL_PREEMPT(thread) -> ts_preempt() ...
Each scheduling class is assigned a range of priorities For each loaded scheduling class, the priority-range falls within the systems total range of global priorities
The dispatcher is the kernel sunsystem that manages the dispatch queues (run queues), handles thread selection, context switching, preemption, etc
copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002
78
The Evolving Solaris Kernel
Scheduling Classes
SunOS currently implements the following scheduling classes
Timeshare (TS) Fixed Priority (FX) Fair Share (FSS) Interactive (IA) System (SYS) Realtime (RT)
highest (best) priority
169 160 159 100 99
interrupt
realtime
interrupt thread priorities above system if realtime class is not loaded, priorities 100-109.
system lowest (worst) priority
60 59
timesharing and interactive
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
79
The Evolving Solaris Kernel
Scheduling Classes - Priorities
59 user priority range 0 +60 system user priority range -60 interactive realtime interrupt
10 169 ints 1
+60 user priority range -60 timeshare
global priority range
0
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
80
The Evolving Solaris Kernel
Quick Tidbit
Use dispadmin(1M) or mdb(1) for scheduling class info
# dispadmin -l CONFIGURED CLASSES ================== SYS TS FX IA (System Class) (Time Sharing) (Fixed Priority) (Interactive)
# mdb -k > ::class SLOT NAME 0 SYS 1 TS 2 FX 3 IA 4 5
INIT FCN sys_init ts_init fx_init ia_init 0 0
CLASS FCN sys_classfuncs ts_classfuncs fx_classfuncs ia_classfuncs 0 0
Note the RT class is not loaded
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
81
The Evolving Solaris Kernel
Thread Priorities & Scheduling
Every thread has 2 priorities; a global priority, derived based on its scheduling class, and (potentially) and inherited priority Priority inherited from parent, alterable via priocntl(1) command or system call Typically, threads run as either TS or IA threads
IA threads created when thread is associated with a windowing system
RT threads are explicitly created SYS class used by kernel threads, and for TS/IA threads when a higher priority is warranted
A temporary boost when an important resource is being held
82
Interrupts run at interrupt priority
copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002
The Evolving Solaris Kernel
File System Types
Filesystem ufs pcfs hsfs tmpfs nfs cachefs autofs specfs procfs sockfs fifofs Type Regular Regular Regular Regular Psuedo Psuedo Psuedo Psuedo Psuedo Psuedo Psuedo Device Disk Disk Disk Memory Network Filesystem Filesystem Device Drivers Kernel Network Files Description Unix Fast Filesystem, default in Solaris MSDOS filesystem High Sierra File System (CDROM) Uses memory and swap Network filesystem Uses a local disk as cache for another NFS file system Uses a dynamic layout to mount other file systems Filesystem for the /dev devices /proc filesystem representing processes Filesystem of socket connections FIFO File System
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
83
The Evolving Solaris Kernel
The virtual le system framework
VNODE OPERATIONS
rename()
VFS OPERATIONS
umount()
unlink()
mkdir()
rmdir()
fsync()
mount()
write()
close()
creat()
statfs()
open()
read()
ioctl()
seek()
link()
sync()
Kernel
System Call Interface VFS- File System Independant Layer (VFS & VNODE INTERFACES)
UFS
PCFS
HSFS
VxFS
NFS
PROCFS
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
84
The Evolving Solaris Kernel
The VFS Interface
vfs_sw[]
/ /usr /var /opt *rootvfs
VFSOP_xxx
Mount Point
VFS
mount() unmount() root() statvfs() sync() vget() mountroot() swapvp()
ufs_mount() ufs_unmount() ufs_root() ufs_statvfs() ufs_sync() ufs_vget() ufs_mountroot() ufs_swapvp()
vnode ufs nfs etc... VFS Type Index into vfssw[]
blocksize ags device synclist hashlist
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
85
The Evolving Solaris Kernel
The vnode interface
VNODE Ops
close() read() write() ioctl() create() link() . . ufs_close() ufs_read() ufs_write() ufs_ioctl() ufs_create() ufs_link() . .
Memory Pages
VNODE
Filesystem Pointer
Regular File Directory Block Device VNODE Type Character Device Link FIFO Process Socket
Nov 2002
copyright (c) 2002 Jim Mauro and Richard McDougall
86
The Evolving Solaris Kernel
File system Caching
Solaris le systems use the VM system to cache and move data Regular reads are page ins, delayed writes are page outs VM Parameters and load dramatically effects le system performance
Solaris 8 gives executable, stack and heap pages priority over file system pages
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
87
The Evolving Solaris Kernel
File System Caching
read() write() fread() fwrite()
Stack
File name lookups
STDIO Buffers
mmap()
(ncsize)
The DNLC cache hit ratio can be observed with netstat -s
Heap
Directory Name Cache
Level 1 Page Cache segmap page cache (256MB on Ultra)
Binary (Data) Binary (T ext)
Inode Cache (ufsninode)
direct. blocks
The cache hit ratio of the segmap cache can be measured with netstat -k segmap
Level 2 Page Cache
The buffer cache hit ratio can be observed with sar -b
Buffer Cache
Dynamic Page Cache
(BUFHWM)
Files mapped with mmap() buypass the segmap cache
Storage Devices
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
88
The Evolving Solaris Kernel
UFS
Block based allocation
2TB Max file system size A file can grow to the max file system size
triple indirect is implemented
Prior to 2.6, max file size is 2GB
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
89
The Evolving Solaris Kernel
UFS Block Allocation
# filestat /home/bigfile Inodes per cyl group: Inodes per block: Cylinder Group no: Cylinder Group blk: File System Block Size: Device block size: Number of device blocks: Start Block ----------66272 66480 1155904 1277392 1387552 1497712 1607872 1718016 1155872 End Block ----------66463 99247 1188671 1310159 1420319 1530479 1640639 1725999 1155887 64 64 0 64 8192 512 204928 Length (Device Blocks) ---------------------192 32768 32768 32768 32768 32768 32768 7984 16 9 22769 Blocks
-> -> -> -> -> -> -> -> ->
Number of extents: Average extent size:
Note: The filestat command is show for demonstration purposes, and is not as yet included with the Solaris operating system copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002
90
The Evolving Solaris Kernel
UFS Logging
Beginning in Solaris 7, UFS logging became a mount option Log to spare blocks in the file system (no metadevice) Fast reboots - no fsck required
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
91
The Evolving Solaris Kernel
UFS Direct I/O
File systems cause a lot of paging activity Solaris 2.6 introduces a mechanism to bypass the VM system
Forces completely unbuffered I/Os Very slow writes (synchronous) Useful for copying large les or when application does caching e.g. Oracle mount -o forcedirectio /dev/xyz /mountpt directio (fd, DIRECTIO_ON | DIRECTIO_OFF)
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
92
The Evolving Solaris Kernel
Direct I/O Checklist
Must be aligned
sector aligned (512 byte boundary)
Must not be mapped Buffer must be word aligned
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
93
The Evolving Solaris Kernel
UFS Write Throttle
A throttle exists in UFS to limit the amount of memory UFS can saturate, per le
Controlled by three parameters ufs_WRITES (1 = enabled) ufs_HW = 393216 bytes (high water mark to suspend IO) ufs_LW = 262144 bytes (low water mark to start IO)
Almost always need to set this higher to get maximum sequential write performance
set ufs_LW=4194304 set ufs_HW=67108864
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
94
The Evolving Solaris Kernel
UFS Performance
Adjacent blocks are grouped and written together or read ahead
Controlled by the maxcontig parameter Defaults to 128k on most platforms, 1MB on SPARCstorage array 100,200 Must be set higher to achieve adequate write performance maxphys must be raised beyond 128k also
copyright (c) 2002 Jim Mauro and Richard McDougall
Nov 2002
95