0% found this document useful (0 votes)
140 views48 pages

Evolving Solaris Kernel

Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views48 pages

Evolving Solaris Kernel

Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

The Evolving Solaris Kernel

The Evolving Solaris Kernel Past, Present & Future


Jim Mauro Senior Staff Engineer - Performance & Availability Engineering Sun Microsystems, Inc. 400 Atrium Drive, Somerset, NJ 08812 [email protected] Richard McDougall Senior Staff Engineer - Performance & Availability Engineering Sun Microsystems, Inc. [email protected]

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

The Evolving Solaris Kernel

Agenda
Introduction
Solaris Overview Distribution Releases System Overview & Kernel Features 64-bits Things added, things changed Tips and tidbits along the way... Solaris 7 Solaris 8 Solaris 9

The Evolution

Major Features Review

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

The Evolving Solaris Kernel

Introduction
What is Solaris?
A complete operating environment, built on a modular, dynamic kernel SunOS - the kernel (the 5.X thing) Windowing - desktop environment. CDE default, OpenWindows still included
GNOME 2 Beta Available GNOME is the strategic direction

The Solaris Operating Environment (SOE)


Open Network Computing (ONC+). NFS (V2 & V3), NIS/NIS+, RPC/XDR, LDAP

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

The Evolving Solaris Kernel

Solaris Distribution
Many CDs in the distribution

- WEB start CD (Installation) - OS bits, disks 1 and 2 - Software Supplement (more optional bits) - Flash PROM Update - Maintenance Update - Sun Management Center - Forte Workshop (try n buy)
Bonus Software

- Software Companion (GNU, etc) - StarOfce 6 - SunONE Advantage Software (2 CDs) - Oracle Enterprise Server
copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Releases
Base release, followed by quarterly update releases
Solaris 8 - released 2/00 Solaris 8, 6/00 (update 1) Solaris 8, 10/00 (update 2) Solaris 8, 1/01 (update 3) Solaris 8, 4/01 (update 4) Solaris 8, 7/01 (update 5) Solaris 8, 10/01 (update 6) Solaris 8, 2/02 (update 7)

Solaris 9 - base release, May, 2002


Provide predicatability for planning Provide a vehicle for getting new features, functionality and patches out in a regular and timely fashion
Nov 2002

The model is designed to

copyright (c) 2002 Jim Mauro and Richard McDougall

The Evolving Solaris Kernel

Releases (cont)
So, which release am I running?
sunsys> cat /etc/release Solaris 8 6/00 s28s_u1wos_08 SPARC Copyright 2000 Sun Microsystems, Inc. All Rights Reserved. Assembled 26 April 2000 sunsys>

Check out http://docs.sun.com, Whats New document for a specific release

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

The Evolving Solaris Kernel

Kernel Features

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

The Evolving Solaris Kernel

System Overview
System Call Interface TS/IA RT FX FSS Thread Scheduling and Process Management UFS NFS SPEC FS
Clocks & Timers Callouts

Virtual File System Framework

Kernel Services

Virtual Memory System Bus and Device Drivers

Networking

TCP IP Sockets

Hardware Address Translation (HAT)

SD

SSD

HARDWARE

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

The Evolving Solaris Kernel

Solaris Kernel Features


Dynamic Kernel
Small core unix modules Major subsystems implemented as dynamically loadable modules (file systems, scheduling classes, STREAMS modules, system calls). Dynamic resource sizing & allocation (processes, files, locks, memory, etc) Dynamic sizing based on system size
Goal is to minimize/elminate need to use /etc/system tuneable parameters

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

The Evolving Solaris Kernel

Solaris Kernel Features


Preemptive kernel
Does NOT require interrupt disable/blocking via PIL for synchronization Most kernel code paths are preemptable A few non-preemption points in critical code paths SCALABILITY & LOW LATENCY INTERRUPTS Module support, synchronization primitives, etc

Well-defined, layered interfaces

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

10

The Evolving Solaris Kernel

Solaris Kernel Features


Multithreaded kernel
Kernel threads perform core system services Fine grained locking for concurrency Threaded subsystems User level threads and synchronization primitives Solaris (UI) & POSIX threads Two-level (M x N) model, evolved to one-level model
Alternate thread library in Solaris 8 Default thread library Solaris 9

Multithreaded process model

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

11

The Evolving Solaris Kernel

Solaris Kernel Features


Table-driven dispatcher with multiple scheduling class support
Dynamically loadable/modifyable table values Relatively easy to add new scheduling classes
FSS and FX in Solaris 9

Realtime support with preemptive kernel


Additional kernel support for realtime applications (memory page locking, asynchronous I/O, processor sets, interrupt control, highres clock) Some things can be done on the fly

Kernel tuning via text file (/etc/system, driver.conf)

mdb(1)

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

12

The Evolving Solaris Kernel

Solaris Kernel Features


Tightly integrated virtual memory and file system support
Dynamic page cache memory implementation Object-like abstractions for files and file systems Facilitates new features/functionality
Kernel sockets via sockfs procfs (/proc) enhancements Doors (doorfs) fdfs, swapfs, tmpfs

Virtual File System (VFS) Implementation

(procfs), Doors (doorfs), fdfs, swapfs, tmpfs


Disk-based, distributed & pseudo file systems

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

13

The Evolving Solaris Kernel

Solaris Kernel Features


32-bit and 64-bit kernel
64-bit kernel required for UltraSPARC-III based systems (SunBlade, SunFire) 32-bit apps run just fine... Device driver interfaces Includes interfaces for dynamic attach/detach/pwr POSIX, UNIX International

Solaris DDI/DKI Implementation

Rich set of standards-compliant interfaces

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

14

The Evolving Solaris Kernel

Solaris Kernel Features


Integrated networking facilities
TCP/IP

IPv4, IPSec, IPv6


Name services - DNS, NIS, NIS+, LDAP NFS - defacto standard distributed file system, NFS-V2 & NFS-V3 Remote Procedure Call/External Data Representation (RPC/XDR) facilities Sockets, TLI, Federated Naming APIs

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

15

The Evolving Solaris Kernel

64-Bits

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

16

The Evolving Solaris Kernel

64-bit Solaris
Since Solaris 7, full 32-bit binary compatibility A simple directory namespace rule providing for the support and co-existence of 32-bit binaries on a 64-bit Solaris 8 system;

For every directory on the system that contains binary object files (executables, shared object libraries, etc), there is a sparcv9 subdirectory containing the 64-bit versions All kernel modules must be the of the same data model; ILP32 (32-bit data model) or LP64 (64-bit data model)
64-bit kernel required to run 64-bit apps

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

17

The Evolving Solaris Kernel

32 bit limits
Solaris 2.5
Heap is limited to 2GB, malloc will fail beyond 2GB Heap limited to 2GB by default Can go beyond 2GB with kernel patch 103640-08+ can raise limit to 3.75G by using ulimit or rlimit() if uid=root Do not need to be root with 103640-23+ Heap limited to 2GB by default can raise limit to 3.75G by using ulimit or rlimit() Limits are raised by default 32 bit program can malloc 3.99GB
Nov 2002

Solaris 2.5.1

Solaris 2.6

Solaris 7 & 8

copyright (c) 2002 Jim Mauro and Richard McDougall

18

The Evolving Solaris Kernel

Solaris/SPARC V8/V9 Data Model


Defines the width of integral data types
32-bit Solaris - ILP32 64-bit Solaris - LP64
C data type char short int long longlong pointer enum oat double quad 8 16 32 32 64 32 32 32 64 128 ILP32 8 16 32 64 64 64 32 32 64 128 LP64

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

19

The Evolving Solaris Kernel

64-bit Performance
64 Bit Virtual Address Space
(+) Free from the 3.9GB barrier (+) Memory map large files (+) 64 Bit Arithmetic, 64 Bit Registers (-) Pointers/Longs require moving 8 bytes
Typically ~5% delta Larger cache footprint

64 Bit data types

(-) Larger Stack

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

20

The Evolving Solaris Kernel

Which Data Model Is Booted?


Use isainfo(1)
sunsys> isainfo sparcv9 sparc sunsys> isainfo -b 64 sunsys> isainfo -v 64-bit sparcv9 applications 32-bit sparc applications

Or isalist(1)
sunsys> isalist sparcv9+vis sparcv9 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc sunsys>

man isaexec(3C)
Invoke isa-specific executable To create wrappers for shipping both 32-bit and 64-bit binaries, and automatically launching the correct one
21

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

The Evolving Solaris Kernel

Evolving Features & Technical Tidbits

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

22

The Evolving Solaris Kernel

The Evolution
1992 1993 1994 1995 1996 1998 2000 2002

Solaris 2.0 VFS/Vnode ISM UP only Solaris 2.1 4-way SMP

Solaris 2.2 sun4d SMP Large UFS Solaris 2.3 8-way SMP New DNLC

Solaris 2.5 Large pages (kernel) Doors NFS V3 sun4u Solaris 2.5.1 sun4u MP Solaris 2.4 20-way SMP New KMA Slab Allocator Cachefs CDE

Solaris 7 64-bit kernel 64-bit procs UFS logging Priority Paging Solaris 8 New KMA Cyclics T2 US-III SunFire StarCat Freeware UFS++

Solaris 9 SVM MPSS MPO Resource Pools FSS FX

Solaris 2.6 Large files Processor Sets Kernel Sockets lockstat UFS directio DR

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

23

The Evolving Solaris Kernel

General Priorities
Reliability, scalability, performance
on-going

Standards compliance SunOS 4.X binary compatibility Threads / SMP scalability Big systems performance
VM & I/O

Lessons learned on threads Resource management


Consolidation, ROI, TCO

Resource Pools, Service Containers, Resource Virtualization


copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

24

The Evolving Solaris Kernel

Virtual Memory & The Dynamic Page Cache


Creating a dynamic page cache allows for all of physical memory to be used as disk buffer cache (read(2), write(2)) The evolution of systems hardware, RAID and general I/O tuning can create environments where the buffer cache throttles the VM system
The VM roller coaster (keeping the freelist sane)

Priority paging (2.6 & 7) provided a band-aid Using directio bypasses the page cache for UFS reads/writes Solaris 8 implements a new cyclic page cache

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

25

The Evolving Solaris Kernel

Global Memory Management


Demand Paged
Not recently used (NRU) algorithm Where has all my memory gone? Operates bottom up from physical pages Default mode treats all memory equally

Dynamic le system cache Page scanner

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

26

The Evolving Solaris Kernel

The Old Page Cache


kernel memory pages pushed out of segmap segmap reclaim process memory heap, data, stack

page scanner free list

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

27

The Evolving Solaris Kernel

The Cyclic Page Cache


kernel memory pages pushed out of segmap segmap reclaim process memory heap, data, stack

cache list

free list

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

28

The Evolving Solaris Kernel

Global Paging Dynamics


8192

(1GB Example)

fastscan
Scan Rate
100

slowscan

16MB

4MB

4MB

8MB

pages_before_pager

throttle- minfree desfree free

lotsfree

cachefree cachefree+ decit Free Memory


Nov 2002

32MB

copyright (c) 2002 Jim Mauro and Richard McDougall

29

The Evolving Solaris Kernel

Priority Paging
Solaris 7 FCS or Solaris 2.6 with T-105181-09
http://www.sun.com/sun-on-net/performance/priority_paging.html Set priority_paging=1 or cachefree in /etc/system ftp://playground.sun.com/pub/rmc/memstat New VM system, priority paging implemented at the core (make sure its disabled in Sol 8!) New vmstat flag, -p Multiple page size support (MPSS) Memory Placement Optimizations (MPO)

Solaris 7 Extended vmstat Solaris 8

Solaris 9

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

30

The Evolving Solaris Kernel

Memory Monitoring
Use vmstat or the memstat command on Solaris 7

# vmstat 3 procs r b w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 memory page disk swap free re mf pi po fr de sr f0 s0 s4 s6 269776 21160 0 0 0 0 0 0 0 0 0 0 2 269776 21152 0 0 0 0 0 0 0 0 0 0 2 269720 3896 5 17 80 0 109 0 59 0 0 0 2 269616 3792 0 0 160 0 160 0 76 0 0 0 2 269616 3792 0 0 192 0 192 0 105 0 0 0 2 269616 3800 1 90 234 5 232 0 99 0 0 0 2 269656 3832 0 0 106 0 106 0 51 0 0 0 2 faults in sy 154 200 155 203 221 773 279 242 294 225 323 964 237 212 cpu cs us sy id 92 0 0 100 113 0 0 99 134 0 2 98 130 0 1 99 138 0 1 99 305 5 3 92 121 0 1 99

ftp://playground.sun.com/pub/rmc/memstat

# memstat 3 (Solaris 7 Only) or # vmstat -p 3 (Solaris 8+) memory free 21160 21152 21152 11920 11888 11896 11904 11896 ---------- paging re mf pi po 0 22 0 5 0 0 0 0 0 18 34 2 0 0 277 106 0 0 256 69 0 0 213 106 0 0 245 66 0 0 245 64 ----------- - executable fr de sr epi epo epf 5 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 272 0 153 0 0 32 224 0 106 0 0 16 261 0 124 0 0 26 242 0 122 0 0 16 224 0 132 0 0 21 - anonymous api apo apf 0 0 0 0 0 0 0 0 0 0 98 149 0 69 178 0 106 232 0 64 221 0 64 189 -- filesys -- --- cpu --fpi fpo fpf us sy wt id 0 5 5 0 1 0 99 0 0 0 0 0 0 100 34 2 2 0 1 0 99 277 8 90 0 3 0 97 256 0 29 0 3 1 96 213 0 2 0 3 13 84 245 2 5 0 2 0 98 245 0 13 0 2 0 98

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

31

The Evolving Solaris Kernel

Simple Memory Rule:


Identifying a memory shortage without PP:
Scanner not scanning -> no memory shortage Scanner running, page ins and page outs, swap device activity -> potential memory shortage (use separate swap disk or 2.6 iostat -p to measure swap partition activity) api and apo should be zero in memstat, non zero is a clear sign of memory shortage scan rate != 0 freemem is real

Identifying a memory shortage with PP on Sol 7:

Identifying a memory shortage on Sol 8:


copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

32

The Evolving Solaris Kernel

Memory Summary
Solaris 9
# mdb -k > ::memstat Page Summary Pages --------------------------Kernel 21146 Anon 16891 Exec and libs 8389 Page cache 8248 Free (cachelist) 2490 Free (freelist) 190309 Total 247473 MB ---------------165 131 65 64 19 1486 1933 %Tot ---9% 7% 3% 3% 1% 77%

Solaris 8 and earlier


# prtmem Total memory: Kernel Memory: Application: Executable & libs: File Cache: Free, file cache: Free, free: 1933 164 128 65 64 19 1491 Megabytes Megabytes Megabytes Megabytes Megabytes Megabytes Megabytes Nov 2002

copyright (c) 2002 Jim Mauro and Richard McDougall

33

The Evolving Solaris Kernel

The Threads Model


Original 2-level, MxN model design goals
Scalability Lightweight threads Pools of Virtual Processors (LWPs) Bound threads available User level thread scheduling is complex Signal delivery is, at times, a nightmare Kernel threads are not as expensive as they used to be Alternate thread library in Solaris 8 (/usr/lib/lwp/libthread.so) 1-level is the default in Solaris 9 (/usr/lib/libthread.so)

Lessons learned...

What we have today

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

34

The Evolving Solaris Kernel

2-Level MxN Model


proc 1 proc 2
Processes

proc 3

proc 4
User Threads

LWPs

User Layer Kernel Layer


Kernel Threads the dispatcher

An unattached kernel thread

Hardware Layer Processors (CPUs)

The 1 level model is effectively all bound threads (proc 4)

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

35

The Evolving Solaris Kernel

Resource Management
Effective management of hardware resources to applications
Large application performance Multiple apps per Solaris instance (consolidation) Provide boundaries on resource consumption by applications Processors (CPUs) Memory (physical memory) Disk IO bandwidth/latency/IOPS Network bandwidth/latency

Resource categories

This is an on-going effort, with significant improvements in subsequent Solaris 9 quarterly releases

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

36

The Evolving Solaris Kernel

Processor Control Commands


CPU related commands

psrinfo(1M) - provides information about the processors on the system. Use -v for verbose psradm(1M) - online/offline processors, interrupt enable/disable psrset(1M) - creation and management of processor sets pbind(1M) - original processor bind command. Does not provide exclusive binding processor_bind(2), processor_info(2), pset_bind(2), pset_info(2), pset_creat(2), p_online(2): system calls to do things programmatically

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

37

The Evolving Solaris Kernel

Solaris 9 Resource Management


Tasks, Projects & Extended Accounting
Task - A collection of processes Project - A collection of tasks

Projects

Task

Task

Task

proc

proc

proc

proc

proc

proc

proc

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

38

The Evolving Solaris Kernel

Solaris 9 Resource Management


Tasks & Projects provide abstractions for binding together related processes, for the purpose of
Resource management. Tasks and Projects can be bound to process sets, have scheduler changes applied to them, etc. Resource controls. Resource limits can be applied at the Project or Task level. Resource monitoring. Tools have been enhanced to monitor utilization at the Project or Task level.
prstat -J - Display statistics for processes and projects prstat -T - Display statistics for processes and tasks Extended accounting. The accounting facility had been enhanced to provide project and task level accounting data.

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

39

The Evolving Solaris Kernel

Solaris 9 Resource Controls


The following resource controls are available
project.cpu-shares: Number of CPU shares (FSS) available to this project task.max-cpu-time: Maximum CPU time available to the processes in this task (milliseconds) task.max-lwps: Maximum number of LWPs available to the processes in this task process.max-cpu-time: Max CPU time available to this process process.max-le-descriptor: Max number of open les for this process process.max-le-size: Max le size process.max-core-size: Max core le size process.max-data-size: Max size of the processs data segment process.max-stack-size: Max size of the processs stack

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

40

The Evolving Solaris Kernel

Solaris 9 Fair Share Scheduler


Share based (versus priority based) process scheduling Designed to provide a guaranteed minimum amount of CPU resources to a specific application (project/task)
Defining a maximum, or ceiling, not currently available Shares are allocated to projects Shares allocated are relative to shares allocated to other projects The total number of shares allocated also matters Finer grained management and control

Shares are not percentages

FSS can be used in conjunction with processor sets

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

41

The Evolving Solaris Kernel

FSS & Processor Sets

Project A 16.66% (1/6) Project B 33.33% (2/6) Project B 40% (2/5) Project C 100% (3/3) Project C 60% (3/5)

Project C 50% (3/6)

Processor Set 1 2 CPUs 25% of System

Processor Set 2 4 CPUs 50% of System

Processor Set 3 2 CPUs 25% of System

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

42

The Evolving Solaris Kernel

Resource Pools
Provides a facility for stateful (persistent) processor sets and project binding, as well as scheduling class assignment Resource pool management is done via pooladm(1M), poolbind(1M), and poolcfg(1M). /etc/pooladm.conf provides persistance across reboots (managed via poolcfg(1M)) poolbind(1M) provides for binding of projects or tasks to a resource pool /etc/projects can define a resource pool for a project or task

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

43

The Evolving Solaris Kernel

Solaris Release Features Summary

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

44

The Evolving Solaris Kernel

Solaris 7 - New Features


64-Bits
Kernel 64-bit binary support Full binary compatibility for 32-bit executables mount -o logging Logs to spare blocks in cylinder group No fsck Disable access time update to inodes Ends ps -ef | grep proc_name | awk { print $2 }

UFS logging

UFS noatime pgrep & pkill

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

45

The Evolving Solaris Kernel

Solaris 7 - New Features


traceroute bundled dumpadm(1M)
Configure a seperate raw partition for dumps Dump running systems

LDAP Client Library TCP with SACK


Selective Acknowledgement - RFC 2018 Device configuration information APIs User level function tracing. -u, -U

libdevinfo(3) truss(1) Enhanced

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

46

The Evolving Solaris Kernel

Solaris 8 - New Features


Cyclic Page Cache
Enhanced VM page management functionality Priority for page allocation given to process segments freemem is real! Numeric ID generated for syslog messages One tool for device configuration/management DR events managed through devfsadmd a = mmap( addr, len, prot, flags| MAP_ ANON,-1, off); CLOCK_HIGHRES via new Cyclics kernel subsystem
Nov 2002

System Message IDs devfsadm(1M)

mmap MAP_ANON POSIX High Resolution Timers

copyright (c) 2002 Jim Mauro and Richard McDougall

47

The Evolving Solaris Kernel

Solaris 8 - New Features


prstat(1)
Top-like curses based process monitor utility truss-like utility for tracing user-level library calls pstack(1), pcred(1), pfiles(1) System-wide core file management New kernel debugger - replace adb & crash Supports use of adb macros and crash utilities Evolved to manage user code debugging (Sol 9)

apptrace(1) /proc tools enhanced to work on core files coreadm(1M) mdb(1)

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

48

The Evolving Solaris Kernel

Solaris 8 - New Features


User Level Priority Inheritance
User defined mutex locks attribute umount -f /usr/lib/lwp/libthread.so - provides all bound threads Does not require re-compilation apache, bash, bzip2, tcsh, gcc, mkisofs, less, zsh, Glib, GTK+, etc, etc,...

Forced unmount Alternate threads library

Freeware CD

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

49

The Evolving Solaris Kernel

Solaris 9 - New Features


Many, many (but not all) Solaris 9 features have been backported to Solaris 8
Available in various Solaris 8 update releases Resource pools - configure boundaries on resources consumed by processes and tasks Processors today, memory coming Resource pools cross reboots (unlike processor sets and bindings) See prctl(1), pooladm(1M), poolcfg(1M), poolbind(1M), rctladm(1M), project(4)

Resource Management

Fixed-Priority Scheduling Class (FX)


TS class priority range, but priorities remain fixed Share-based (versus priority-based) CPU allocation
Nov 2002

Fair Share Scheduling Class (FSS)

copyright (c) 2002 Jim Mauro and Richard McDougall

50

The Evolving Solaris Kernel

Solaris 9 - New Features


Command line process facilties
pargs(1) - dump args and env associated with a live process, or core file preap(1) - remove zombies (Harry Cooper & Ben could have used this in 1968!) -h - provide human-readable output format. Lists sizes in Kbytes, Mbytes, Gbytes, etc... Support of pages larger than 8k for process stack, heap and mmapd anonymous memory Actual supported page sizes hardware dependent UltraSPARC-III supports 8k, 64k, 512k, 4MB...

du(1), df(1M) and ls(1) - New -h option


Multiple Page Size Support (MPSS)

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

51

The Evolving Solaris Kernel

Solaris 9 - New Features


MPSS (cont)
jurassic> pagesize -a 8192 65536 524288 4194304 jurassic>

New Threads Library/Model


1 Level threads model - all bound threads What was the alternate threads library in Solaris 8 is the default (in /usr/lib) in Solaris 9. Allows database to dynamically shrink/grow the shared segment Original ISM implementation was a big performance win (shared translation information, large pages), but was fixed in size DISM gives the best of both worlds
52

Dynamic Intimate Shared Memory (DISM)


copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

The Evolving Solaris Kernel

Solaris 9 - New Features


Security
Internet Key Exchange (IKE) Protocol Secure Shell (ssh) - SSHv1 & SSHv2 Kerberos Key Distribution Center (KDC) & Admin Tools Secure LDAP 128-bit Encryption Role-Based Access Controls (RBAC) Enhanced tcp-wrappers 7.6 in freeware CD Xserver encrypted connections supported

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

53

The Evolving Solaris Kernel

Solaris 9 - New Features


iPlanet Director Server
LDAP Server bundled/integrated NIS+ - to - LDAP Migration Tool Based on WU-ftp server Includes PPPoE (Solaris 8 7/01)

LDAP Name Service Support FTP Server PPP 4.0 IP Network Multipathing (Solaris 8 10/00) Solaris Volume Manager
Formerly Solaris DiskSuite Soft partitions and Device ID support
Nov 2002

copyright (c) 2002 Jim Mauro and Richard McDougall

54

The Evolving Solaris Kernel

Summary
Steady, sustained progress on key areas - scalability, reliability, performance, features Going forward
Resource management - memory, service containers Observability - More & better tools Resilience - fault detection, isolation, containment Management - Zero downtime admin
patches, upgrades

Reliability, performance, always at the top

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

55

The Evolving Solaris Kernel

Supplemental Slides

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

56

The Evolving Solaris Kernel

Kernel Statistics
Solaris uses a central mechanism for kernel statistics
kstat Kernel providers
raw statistics (c structure) typed data classed statistics

Perl and C API kstat(1M) command


instance: 0 class: misc 90 86 87 1020713737 2999968 64.1117776 0 2999968 2 Nov 2002

# kstat -n system_misc module: unix name: system_misc avenrun_15min avenrun_1min avenrun_5min boot_time clk_intr crtime deficit lbolt ncpus

copyright (c) 2002 Jim Mauro and Richard McDougall

57

The Evolving Solaris Kernel

Memory Accounting
The ps command SZ = Virtual Size RSS = Resident Set Size (including shared)
# ps -ale USER root root root root root root root PID %CPU %MEM SZ RSS TT 22998 12.0 0.8 4584 1992 ? 23672 1.0 0.7 1736 1592 pts/16 3 0.4 0.0 0 0 ? 733 0.4 1.0 6352 2496 ? 345 0.3 0.7 2968 1736 ? 23100 0.2 0.5 3880 1104 ? 732 0.2 2.5 9920 6304 ? S START TIME COMMAND S 10:05:30 3:22 /usr/sbin/nsr/nsrc O 10:22:54 0:00 /usr/ucb/ps -aux S Sep 28 166:38 fsflush S Sep 28 174:29 /opt/SUNWsymon/jre S Sep 28 55:39 /usr/sbin/nsr/nsrd S Oct 15 0:25 rpc.rstatd S Sep 28 94:43 esd - init topolog

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

58

The Evolving Solaris Kernel

The pmap command


Verbose Process mappings
Solaris 8 private/shared Solaris 9 private=Anon, shared=RSS-Anon
Mode r-x-rwx-rw-srw--rw--R rw--rw--R r-x-rwx-r-x-r-x-rwx-r-x-rwx-rwx-Mapped File mmap mmap dev:0,2 ino:5304657 dev:0,2 ino:5304657 dev:0,2 ino:5304657 [ anon ] [ anon ] libc.so.1 libc.so.1 libc_psr.so.1 libdl.so.1 [ anon ] ld.so.1 ld.so.1 [ stack ]

# pmap -x 123 Address Kbytes RSS Anon Locked 00010000 8 8 00020000 8 8 8 01000000 1024 1024 02000000 1024 1024 512 03000000 1024 1024 512 04000000 1024 1024 1024 05000000 512 512 512 FF280000 680 680 FF33A000 32 32 32 FF380000 16 16 FF3A0000 8 8 FF3B0000 8 8 8 FF3C0000 152 152 FF3F6000 8 8 8 FFBFA000 24 24 24 -------- ------- ------- ------- ------total Kb 5552 5552 2640 -

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

59

The Evolving Solaris Kernel

SWAP Space ctd...


# swap -s total: 101456k bytes allocated + 12552k reserved = 114008k used, 597736k available should read: total: 101456k bytes unallocated + 12552k allocated = 114008k reserved, 597736k available

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

60

The Evolving Solaris Kernel

Swap:
# ./prtswap -l Swap Reservations: -------------------------------------------------------------------------Total Virtual Swap Configured: 767MB = RAM Swap Configured: 255MB Physical Swap Configured: + 512MB Total Virtual Swap Reserved Against: RAM Swap Reserved Against: Physical Swap Reserved Against: Total Virtual Swap Unresv. & Avail. for Reservation: Physical Swap Unresv. & Avail. for Reservations: RAM Swap Unresv. & Avail. for Reservations: 513MB = 1MB 512MB 253MB = 0MB 253MB

Swap Allocations: (Reserved and Phys pages allocated) -------------------------------------------------------------------------Total Virtual Swap Configured: 767MB Total Virtual Swap Allocated Against: 467MB Physical Swap Utilization: (pages swapped out) -------------------------------------------------------------------------Physical Swap Free (should not be zero!): 232MB = Physical Swap Configured: 512MB Physical Swap Used (pages swapped out): 279MB

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

61

The Evolving Solaris Kernel

The pmap command


Swap reservations (Solaris 9):
# pmap -S 123 Address Kbytes Swap 00010000 8 00020000 8 8 01000000 1024 02000000 1024 1024 03000000 1024 512 04000000 1024 1024 05000000 512 512 FF280000 680 FF33A000 32 32 FF380000 16 FF3A0000 8 FF3B0000 8 8 FF3C0000 152 FF3F6000 8 8 FFBFA000 24 24 -------- ------- ------total Kb 5552 3152 Mode r-x-rwx-rw-srw--rw--R rw--rw--R r-x-rwx-r-x-r-x-rwx-r-x-rwx-rwx-Mapped File mmap mmap dev:0,2 ino:5304657 dev:0,2 ino:5304657 dev:0,2 ino:5304657 [ anon ] [ anon ] libc.so.1 libc.so.1 libc_psr.so.1 libdl.so.1 [ anon ] ld.so.1 ld.so.1 [ stack ]

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

62

The Evolving Solaris Kernel

Shared Memory
System V Initimate Shared Memory (ISM)
Shared translation data structures 4MB TLB Page Size Locked pages Invoke with an additional ag to shmat () - SHARE_MMU Default shared memory mode for Oracle RDBMS Solaris 8 U3 Pageable variant of ISM Integrated with Oracle 9i (dynamic SGA) 8k TLB Page Size for Solaris 8 4MB TLB Page Size for Solaris 9 U1

System V Dynamic Intimate Shared Memory (DISM)

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

63

The Evolving Solaris Kernel

The pmap command


# pmap -x 15492 15492: ./maps Address Kbytes RSS Anon Locked 00010000 8 8 00020000 8 8 8 00022000 20344 16248 16248 03000000 1024 1024 04000000 1024 1024 512 05000000 1024 1024 512 06000000 1024 1024 1024 07000000 512 512 512 08000000 8192 8192 8192 09000000 8192 4096 0A000000 8192 8192 8192 0B000000 8192 8192 8192 FF280000 680 672 FF33A000 32 32 32 FF390000 8 8 FF3A0000 8 8 FF3B0000 8 8 8 FF3C0000 152 152 FF3F6000 8 8 8 FFBFA000 24 24 24 -------- ------- ------- ------- ------total Kb 50464 42264 18888 16384 Mode r-x-rwx-rwx-rw-srw--rw--R rw--rw--R rwxsrwxsrwxsR rwxsR r-x-rwx-r-x-r-x-rwx-r-x-rwx-rwx-Mapped File maps maps [ heap ] dev:0,2 ino:4628487 dev:0,2 ino:4628487 dev:0,2 ino:4628487 [ anon ] [ anon ] [ dism shmid=0x5 ] [ dism shmid=0x4 ] [ ism shmid=0x2 ] [ ism shmid=0x3 ] libc.so.1 libc.so.1 libc_psr.so.1 libdl.so.1 [ anon ] ld.so.1 ld.so.1 [ stack ]

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

64

The Evolving Solaris Kernel

Multiple Page Size Support


Solaris 8
Large (4MB) pages with ISM/DISM for shared memory "Multiple Page Size Support" Optional large pages for heap/stack A wrapper for unchanged programs (ppgsz) Programatically via memcntl(3C) Shared library for existing binaries (LD_PRELOAD) (/usr/lib/ libmpss.so) pmap enhancements to observe page sizes (pmap -sx) Tool to observe potential gains (trapstat -T)

Solaris 9

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

65

The Evolving Solaris Kernel

TLB Trap CPU Accounting


# trapstat -t 3 cpu | itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim | %tim -----+-------------------------------+-------------------------------+----0 k| 25 0.0 0 0.0 | 29558 0.5 6 0.0 | 0.6 0 u| 9728 0.1 1 0.0 | 17943 0.3 3 0.0 | 0.5 -----+-------------------------------+-------------------------------+----1 k| 0 0.0 0 0.0 | 19001 1.2 3 0.0 | 1.2 1 u| 7872 0.2 0 0.0 | 16300 0.5 0 0.0 | 0.8 =====+===============================+===============================+===== ttl | 17625 0.2 1 0.0 | 82802 1.3 12 0.0 | 1.5

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

66

The Evolving Solaris Kernel

The pmap command


# pmap -xs 15492 Address Kbytes RSS Anon Locked Pgsz Mode 00010000 8 8 8K r-x-00020000 8 8 8 8K rwx-00022000 3960 3960 3960 8K rwx-00400000 8192 8192 8192 4M rwx-00C00000 4096 - rwx-01000000 4096 4096 4096 4M rwx-03000000 1024 1024 8K rw-s08000000 8192 8192 8192 - rwxs09000000 4096 4096 8K rwxs0A000000 4096 - rwxs0B000000 8192 8192 8192 4M rwxsR FF280000 136 136 8K r-x-... FF390000 8 8 8K r-x-FF3A0000 8 8 8K r-x-FF3B0000 8 8 8 8K rwx-FF3C0000 152 152 8K r-x-FF3F6000 8 8 8 8K rwx-FFBFA000 24 24 24 8K rwx--------- ------- ------- ------- ------total Kb 50464 42264 18888 16384 Mapped File maps maps [ heap ] [ heap ] [ heap ] [ heap ] dev:0,2 ino:4628487 [ dism shmid=0x5 ] [ dism shmid=0x4 ] [ dism shmid=0x2 ] [ ism shmid=0x3 ] libc.so.1 libc_psr.so.1 libdl.so.1 [ anon ] ld.so.1 ld.so.1 [ stack ]

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

67

The Evolving Solaris Kernel

Memory Placement Optmization


Memory locality optimization for non-uniform memory architectures
Solaris 9 Update 1 Ex800 machines are slightly non-uniform F15k systems are slightly more non-uniform Unit is typically a system board (processors+memory) Lgroups are an artifact of the hardware architecture (not user configurable) Threads are assigned a home lgroup Program heap and stack is allocated on the same lgroup Shared memory allocated round robin across boards in the system or processor set. Different programatic policies also provided.
Nov 2002

Machine described as groups of latency (lgroups)

Memory allocated close to the threads accessing it

copyright (c) 2002 Jim Mauro and Richard McDougall

68

The Evolving Solaris Kernel

Lock Statistics - mpstat


# mpstat 1 CPU minf mjf xcal 8 0 0 6611 9 1 0 1294 10 0 0 3232 11 0 0 647 12 0 0 190 13 0 0 624 14 0 0 392 15 0 0 146 16 0 0 382 17 0 0 88 18 0 0 3571 19 0 0 3133 20 0 0 385 21 0 0 152 22 0 0 3964 23 0 2 555 24 0 0 811 25 0 0 105 26 0 0 163 27 0 1 718 28 0 0 868 29 0 0 931 30 0 0 2800 31 0 1 1778 intr ithr csw icsw migr smtx 456 300 1637 7 26 1110 250 100 2156 3 29 1659 308 100 2357 2 36 1893 385 100 1952 1 19 1418 225 100 307 0 1 589 373 100 1689 2 14 1175 312 100 1810 1 12 1302 341 100 2586 2 13 1676 355 100 1968 2 7 1628 283 100 689 0 4 474 152 104 568 0 7 2007 278 100 2043 2 24 1307 242 127 2127 2 22 1296 369 100 2259 0 10 1400 241 120 1754 3 25 1085 193 100 1827 2 23 1148 245 113 1327 2 23 1228 500 100 2369 0 11 1736 395 131 2383 2 16 1487 1278 1051 2073 4 23 1311 271 100 2287 4 27 1309 302 103 2480 3 29 1569 303 100 2146 2 13 1266 320 100 2368 2 24 1381 srw syscl 0 135 0 68 0 104 0 21 0 0 0 87 0 49 0 8 0 4 0 95 0 15 0 113 0 36 0 140 0 91 0 288 0 110 0 6 0 64 0 237 0 139 0 165 0 152 0 261 usr sys 33 45 9 63 2 66 4 83 0 98 7 80 2 80 0 82 0 88 1 94 0 93 7 69 0 73 2 84 11 62 7 64 3 76 0 88 2 79 9 67 9 55 9 66 11 70 11 56 wt idl 2 21 0 28 2 30 0 13 0 2 2 12 2 15 1 17 0 12 2 3 1 6 1 22 0 26 2 12 1 26 7 22 4 17 0 11 1 18 6 19 0 36 2 23 3 16 5 28

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

69

The Evolving Solaris Kernel

Lock Statistics - lockstat


# lockstat sleep 10 Adaptive mutex spin: 293311 events in 10.015 seconds (29288 events/sec) Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------218549 75% 75% 1.00 3337 0x71ca3f50 entersq+0x314 26297 9% 83% 1.00 2533 0x71ca3f50 putnext+0x104 19875 7% 90% 1.00 4074 0x71ca3f50 strlock+0x534 14112 5% 95% 1.00 3577 0x71ca3f50 qcallbwrapper+0x274 2696 1% 96% 1.00 3298 0x71ca51d4 putnext+0x50 1821 1% 97% 1.00 59 0x71c9dc40 putnext+0xa0 1693 1% 97% 1.00 2973 0x71ca3f50 qdrain_syncq+0x160 683 0% 97% 1.00 66 0x71c9dc00 putnext+0xa0 678 0% 98% 1.00 55 0x71c9dc80 putnext+0xa0 586 0% 98% 1.00 25 0x71c9ddc0 putnext+0xa0 513 0% 98% 1.00 42 0x71c9dd00 putnext+0xa0 507 0% 98% 1.00 28 0x71c9dd80 putnext+0xa0 407 0% 98% 1.00 42 0x71c9dd40 putnext+0xa0 349 0% 98% 1.00 4085 0x8bfd7e1c putnext+0x50 264 0% 99% 1.00 44 0x71c9dcc0 putnext+0xa0 187 0% 99% 1.00 12 0x908a3d90 putnext+0x454 183 0% 99% 1.00 2975 0x71ca3f50 putnext+0x45c 170 0% 99% 1.00 4571 0x8b77e504 strwsrv+0x10 168 0% 99% 1.00 4501 0x8dea766c strwsrv+0x10 154 0% 99% 1.00 3773 0x924df554 strwsrv+0x10

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

70

The Evolving Solaris Kernel

Lock Statistics - lockstat


Adaptive mutex block: 2818 events in 10.015 seconds (281 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------2134 76% 76% 1.00 1423591 0x71ca3f50 entersq+0x314 272 10% 85% 1.00 893097 0x71ca3f50 strlock+0x534 152 5% 91% 1.00 753279 0x71ca3f50 putnext+0x104 134 5% 96% 1.00 654330 0x71ca3f50 qcallbwrapper+0x274 65 2% 98% 1.00 872630 0x71ca51d4 putnext+0x50 9 0% 98% 1.00 260444 0x71ca3f50 qdrain_syncq+0x160 7 0% 98% 1.00 1390807 0x8dea766c strwsrv+0x10 6 0% 99% 1.00 906048 0x88876094 strwsrv+0x10 5 0% 99% 1.00 2266267 0x8bfd7e1c putnext+0x50 4 0% 99% 1.00 468550 0x924df554 strwsrv+0x10 3 0% 99% 1.00 834125 0x8dea766c cv_wait_sig+0x198 2 0% 99% 1.00 759290 0x71ca3f50 drain_syncq+0x380 2 0% 99% 1.00 1906397 0x8b77e504 cv_wait_sig+0x198 2 0% 99% 1.00 645358 0x71dd69e4 qdrain_syncq+0xa0

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

71

The Evolving Solaris Kernel

Lock Statistics - lockstat


Spin lock spin: 52335 events in 10.015 seconds (5226 events/sec) Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------23531 45% 45% 1.00 4352 turnstile_table+0x79c turnstile_lookup+0x48 1864 4% 49% 1.00 71 cpu[19]+0x40 disp+0x90 1420 3% 51% 1.00 74 cpu[18]+0x40 disp+0x90 1228 2% 54% 1.00 23 cpu[10]+0x40 disp+0x90 1159 2% 56% 1.00 60 cpu[16]+0x40 disp+0x90 1138 2% 58% 1.00 22 cpu[24]+0x40 disp+0x90 1108 2% 60% 1.00 57 cpu[17]+0x40 disp+0x90 1082 2% 62% 1.00 24 cpu[11]+0x40 disp+0x90 1039 2% 64% 1.00 25 cpu[29]+0x40 disp+0x90 1009 2% 66% 1.00 17 cpu[23]+0x40 disp+0x90 1007 2% 68% 1.00 21 cpu[31]+0x40 disp+0x90 882 2% 70% 1.00 29 cpu[13]+0x40 disp+0x90 846 2% 71% 1.00 25 cpu[28]+0x40 disp+0x90 833 2% 73% 1.00 27 cpu[30]+0x40 disp+0x90

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

72

The Evolving Solaris Kernel

Lock Statistics - lockstat


Thread lock spin: 1232 events in 10.015 seconds (123 events/sec) Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------468 38% 38% 1.00 1018 turnstile_table+0x79c ts_tick+0x8 251 20% 58% 1.00 683 turnstile_table+0x79c turnstile_block+0x1f4 180 15% 73% 1.00 152 sleepq_head+0x7f4 ts_tick+0x8 68 6% 78% 1.00 35 sleepq_head+0x7f4 turnstile_block+0x1f4 31 3% 81% 1.00 650 sleepq_head+0x7f4 ts_update_list+0x60 17 1% 82% 1.00 34 cpu[27]+0x64 cv_wait+0x18 7 1% 83% 1.00 64 cpu[13]+0x64 cv_wait+0x18 7 1% 84% 1.00 146 cpu[30]+0x64 ts_tick+0x8 6 0% 84% 1.00 56 cpu[29]+0x64 ts_tick+0x8 6 0% 84% 1.00 37 cpu[8]+0x64 turnstile_block+0x1f4 6 0% 85% 1.00 96 cpu[9]+0x64 ts_tick+0x8

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

73

The Evolving Solaris Kernel

Lock Statistics - lockstat


R/W writer blocked by writer: 1 events in 10.015 seconds (0 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------1 100% 100% 1.00 169634 0x9d42d620 segvn_pagelock+0x150 ------------------------------------------------------------------------------R/W reader blocked by writer: 3 events in 10.015 seconds (0 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------3 100% 100% 1.00 1841415 0x75b7abec mir_wsrv+0x18 -------------------------------------------------------------------------------

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

74

The Evolving Solaris Kernel

lockstat - kernel proling


# lockstat -kIi997 sleep 10 Profiling interrupt: 10596 events in 5.314 seconds (1994 events/sec) Count indv cuml rcnt nsec CPU+PIL Caller ------------------------------------------------------------------------------5122 48% 48% 1.00 1419 cpu[0] default_copyout 1292 12% 61% 1.00 1177 cpu[1] splx 1288 12% 73% 1.00 1118 cpu[1] idle 911 9% 81% 1.00 1169 cpu[1] disp_getwork 695 7% 88% 1.00 1170 cpu[1] i_ddi_splhigh 440 4% 92% 1.00 1163 cpu[1]+11 splx 414 4% 96% 1.00 1163 cpu[1]+11 i_ddi_splhigh 254 2% 98% 1.00 1176 cpu[1]+11 disp_getwork 27 0% 99% 1.00 1349 cpu[0] uiomove 27 0% 99% 1.00 1624 cpu[0] bzero 24 0% 99% 1.00 1205 cpu[0] mmrw 21 0% 99% 1.00 1870 cpu[0] (usermode) 9 0% 99% 1.00 1174 cpu[0] xcopyout 8 0% 99% 1.00 650 cpu[0] ktl0 6 0% 99% 1.00 1220 cpu[0] mutex_enter 5 0% 99% 1.00 1236 cpu[0] default_xcopyout 3 0% 100% 1.00 1383 cpu[0] write 3 0% 100% 1.00 1330 cpu[0] getminor 3 0% 100% 1.00 333 cpu[0] utl0 2 0% 100% 1.00 961 cpu[0] mmread 2 0% 100% 1.00 2000 cpu[0]+10 read_rtc

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

75

The Evolving Solaris Kernel

Kernel Process Model


Processes
All processes begin life as a program All processes begin life as a disk file (ELF object) All processes have state or context that defines their execution environment - hardware & software context The processor state, which is CPU architecture dependent. In general, the state of the hardware registers (general registers, privileged registers) Maintained in the LWP Address space, credentials, open files, resource limits, etc - stuff shared by all the threads in a process can be further divided into hardware context and software context
Nov 2002

Hardware context

Software context

copyright (c) 2002 Jim Mauro and Richard McDougall

76

The Evolving Solaris Kernel

Dispatcher Views
user thread user thread user thread user thread user thread user thread
unbound user threads are scheduled within the threads library, where the selected user thread is linked to an available LWP. This does not apply to bound threads

process

LWP LWP machine state

LWP machine state

process

software context: open les, credentials, address space, process group, session control,...

software context: open les, credentials, address space, process group, session control,...

LWP LWP machine state

LWP machine state

kernel dispatcher view. CPU

kthread

kthread

kthread

kthread

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

77

The Evolving Solaris Kernel

Dispatcher & Scheduling Classes


Solaris supports multiple scheduling classes
Allows for the co-existence of different priority schemes and scheduling algorithms (policies) within the kernel Each scheduling class provides a class-specific function to manage thread priorities, administration, creation, termination, etc. The class-specific functions are called using a MACRO scheme, similar to what is used at the VFS layer

... CL_PREEMPT(thread) -> ts_preempt() ...

Each scheduling class is assigned a range of priorities For each loaded scheduling class, the priority-range falls within the systems total range of global priorities

The dispatcher is the kernel sunsystem that manages the dispatch queues (run queues), handles thread selection, context switching, preemption, etc
copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

78

The Evolving Solaris Kernel

Scheduling Classes
SunOS currently implements the following scheduling classes
Timeshare (TS) Fixed Priority (FX) Fair Share (FSS) Interactive (IA) System (SYS) Realtime (RT)
highest (best) priority

169 160 159 100 99

interrupt

realtime

interrupt thread priorities above system if realtime class is not loaded, priorities 100-109.

system lowest (worst) priority

60 59
timesharing and interactive

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

79

The Evolving Solaris Kernel

Scheduling Classes - Priorities


59 user priority range 0 +60 system user priority range -60 interactive realtime interrupt

10 169 ints 1

+60 user priority range -60 timeshare

global priority range


0

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

80

The Evolving Solaris Kernel

Quick Tidbit
Use dispadmin(1M) or mdb(1) for scheduling class info
# dispadmin -l CONFIGURED CLASSES ================== SYS TS FX IA (System Class) (Time Sharing) (Fixed Priority) (Interactive)

# mdb -k > ::class SLOT NAME 0 SYS 1 TS 2 FX 3 IA 4 5

INIT FCN sys_init ts_init fx_init ia_init 0 0

CLASS FCN sys_classfuncs ts_classfuncs fx_classfuncs ia_classfuncs 0 0

Note the RT class is not loaded

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

81

The Evolving Solaris Kernel

Thread Priorities & Scheduling


Every thread has 2 priorities; a global priority, derived based on its scheduling class, and (potentially) and inherited priority Priority inherited from parent, alterable via priocntl(1) command or system call Typically, threads run as either TS or IA threads
IA threads created when thread is associated with a windowing system

RT threads are explicitly created SYS class used by kernel threads, and for TS/IA threads when a higher priority is warranted
A temporary boost when an important resource is being held
82

Interrupts run at interrupt priority


copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

File System Types


Filesystem ufs pcfs hsfs tmpfs nfs cachefs autofs specfs procfs sockfs fifofs Type Regular Regular Regular Regular Psuedo Psuedo Psuedo Psuedo Psuedo Psuedo Psuedo Device Disk Disk Disk Memory Network Filesystem Filesystem Device Drivers Kernel Network Files Description Unix Fast Filesystem, default in Solaris MSDOS filesystem High Sierra File System (CDROM) Uses memory and swap Network filesystem Uses a local disk as cache for another NFS file system Uses a dynamic layout to mount other file systems Filesystem for the /dev devices /proc filesystem representing processes Filesystem of socket connections FIFO File System

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

83

The Evolving Solaris Kernel

The virtual le system framework


VNODE OPERATIONS
rename()

VFS OPERATIONS
umount()

unlink()

mkdir()

rmdir()

fsync()

mount()

write()

close()

creat()

statfs()

open()

read()

ioctl()

seek()

link()

sync()

Kernel

System Call Interface VFS- File System Independant Layer (VFS & VNODE INTERFACES)

UFS

PCFS

HSFS

VxFS

NFS

PROCFS

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

84

The Evolving Solaris Kernel

The VFS Interface


vfs_sw[]
/ /usr /var /opt *rootvfs

VFSOP_xxx

Mount Point

VFS

mount() unmount() root() statvfs() sync() vget() mountroot() swapvp()

ufs_mount() ufs_unmount() ufs_root() ufs_statvfs() ufs_sync() ufs_vget() ufs_mountroot() ufs_swapvp()

vnode ufs nfs etc... VFS Type Index into vfssw[]

blocksize ags device synclist hashlist

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

85

The Evolving Solaris Kernel

The vnode interface


VNODE Ops
close() read() write() ioctl() create() link() . . ufs_close() ufs_read() ufs_write() ufs_ioctl() ufs_create() ufs_link() . .

Memory Pages

VNODE

Filesystem Pointer

Regular File Directory Block Device VNODE Type Character Device Link FIFO Process Socket
Nov 2002

copyright (c) 2002 Jim Mauro and Richard McDougall

86

The Evolving Solaris Kernel

File system Caching


Solaris le systems use the VM system to cache and move data Regular reads are page ins, delayed writes are page outs VM Parameters and load dramatically effects le system performance
Solaris 8 gives executable, stack and heap pages priority over file system pages

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

87

The Evolving Solaris Kernel

File System Caching


read() write() fread() fwrite()
Stack

File name lookups


STDIO Buffers

mmap()

(ncsize)
The DNLC cache hit ratio can be observed with netstat -s

Heap

Directory Name Cache

Level 1 Page Cache segmap page cache (256MB on Ultra)


Binary (Data) Binary (T ext)

Inode Cache (ufsninode)

direct. blocks

The cache hit ratio of the segmap cache can be measured with netstat -k segmap

Level 2 Page Cache


The buffer cache hit ratio can be observed with sar -b

Buffer Cache
Dynamic Page Cache

(BUFHWM)

Files mapped with mmap() buypass the segmap cache

Storage Devices

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

88

The Evolving Solaris Kernel

UFS
Block based allocation
2TB Max file system size A file can grow to the max file system size
triple indirect is implemented

Prior to 2.6, max file size is 2GB

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

89

The Evolving Solaris Kernel

UFS Block Allocation


# filestat /home/bigfile Inodes per cyl group: Inodes per block: Cylinder Group no: Cylinder Group blk: File System Block Size: Device block size: Number of device blocks: Start Block ----------66272 66480 1155904 1277392 1387552 1497712 1607872 1718016 1155872 End Block ----------66463 99247 1188671 1310159 1420319 1530479 1640639 1725999 1155887 64 64 0 64 8192 512 204928 Length (Device Blocks) ---------------------192 32768 32768 32768 32768 32768 32768 7984 16 9 22769 Blocks

-> -> -> -> -> -> -> -> ->

Number of extents: Average extent size:

Note: The filestat command is show for demonstration purposes, and is not as yet included with the Solaris operating system copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

90

The Evolving Solaris Kernel

UFS Logging
Beginning in Solaris 7, UFS logging became a mount option Log to spare blocks in the file system (no metadevice) Fast reboots - no fsck required

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

91

The Evolving Solaris Kernel

UFS Direct I/O


File systems cause a lot of paging activity Solaris 2.6 introduces a mechanism to bypass the VM system
Forces completely unbuffered I/Os Very slow writes (synchronous) Useful for copying large les or when application does caching e.g. Oracle mount -o forcedirectio /dev/xyz /mountpt directio (fd, DIRECTIO_ON | DIRECTIO_OFF)

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

92

The Evolving Solaris Kernel

Direct I/O Checklist


Must be aligned
sector aligned (512 byte boundary)

Must not be mapped Buffer must be word aligned

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

93

The Evolving Solaris Kernel

UFS Write Throttle


A throttle exists in UFS to limit the amount of memory UFS can saturate, per le
Controlled by three parameters ufs_WRITES (1 = enabled) ufs_HW = 393216 bytes (high water mark to suspend IO) ufs_LW = 262144 bytes (low water mark to start IO)

Almost always need to set this higher to get maximum sequential write performance
set ufs_LW=4194304 set ufs_HW=67108864

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

94

The Evolving Solaris Kernel

UFS Performance
Adjacent blocks are grouped and written together or read ahead
Controlled by the maxcontig parameter Defaults to 128k on most platforms, 1MB on SPARCstorage array 100,200 Must be set higher to achieve adequate write performance maxphys must be raised beyond 128k also

copyright (c) 2002 Jim Mauro and Richard McDougall

Nov 2002

95

You might also like