AIX Performance Tuning for Databases
December 2, 2005
Mathew Accapadi
12/08/21 © 2005 IBM Corporation
AIX Performance Tools
Monitoring/Analysis tools for CPU
- Profiling tools: gprof, xprofiler, prof, tprof, time, timex
- Monitoring tools: vmstat, iostat, sar, emstat, alstat, mpstat, lparst
at, topas, nmon, PTX, jtopas, ps, wlmmon, pprof, procmon
- Trace tools: trace, trcrpt, curt, splat, truss
- Hardware counter tools: PMAPI, tcount
Monitoring/Analysis tools for memory
– Profiling tools: svmon, MALLOCDEBUG, ps
– Monitoring tools: svmon, vmstat, topas, jtopas, nmon, PTX, lsps,
ps
- Trace tools: trace, trcrpt
- Hardware counter tools: PMAPI, tcount
2 12/08/21 © 2005 IBM Corporation
AIX Performance Tools cont.
Monitoring/Analysis tools for Network
– netstat, nfsstat, netpmon
– iptrace, ipreport, ipfilter, tcpdump
– topas, jtopas, nmon, PTX
– trace, trcrpt, curt
Monitoring/Analysis tools for I/O
– iostat, vmstat
– filemon, fileplace, lvmstat
– topas, jtopas, nmon, PTX
– trace, trcrpt, curt
3 12/08/21 © 2005 IBM Corporation
AIX Performance Tools cont.
/proc tools
– Proccred, procfiles, procflags, procldd, procmap,
procrun, procsig, procstack, procstop, proctree, procwait,
procwdx
SPMI
– provides access to all PTX metrics
– allows application metrics to be export to PTX tools
Hardware counter tools
– PMAPI
– hpmcount, hpmstat, libhpm
4 12/08/21 © 2005 IBM Corporation
AIX Tuning Tools
CPU
– schedo (scheduler options)
– priority tools: nice/renice
– affinity tools: bindprocessor, bindintcpu, rset tools
– ulimit (cpu limit)
Memory
– vmo (virtual memory options)
– ioo (io options but related to memory)
– fdpr, chdev (sys0 device), ulimit (data/stack/rss limits)
5 12/08/21 © 2005 IBM Corporation
AIX Tuning Tools cont
Network
– no (network options), nfso (network options)
– chdev (network adapter tuning), ifconfig (interface tuning)
I/O
– ioo (I/O options)
– lvmo (LVM options)
– chdev (hdisk, adapter tuning)
– migratepv, migratelp, reorgvg
6 12/08/21 © 2005 IBM Corporation
Standard Tuning Framework
Starting with AIX 5.2, the main tuning tools will be
part of a standard framework
– Tools will end with ‘o’ (schedo, vmo, ioo, no, nfso, …)
– Each command will have the following options:
• -a display all the parameters
• -p (set the value now and make it permanent)
• -r (apply the value at the next reboot and make it perman
ent)
• -o specifies the tunable parameter with its value
• -h shows help information for the tunable
• -L shows current/default/ranges for the tunable
• -d resets tunable to default value
• -D resets all tunables to default values
7 12/08/21 © 2005 IBM Corporation
Standard Tuning Framework cont.
Tuning commands store values in /etc/tunables directory
– Standard tuning framework commands will modify the following fil
es:
• /etc/tunables/nextboot will be used to apply tunable values when the sy
stem boots
• /etc/tunables/lastboot contains the values that were set at the last syst
em boot
• /etc/tunables/lastboot.log contains log information for any tunable that
was changed
– Tunables file commands:
• tunrestore – sets tunables based on parameters in file (used in /etc/initt
ab to set tunables from /etc/nextboot)
• tuncheck – used to validate the parameter values in a file
• tunsave – saves tunable values to a stanza file
• tundefault – resets tunable parameters to default values
8 12/08/21 © 2005 IBM Corporation
Semaphore Tuning
Other OS’s may require tuning of semaphore parameters
Oracle on AIX uses post/wait mechanism instead of semaphores
– No need to tune semaphores on AIX
AIX does not provide tuning for semaphore parameters for this but
you do not need to tune these parameters since they’re already set
as high as they can go
– semmni (131072) – max number of semaphore id’s
– semmsl (65535) – max number of semaphores per id
– semopm (1024) – max number of operations per semop call
– semume (1024) – max number of undo entries per process
– semvmx (32767) – maximum value of a semaphore
Use ipcs to see semaphores/message queues/shared memory
9 12/08/21 © 2005 IBM Corporation
Message Queue Tuning
– Message queue structures are also dynamically scaled by AIX
up to their maximum values
– Message queue parameters do not need to be tuned
– Upper limits of message queue parameters:
• msgmax (4194304) – maximum size of message in bytes
• msgmnb (4194304) – maximum number of bytes on a queue
• msgmni (131072) - maximum number of message queue IDs
• msgmnm (524288) - maximum number of messages per queue
10 12/08/21 © 2005 IBM Corporation
Shared Memory Tuning
Following shared memory parameters do not need to be tuned
on AIX (shared memory structures are scaled dynamically up t
o upper limits)
– shmmni (131072) – number of shared memory IDs
– shmmin (1) - minimum shared memory segment size in byt
es
– shmmax (3.25 GB) – maximum shared memory region size
for 32-bit process
– shmmax64 (32 TB) – maximum shared memory region size
for 64-bit process on a 64-bit kernel
– shmmax64 (1 TB) – maximum shared memory region size f
or 64-bit process on a 32-bit kernel
Application can request the size of the shared memory area an
d whether or not to use large pages or pin memory using shm
get()
11 12/08/21 © 2005 IBM Corporation
Pinned Shared Memory
To pin shared memory, it requires two steps
– vmo –p –o v_pinshm=1
– set application parameter to pin the memory (ie,
lock_sga=TRUE for Oracle)
• Application must specify SHM_PIN in the shmget system call
Make sure that pinned memory does not go above
the maxpin value (defaults to 80% of RAM but can
be changed with vmo-p –o maxpin%=value)
12 12/08/21 © 2005 IBM Corporation
Using Large Pages for the Buffer Cache
Large pages (16MB) can make a noticeable improvement in
performance when the buffer cache is large
Steps needed to use large pages
– To enable the use of large pages for shared memory, vmo parameter
v_pinshm must be set to 1
– Use vmo parameters lgpg_regions, lgpg_size to specify how many lar
ge pages to pin and bosboot/reboot to enable large pages
– Application must be configured to use large pages (ie, lock_sga=TRU
E for Oracle)
– Application user id must be configured with permission to use large
pages:
• chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE <user id>
– Verify that large pages are there using ‘vmstat –l’ or ‘svmon –G’
13 12/08/21 © 2005 IBM Corporation
vmo Parameters
# vmo -a
memory_frames = 1048576
pinnable_frames = 999141
maxfree = 152
minfree = 144
minperm% = 20
minperm = 200476
maxperm% = 50
maxperm = 501190
strict_maxperm = 0
maxpin% = 80
maxpin = 838861
maxclient% = 50
lrubucket = 131072
defps = 1
nokilluid = 0
numpsblks = 131072
npskill = 1024
npswarn = 4096
v_pinshm = 0
14 12/08/21 © 2005 IBM Corporation
vmo Parameters cont
pta_balance_threshold = 50
pagecoloring = 0
framesets = 2
mempools = 1
lgpg_size = 0
lgpg_regions = 0
num_spec_dataseg = n/a
spec_dataseg_int = n/a
memory_affinity = n/a
htabscale = -1
force_relalias_lite = 0
relalias_percentage = 0
data_stagger_interval = 161
large_page_heap_size = n/a
kernel_heap_psize = n/a
soft_min_lgpgs_vmpool = 0
vmm_fork_policy = 0
low_ps_handling = 1
mbuf_heap_psize = n/a
strict_maxclient = 1
cpu_scale_memp = 8
15 12/08/21 © 2005 IBM Corporation
vmo Additional Parameters (5.3)
memplace_data = 2
memplace_mapped_file = 2
memplace_shm_anonymous = 2
memplace_shm_named = 2
memplace_stack = 2
memplace_text = 2
memplace_unmapped_file = 2
npsrpgmax = 12288
npsrpgmin = 9216
npsscrubmax = 12288
npsscrubmin = 9216
rpgclean = 0
rpgcontrol = 2
scrub = 0
scrubclean = 0
16 12/08/21 © 2005 IBM Corporation
ioo Parameters
# ioo -a
minpgahead = 2
maxpgahead = 32
pd_npages = 65536
maxrandwrt = 0
numclust = 1
numfsbufs = 186
sync_release_ilock = 0
lvm_bufcnt = 9
j2_minPageReadAhead = 2
j2_maxPageReadAhead = 128
j2_nBufferPerPagerDevice = 512
j2_nPagesPerWriteBehindCluster = 32
j2_maxRandomWrite = 0
j2_nRandomCluster = 0
jfs_clread_enabled = 0
jfs_use_read_lock = 1
hd_pvs_opn = 2
hd_pbuf_cnt = 384
j2_inodeCacheSize = 400
j2_metadataCacheSize = 400
j2_dynamicBufferPreallocation = 16
j2_maxUsableMaxTransfer = 512
pgahd_scale_thresh = 0
pv_min_pbuf = 512 (5.3)
17 12/08/21 © 2005 IBM Corporation
schedo Parameters
# schedo -a
v_repage_hi = 0
v_repage_proc = 4
v_sec_wait = 1
v_min_process = 2
v_exempt_secs = 2
pacefork = 10
sched_D = 16
sched_R = 16
timeslice = 1
maxspin = 16384
%usDelta = 100
affinity_lim = 7
idle_migration_barrier = 4
fixed_pri_global = 0
big_tick_size = 1
force_grq = 0
18 12/08/21 © 2005 IBM Corporation
schedo Additional Parameters (5.3)
krlock_confer2self = 0
krlock_conferb4alloc = 0
krlock_enable = 1
krlock_spinb4alloc = 1
krlock_spinb4confer = 1024
n_idle_loop_vlopri = n/a
search_globalrq_mload = n/a
search_smtrunq_mload = n/a
setnewrq_sidle_mload = n/a
shed_primrunq_mload = n/a
sidle_S1runq_mload = n/a
sidle_S2runq_mload = n/a
sidle_S3runq_mload = n/a
sidle_S4runq_mload = n/a
slock_spinb4confer = 1024
smt_snooze_delay = n/a
smtrunq_load_diff = n/a
unboost_inflih = n/a
19 12/08/21 © 2005 IBM Corporation
I/O Layers
Database Application --> Library
Asynchronous I/O (optional)
Filesystem
Virtual Memory Manager (optional)
LVM
Disk Subsystem powerpath/vpath layer (optional)
Disk Driver
Fibre Channel Protocol
Fibre Channel Device Driver
Host Bus Adapter -> Switch -> Disk Subsystem
20 12/08/21 © 2005 IBM Corporation
Application I/O layer
Databases can do reads using read(), readv(), pread(),
lio_listio(), aio_read()
Databases can do writes with write(), writev(), pwrite(),
lio_listio(), aio_write()
Database I/O sizes (with exception for the redo log writer
and archiver) based on database block buffer size and
parameters such as db_multiblock_read_count
Usually, biggest improvements in performance are achieved
through tuning at the application/database layer
21 12/08/21 © 2005 IBM Corporation
Database Logical vs Physical I/Os
It’s more efficient to rely on the Database caching than on
filesystem caching
– No system call required to do reads
– Database logical I/Os usually referred to as ‘buffer gets’
– Database physical I/Os handled via read system call but may
not be a physical I/O from an operating system perspective
(could be a logical I/O on the OS side since the data may be
retrieved from the filesystem cache)
• OS physical I/Os (seen in output of iostat command) always go to
the disk layer whereas DB physical I/Os can be retrieved from the
filesystem cache (DB physical I/Os >= OS physical I/Os)
• If filesystem cache is not used, DB physical I/Os would be equal to
the number of OS physical I/Os
22 12/08/21 © 2005 IBM Corporation
Asynchronous I/O Layer
AIO to LVM files can take a fastpath and bypass the AIO servers
– No tuning needs to be done in this case
AIO to filesystems currently use a set of AIO queues and AIO
server threads
– AIO server threads take I/Os off the queues and submit them to the
filesystem
– Number of AIO server threads can be tuned (maxservers, a per CPU
value)
– AIO server thread does synchronous or non-synchronous I/O based
on the file open flags
– The AIO parameter ‘maxreqs’ specifies how many AIOs can be in
progress and on the queues at any one time
• Once the limit is reached, EAGAIN error is returned to the application
23 12/08/21 © 2005 IBM Corporation
Asynchronous I/O Parameters
# lsattr -E -l aio0
autoconfig available STATE to be configured at system restart True
fastpath enable State of fast path True
kprocprio 39 Server PRIORITY True
maxreqs 4096 Maximum number of REQUESTS True
maxservers 10 MAXIMUM number of servers per cpu True
minservers 1 MINIMUM number of servers True
24 12/08/21 © 2005 IBM Corporation
Asynchronous I/O Tuning
maxservers and maxreqs may need to be tuned
AIO server threads are created dynamically as needed up to
the maxservers parameter and will stay in existence from
then on
AIO parameters can be tuned on a permanent basis using
SMIT or chdev –l aio0 –a parameter=value
AIO parameters can be dynamically increased temporarily
using the aioo command
– aioo –o parameter=value
25 12/08/21 © 2005 IBM Corporation
Filesystem Sequential Read Tuning
Sequential reads benefit from readahead parameters
– minpgahead, maxpgahead for JFS
– j2_minPageReadAhead, j2_maxPageReadAhead for JFS2
– Increasing max readahead parameters can benefit sequential
reads (ie. table scans)
• Need to increase maxfree parameter also to ensure that LRU keeps
up with readahead
– ‘rbr’ mount option releases pages from the file cache once they
are read into the application buffers (only for sequential reads)
– Filesystem readahead only available when using filesystem
cache (must rely on database readahead otherwise)
26 12/08/21 © 2005 IBM Corporation
Filesystem Sequential Write Tuning
‘rbw’ mount will release pages on writes if writes
are sequential and file was opened non-
synchronously
Non-synchronous sequential writes may benefit
from write-behind tuning
– numclust parameter for JFS,
j2_nPagesPerWriteBehindCluster for JFS2
– In most cases, databases will not be doing such writes
since most database files are opened with a sync flag
27 12/08/21 © 2005 IBM Corporation
Filesystem Buffer Tuning
I/Os to filesystems use buffer structures called
bufstructs
– Each filesystem preallocates a pool of bufstructs when
the filesystem is mounted
– If a bufstruct is not available, the I/O is queued until an
already submitted I/O completes and releases its
bufstruct
– A counter is incremented for each filesystem type when a
bufstruct is not available; run ‘vmstat –v’ to examine
counters
28 12/08/21 © 2005 IBM Corporation
Filesystem Buffer Tuning cont.
vmstat -v
---------
1572864 memory pages
1509213 lruable pages
18938 free pages
3 memory pools
211903 pinned pages
80.1 maxpin percentage
5.0 minperm percentage
5.0 maxperm percentage
58.4 numperm percentage
882475 file pages
0.0 compressed percentage
0 compressed pages
27.2 numclient percentage
5.0 maxclient percentage
411210 client pages
0 remote pageouts scheduled
0 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
6263 filesystem I/Os blocked with no fsbuf
0 client filesystem I/Os blocked with no fsbuf
0 external pager filesystem I/Os blocked with no fsbuf
29 12/08/21 © 2005 IBM Corporation
Filesystem Buffer Tuning cont.
Increase ioo parameter numfsbufs for JFS filesystems if
‘filesystem I/Os blocked with no fsbuf’ counter continues to
increase
Increase ioo parameter j2_nBufferPerPagerDevice if
‘external pager filesystem I/Os blocked with no fsbuf’
counter continues to increase
– Starting with 5.2 ML4, it should be rare that this parameter
would need to be tuned since JFS2 does dynamic buffer
allocation (increases bufstructs as needed)
– Ioo parameter j2_dynamicBufferPreallocation controls how
many Kbytes worth of bufstructs are allocated each time
• Can set to 0 to disable dynamic buffer allocation
30 12/08/21 © 2005 IBM Corporation
Filesystem Direct I/O
By default, JFS and JFS2 filesystem I/Os are cached in
memory
Caching can be bypassed through the use of Direct I/O or
Concurrent I/O
Direct I/O is used by default with Oracle on GPFS
filesystems by specifying an open() flag (O_DIO)
Direct I/O is enabled at the filesystem by using the ‘dio’
mount option
– I/O size must be a multiple of the filesystem block size
Direct I/O mount supported by JFS, JFS2, GPFS
31 12/08/21 © 2005 IBM Corporation
Performance Impacts of Direct I/O
Bypassing caching eliminates VMM overhead
Direct I/O reads will not be able to take advantage of
readahead at the filesystem layer (disk subsystem may
provide readahead)
– Database may provide its own prefetch mechanism
Useful for random I/O and I/O that would normally not have a
high cache hit rate
Direct I/O writes will not be considered complete until they
have made it to disk
– Many databases open their files with a sync flag, so writes
must go to disk each time anyway
32 12/08/21 © 2005 IBM Corporation
Filesystem Concurrent I/O
Concurrent I/O is a JFS2 feature which is Direct I/O without
inode locking
– Inode lock is still held when the file is being extended
– Inode locking by the filesystem is not necessary if the
application is performing proper serialization
• Application vendor must provide support for Concurrent I/O
– Concurrent I/O has all characteristics of Direct I/O
• No readahead, so it may not be optimal in all cases
– No inode locking can provide a large increase in performance
(when DB has a lot of writes and reads to the same files)
33 12/08/21 © 2005 IBM Corporation
Concurrent I/O Enablement
Concurrent I/O is enabled either through the ‘cio’ mount
option or the O_CIO flag in the open() system call
Oracle 10G will open files with O_CIO if the oracle parameter
‘setall’ is used
Directory where the application executables and libraries
reside should not be mounted with CIO (ex.
$ORACLE_HOME – oracle doesn’t use latches on files there)
– A ‘namefs’ mount can be used to mount a subdirectory
using ‘cio’
• mount –v namefs –o cio /filesystem/subdir /filesystem/subdir
For Oracle redo log, put redo logs in separate filesystem
created with a 512-byte filesystem and mount with ‘cio’
34 12/08/21 © 2005 IBM Corporation
Filesystem File Caching
When DIO or CIO is not used, files are cached in
real memory
– Size of file cache is based on ‘vmo’ parameters maxperm
(for JFS) and maxclient (for JFS2)
• strict_maxperm=0 (default) makes JFS cache size a soft limit
• strict_maxclient=1 (default) makes JFS2 cache size a hard
limit
• With soft limits, number of file pages in RAM can exceed the
limit but if page replacement needs to occur only file pages
are stolen/replaced
35 12/08/21 © 2005 IBM Corporation
Page Replacement (LRU)
Page replacement also known as LRU or Least Recently
Used is handled by one or more threads called lrud (kernel
multi-threaded process)
LRU runs when certain thresholds are reached
– If number of file pages in memory reaches within ‘minfree’
pages of the file cache limit (if limit is strict)
• LRU stops when number of file pages in memory is within ‘maxfree’
pages of the file cache limit
– If number of free memory pages on VMM freelist reaches
‘minfree’
• LRU stops when number of free memory pages on freelist reaches
‘maxfree’
– If a WLM class has reached its limit
36 12/08/21 © 2005 IBM Corporation
LRU scanning
LRU scans the page frame table looking for eligible pages using a simple
least recently used criteria
– If the page has its reference bit set, it is not stolen but reference bit is reset for
the next pass of LRU
– If the reference bit is not set, the page may be stolen if the page meets certain
criteria
• If the page is a file page, then if the number of file pages is above the file cache limit,
the page can be stolen
• If the number of file pages in memory is in between the minperm value and the
maxperm/maxclient value, then repaging rates are used to determine if the page can
be stolen
– if the lru_file_repage parameter is set to 0, then if the number of file pages in memory is above the
minperm value, file pages are stolen
> Recommendation: set lru_file_repage=0, minperm%=1
• If the number of file pages in memory is below minperm, then LRU steals an
unreferenced page regardless of whether it’s a file page or a computational page
37 12/08/21 © 2005 IBM Corporation
LRU Scanning cont.
If not enough eligible pages are found after scanning
‘lrubucket’ worth of pages (default is 131072 pages), LRU
starts over in that bucket and scans again
– The scan rate can be viewed under the ‘sr’ column in the
output of the vmstat command
When pages are stolen, they may be freed immediately if the
pages were not modified or freed after the pages are written
out to disk
– The free rate can be viewed under the ‘fr’ column in the output
of the vmstat command
38 12/08/21 © 2005 IBM Corporation
Page Replacement and Paging Space
Computational pages that are modified and stolen must be
paged out to paging space
– Paging space I/O activity can be seen in the ‘pi’ (page-in rate)
and ‘po’ (page-out rate) columns of vmstat
– With vmstat –I, file pages read in from the filesystem (for
cached filesystems) have page-in rates under the ‘fi’ column of
vmstat; file writes have page-out rates under the filesystem are
under the ‘fo’ column
Performance can be noticeably impacted if computational
pages (such as the database buffer caches or the process
heap) are paged out and have to be paged in again
39 12/08/21 © 2005 IBM Corporation
vmstat -I
# vmstat -I 10
System Configuration: lcpu=32 mem=191488MB
kthr memory page faults cpu
r b p avm fre fi fo pi po fr sr in sy cs us sy id wa
8 1 0 12945134 4213 151 7488 0 21 179 193 711 202851 203538 15 9 76 1
9 1 0 12945157 3926 25 6423 0 23 179 277 453 116885 191633 14 8 78 1
8 1 0 12945194 5759 15 9065 0 24 231 463 2008 125516 190439 14 9 76 1
12 1 0 12945211 5486 31 9958 0 15 243 428 3799 117624 189488 14 18 64 4
10 1 0 12945247 4280 29 6193 0 7 140 224 427 113468 190980 12 8 79 0
11 1 0 12945258 3921 10 5845 0 0 0 0 484 112393 191256 11 8 80 0
11 0 0 12945262 4092 12 5823 0 3 51 89 407 112539 191034 12 8 80 0
7 2 0 12946529 4025 88 6353 0 32 383 493 541 114747 191927 11 9 79 1
6 1 0 12945285 3868 80 6564 0 19 218 433 622 118519 190818 14 11 74 1
9 1 0 12945301 4663 60 9375 0 17 165 240 3114 118963 192304 13 10 77 1
8 7 0 12945308 4282 11 9270 0 0 0 0 1878 109050 185043 10 16 72 2
9 1 0 12945398 3898 10 5835 0 0 0 0 499 113986 193613 12 8 79 0
avm = 12945406 pages or 12945406*4K bytes = 49GB Active Virtual Memory
This server has 187GB of RAM.
40 12/08/21 © 2005 IBM Corporation
How Many File Pages in Memory?
# vmstat -v
49020928 memory pages
47128656 lruable pages
5807 free pages
4 memory pools
2404954 pinned pages
80.0 maxpin percentage
20.0 minperm percentage
80.0 maxperm percentage
77.9 numperm percentage
36732037 file pages
0.0 compressed percentage
0 compressed pages
78.0 numclient percentage
80.0 maxclient percentage
36767316 client pages
0 remote pageouts scheduled
321640 pending disk I/Os blocked with no pbuf
763 paging space I/Os blocked with no psbuf
2888 filesystem I/Os blocked with no fsbuf
9832 client filesystem I/Os blocked with no fsbuf
2038066 external pager filesystem I/Os blocked with no fsbuf
41 12/08/21 © 2005 IBM Corporation
Tuning to Prevent Paging
Assuming that memory is not over-committed, tuning vmo
parameters may eliminate paging
For JFS, lowering maxperm below numperm while keeping it
a soft limit should eliminate paging
For JFS2, maxclient can be lowered but it would have to be
changed to make it a soft limit
– maxperm would have to be lowered to the same value or
higher than maxclient
Best solution is to simply disable the use of repage counters
42 12/08/21 © 2005 IBM Corporation
Disabling Use of Repage Counters
vmo parameter ‘lru_file_repage’ can be set to 0,
which means to not use the repage counters
If value is 1 (default),
then
if numperm is between minperm and maxperm
or if numclient is between minperm and maxclient,
repage counters are used
If value is 0,
then
if numperm is higher than minperm (for JFS)
or if numclient is higher than minperm (for JFS2),
only file pages are stolen
Best solution for paging issues when using filesystems:
lower minperm to a low value like 5%
and set lru_file_repage=0
43 12/08/21 © 2005 IBM Corporation
Result of Setting lru_file_repage=0
# vmstat -I 10
System Configuration: lcpu=16 mem=191488MB
kthr memory page faults cpu
r b p avm fre fi fo pi po fr sr in sy cs us sy id wa
9 2 0 35226223 4272 950 9879 0 0 1016 1810 3939 99613 53246 37 7 55 1
6 0 0 35226288 4284 272 8256 0 0 618 1088 2883 63720 47746 30 5 66 0
7 1 0 35226356 4194 469 8078 0 0 758 1214 2805 59068 45870 26 5 69 0
5 0 0 35226431 4320 542 8101 0 0 865 1387 2886 58182 43960 28 4 68 0
7 0 0 35226479 4338 640 8002 0 0 923 1561 2779 54933 40913 28 6 66 0
8 1 0 35226556 4355 565 22850 0 0 959 1928 9190 74983 48209 40 9 49 2
9 1 0 35226899 3910 379 8232 0 0 756 1717 2889 63565 48098 31 5 64 0
8 1 0 35226971 3968 489 8351 0 0 878 2190 3177 67044 50490 32 5 63 0
9 0 0 35228294 3965 632 8473 0 0 1284 2755 2923 71085 48734 33 5 61 0
8 0 0 35227083 4423 639 8406 0 0 1113 1807 2597 58393 42845 29 6 64 1
5 1 0 35227125 3905 876 8059 0 0 1092 1845 3029 55988 42565 26 5 69 0
5 1 0 35227164 4240 1898 9557 0 0 2502 4365 4544 65081 45162 29 6 64 1
8 2 0 35227229 3960 840 21796 0 0 1097 2011 9045 59728 44146 34 9 54 3
7 1 0 35227279 4693 750 8321 0 0 1321 2156 2827 58103 43630 30 7 63 1
avm = 35227279 pages or 35227279*4K bytes = 134GB Active Virtual Memory
This server has 187GB of RAM.
44 12/08/21 © 2005 IBM Corporation
Memory Over-Commitment
Memory is considered over-committed if the working storage
requirements (computational) exceed the real memory size
– ‘avm’ column in the output of vmstat shows the working storage
number of pages
• Multiply this by 4K and if it is greater than RAM, then memory is over-
committed
– If memory is over-committed, it is recommended to reduce the
workload or add more real memory
– Important to have sufficient paging space in the case of
memory over-commitment
• Should be at least the size of ‘avm’ if prior to AIX 5.3
• AIX 5.3 provides paging space garbage collection
45 12/08/21 © 2005 IBM Corporation
Page Replacement by Memory Pool
By default (if memory_affinity=1) , each chip module (MCM on
Power4 or DCM on Power5) will have at least one memory pool
Each memory pool will have its own LRU daemon to do page
replacement and VMM parameters such as minfree, maxfree,
minperm, maxperm, maxclient apply on a per-pool basis
LRU will run on its own pool when thresholds are reached
Number of memory pools in the chip module is based on amount of
RAM on the chip module (if memory_affinity=1) and the number of
CPUs in the LPAR
If memory_affinity is disabled (set to 0 using ‘vmo’, bosboot,
reboot), then number of pools is based on total amount of RAM and
number of CPUs
– This method guarantees evenly sized memory pools which is desirable
– AIX 5.3 will not allow disabling of memory affinity until ML3
46 12/08/21 © 2005 IBM Corporation
Monitoring Memory Usage
‘vmstat –v’ can be used to show number of file pages in
memory
‘avm’ column in vmstat shows working
storage/computational memory usage
‘fre’ column in vmstat shows free real memory
– Note that in other OS’s like Solaris, free memory may not really
be free but also include the filesystem cache
Process memory usage can be monitored using commands
such as ‘ps’, ‘svmon’, ‘topas’, ‘nmon’, or PTX (Performance
Tool Box)
47 12/08/21 © 2005 IBM Corporation
Monitoring Process Memory Usage using PS
PS reports memory in 1KB units
# ps gv
PID PGIN SIZE RSS TRS DRS C PRI NI %CPU TIME CMD
0 7 64 64 0 64 120 16 -- 0.1 2:25 swapper
1 108 844 880 36 844 0 60 20 0.0 0:03 init
8196 0 48 48 0 48 120 255 -- 27.0 954:21 wait
12294 0 48 48 0 48 120 255 -- 26.2 926:39 wait
16392 0 48 48 0 48 120 255 -- 26.0 918:13 wait
20490 0 48 48 0 48 0 255 -- 0.0 0:00 wait
24588 0 56 56 0 56 120 17 -- 0.0 0:33 reaper
28686 0 92 92 0 92 0 16 -- 0.3 12:01 lrud
48 12/08/21 © 2005 IBM Corporation
Monitoring System Memory Usage using svmon
# svmon -G
size inuse free pin virtual
memory 1572864 1554348 18516 211932 652201
pg space 1048576 5363
work pers clnt lpage
pin 211932 0 0 0
in use 652220 495384 406744 0
49 12/08/21 © 2005 IBM Corporation
Monitoring Process Memory Using svmon
# svmon –P 978946
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd LPage
978946 oracle 50541 3840 0 46773 Y N N
Vsid Esid Type Description LPage Inuse Pin Pgsp Virtual
73ece 70000000 work default shmat/mmap - 36255 0 0 36255
0 0 work kernel segment - 6696 3818 0 6696
5becb 10 pers text data BSS heap, - 3750 0 - -
/dev/lv63:86028
9bed3 11 work text data BSS heap - 1789 0 0 1789
688ad 90000000 work loader segment - 1221 0 0 1221
7124e 90020014 work shared library text - 420 0 0 420
3bee7 9001000a work shared library text - 121 0 0 121
e3ebc 80020014 work private load - 103 0 0 103
c3eb8 8001000a work private load - 90 0 0 90
13ec2 f00000002 work process private - 29 22 0 29
808f0 9ffffffe work other kernel segments - 29 0 0 29
49249 9fffffff pers shared library text, - 11 0 - -
/dev/hd2:76083
53eca 8fffffff work private load - 10 0 0 10
d3eba ffffffff work application stack - 10 0 0 10
bbed7 - pers /dev/lv63:32787 - 6 0 - -
6becd - pers /dev/app_oracle:10518 - 1 0 - -
50 12/08/21 © 2005 IBM Corporation
LVM Tuning
LVM stored data in LTG units (default 128K)
– I/Os sent to the LVM larger than LTG size are broken up into
multiple I/Os but disk layer can coalesce them back into larger
I/Os
LVM initiates I/O to the disk layer once a buffer structure
called a pbuf is available
– Shortages of pbufs can be viewed using ‘vmstat –v’
– Pbufs can be dynamically increased by increasing the value of
hd_pbuf_cnt (using ioo in AIX 5.2)
– AIX 5.3 has a pbuf pool per Volume Group
• Shortages are seen in the output of the ‘lvmstat’ command
• Per-VG pbufs can be increased using the lvmo command
51 12/08/21 © 2005 IBM Corporation
LVM Tuning cont.
Sequential I/O can benefit from LVM striping across multiple
disks
– Stripe size and width should take into account the typical
application I/O size
LVM can also provide mirroring; however, it is usually more
efficient to do hardware mirroring if available
lvmstat can be used to monitor LVM hotspots
– Individual lvm partitions can be moved from one hdisk to
another using the migratelp command
– Entire LVM devices can be moved from one hdisk to another
using the migratepv command even while LVM device is being
used
52 12/08/21 © 2005 IBM Corporation
lvmo - LVM tuning command
# lvmo –a
vgname = rootvg
pv_pbuf_count = 512
total_vg_pbufs = 512
max_vg_pbuf_count = 16384
pervg_blocked_io_count = 17
global_pbuf_count = 512
global_blocked_io_count = 17
53 12/08/21 © 2005 IBM Corporation
Disk Tuning
Disks may have a max_coalesce and/or a max_transfer
parameter which are upper limits on the size of a disk I/O
– Increasing max_coalesce and/or max_transfer can allow
coalescing of sequential I/Os into larger I/Os
Disks may have a tunable queue_depth parameter which
places a limit on how many I/Os can be sent to the disk at a
time
– I/Os not queued to the disk may be coalesced by the device
driver
– Logical disks with many physical disks in them can benefit from
larger values for the queue_depth
54 12/08/21 © 2005 IBM Corporation
Disk Bottlenecks
High disk utilizations (viewed by iostat) can be an indication
of a bottleneck
Logical hdisks may not have enough physical disks or may
not be optimally organized in the disk subsystem (striping, #
of LUNs, # of paths, # of ports)
Disk multipathing software may have bottlenecks (could be
limited by disk queue depths or by multipath process)
Other servers attached to the SAN may affect the I/O
response time of a server
Bottleneck can also occur on the SAN switch
Disk subsystem monitoring software should be used to
detect bottlenecks
55 12/08/21 © 2005 IBM Corporation
iostat
# iostat 5
tty: tin tout avg-cpu: % user % sys % idle % iowait
0.0 5.7 6.7 5.9 59.9 27.6
Disks: % tm_act Kbps tps Kb_read Kb_wrtn
hdisk1144 8.6 127.5 15.3 284 992
hdisk1145 7.1 110.0 11.3 108 993
hdisk2 0.0 0.0 0.0 0 0
hdisk3 0.0 0.0 0.0 0 0
hdisk11 0.0 0.0 0.0 0 0
hdisk13 0.0 0.0 0.0 0 0
hdisk14 0.0 0.0 0.0 0 0
hdisk15 0.0 0.0 0.0 0 0
hdisk35 0.2 2.8 0.7 0 28
hdisk41 0.0 0.0 0.0 0 0
hdisk42 0.0 0.0 0.0 0 0
hdisk43 0.0 0.0 0.0 0 0
hdisk46 0.0 0.0 0.0 0 0
hdisk47 23.3 77.9 19.5 780 0
56 12/08/21 © 2005 IBM Corporation
iostat –a, iostat -s
# iostat -a 1
System configuration: lcpu=2 drives=3
tty: tin tout avg-cpu: % user % sys % idle % iowait
0.0 37.3 0.0 0.5 99.5 0.0
Adapter: Kbps tps Kb_read Kb_wrtn
scsi0 0.0 0.0 0 0
Disks: % tm_act Kbps tps Kb_read Kb_wrtn
hdisk0 0.0 0.0 0.0 0 0
hdisk1 0.0 0.0 0.0 0 0
57 12/08/21 © 2005 IBM Corporation
iostat –D (detailed disk stats)
# iostat –D 5
hdisk11 xfer: %tm_act bps tps bread bwrtn
0.0 0.0 0.0 0.0 0.0
read: rps avgserv minserv maxserv timeouts fails
0.0 0.0 0.0 0.0 0 0
write: wps avgserv minserv maxserv timeouts fails
0.0 0.0 0.0 0.0 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
0.0 0.0 0.0 0.0 0.0 0
hdisk12 xfer: %tm_act bps tps bread bwrtn
0.0 0.0 0.0 0.0 0.0
read: rps avgserv minserv maxserv timeouts fails
0.0 0.0 0.0 0.0 0 0
write: wps avgserv minserv maxserv timeouts fails
0.0 0.0 0.0 0.0 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
0.0 0.0 0.0 0.0 0.0 0
58 12/08/21 © 2005 IBM Corporation
iostat –A (Asynchronous I/O stats)
omd :iostat -A 1
System configuration: lcpu=2 drives=3
aio: avgc avfc maxg maxf maxr avg-cpu: % user % sys % idle % iowait
0 0 0 0 4096 0.0 0.0 100.0 0.0
Disks: % tm_act Kbps tps Kb_read Kb_wrtn
hdisk0 0.0 0.0 0.0 0 0
hdisk1 0.0 0.0 0.0 0 0
cd0 0.0 0.0 0.0 0 0
avgc Average global AIO request count per second for the specified interval (filesystem)
avfc Average fastpath request count per second for the specified interval (raw LV)
maxgc Maximum global AIO request count since the last time this value was fetched (filesystem)
maxfc Maximum fastpath request count since the last time this value was fetched (raw LV)
maxreqs Maximum AIO requests allowed
59 12/08/21 © 2005 IBM Corporation
Network Layers
Transmit
– Application -> socket -> TCP or UDP -> IP -> Interface ->
Device Driver -> Adapter -> wire
Receive
– Wire -> Adapter -> Device Driver -> Demux -> IP -> TCP
or UDP -> socket -> Application
60 12/08/21 © 2005 IBM Corporation
Network Tuning
Databases that transmit data over a network (local or
remote) may run into network bottlenecks
Parameters such as tcp_sendspace and tcp_recvspace can
be used to submit larger network I/Os without blocking
Parameters such as tcp_nodelayack and tcp_nagle_limit can
be used to eliminate delays that can occur due to algorithms
such as Nagle and delayed acknowledgements
– Set tcp_nodelayack=1
– Set tcp_nagle_limit=0
Check for media mismatches (sometimes non-zero CRC
errors in output of ‘netstat –v’ is a clue)
61 12/08/21 © 2005 IBM Corporation
CPU monitoring
A server is considered starved for CPU resources if the
number of runnable threads exceeds the number of logical
CPUs and if CPU utilization is 100%
CPU utilization in this case refers to sum of %user and
%system
%iowait is simply a form of idle time
– Indicates % of time the CPU was idle but there was at least
one I/O in progress
CPU monitoring tools have new options for SMT and
SPLPAR
62 12/08/21 © 2005 IBM Corporation
CPU Utilization in an SMT enabled LPAR
Each physical CPU has two hardware threads; each hardware thread is viewed as a logical
processor by AIX
– each logical processor still collects 100 utilization samples per second
– "ticks" will still be collected in per-logical processor cpuinfo structures (for binary compatibility)
– additional purr-based metrics (from the PURR registers) will be collected in new structures
– and sorted in the same four categories: user, sys, iowait, and idle
– values are accumulated purr tics
New "physical" cpu utilization calculation
– current metrics can be misleading unless they’re modified to use PURR
– case of 1 hardware thread 100% busy and one hardware thread idle would result in 50%
utilization with old method
• but physical processor is really 100% busy
– displayed %user,%sys,%idle,%wait will now be calculated using the purr-based metrics
• in case of one thread 100% busy, PURR-based utilization would be 100%
• one thread would receive (almost) all the purr increments, the other (practically) none
• practically 100% of purr increments would go into the %user and %sys buckets
63 12/08/21 © 2005 IBM Corporation
topas
Topas Monitor for host: server EVENTS/QUEUES FILE/TTY
Tue Apr 12 19:36:59 2005 Interval: 2 Cswitch 177 Readch 1901
Syscall 200 Writech 686
Kernel 0.4 |# | Reads 4 Rawin 0
User 0.0 |# | Writes 1 Ttyout 678
Wait 10.7 |#### | Forks 0 Igets 0
Idle 89.0 |######################### | Execs 0 Namei 14
Runqueue 0.0 Dirblk 0
Network KBPS I-Pack O-Pack KB-In KB-Out Waitqueue 0.0
en2 0.8 2.5 0.5 0.1 0.7
en7 0.0 0.0 0.0 0.0 0.0 PAGING MEMORY
Faults 69 Real,MB 128256
Disk Busy% KBPS TPS KB-Read KB-Writ Steals 0 % Comp 7.0
hdisk2 0.0 0.0 0.0 0.0 0.0 PgspIn 0 % Noncomp 0.5
hdisk0 0.0 12.0 2.5 0.0 12.0 PgspOut 0 % Client 0.5
PageIn 0
Name PID CPU% PgSp Owner PageOut 0 PAGING SPACE
syncd 78034 0.3 0.5 root Sios 0 Size,MB 512
topas 270336 0.0 1.4 root % Used 3.5
gil 25156 0.0 0.1 root NFS (calls/sec) % Free 96.4
sched 4920 0.0 0.1 root ServerV2 0
sched 4652 0.0 0.1 root ClientV2 0 Press:
sched 4384 0.0 0.1 root ServerV3 0 "h" for help
aixmibd 320434 0.0 0.6 root ClientV3 0 "q" to quit
nfsd 373102 0.0 0.2 root
rpc.lock 73904 0.0 0.2 root
64 12/08/21 © 2005 IBM Corporation
Performance Data Collection
If a performance problem requires IBM support, a tool called
PERFPMR is used to collect performance data
PERFPMR is downloadable from a public ftp site:
– ftp ftp.software.ibm.com using anonymous ftp
– cd /aix/tools/perftools/perfpmr/perfXX (where XX is the AIX
release)
– Get the compressed tar file in that directory and install it using
the directions in the provided README file
– PERFPMR is updated periodically, so it’s advisable to check
the FTP site for the most recent version
65 12/08/21 © 2005 IBM Corporation
Running PERFPMR
Once PERFPMR has been installed, you can run it in any directory
– To determine the amount of space needed, estimate at least 20MB
per logical CPU plus an extra 50MB of space
– Run “perfpmr.sh <# of seconds>” at at time when the performance
problem is occurring
– A pair of 5-second traces are collected first
– Then various monitoring tools are run for the duration of time specified
as a parameter to perfpmr.sh
– After this, tprof, filemon, iptrace, tcpdump data is collected
– Finally, system config data is collected
– Data can be tar’d up and sent to testcase.software.ibm.com with the
filename having the pmr# in it
66 12/08/21 © 2005 IBM Corporation