HNAS Performance Data Collection and Analysis
HNAS Performance Data Collection and Analysis
January 2013
Gokula Rangarajan
File Services Competency Center,
Performance Measurement Group, Technical Operations
HDS employees and TrueNorth™ Partners only. NDA required for customers.
Technical Operations
The Hitachi Storage Advantage !
© 2011 Hitachi Data Systems
Agenda
Prerequisites
What to collect?
How to collect?
Analysis
Summary
TechnIcal Operations
The Hitachi Storage Advantage !
2
Goal & expectations
• The goal of this presentation is explain the HNAS performance
information collection procedures and to provide analysis.
• Performance troubleshooting is hard. It takes a lot of time and
practice/experience.
• HNAS PIR analysis is complex and I am not aware of anyone
who can analyze the entire contents within a PIR. We will
always need extra hands.
• Eliminate the potential/known issues first.
• Before performing in depth PIR analysis, eliminate any storage
system side issues first.
• The information available in this presentation is for education
purposes only. Let GSC do the expert review of all the available
information.
• Not all data presented in this PPT are from the same PIR.
TechnIcal Operations
The Hitachi Storage Advantage !
3
HNAS performance data collection - Introduction
4
nodes.
The Hitachi Storage Advantage !
Prerequisites
TechnIcal Operations
The Hitachi Storage Advantage !
5
What information should you collect?
Below are some of the key performance related questions the GSC typically ask:
1) Provide a description of the symptom / problem [application, latency, performance, symptoms, are all
applications / clients affected, what is expected performance / behavior etc.]:
2) Is the problem reproducible?
3) What have you done thus far to troubleshoot this issue?
4) Is this client specific? Or across multiple clients?
5) Which OS is the client(s) running? [e.g. SW version and patch level ]
6) When was the problem first noticed?
7) What changes have been made to the infrastructure?
8) What protocols are being used (NFS, CIFS, iSCSI, FTP)? If NFS, what mount options are being used? If
iSCSI which initiator is being used?
9) Is Anti-Virus configured on the Bluearc / HNAS ?
10) What is the network topology between the affected client and BlueArc server?
11) Please provide port configuration showing flow-control and port counters for the ingress and egress
ports involved for each switch / router involved.
12) Can you gather a network trace from the client? Please provide a description of the associated IP's
and what was happening during the trace.
13) Is there any additional relevant information we should be aware of?
14) After all pir's have completed, please gather the system diagnostics, and if fs related, storage
diagnostics .
TechnIcal Operations
The Hitachi Storage Advantage !
6
RUSC
•The RUSC (Ruby Statistics Client) command outputs reports the aggregated
HNAS performance information for a given window at a high level.
•Queries the stats server on the HNAS (same way as the GUI) and reports the
data.
•Prerequisite is Ruby. Preinstalled in the SMU. Can be copied over to any client
system.
•Stats are grouped into various groups. Basic, protocol, disk etc.
•Can output in CSV format.
•./rusc --period 5 --header-every 20 --timestamp --csv --groups
basic,protocol,disk 192.0.2.2 >filename.csv
•The above command options will capture information related to ops/sec,
Network Megabits/sec, FC Megabits/sec, protocol level ops/sec, disk latency
etc.
TechnIcal Operations
The Hitachi Storage Advantage !
7
What information to look for in RUSC
command output?
•showall.txt
•The third file to look at. A collection of diagnostic and configuration summary command line
outputs.
• old-statistics.csv
• historical samples from prior to the pir period sampling period.
• old-command-output.txt.
• historical samples from prior to the pir period
TechnIcal Operations
The Hitachi Storage Advantage !
11
What files are included in PIR?
•eventlog.txt
•A High level list of events that occurred on the node, spanning reboots. Each
entry is tagged by an Event ID and a Severity level.
•trouble.txt
•The output from the trouble fault reporters (performance reporters are not
included)
•vlsishowall.txt
•Contains the output from the vlsishowall command – includes the combined
output of nim-info, fsm-info and sim-info.
•scsidiags.txt
•A point in time summary of attached SCSI devices by port
•registry.tgz
•Machine-readable configuration files (not directly editable)
•current_dblog.txt
•In-depth log of what was occurring on the Mercury.
TechnIcal Operations
The Hitachi Storage Advantage !
12
What information to look for in PIR? –
loggedstatistics.csv
• Operations received per second: ops/sec
received by this physical node.
• Forwarded ops/sec: ops/sec forwarded by other
nodes to this physical node.
• Ops/sec: ops/sec processed on this physical
node (own ops/sec and forwarded by other
nodes)
• Operations forwarded per second: ops/sec
forwarded by this node to other physical node.
• SMB: SMB (typically by Windows 2003 and
below) ops/sec
• SMB2: SMB2 (typically by Windows 2008 and
above) ops/sec
• NFS/iSCSI/FTP: Similar ops/sec data broken
down by different protocols.
TechnIcal Operations
The Hitachi Storage Advantage !
13
What information to look for in PIR? –
loggedstatistics.csv
• MMB Load: Processor busy rate on the MMB. Will
typically be busy for CIFS, iSCSI, FTP, NFSv4, Dedupe
type workloads. >90% busy a concern.
• MFB load: Busy rate of a busiest FPGA on the MFB.
>90 busy a concern.
TechnIcal Operations
The Hitachi Storage Advantage !
16
What information to look for in PIR? –
loggedstatistics.csv
TechnIcal Operations
The Hitachi Storage Advantage !
17
What information to look for in PIR? –
loggedstatistics.csv
• FC read/write ops/sec: ops/sec at the FC layer. The amount of
overall IOPS that the storage system will see from the HNAS.
• Current disk requests: Overall number of disk requests across
the entire node (includes both read and writes)
• Current disk reads: Overall disk read requests
• Current disk writes: Overall disk write requests.
• Device current disk reads: Number of disk read request on a
particular SD (LUN) (Max-=32)
• Device current disk writes: Number of disk write request on a
particular SD (LUN) (Max-=32)
• Queued disk reads/writes: Overall number of queued disk
requests. For reads, values of <100 per SD is acceptable. >100
per SD results in higher read latency.
• For writes, it’s tricky. Drive characteristics and RAID type plays
a major role. In general few hundreds of queued writes per SD
is acceptable. In thousands is a definite concern.
• More queued writes results in more NVRAM waiting allocs and
thus write smoothing.
• Si_tach_bytes_read_last_sec: Bytes of data received from the
storage system over the last second on this FC interface.
• Si_tach_bytes_written_last_sec: Bytes of data written to the
storage system over the last second on this FC interface.
• Tach 0 1 2 3 refers to the 4 FC ports in a 3090/3080.
• Remember the per port FC limits? Use that knowledge to see if
we are reaching per port limits in the PIR.
TechnIcal Operations
The Hitachi Storage Advantage !
18
What information to look for in PIR? –
command-output.txt
•Command-output is a text file that contains more granular performance information.
•Tools like Notepad++, Wordpad works well.
•There’s a table of contents at the end of the file. Use the TOC to look for specific items.
•It displays the CLI commands that it used internally to collect the output. Some of
those commands can be run manually without the DEV password.
•A lot of data in this file can be understood only by the Developers and maybe the
Engineering. So don’t feel bad.
•Depending on the software release version, the data is arranged in this order:
•Trouble
•NIM software
•NIM stats
•FSM software
•FSM stats
•FSM protocols
•FSM Protocol errors
•FSM vlsi
•FSM per-file-system info
•Sim software
•Sim storage
•Cluster stats
TechnIcal Operations
The Hitachi Storage Advantage !
19
What information to look for in PIR? –
command-output.txt
•The output from trouble fault reports (performance specific) are displayed in this section.
•It highlights the counters/issues that stands out from others.
•Any counters having much higher or lower values than the normal is highlighted here.
•For example, it says the “read-ahead” tuning is disabled, but we are issuing several read requests. In such occasions, turning on read-ahead
could help.
•Another example is the read/modify/write. For example, a application writes 32KB files but later updates a part of the file (like 4KB or 8KB)
very frequently. This creates read/modify/writes and affects the write performance.
•Another example is the write and read size. It says that most of the client writes are about 16KB, but it considers >30KB to be a good value.
•Some of the FPGAs in this example are about 90% busy.
TechnIcal Operations
The Hitachi Storage Advantage !
20
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
21
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
22
What information to look for in PIR? –
command-output.txt
•This section displays the NFS performance statistics broken down by different NFS operations at the network layer.
•The samples here refer to the ops across the whole PIR capture window (like 10 minutes)
•It reports the ops for different NFS operations, average and peak latency.
•A good indicator of what kind of NFS workload does the HNAS is handling.
•In this example, it is handling a lot of NFS metadata operations.
•The response times observed here is what the clients should expect to see as well.
•Look for similar information for CIFS protocol as well.
TechnIcal Operations
The Hitachi Storage Advantage !
23
What information to look for in PIR? –
command-output.txt
•This section displays the amount of data transferred by the NFS clients .
•The samples here refer to the ops across the whole PIR capture window (like 10 minutes).
•It reports the ops for different NFS operations, amount of read data, write data, average I/O size and maximum I/O
size.
•A good indicator of what amount of data does the HNAS is handling during a given window.
•Look for similar information for CIFS protocol as well.
•In this example, it received (write workload) about 24GB of data from the clients and transferred (read workload)
about 11GB of data back to the clients.
•The max= values should be 64KB (at least 32KB). Check the NFS mount options and make sure they are using
correct NFS mount options.
•Average write/read size is application dependent. Smaller read or write sizes (like 4KB) turns the workloads to be
random at the HNAS. Will result in more ops/sec, but less MB/s.
•Larger IO sizes (like 32KB) turns the workloads to be mostly sequential (assuming FS is not fragmented and better
storage configuration) at the HNAS. Will result in less ops/sec, but more MB/s.
TechnIcal Operations
The Hitachi Storage Advantage !
24
What information to look for in PIR? –
command-output.txt
•This section displays the per connection (of a client) TCP statistics and displays information up to 50 connections.
•Identify your rogue client systems from this section. One or few clients may be doing much more operations than
the others and affecting the entire performance. This is the only place (or command) to identify such clients.
•If you are working on a POC with performance tests, make sure all the clients send uniform workload to the HNAS
for optimal performance.
•This section also displays connections with highest error counts like duplicate ACKs, Retransmits packets etc. Check
the netstat –st command output on those client systems and troubleshoot the issue.
TechnIcal Operations
The Hitachi Storage Advantage !
25
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
26
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
27
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
28
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
29
What information to look for in PIR? –
command-output.txt
•This section displays detailed
information about all the NFS
operations.
•A good indicator of what type of NFS
workloads the HNAS is handling during a
given window.
•This PIR was captured for 3 minutes. In
this example, the workload is metadata
heavy. At the same time it is also disk
intensive.
•Overall writes+reads+creates+removes
= 4,572,101 operations for 3 minutes or
~25K disk intensive ops/sec. (Whew..).
25k is the potential disk ops/sec in this
example, but not actuals.
•We would Superflush the writes and
thus result in less write ops/sec.
•Look for similar information for other
protocols as well.
TechnIcal Operations
The Hitachi Storage Advantage !
30
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
31
What information to look for in PIR? –
command-output.txt
•This section displays detailed information about the load on different hardware components (MMB
CPU, MFB FPGA’s etc).
•Values over >90% indicate a very busy system.
TechnIcal Operations
The Hitachi Storage Advantage !
32
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
33
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
34
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
35
What information to look for in PIR? –
command-output.txt
•This section provides detailed information about overall storage ops, average and peak response
times of the entire system (including all FS and SD’s).
•There are FS specific storage statistics and per SD specific statistics available in the PIR as well.
•Drill down into FS or SD specifics for accurate information.
•This PIR was captured over 3 minutes. During the 3 minute window, the average read response
time was 13ms, single buffer write response time was 45ms and multi buffer write response time
was 54ms. In general, these are good numbers.
•Samples are the storage ops/sec and a good indicator of read vs. write intensive workloads from
the HNAS to the storage system.
TechnIcal Operations
The Hitachi Storage Advantage !
36
What information to look for in PIR? –
command-output.txt
•This section provides detailed information about storage ops/sec, average and peak response
times of a given system drive.
•This PIR was captured over 3 minutes. During the 3 minute window, the average read response
time was 13ms, single buffer write response time was 43ms and multi buffer write response time
was 58ms. In general, these are good numbers.
•Samples are the storage ops/sec and a good indicator of read vs. write intensive workloads from
the HNAS to the storage system.
•A good information for future sizing as well. For example, 243,377 ops over 3 minutes = 1352
read ops/sec for this SD. This is equivalent (or pretty close) to 1352 LUN read IOPS at the storage
system.
•But the writes are split into single and multi buffer writes (Superflush). Watch it carefully.
TechnIcal Operations
The Hitachi Storage Advantage !
37
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
38
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
39
What information to look for in PIR? –
command-output.txt
TechnIcal Operations
The Hitachi Storage Advantage !
40
What are the nominal LUN response times?
(from storage system perspective)
• Response time varies by the cache hit rate, RAID type, disk characteristics, thread count per disk
(command tags) etc. at the storage system.
• Below tables can be used a reference for different drive types, RAID types and in a 0% cache hit
scenario for 100% random workloads, 8KB I/O size.
• But be aware that when Superflush is in play, the average write I/O will be >256KB and that results in
higher response times between HNAS and the storage system. But the clients will NOT see higher
response time in such scenarios.
TechnIcal Operations
The Hitachi Storage Advantage !
41
What information to look for in PIR? –
showall.txt
•showall.txt file includes few diagnostics and complete configuration summary
command line outputs.
•Few key things to review are:
•Hardware model and revision (Gen1, Gen2)
•Software version
•Number of nodes, EVS
•Installed license keys
•Network configuration and advanced IP configuration
•System drive, host path, target path, LUN mapping details
•Queue depth settings
•System drive group, storage pool details
•Number of Fs and where it resides (SP, EVS etc.)
•And, many more
•Use the information from this file to understand how the HNAS is configured.
•While PIR identify possible issues, correlate that information with configuration
summary and see if the issue is due to a misconfigured system.
TechnIcal Operations
The Hitachi Storage Advantage !
42
Summary
• As you could see, HNAS PIR analysis can be complex and time consuming. I
have only covered some of the key areas that are of highest concern.
• There are more sections within the PIR that needs additional analysis. Let
GSC do the expert review of all the available information.
• But the areas that I have highlighted so far will eliminate quite a few issues.
• Always combine PIR analysis with storage system performance data analysis
as well.
• Be aware of the hardware platform and hardware configuration limitations.
• Practice it with multiple PIRs and see if you spot any issues. The more you
start to review, higher the confidence.
• The attached PIR was captured during a SPECsfs test for a target load of 95K
NFS ops/sec (the busiest it could get). It will be a good learning exercise for
you.
TechnIcal Operations
The Hitachi Storage Advantage !
43
Thank You
Technical Operations
The Hitachi Storage Advantage !
44