Solving I/O the Slowdown:
The "Noisy Neighbor" Problem
Rice Oil and Gas Conference 2019
John Fragalla
Principal Engineer
Cray, Inc.
© 2018 Cray Inc. 1
Agenda
• Today’s I/O challenge with shared I/O
• New Lustre Features enabling Flash to improve Shared Application Performance
• Performance results isolating “Noisy Neighbor” Applications
• Summary
© 2018 Cray Inc. 2
Today’s I/O Challenge
• When multiple users share a high speed parallel filesystem, ”bad applications”
will effect “good application” performance
• Bad applications: Lots of Small Files, Random Small I/O, Unaligned I/O
• Good applications: Stream large I/O, Sequential performance, Aligned I/O
• Recent features available in Lustre help automate I/O isolation and placement
with transparent use of Flash and HDD Devices in a Single Namespace
• Progressive File Layout with Lustre Storage Pools
• Data on Metadata
• Distributed Namespace (DNE) 2 – Clustered Metadata
© 2018 Cray Inc. 3
Hybrid File System Architecture
Scalable MDS Flash Tier :
Parallel File System
• Large number of Inode Support per FS
Single Namespace
• Improved Metadata Operations
Compute Cluster SSD SSD SSD SSD
• Improved Small I/O Latency
Flash Tier :
CN CN CN CN SSD SSD SSD SSD
• Optimize for throughput and IOPs - $/GB/sec.
MDS
• Improved Performance for Intermediate Results
CN CN CN CN HDD HDD HDD HDD
MDS
• Small/random I/O performance improved
CN CN CN CN HDD HDD HDD HDD
MDS High Performance HDD Tier:
CN CN CN CN
• Optimize Throughput/Capacity - $/GB/Sec
HDD HDD HDD HDD
MDS • Optional Flash/Cache to accelerate small block
IO within the HDD Tier
CN CN CN CN HDD HDD HDD HDD
Capacity HDD Tier:
HDD HDD HDD HDD
• Optimize cost - $/GB
• Lower performance, longer term data retention
© 2018 Cray Inc. 4
Lustre | Flexibility and Usability
• Progressive File Layouts (PFL)
• Optimized striping based on file size
• Layout changes at specific thresholds
• Can locate components on specific pools
• Fixed amount of file on flash, the rest on disk
© 2018 Cray Inc. 5
Lustre Storage Pools are more relevant Now
• Lustre Pools historically was used for debugging to isolate performance issues,
for example, for a subset of OSTs or OSS Nodes.
• Now, with PFL, Lustre Storage Pools can be a powerful tool to create automatic
data placement on different storage medias
• Flash Pool
• High Performing Disk Pool
• Slow Performing Disk Tier (e.g. focus on capacity)
© 2018 Cray Inc. 6
Lustre | Improved Small File Performance
• Data on Metadata
• Ideal for small file workloads
• File data stored directly on metadata storage
• Lower communication overhead for data access
• Scales with Distributed Namespace (DNE)
• Avoids contention by not placing small files on OSTs
© 2018 Cray Inc. 7
Lustre | Data on Metadata and PFL
• Leverage DoM and PFL for more flexible solution
• Small files land on MDT Component
• Medium files land on flash with larger files growing to disk (for example)
• Compatible with Progressive File Layouts
© 2018 Cray Inc. 8
DNE Phase 2
• Allows a user to spread a single large directory across multiple MDTs using the
DNE striped directory feature
• Note due to some overhead, this should only be done to very large
directories with file counts in +50K range.
DNE phase 1 DNE phase 2
© 2018 Cray Inc. 9
System Setup for Benchmarks
Hardware: Software:
• 4 Flash MDTs with RAID-10 • Lustre 2.11.0 clients and server
• 2 Flash IOPS Optimized OSTs with • CentOS Linux release 7.5 (server and
RAID-10 client)
• 4 GridRAID OSTs (Parity De- • Spectre/Meltdown-enabled kernels on
Clustered RAID-6 equivalent data Clients, S/M disabled on Server
protection) • Client: 3.10.0-862.el7.x86_64
• Up to 64 Client nodes (FDR • Server: 3.10.0-693.21.1.x3.1.9.x86_64
Connectivity)
• EDR InfiniBand Non-Blocking Fabric
© 2018 Cray Inc. 10
LUSTRE PFL STREAMING PERFORMANCE
Flash MDTs -> HDD OST Tier (DoM)
30,000
25,000
Performance (MB/sec)
20,000
15,000
We want no change in performance across various sizes
Write Mean
Read Mean
10,000
5,000
0
No DoM DoM=64K DoM=256K DoM=1024K DoM=4096K
Progressive File Layout Small Component Size
Progressive File Layout maintains peak performance for streaming workloads
© 2018 Cray Inc. 11
LUSTRE PFL NOISY NEIGHBOR ISOLATION
Two Competing Workloads on Same Resources
Streaming Workload File
Lustre
HDD OST
Small File Workload File 1 File 2 File 3 File 4
Two Workloads Separated using PFL
Streaming Workload File HDD OST
Small File Workload File 1 File 2 File 3 File 4 Flash
OST or MDT
© 2018 Cray Inc. 12
LUSTRE PFL NOISY NEIGHBOR ISOLATION
Flash Tier (OST or DoM with MDTs) -> HDD OST Tier
1 MB FILES 4 MB FILES
Write Mean Read Mean Write Mean Read Mean
24,000 24,000
Baseline Isolated Baseline Isolated
22,000 22,000
MB/s
MB/s
20,000 20,000
Interfered Interfered
18,000 18,000
16,000 16,000
14,000 14,000
12,000 12,000
None None (1MB) 1024K (1MB) X-Axis Legend None None (4MB) 1024K (4MB) 4096K (4MB)
PFL Size on Flash (Noisy Neighbor File Size)
PFL ISOLATION OF IOPS FROM STREAMING IMPROVES PERFORMANCE
© 2018 Cray Inc. 13
Summary
• New Lustre Features such as PFL, DoM, and DNE2 help improve mixed I/O
performance on high speed shared parallel filesystem
• Transparent data placement on Flash MDTs and/or OSTs and HDDs for various
I/O sizes to optimize throughput and IOPS
• Isolate small files or small I/Os from streaming I/O to solve the “Noisy Neighbor”
slow down for sequential performance
© 2018 Cray Inc. 14
THANK YOU
QUESTIONS?
cray.com