PARALLEL I/O AND PORTABLE DATA FORMATS
INTRODUCTION AND PARALLEL I/O STRATEGIES
27.01.2020 I SEBASTIAN LÜHRS ([Link]@[Link])
JUST STORAGE SYSTEM
JUelich STorage
EUDAT
DEEP
JUWELS
2600+ Nodes
JUST
JUROPA3
JUDAC
JURECA + JURECA Booster JUROPA3-ZEA
3600+ Nodes
JuAMS
3
JUST CES
$DATA
JUSTTSM
XCST
IBM Spectrum Scale
(GPFS)
IBM Spectrum Scale
(GPFS)
JUSTDSS
Backup
SAN
HSM
IBM Spectrum Scale
(GPFS)
IBM Spectrum Protect
(TSM)
$SCRATCH
$FASTDATA
Restore
Backup
$PROJECT
NFS
$ARCHIVE
$HOME
JuNet
4
File I/O to GPFS
5
JUST – 5th generation
21 x DSS240 + 1 x DSS260 → 44 x NSD Server, 90 x Enclosure → +7.500 10TB disks
●●●
3 x 100 GE
2 x 200 GE 1 x 100 GE
Monitoring
2 2
Cluster Export Cluster Management
8
TSM Server 5 Server (NFS)
Power 8 GPFS
Manager
Page 6
Declustered RAID
Disk failure causes disk rebuild Disk failure causes strips rebuild
• Volume degraded for a long time • all disc involved
• performance impact for file system • Volume degraded for a short time
• minimized performance impact
7
JUST – Characteristics
Spectrum Scale (GPFS 5.0.1) + GNR (GPFS Native RAID)
• Declustered RAID technology
• End-to-End data integrity
Spectrum Protect (TSM) for Backup & HSM
Hardware:
• x86 based server + RHEL 7
• IBM Power 8 + AIX 7.2
• 100GE network fabric
75 PB gross capacity
Bandwidth: 400 GB/s
8
PARALLEL I/O STRATEGIES
Parallel I/O Strategies
One process performs I/O
P00 P01 P02 P03
P04 P05 P06 P07
P08 P09 P10 P11
file system P12 P13 P14 P15
processes
10
Parallel I/O Strategies
One process performs I/O
+ Simple to implement
- I/O bandwidth is limited to the rate of this single process
- Additional communication might be necessary
- Other processes may idle and waste computing resources during I/O time
11
Parallel I/O Pitfalls
Frequent flushing on small blocks
• Modern file systems in HPC have large file system blocks (e.g. 4MB)
• A flush on a file handle forces the file system to perform all pending write operations
• If application writes in small data blocks, the same file system block it has to be read and
written multiple times
• Performance degradation due to the inability to combine several write calls
12
Parallel I/O Strategies
Task-local files
P00 P01 P02 P03
P04 P05 P06 P07
P08 P09 P10 P11
file system P12 P13 P14 P15
processes
13
Parallel I/O Strategies
Task-local files
+ Simple to implement
+ No coordination between processes needed
+ No false sharing of file system blocks
- Number of files quickly becomes unmanageable
- Files often need to be merged to create a canonical dataset
- File system might serialize meta data modification
14
Parallel I/O Pitfalls
Serialization of meta data modification
file i-node
indirect
I/O- blocks
Example: Creating files in parallel in the same directory client
FS blocks
• Meta-data wall on file level
• File changes by multiple processes can
cause serialization
• File meta-data management
• Locking
Parallel file creation on JUQUEEN
0.5-28 racks, 64 tasks/node
W. Frings
The creation of 2.097.152 files costs 113.595 core hours on JUQUEEN!
15
Parallel I/O Strategies
Shared files
P00 P01 P02 P03
P04 P05 P06 P07
P08 P09 P10 P11
file system P12 P13 P14 P15
processes
16
Parallel I/O Strategies
Shared files
+ Number of files is independent of number of processes
+ File can be in canonical representation (no post-processing)
- Uncoordinated client requests might induce time penalties
- File layout may induce false sharing of file system blocks
17
Parallel I/O Pitfalls
False sharing of file system blocks
• Data blocks of individual processes do not fill up a complete file system block
• Several processes share a file system block
• Exclusive access (e.g. write) must be serialized
• The more processes have to synchronize the more waiting time will propagate
data block free file system block
t1 t2
lock lock
FS Block FS Block FS Block
file system block … data
task 1
data
task 2
…
18
I/O Workflow
• Post processing can be very time-consuming (> data creation)
• Widely used portable data formats avoid post processing
• Data transportation time can be long:
• Use shared file system for file access, avoid raw data transport
• Avoid renaming/moving of big files (can block backup)
data post processing
(merge files, switch to
different file format) visualization
data creation
19
Parallel I/O Pitfalls
Portability
• Endianness (byte order) of binary data
• Conversion of files might be necessary and expensive
2,712,847,316
=
10100001 10110010 11000011 11010100
Address Little Endian Big Endian
1000 11010100 10100001
1001 11000011 10110010
1002 10110010 11000011
1003 10100001 11010100
20
Parallel I/O Pitfalls
Portability
• Memory order depends on programming language
• Transpose of array might be necessary when using different programming languages in
the same workflow
• Solution: Choosing a portable data format (HDF5, NetCDF)
Address row-major order column-major order
(e.g. C/C++) (e.g. Fortran)
1000 1 1
1 2 3
1001 2 4
4 5 6
1002 3 7
7 8 9 1003 4 2
1004 5 5
… … …
21
Storage Tiers
Different storage tiers with different optimization targets
Data staging at JSC
HPST
$SCRATCH
$FASTDATA
$DATA
$ARCHIVE
Tape Library JUST 5
22
How to choose the I/O strategy?
• Performance considerations
• Amount of data
• Frequency of reading/writing
• Scalability
• Portability
• Different HPC architectures
• Data exchange with others
• Long-term storage
• E.g. use two formats and converters:
• Internal: Write/read data “as-is”
• External: Write/read data in non-decomposed format (portable, system-independent, self-
describing)
23
Parallel I/O Software Stack
Parallel application
Task-
Shared
P-HDF5 NetCDF-4 PNetCDF … local
files
file
MPI-I/O
SIONlib …
…
POSIX I/O
Parallel file system
data stored in global view in local view
24
HANDS ON PREPARATION
HPC access
JUDOOR account and • Register and join the course project:
part of the training2000 [Link]
project? [Link]/projects/join/training2000
• Create a new public/private Key-pair:
ssh-keygen or use puttygen
Available SSH-Key?
• Add the private key into your agent:
ssh-add <private_key>
Public key on the • Upload your public key via [Link]
system? [Link] to the system
• Try to login:
Login possible? ssh <userid>@[Link]
26
Course exercise: Mandelbrot set
• Set of all complex numbers 𝑐 in the complex plane for which
𝑧𝑛+1 = 𝑧𝑛 2 + 𝑐
𝑧0 = 0
does not approach infinity
Im[c]
Re[c]
27
Course exercise: Mandelbrot set
• I/O comparison example
• Four different decomposition types
• stride
• static
• master-worker (workers write)
• master-worker (master writes)
• Five different output formats
• SIONlib
• HDF5
• MPI-IO
• parallel-netcdf
• netcdf4
28
Decomposition types
stride static
blocksize
height
width
master-worker, workers write master-worker, master writes
blocksize
blocksize
29
mandelmpi
-v use verbose mode
-t decomposition type (0: stride, 1: static, 2: master-worker worker write,
3: master-worker master write), default: 0
-w width, default: 256
Command line options
-h height, default: 256
-b blocksize (not used for type = 1), default: 64
-p number of procs in x-direction (only used for type = 1)
-q number of procs in y-direction (only used for type = 1)
-x coordinates of initial area: x1 x2 y1 y2,
default: -1.5 0.5 -1.0 1.0
-i max. iterations, default 256
-f output type (0: SIONlib, 1: HDF5, 2: MPI-IO, 3: pnetcdf, 4: netcdf4),
default: 0
30
mandelseq
Command line options
-f output type (0: SIONlib, 1: HDF5, 2: MPI-IO, 3: pnetcdf, 4: netcdf4),
default: 0
Output
process distribution image
only available for SIONlib
31
Mandelbrot exercise workflow
mandelmpi mandelseq
open_<lib> collect_<lib>
write_to_<lib>_file
close_<lib>
data
file
32
Mandelbrot exercise workflow
1. Load modules
. load_modules_jureca.sh
2. Run compilation
make
3. Change runtime parameter in "[Link]" file or use srun
4. Submit a job if not using srun directly
sbatch [Link]
5. Create result image
./mandelseq -f <format>
6. View image (not in interactive session)
display [Link]
33
Mandelbrot exercise API
typedef struct _infostruct
{
int type; int width; int height;
int numprocs;
C
double xmin; double xmax; double ymin; double ymax;
int maxiter;
} _infostruct;
type :: t_infostruct
integer :: type, width, height
Fortran
integer :: numprocs
real :: xmin, xmax, ymin, ymax
integer :: maxiter
end type t_infostruct
34
Mandelbrot exercise API
void open_<lib>(<type> *fid, _infostruct *infostruct,
C
int *blocksize, int *start, int rank)
open_<lib>(fid, info, blocksize, start, rank)
<type>, intent(out) :: fid
Fortran type(t_infostruct), intent(in) :: info
integer, dimension(2), intent(in) :: blocksize
integer, dimension(2), intent(in) :: start
integer, intent(in) :: rank
fid lib specific file_id (can occurs twice if multiple ids needed)
info global information structure
blocksize chosen (or calculated) blocksizes (C: [y,x], Fortran: [x,y])
start calculated start point (C: [y,x], Fortran: [x,y], starting at 0)
rank process MPI rank
35
Mandelbrot exercise API
void close_<lib>(<type> *fid, _infostruct *infostruct,
C
int rank)
close_<lib>(fid, info, rank)
Fortran
<type>, intent(inout) :: fid
type(t_infostruct), intent(in) :: info
integer, intent(in) :: rank
fid lib specific file_id (can occurs twice if multiple ids needed)
info global information structure
rank process MPI rank
36
Mandelbrot exercise API
void write_to_<lib>_file(
<type> *fid, _infostruct *infostruct, int *iterations,
C
int width, int height, int xpos, int ypos)
write_to_<lib>_file(fid, info, iterations, width, height,
xpos, ypos)
<type>, intent(in) :: fid
type(t_infostruct), intent(in) :: info
Fortran
integer, dimension(:), intent(in) :: iterations
integer, intent(in) :: width
integer, intent(in) :: height
integer, intent(in) :: xpos
integer, intent(in) :: ypos
iterations data array
width, height size of current data block (pixel coordinates)
xpos, ypos position of current data block (pixel coordinates starting at 0)
37
Mandelbrot exercise API
void collect_<lib>(
int **iterations, int **proc_distribution,
C
_infostruct *infostruct)
collect_<lib>(iterations, proc_distribution, info)
Fortran integer, dimension(:), pointer :: iterations
integer, dimension(:), pointer :: proc_distribution
type(t_infostruct), intent(inout) :: info
iterations data array
proc_distribution process distribution array (only in Sionlib)
info global information structure
38