0% found this document useful (0 votes)
22 views38 pages

Parallel Io Intro

Uploaded by

mamalee393
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views38 pages

Parallel Io Intro

Uploaded by

mamalee393
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PARALLEL I/O AND PORTABLE DATA FORMATS

INTRODUCTION AND PARALLEL I/O STRATEGIES

27.01.2020 I SEBASTIAN LÜHRS ([Link]@[Link])


JUST STORAGE SYSTEM
JUelich STorage

EUDAT

DEEP
JUWELS
2600+ Nodes

JUST

JUROPA3

JUDAC

JURECA + JURECA Booster JUROPA3-ZEA


3600+ Nodes
JuAMS

3
JUST CES
$DATA
JUSTTSM
XCST

IBM Spectrum Scale


(GPFS)
IBM Spectrum Scale
(GPFS)

JUSTDSS

Backup
SAN
HSM

IBM Spectrum Scale


(GPFS)
IBM Spectrum Protect
(TSM)
$SCRATCH
$FASTDATA
Restore
Backup

$PROJECT

NFS
$ARCHIVE
$HOME
JuNet

4
File I/O to GPFS

5
JUST – 5th generation
21 x DSS240 + 1 x DSS260 → 44 x NSD Server, 90 x Enclosure → +7.500 10TB disks

●●●

3 x 100 GE

2 x 200 GE 1 x 100 GE

Monitoring

2 2
Cluster Export Cluster Management
8
TSM Server 5 Server (NFS)
Power 8 GPFS
Manager

Page 6
Declustered RAID

Disk failure causes disk rebuild Disk failure causes strips rebuild
• Volume degraded for a long time • all disc involved
• performance impact for file system • Volume degraded for a short time
• minimized performance impact

7
JUST – Characteristics
Spectrum Scale (GPFS 5.0.1) + GNR (GPFS Native RAID)
• Declustered RAID technology
• End-to-End data integrity
Spectrum Protect (TSM) for Backup & HSM
Hardware:
• x86 based server + RHEL 7
• IBM Power 8 + AIX 7.2
• 100GE network fabric
75 PB gross capacity
Bandwidth: 400 GB/s

8
PARALLEL I/O STRATEGIES
Parallel I/O Strategies
One process performs I/O

P00 P01 P02 P03

P04 P05 P06 P07

P08 P09 P10 P11

file system P12 P13 P14 P15

processes

10
Parallel I/O Strategies
One process performs I/O

+ Simple to implement

- I/O bandwidth is limited to the rate of this single process


- Additional communication might be necessary
- Other processes may idle and waste computing resources during I/O time

11
Parallel I/O Pitfalls
Frequent flushing on small blocks

• Modern file systems in HPC have large file system blocks (e.g. 4MB)
• A flush on a file handle forces the file system to perform all pending write operations
• If application writes in small data blocks, the same file system block it has to be read and
written multiple times
• Performance degradation due to the inability to combine several write calls

12
Parallel I/O Strategies
Task-local files

P00 P01 P02 P03

P04 P05 P06 P07

P08 P09 P10 P11

file system P12 P13 P14 P15

processes

13
Parallel I/O Strategies
Task-local files

+ Simple to implement
+ No coordination between processes needed
+ No false sharing of file system blocks

- Number of files quickly becomes unmanageable


- Files often need to be merged to create a canonical dataset
- File system might serialize meta data modification

14
Parallel I/O Pitfalls
Serialization of meta data modification
file i-node
indirect
I/O- blocks
Example: Creating files in parallel in the same directory client

FS blocks

• Meta-data wall on file level


• File changes by multiple processes can
cause serialization
• File meta-data management
• Locking

Parallel file creation on JUQUEEN


0.5-28 racks, 64 tasks/node
W. Frings

The creation of 2.097.152 files costs 113.595 core hours on JUQUEEN!

15
Parallel I/O Strategies
Shared files

P00 P01 P02 P03

P04 P05 P06 P07

P08 P09 P10 P11

file system P12 P13 P14 P15

processes

16
Parallel I/O Strategies
Shared files

+ Number of files is independent of number of processes


+ File can be in canonical representation (no post-processing)

- Uncoordinated client requests might induce time penalties


- File layout may induce false sharing of file system blocks

17
Parallel I/O Pitfalls
False sharing of file system blocks

• Data blocks of individual processes do not fill up a complete file system block
• Several processes share a file system block
• Exclusive access (e.g. write) must be serialized
• The more processes have to synchronize the more waiting time will propagate

data block free file system block


t1 t2
lock lock
FS Block FS Block FS Block

file system block … data


task 1
data
task 2

18
I/O Workflow
• Post processing can be very time-consuming (> data creation)
• Widely used portable data formats avoid post processing
• Data transportation time can be long:
• Use shared file system for file access, avoid raw data transport
• Avoid renaming/moving of big files (can block backup)

data post processing


(merge files, switch to
different file format) visualization

data creation

19
Parallel I/O Pitfalls
Portability

• Endianness (byte order) of binary data


• Conversion of files might be necessary and expensive

2,712,847,316
=
10100001 10110010 11000011 11010100
Address Little Endian Big Endian
1000 11010100 10100001
1001 11000011 10110010
1002 10110010 11000011
1003 10100001 11010100

20
Parallel I/O Pitfalls
Portability

• Memory order depends on programming language


• Transpose of array might be necessary when using different programming languages in
the same workflow
• Solution: Choosing a portable data format (HDF5, NetCDF)
Address row-major order column-major order
(e.g. C/C++) (e.g. Fortran)
1000 1 1
1 2 3
1001 2 4
4 5 6
1002 3 7
7 8 9 1003 4 2
1004 5 5
… … …

21
Storage Tiers
Different storage tiers with different optimization targets
Data staging at JSC

HPST

$SCRATCH

$FASTDATA

$DATA

$ARCHIVE

Tape Library JUST 5

22
How to choose the I/O strategy?
• Performance considerations
• Amount of data
• Frequency of reading/writing
• Scalability
• Portability
• Different HPC architectures
• Data exchange with others
• Long-term storage
• E.g. use two formats and converters:
• Internal: Write/read data “as-is”
• External: Write/read data in non-decomposed format (portable, system-independent, self-
describing)

23
Parallel I/O Software Stack

Parallel application

Task-

Shared
P-HDF5 NetCDF-4 PNetCDF … local
files

file
MPI-I/O
SIONlib …

POSIX I/O

Parallel file system


data stored in global view in local view

24
HANDS ON PREPARATION
HPC access
JUDOOR account and • Register and join the course project:
part of the training2000 [Link]
project? [Link]/projects/join/training2000

• Create a new public/private Key-pair:


ssh-keygen or use puttygen
Available SSH-Key?
• Add the private key into your agent:
ssh-add <private_key>

Public key on the • Upload your public key via [Link]


system? [Link] to the system

• Try to login:
Login possible? ssh <userid>@[Link]

26
Course exercise: Mandelbrot set
• Set of all complex numbers 𝑐 in the complex plane for which
𝑧𝑛+1 = 𝑧𝑛 2 + 𝑐
𝑧0 = 0
does not approach infinity
Im[c]

Re[c]

27
Course exercise: Mandelbrot set
• I/O comparison example
• Four different decomposition types
• stride
• static
• master-worker (workers write)
• master-worker (master writes)
• Five different output formats
• SIONlib
• HDF5
• MPI-IO
• parallel-netcdf
• netcdf4
28
Decomposition types
stride static
blocksize

height

width
master-worker, workers write master-worker, master writes

blocksize

blocksize

29
mandelmpi

-v use verbose mode


-t decomposition type (0: stride, 1: static, 2: master-worker worker write,
3: master-worker master write), default: 0
-w width, default: 256
Command line options

-h height, default: 256


-b blocksize (not used for type = 1), default: 64
-p number of procs in x-direction (only used for type = 1)
-q number of procs in y-direction (only used for type = 1)
-x coordinates of initial area: x1 x2 y1 y2,
default: -1.5 0.5 -1.0 1.0
-i max. iterations, default 256
-f output type (0: SIONlib, 1: HDF5, 2: MPI-IO, 3: pnetcdf, 4: netcdf4),
default: 0

30
mandelseq
Command line options
-f output type (0: SIONlib, 1: HDF5, 2: MPI-IO, 3: pnetcdf, 4: netcdf4),
default: 0
Output

process distribution image


only available for SIONlib

31
Mandelbrot exercise workflow

mandelmpi mandelseq
open_<lib> collect_<lib>

write_to_<lib>_file

close_<lib>

data
file

32
Mandelbrot exercise workflow
1. Load modules
. load_modules_jureca.sh
2. Run compilation
make
3. Change runtime parameter in "[Link]" file or use srun
4. Submit a job if not using srun directly
sbatch [Link]
5. Create result image
./mandelseq -f <format>
6. View image (not in interactive session)
display [Link]

33
Mandelbrot exercise API
typedef struct _infostruct
{
int type; int width; int height;
int numprocs;

C
double xmin; double xmax; double ymin; double ymax;
int maxiter;
} _infostruct;

type :: t_infostruct
integer :: type, width, height
Fortran

integer :: numprocs
real :: xmin, xmax, ymin, ymax
integer :: maxiter
end type t_infostruct

34
Mandelbrot exercise API
void open_<lib>(<type> *fid, _infostruct *infostruct,

C
int *blocksize, int *start, int rank)

open_<lib>(fid, info, blocksize, start, rank)


<type>, intent(out) :: fid
Fortran type(t_infostruct), intent(in) :: info
integer, dimension(2), intent(in) :: blocksize
integer, dimension(2), intent(in) :: start
integer, intent(in) :: rank

fid lib specific file_id (can occurs twice if multiple ids needed)
info global information structure
blocksize chosen (or calculated) blocksizes (C: [y,x], Fortran: [x,y])
start calculated start point (C: [y,x], Fortran: [x,y], starting at 0)
rank process MPI rank

35
Mandelbrot exercise API
void close_<lib>(<type> *fid, _infostruct *infostruct,

C
int rank)

close_<lib>(fid, info, rank)

Fortran
<type>, intent(inout) :: fid
type(t_infostruct), intent(in) :: info
integer, intent(in) :: rank

fid lib specific file_id (can occurs twice if multiple ids needed)
info global information structure
rank process MPI rank

36
Mandelbrot exercise API
void write_to_<lib>_file(
<type> *fid, _infostruct *infostruct, int *iterations,

C
int width, int height, int xpos, int ypos)
write_to_<lib>_file(fid, info, iterations, width, height,
xpos, ypos)
<type>, intent(in) :: fid
type(t_infostruct), intent(in) :: info
Fortran

integer, dimension(:), intent(in) :: iterations


integer, intent(in) :: width
integer, intent(in) :: height
integer, intent(in) :: xpos
integer, intent(in) :: ypos

iterations data array


width, height size of current data block (pixel coordinates)
xpos, ypos position of current data block (pixel coordinates starting at 0)

37
Mandelbrot exercise API
void collect_<lib>(
int **iterations, int **proc_distribution,

C
_infostruct *infostruct)

collect_<lib>(iterations, proc_distribution, info)


Fortran integer, dimension(:), pointer :: iterations
integer, dimension(:), pointer :: proc_distribution
type(t_infostruct), intent(inout) :: info

iterations data array


proc_distribution process distribution array (only in Sionlib)
info global information structure

38

You might also like