Efficient Data Management with
HDF5
Name: Rohan Gupta
Roll No: B22MT037
B.Tech Project
Under the guidance of
Dr. Devendra Negi
Department of Metallurgical and Materials Engineering
What is HDF5
HDF5 Stands for Hierarchical data format with version
5 . Therefore we should first understand what is a
Hierarchical data structure
What is a Hierarchical Data Structure
❏ A hierarchical data structure is an organizational
system where data are arranged in levels.
❏ Resembling a tree-like format with parent-child
relationships
How HDF5 Works
❏ Hierarchy:
➢ At the top is the “root” group, below which are nested groups and datasets.
➢ Each dataset can be a multi-dimensional array containing numbers, text, images, or
other scientific measurements.
How HDF5 Works
❏ Access: Data and metadata are accessed using APIs in languages like
Python (h5py), C/C++, Fortran, or MATLAB. Groups are navigated similar
to directories, datasets like arrays.
❏ Only requested data (“lazy loading”) are loaded into memory, so even
huge files remain manageable.
How to use HDF5 in Pyhton
In Python, there are two libraries that can interface with the HDF5 format: PyTables and
h5py.
PyTables → employed by Pandas
h5py → Maps the features of the HDF5 specification to numpy arrays.
Some of the features
are the same with
both libraries, but we
will focus on h5py.
Installing
● The HDF5 format is supported by the HDF Group, and it is based on open source
standards, meaning that your data will always be accessible, even if the group disappears.
● The command will also install numpy, in case you don't have it already in your
environment.
HDF5 Viewer
● When working with HDF5 files, it is handy to have a tool that allows you to
explore the data graphically
● HDF5 group provides a tool called HDF5 Viewer.
Basic Saving and Reading Data
● Create datasets with f.create_dataset("name", data=array).
● Open file in write (w), append (a), or read (r) mode.
● Access datasets like a dictionary: f['dataset_name'].
Basic Saving and Reading Data
To read the data back, we can do it in a very similar way to when we read a numpy file:
● Use list(f.keys()) or f.visit(print) to
browse the file hierarchy
.
● Data is usually retrieved as NumPy
arrays.
Selective Reading from HDF5 files
● Slice datasets without loading the
full array: data[:10].
● Dataset object (h5py.Dataset) ≠
NumPy array.
● Efficient when working with very
large datasets.
Selective Writing to HDF5 Files
● Create empty datasets with
create_dataset(shape).
● Write subsets directly: dset[10:20]
= arr[50:60].
● Mistake to avoid: dset = arr
(overwrites pointer, doesn’t save).
Specify Data Types to Optimize Space
● Optimize storage with dtype (i1, i8, c16, etc.).
● Smaller dtypes save disk space but may truncate values.
● Default NumPy arrays = float64 (f8).
Save Data in Chunks
● Chunking splits datasets into smaller fixed-size blocks instead of storing as
one continuous array.
● Allows efficient reading/writing of only the needed parts (instead of the whole
dataset).
● Enables compression and resizing, since HDF5 operates on chunks
internally.
Storing Metadata in HDF5
● Metadata is stored as
attributes attached to
datasets or groups
(like properties/labels).
● Use .attrs to add
key–value pairs (e.g.,
units, descriptions,
author).
● Attributes make
datasets
self-describing,
improving readability
and portability.