0% found this document useful (0 votes)
4 views14 pages

Model

Uploaded by

Rohan Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

Model

Uploaded by

Rohan Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Efficient Data Management with

HDF5

Name: Rohan Gupta

Roll No: B22MT037

B.Tech Project

Under the guidance of


Dr. Devendra Negi
Department of Metallurgical and Materials Engineering
What is HDF5
HDF5 Stands for Hierarchical data format with version
5 . Therefore we should first understand what is a
Hierarchical data structure

What is a Hierarchical Data Structure

❏ A hierarchical data structure is an organizational


system where data are arranged in levels.
❏ Resembling a tree-like format with parent-child
relationships
How HDF5 Works
❏ Hierarchy:
➢ At the top is the “root” group, below which are nested groups and datasets.
➢ Each dataset can be a multi-dimensional array containing numbers, text, images, or
other scientific measurements.
How HDF5 Works
❏ Access: Data and metadata are accessed using APIs in languages like
Python (h5py), C/C++, Fortran, or MATLAB. Groups are navigated similar
to directories, datasets like arrays.

❏ Only requested data (“lazy loading”) are loaded into memory, so even
huge files remain manageable.
How to use HDF5 in Pyhton
In Python, there are two libraries that can interface with the HDF5 format: PyTables and
h5py.

PyTables → employed by Pandas

h5py → Maps the features of the HDF5 specification to numpy arrays.

Some of the features


are the same with
both libraries, but we
will focus on h5py.
Installing

● The HDF5 format is supported by the HDF Group, and it is based on open source
standards, meaning that your data will always be accessible, even if the group disappears.

● The command will also install numpy, in case you don't have it already in your
environment.

HDF5 Viewer

● When working with HDF5 files, it is handy to have a tool that allows you to
explore the data graphically
● HDF5 group provides a tool called HDF5 Viewer.
Basic Saving and Reading Data

● Create datasets with f.create_dataset("name", data=array).

● Open file in write (w), append (a), or read (r) mode.

● Access datasets like a dictionary: f['dataset_name'].


Basic Saving and Reading Data

To read the data back, we can do it in a very similar way to when we read a numpy file:

● Use list(f.keys()) or f.visit(print) to


browse the file hierarchy
.
● Data is usually retrieved as NumPy
arrays.
Selective Reading from HDF5 files

● Slice datasets without loading the


full array: data[:10].

● Dataset object (h5py.Dataset) ≠


NumPy array.

● Efficient when working with very


large datasets.

Selective Writing to HDF5 Files

● Create empty datasets with


create_dataset(shape).

● Write subsets directly: dset[10:20]


= arr[50:60].

● Mistake to avoid: dset = arr


(overwrites pointer, doesn’t save).
Specify Data Types to Optimize Space

● Optimize storage with dtype (i1, i8, c16, etc.).

● Smaller dtypes save disk space but may truncate values.

● Default NumPy arrays = float64 (f8).


Save Data in Chunks

● Chunking splits datasets into smaller fixed-size blocks instead of storing as


one continuous array.

● Allows efficient reading/writing of only the needed parts (instead of the whole
dataset).

● Enables compression and resizing, since HDF5 operates on chunks


internally.
Storing Metadata in HDF5
● Metadata is stored as
attributes attached to
datasets or groups
(like properties/labels).

● Use .attrs to add


key–value pairs (e.g.,
units, descriptions,
author).

● Attributes make
datasets
self-describing,
improving readability
and portability.

You might also like