0% found this document useful (0 votes)

4 views14 pages

Model

Uploaded by

Rohan Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views14 pages

Model

Uploaded by

Rohan Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Efficient Data Management with

HDF5

Name: Rohan Gupta

Roll No: B22MT037

B.Tech Project

Under the guidance of

Dr. Devendra Negi
Department of Metallurgical and Materials Engineering
What is HDF5
HDF5 Stands for Hierarchical data format with version
5 . Therefore we should first understand what is a
Hierarchical data structure

What is a Hierarchical Data Structure

❏ A hierarchical data structure is an organizational

system where data are arranged in levels.
❏ Resembling a tree-like format with parent-child
relationships
How HDF5 Works
❏ Hierarchy:
➢ At the top is the “root” group, below which are nested groups and datasets.
➢ Each dataset can be a multi-dimensional array containing numbers, text, images, or
other scientific measurements.
How HDF5 Works
❏ Access: Data and metadata are accessed using APIs in languages like
Python (h5py), C/C++, Fortran, or MATLAB. Groups are navigated similar
to directories, datasets like arrays.

❏ Only requested data (“lazy loading”) are loaded into memory, so even
huge files remain manageable.
How to use HDF5 in Pyhton
In Python, there are two libraries that can interface with the HDF5 format: PyTables and
h5py.

PyTables → employed by Pandas

h5py → Maps the features of the HDF5 specification to numpy arrays.

Some of the features

are the same with
both libraries, but we
will focus on h5py.
Installing

● The HDF5 format is supported by the HDF Group, and it is based on open source
standards, meaning that your data will always be accessible, even if the group disappears.

● The command will also install numpy, in case you don't have it already in your
environment.

HDF5 Viewer

● When working with HDF5 files, it is handy to have a tool that allows you to
explore the data graphically
● HDF5 group provides a tool called HDF5 Viewer.
Basic Saving and Reading Data

● Create datasets with f.create_dataset("name", data=array).

● Open file in write (w), append (a), or read (r) mode.

● Access datasets like a dictionary: f['dataset_name'].

Basic Saving and Reading Data

To read the data back, we can do it in a very similar way to when we read a numpy file:

● Use list(f.keys()) or f.visit(print) to

browse the file hierarchy
.
● Data is usually retrieved as NumPy
arrays.
Selective Reading from HDF5 files

● Slice datasets without loading the

full array: data[:10].

● Dataset object (h5py.Dataset) ≠

NumPy array.

● Efficient when working with very

large datasets.

Selective Writing to HDF5 Files

● Create empty datasets with

create_dataset(shape).

● Write subsets directly: dset[10:20]

= arr[50:60].

● Mistake to avoid: dset = arr

(overwrites pointer, doesn’t save).
Specify Data Types to Optimize Space

● Optimize storage with dtype (i1, i8, c16, etc.).

● Smaller dtypes save disk space but may truncate values.

● Default NumPy arrays = float64 (f8).

Save Data in Chunks

● Chunking splits datasets into smaller fixed-size blocks instead of storing as

one continuous array.

● Allows efficient reading/writing of only the needed parts (instead of the whole
dataset).

● Enables compression and resizing, since HDF5 operates on chunks

internally.
Storing Metadata in HDF5
● Metadata is stored as
attributes attached to
datasets or groups
(like properties/labels).

● Use .attrs to add

key–value pairs (e.g.,
units, descriptions,
author).

● Attributes make
datasets
self-describing,
improving readability
and portability.

H5py Python
No ratings yet
H5py Python
25 pages
HDF5 in Python: Efficient Data Storage
No ratings yet
HDF5 in Python: Efficient Data Storage
11 pages
HDF5 and h5py: Data Management Guide
No ratings yet
HDF5 and h5py: Data Management Guide
26 pages
Py Tables
No ratings yet
Py Tables
143 pages
Pycon2003 Paper
No ratings yet
Pycon2003 Paper
9 pages
HDF5 Intro
No ratings yet
HDF5 Intro
25 pages
HDF5 tutorialNUG2010
No ratings yet
HDF5 tutorialNUG2010
112 pages
Data Science Formats Beyond CSV and Hdfs
No ratings yet
Data Science Formats Beyond CSV and Hdfs
54 pages
Python for Data-Driven Programmers
100% (3)
Python for Data-Driven Programmers
49 pages
Pytables: An On - Disk Binary Data Container, Query Engine and Computa:Onal Kernel
No ratings yet
Pytables: An On - Disk Binary Data Container, Query Engine and Computa:Onal Kernel
35 pages
Python For Netcdf
No ratings yet
Python For Netcdf
17 pages
Python HDF5 with PyTables for Genomics
100% (2)
Python HDF5 with PyTables for Genomics
17 pages
HDF5 Users Guide
No ratings yet
HDF5 Users Guide
342 pages
Pythonlibraries
No ratings yet
Pythonlibraries
20 pages
HDF5 RM r187
No ratings yet
HDF5 RM r187
778 pages
Ty B Tech - Bda - Ai315 - Lab Manual
No ratings yet
Ty B Tech - Bda - Ai315 - Lab Manual
52 pages
Unit 3
No ratings yet
Unit 3
110 pages
Python NetCDF File Handling Guide
No ratings yet
Python NetCDF File Handling Guide
9 pages
Exp 1
No ratings yet
Exp 1
22 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
20 pages
BIG DATA Lab Record-2024
No ratings yet
BIG DATA Lab Record-2024
59 pages
Future For Scientific Computing Using Python
No ratings yet
Future For Scientific Computing Using Python
7 pages
Essential Python Tools for Data Science
No ratings yet
Essential Python Tools for Data Science
2 pages
Recording Performances of Some File Types For Pandas Data. DOI-10.31590-Ejosat.1103499-2374400
No ratings yet
Recording Performances of Some File Types For Pandas Data. DOI-10.31590-Ejosat.1103499-2374400
6 pages
EX - No: 1 Date:: Download Install Explore The Features of Numpy, Scipy, Jupiter, Statsmodels and Pandas Packages
No ratings yet
EX - No: 1 Date:: Download Install Explore The Features of Numpy, Scipy, Jupiter, Statsmodels and Pandas Packages
38 pages
Feature Engineering - Introduction
No ratings yet
Feature Engineering - Introduction
74 pages
Saurabh Maurya's Income Analysis Report
No ratings yet
Saurabh Maurya's Income Analysis Report
13 pages
HDF5 Performance in Cloud Environments
No ratings yet
HDF5 Performance in Cloud Environments
33 pages
Advance Python Programming
No ratings yet
Advance Python Programming
46 pages
BDA Practical File
No ratings yet
BDA Practical File
57 pages
Unit Vi
No ratings yet
Unit Vi
60 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
10 pages
Introduction To HDF5: HDF & HDF-EOS Workshop XII October 15, 2008
No ratings yet
Introduction To HDF5: HDF & HDF-EOS Workshop XII October 15, 2008
80 pages
Python for Scientific Computing: NumPy & Pandas
No ratings yet
Python for Scientific Computing: NumPy & Pandas
7 pages
Fds Lab Manual
No ratings yet
Fds Lab Manual
61 pages
Scientific Data Management Tools
No ratings yet
Scientific Data Management Tools
2 pages
Numpy Cheat Sheet for Quick Reference
No ratings yet
Numpy Cheat Sheet for Quick Reference
4 pages
Revision Python For Computer Vision
No ratings yet
Revision Python For Computer Vision
50 pages
Pandas & NumPy in Business Analytics
No ratings yet
Pandas & NumPy in Business Analytics
13 pages
PP&DS Unit Iii
No ratings yet
PP&DS Unit Iii
26 pages
Python Libraries for Data Science
No ratings yet
Python Libraries for Data Science
53 pages
Numpy
No ratings yet
Numpy
13 pages
Glosario m4
No ratings yet
Glosario m4
2 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
HDF5 vs Zarr vs netCDF4: I/O Performance Comparison
No ratings yet
HDF5 vs Zarr vs netCDF4: I/O Performance Comparison
6 pages
Datascience Unit3
No ratings yet
Datascience Unit3
19 pages
Module 1.foundations of Data Science
No ratings yet
Module 1.foundations of Data Science
17 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
26 pages
FDS Lab Manual For CSE 1
No ratings yet
FDS Lab Manual For CSE 1
86 pages
Pandas Documentation PDF
No ratings yet
Pandas Documentation PDF
86 pages
Glossary Working With Data in Python
No ratings yet
Glossary Working With Data in Python
2 pages
Ex 1
No ratings yet
Ex 1
6 pages
Lab 2 DWM
No ratings yet
Lab 2 DWM
13 pages
BDA Chits Rroobbiinn
No ratings yet
BDA Chits Rroobbiinn
2 pages
Numpy Code
No ratings yet
Numpy Code
10 pages
Python GTU Study Material E-Notes 3 16012021061619AM
No ratings yet
Python GTU Study Material E-Notes 3 16012021061619AM
36 pages
NumPy and Pandas: Essential Python Libraries
No ratings yet
NumPy and Pandas: Essential Python Libraries
72 pages
EXP1-siddhant Gupta (23 - SE - 148)
No ratings yet
EXP1-siddhant Gupta (23 - SE - 148)
17 pages
2.PPT On Features of TDS
No ratings yet
2.PPT On Features of TDS
21 pages
AWS Case Study Sumit
No ratings yet
AWS Case Study Sumit
8 pages
Infinito Test 2
No ratings yet
Infinito Test 2
7 pages
SLE-smt Color en
No ratings yet
SLE-smt Color en
99 pages
Graph Visualization Tools Overview 2019
No ratings yet
Graph Visualization Tools Overview 2019
117 pages
HyperForm Tutorials
No ratings yet
HyperForm Tutorials
262 pages
Sl-Unit 3
No ratings yet
Sl-Unit 3
22 pages
Layers of Cloud Computing Architecture
No ratings yet
Layers of Cloud Computing Architecture
8 pages
Software Testing Tools
100% (2)
Software Testing Tools
8 pages
Seat Work 1 G11
No ratings yet
Seat Work 1 G11
5 pages
Writing in The Computer Science Field
100% (2)
Writing in The Computer Science Field
6 pages
PO Service Line Item Quantity Exceed Validation Against PR Quantity - SAP Blogs
No ratings yet
PO Service Line Item Quantity Exceed Validation Against PR Quantity - SAP Blogs
14 pages
Overview of Computer System Types
No ratings yet
Overview of Computer System Types
57 pages
User Exit/Badi in Routing, EWB and BOM: Symptom
No ratings yet
User Exit/Badi in Routing, EWB and BOM: Symptom
3 pages
DCIPS FWD Guide
No ratings yet
DCIPS FWD Guide
40 pages
Showeditor
No ratings yet
Showeditor
167 pages
Synopsis For Online Examination Website: 1. Subject
No ratings yet
Synopsis For Online Examination Website: 1. Subject
3 pages
IN Installation Note enUS 19540996363
No ratings yet
IN Installation Note enUS 19540996363
2 pages
PHP Manual Summary
No ratings yet
PHP Manual Summary
3 pages
Word Processing
No ratings yet
Word Processing
7 pages
Technology in Dance Education
No ratings yet
Technology in Dance Education
8 pages
P Vijaycharles
No ratings yet
P Vijaycharles
2 pages
PM Debug Info
No ratings yet
PM Debug Info
14 pages
110 Social Engineering Attack Vectors PDF
0% (1)
110 Social Engineering Attack Vectors PDF
383 pages
Programming Software Tools 8000
100% (1)
Programming Software Tools 8000
1 page
Library Management System Code
No ratings yet
Library Management System Code
20 pages
Vectorworks 2009 Keyboard Shortcuts
No ratings yet
Vectorworks 2009 Keyboard Shortcuts
4 pages
Blocks Programming Topics - Apple (2011)
No ratings yet
Blocks Programming Topics - Apple (2011)
24 pages
Flutter Developer Needed For UI Project
No ratings yet
Flutter Developer Needed For UI Project
4 pages
I PUC Unit A
No ratings yet
I PUC Unit A
44 pages

Model

Uploaded by

Model

Uploaded by

Efficient Data Management with

Name: Rohan Gupta

Roll No: B22MT037

Under the guidance of

What is a Hierarchical Data Structure

❏ A hierarchical data structure is an organizational

PyTables → employed by Pandas

h5py → Maps the features of the HDF5 specification to numpy arrays.

Some of the features

● Create datasets with f.create_dataset("name", data=array).

● Open file in write (w), append (a), or read (r) mode.

● Access datasets like a dictionary: f['dataset_name'].

● Use list(f.keys()) or f.visit(print) to

● Slice datasets without loading the

● Dataset object (h5py.Dataset) ≠

● Efficient when working with very

Selective Writing to HDF5 Files

● Create empty datasets with

● Write subsets directly: dset[10:20]

● Mistake to avoid: dset = arr

● Optimize storage with dtype (i1, i8, c16, etc.).

● Smaller dtypes save disk space but may truncate values.

● Default NumPy arrays = float64 (f8).

● Chunking splits datasets into smaller fixed-size blocks instead of storing as

● Enables compression and resizing, since HDF5 operates on chunks

● Use .attrs to add

You might also like