Skip to content

open-gigaai/giga-datasets

GigaDatasets

A Unified and Lightweight Framework for Data Curation, Evaluation and Visualization

| Quick Start | Contributing | License | Citation |

✨ Introduction

GigaDatasets is a unified and lightweight framework for data curation, evaluation and visualization. Designed to make handling massive datasets simple, efficient, and consistent.

Major features
  • 🔍 Unified Workflow: Unify all steps from data curation and packaging to loading, evaluation, and visualization.
  • Lightweight and Easy to Use: Simple pip/source install pip3 install giga-datasets, one line of code for data loading dataset = load_dataset(data_path), one line of code for data evaluation eval_results = FIDEvaluator(datasets)(pred_results).
  • 🗂️ Multi-format and Multi-structure Data Support: File, LMDB, Pickle, and LeRobot datasets with flexible loading. Unified support for images, videos, 2D/3D boxes, 2D/3D points, and other structured data.
  • 🚀 Efficient Processing: Optimized for speed and memory, suitable for large-scale data processing needs.

⚡ Installation

GigaDatasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip3 install giga-datasets

or you can install directly from source for the latest updates:

conda create -n giga_datasets python=3.11.10
conda activate giga_datasets
git clone https://github.com/open-gigaai/giga-datasets.git
cd giga-datasets
pip3 install -e .

🚀 Usage

We provide accessible demo data and Jupyter notebooks in getting_started. Utility scripts can be found in the scripts folder.

1. load dataset

There is a simple way to load datasets using the load_dataset function from the giga_datasets library. We provide a demo dataset in the giga_data directory for you to try out. Here is a quick example, and the full code is available here:

from giga_datasets import load_dataset

dataset = load_dataset('./getting_started/giga_data')
data_dict = dataset[0]
print('Dataset size:', len(dataset))
print('First item in dataset:', data_dict)

The giga_data directory contains the following structure:

giga_data/
├── config.json          # Configuration file describing the dataset
├── labels/              # Directory containing label files
│   ├── config.json      # Additional configuration for labels
│   ├── data.pkl         # Serialized label data
├── images/              # Directory containing image files
│   ├──config.json       # Additional configuration for images
│   ├──data.mdb          # Lmdb format for images
├   ├──lock.mdb

The config.json file in the giga_data directory contains the following structure:

{
    "_class_name": "Dataset",
    "config_paths": [
        "labels/config.json",
        "images/config.json"
    ]
}

This file specifies:

  • _class_name: Indicates the class type used for the dataset, which is Dataset in this case.
  • config_paths: Lists paths to additional configuration files for specific components of the dataset, such as labels/config.json and images/config.json.

2. package dataset

For an unstructured dataset, you can use the Writer classes (including PklWriter, FileWriter and LmdbWriter to package your data into a structured format. Below is an example of how to package a dataset consisting of images and labels.

The raw_data directory contains the following structure:

raw_data/
├── 0.json               # Annotation file for image 0
├── 0.png                # Image file 0
├── 1.json               # Annotation file for image 1
├── 1.png                # Image file 1
├── ...

You can run the following python code to package the dataset, the full code is available here:

image_paths = utils.list_dir(image_dir, recursive=True, exts=['.png', '.jpg', '.jpeg'])
label_writer = PklWriter(os.path.join(save_dir, 'labels'))
image_writer = LmdbWriter(os.path.join(save_dir, 'images'))
for idx in tqdm(range(len(image_paths))):
    label_path = image_paths[idx].replace('.png', '.json')
    label_dict = json.load(open(label_path))
    label_dict['data_index'] = idx
    label_writer.write_dict(label_dict)
    image_writer.write_image(idx, image_paths[idx])
label_writer.write_config()
image_writer.write_config()
label_writer.close()
image_writer.close()
label_dataset = load_dataset(os.path.join(save_dir, 'labels'))
image_dataset = load_dataset(os.path.join(save_dir, 'images'))
dataset = Dataset([label_dataset, image_dataset])
dataset.save(save_dir)

We supports packaging and reading different data formats. In addition to packaging images, we also provide an example of packaging video data, where we store the video's metadata.

# package video samples in the input directory to the output directory
python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos

# if you want to package videos into lmdb format for better read performance
python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos --pack-lmdb

# if you want to package samples, but not copy the video files and only store the metadata and absolute paths
python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos --only_path

3. add new field

In models' training or inference, a sample is often represented as a dictionary with multiple fields. Our framework is designed to be easily extensible to accommodate new data fields. Below is an example of how to add canny maps as a new field:

python getting_started/add_new_filed.py --data_dir getting_started/giga_data

Additional Usage Examples

Note: More usage examples and feature documentation will be added in future updates—stay tuned!

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

📖 Citation

@misc{gigaai2025gigadatasets,
    author = {GigaAI},
    title = {GigaDatasets: A Unified and Lightweight Framework for Data Curation, Evaluation and Visualization},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/open-gigaai/giga-datasets}}
}

About

GigaDatasets: A Unified and Lightweight Framework for Data Processing, Curation, and Visualization

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages