- 📄 Paper (arXiv): LLM as Dataset Analyst
- 📘 ECCV 2024 Camera Ready: Springer Link
- 🌐 Project Website: llm-as-dataset-analyst.github.io
- 🎬 Video Presentation: YouTube
🧠 Discover subpopulation structures using Large Language Models (LLMs) with linguistic interpretability and automation.
SSD-LLM is an innovative framework for discovering subpopulation structures within datasets using Large Language Models (LLMs). By leveraging LLMs' extensive world knowledge and advanced reasoning capabilities, SSD-LLM offers:
- ✨ Linguistically Interpretable Subpopulation Discovery: Understand dataset structures through natural language
- 🤖 Automated Dataset Analysis: Use LLMs to uncover subpopulation patterns with minimal human effort
- 🔄 Comprehensive Workflow: End-to-end pipeline for subpopulation discovery and evaluation
- 🔌 Flexible Integration: Compatible with various MLLMs and LLMs
Future Directions:
- 📊 Diverse structure discovery: Tailoring subpopulation forms to specific downstream needs
- 🖼️ Multimodal extension: Applying SSD-LLM to vision and multimodal datasets
- ✅ Unbiased data generation: Supporting fair and balanced dataset development
-
Clone the repository:
git clone https://github.com/llm-as-dataset-analyst/SSDLLM.git
-
Navigate to the project directory:
cd SSDLLM -
Install dependencies:
pip install -e .
SSDLLM/
├── captions/ # Pre-captioned datasets (e.g., with LLaVA1.5-7B)
├── config/ # YAML configs for datasets and pipeline settings
├── run.sh # Main entry script
├── utils.py # Configuration and utility functions
└── step1_image_caption/ # Custom dataset captioning logic
captions/: Pre-captioned datasets usingllava1.5-7b; ready for direct use.config/: Contains YAML configs:0_summary.yaml: Sets pipeline hyperparameters (auto-saved tooutput/after running).- Other files define task name, dataset name, class count, etc. Follow provided examples to format your own.
run.sh: Pipeline launcher script. Supports switchingmllmandllm—ensure corresponding models are prepared.utils.py: Contains OpenAI API key and helper functions.step1_image_caption/: Batch captioning scripts for custom datasets (supports ImageFolder format).
-
Set your OpenAI API Key in
utils.py:api_key = "your-openai-api-key"
-
Adjust
run.shParameters:mllm_name=llava1.5-7b llm_name=gpt-3.5-turbo for class_name in mood
mllm_name: Multimodal LLM for captioningllm_name: Main LLM for reasoning & subpopulation discoveryclass_name: Dataset to analyze (check/configfor available ones)
-
Prepare Custom Dataset (Optional):
- Format your images using
ImageFolderstructure:dataset/ ├── class_a/ ├── class_b/ └── ... - Modify
step1_image_caption/scripts_infer_batch.shfor inference logic.
- Format your images using
Run the pipeline using:
bash run.shThis project is benefited from the following repositories:
Thanks for their great works!
If you find our work helpful, please cite us:
@inproceedings{luo2025llm,
title={LLM as dataset analyst: Subpopulation structure discovery with large language model},
author={Luo, Yulin and An, Ruichuan and Zou, Bocheng and Tang, Yiming and Liu, Jiaming and Zhang, Shanghang},
booktitle={European Conference on Computer Vision},
pages={235--252},
year={2025},
organization={Springer}
}🌟 Star this repo if you find it useful! Contributions and feedback are welcome 🙌
