LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

🔗 Featured Resources

📄 Paper (arXiv): LLM as Dataset Analyst
📘 ECCV 2024 Camera Ready: Springer Link
🌐 Project Website: llm-as-dataset-analyst.github.io
🎬 Video Presentation: YouTube

🧠 Discover subpopulation structures using Large Language Models (LLMs) with linguistic interpretability and automation.

🧩 Overview

SSD-LLM is an innovative framework for discovering subpopulation structures within datasets using Large Language Models (LLMs). By leveraging LLMs' extensive world knowledge and advanced reasoning capabilities, SSD-LLM offers:

✨ Linguistically Interpretable Subpopulation Discovery: Understand dataset structures through natural language
🤖 Automated Dataset Analysis: Use LLMs to uncover subpopulation patterns with minimal human effort
🔄 Comprehensive Workflow: End-to-end pipeline for subpopulation discovery and evaluation
🔌 Flexible Integration: Compatible with various MLLMs and LLMs

Future Directions:

📊 Diverse structure discovery: Tailoring subpopulation forms to specific downstream needs
🖼️ Multimodal extension: Applying SSD-LLM to vision and multimodal datasets
✅ Unbiased data generation: Supporting fair and balanced dataset development

⚙️ Installation

Clone the repository:

git clone https://github.com/llm-as-dataset-analyst/SSDLLM.git

Navigate to the project directory:
```
cd SSDLLM
```
Install dependencies:
```
pip install -e .
```

📁 Project Structure

SSDLLM/
├── captions/               # Pre-captioned datasets (e.g., with LLaVA1.5-7B)
├── config/                 # YAML configs for datasets and pipeline settings
├── run.sh                  # Main entry script
├── utils.py                # Configuration and utility functions
└── step1_image_caption/    # Custom dataset captioning logic

captions/: Pre-captioned datasets using llava1.5-7b; ready for direct use.
config/: Contains YAML configs:
- 0_summary.yaml: Sets pipeline hyperparameters (auto-saved to output/ after running).
- Other files define task name, dataset name, class count, etc. Follow provided examples to format your own.
run.sh: Pipeline launcher script. Supports switching mllm and llm—ensure corresponding models are prepared.
utils.py: Contains OpenAI API key and helper functions.
step1_image_caption/: Batch captioning scripts for custom datasets (supports ImageFolder format).

🛠️ Configuration

Set your OpenAI API Key in utils.py:
```
api_key = "your-openai-api-key"
```
Adjust run.sh Parameters:
```
mllm_name=llava1.5-7b
llm_name=gpt-3.5-turbo
for class_name in mood
```
- mllm_name: Multimodal LLM for captioning
- llm_name: Main LLM for reasoning & subpopulation discovery
- class_name: Dataset to analyze (check /config for available ones)
Prepare Custom Dataset (Optional):
- Format your images using ImageFolder structure:
```
dataset/
├── class_a/
├── class_b/
└── ...
```
- Modify step1_image_caption/scripts_infer_batch.sh for inference logic.

🚀 Usage

Run the pipeline using:

bash run.sh

🙏 Acknowledgements

This project is benefited from the following repositories:

ICTC
LLAVA

Thanks for their great works!

📚 Citation

If you find our work helpful, please cite us:

@inproceedings{luo2025llm,
  title={LLM as dataset analyst: Subpopulation structure discovery with large language model},
  author={Luo, Yulin and An, Ruichuan and Zou, Bocheng and Tang, Yiming and Liu, Jiaming and Zhang, Shanghang},
  booktitle={European Conference on Computer Vision},
  pages={235--252},
  year={2025},
  organization={Springer}
}

🌟 Star this repo if you find it useful! Contributions and feedback are welcome 🙌

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
captions		captions
config		config
fig		fig
logs/dimension_suggestion_examples		logs/dimension_suggestion_examples
scripts		scripts
step1_image_caption/llava1.5		step1_image_caption/llava1.5
step2_criteria_initialization		step2_criteria_initialization
step3_criteria_refinement		step3_criteria_refinement
step4_image_assignment		step4_image_assignment
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh
setup.py		setup.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

🔗 Featured Resources

🧩 Overview

⚙️ Installation

📁 Project Structure

🛠️ Configuration

🚀 Usage

🙏 Acknowledgements

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

llm-as-dataset-analyst/SSDLLM

Folders and files

Latest commit

History

Repository files navigation

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

🔗 Featured Resources

🧩 Overview

⚙️ Installation

📁 Project Structure

🛠️ Configuration

🚀 Usage

🙏 Acknowledgements

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages