VisualTrans is the first comprehensive benchmark specifically designed for Visual Transformation Reasoning (VTR) in real-world human-object interaction scenarios.
📄 Paper: VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning
🗂️ Dataset: VisualTrans Benchmark Dataset - Download benchmark data and images
Key Features:
- 🎯 12 manipulation tasks covering diverse real-world scenarios
- 🧠 3 reasoning dimensions: Spatial, Procedural, and Quantitative
- 📊 472 high-quality QA pairs in multiple formats
- 🔄 End-to-end pipeline from data processing to evaluation
# Clone the repository
git clone https://github.com/WangYipu2002/VisualTrans.git
cd VisualTrans
# Create and activate conda environment
conda create -n VisualTrans python=3.10
conda activate VisualTrans
# Install required dependencies
pip install -r requirements.txtVisualTrans provides an end-to-end pipeline with four main components:
- Data Cleaning - Filter and preprocess raw visual data
- Meta Annotation - Annotate metadata
- Question Generation - Synthesize reasoning questions and answers
- Model Evaluation - Evaluate vision-language models
📌 Usage Options:
- Full Pipeline: Start from Step 1 to generate your own transformation QA data
- Evaluation Only: Skip directly to Step 4 if you only want to evaluate models on VisualTrans
Configure paths in VisualTrans/filter/filter.bash and run:
bash VisualTrans/filter/filter.bashFirst, download the Grounding DINO model from HuggingFace.
Configure paths in VisualTrans/meta_annotation/meta_annotation.bash and run:
bash VisualTrans/meta_annotation/meta_annotation.bashConfigure paths in VisualTrans/qa_gen/qa_gen.bash and run:
bash VisualTrans/qa_gen/qa_gen.bashYou can write your own script to sample a specific number of QA pairs and save them as a JSON file for evaluation.
Configure paths in VisualTrans/eval/eval.bash and choose the model you want to evaluate:
bash VisualTrans/eval/eval.bashBefore running each step, edit the corresponding bash files to set your paths:
VisualTrans/filter/filter.bash:
IMAGE_BASE_DIR="path/to/your/image/base/dir"
FILTER_BASE_DIR="path/to/your/filter/jsonl/dir"
DISCARDED_OUTPUT_DIR="path/to/your/discarded/image/dir"VisualTrans/meta_annotation/meta_annotation.bash:
IMAGE_BASE_DIR="path/to/your/image/base/dir"
CROP_IMAGE_DIR="path/to/your/crop/image/dir"
META_OUTPUT_DIR="path/to/your/meta/output/dir"VisualTrans/qa_gen/qa_gen.bash:
IMAGE_BASE_DIR="path/to/your/image/base/dir"
META_OUTPUT_DIR="path/to/your/meta/output/dir"VisualTrans/eval/eval.bash:
MODEL_NAME="your_model_name"
API_KEY="your_api_key"
BENCHMARK_PATH="path/to/your/VisualTrans.json"
IMAGE_BASE="path/to/your/image/base/dir"
RESULT_DIR="path/to/your/result/dir"If you use this framework, please cite our work:
@misc{ji2025visualtransbenchmarkrealworldvisual,
title={VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning},
author={Yuheng Ji and Yipu Wang and Yuyang Liu and Xiaoshuai Hao and Yue Liu and Yuting Zhao and Huaihai Lyu and Xiaolong Zheng},
year={2025},
eprint={2508.04043},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.04043},
}