AutoCaption is a novel framework that employs Monte Carlo Tree Search (MCTS) to generate rich, diverse, and detailed video captions. The framework iteratively constructs high-quality video descriptions that thoroughly cover objects, actions, environments, and temporal dynamics.
MCTS-VCB is a fine-grained video captioning benchmark automatically constructed using AutoCaption, enabling comprehensive evaluation of Multimodal Large Language Models (MLLMs) on video understanding tasks.
- 🧠 AutoCaption Framework: Iteratively constructs high-quality video descriptions using MCTS, covering objects, actions, environments, and more
- 📊 MCTS-VCB Benchmark: Contains diverse, multi-faceted video captions for robust MLLM evaluation
- 🔍 Comprehensive Evaluation: Benchmarked over 20 MLLMs with Gemini-1.5-Pro achieving the top F1 score of 71.2%
- 📈 Fine-tuning Results: InternVL2.5-8B fine-tuned on AutoCaption data achieved:
- +25.0% improvement on MCTS-VCB
- +16.3% improvement on DREAM-1K
- Python 3.8+
- CUDA-compatible GPU (recommended)
- 16GB+ GPU memory for Qwen2-VL-7B
git clone https://github.com/your-username/autocaption.git
cd autocaption
pip install -r requirements.txtgit clone https://github.com/your-username/autocaption.git
cd autocaption
pip install -e .# For distributed processing
pip install mpi4py
# For experiment tracking
pip install wandbCreate your input file in JSONL format:
{"video_name": "video1.mp4", "video_path": "/path/to/video1.mp4", "index": 0}
{"video_name": "video2.mp4", "video_path": "/path/to/video2.mp4", "index": 1}# Copy and modify configuration
cp config/config.yaml config/my_config.yaml
# Edit config/my_config.yaml as needed# Multi-GPU processing
python main.py \
--input_path data/videos.jsonl \
--output_path results/captions.jsonl \
--process_num 4 \
--gpu_nums_one_process 2 \
--max_rollout_times 25 \
--log_level INFOautocaption/
├── 📁 generator/ # Model generators
│ └── qwen2vl_7b.py # Qwen2-VL-7B wrapper
├── 📁 scripts/ # Utility scripts
│ └── run_autocaption.sh # Main execution script
├── 🐍 main.py # Main entry point
├── 🐍 mcts.py # MCTS algorithm implementation
├── 🐍 util.py # Utility functions
├── 📋 requirements.txt # Python dependencies
├── ⚙️ setup.py # Package setup
└── 📄 README.md # This file
AutoCaption uses 6 different action types for comprehensive video analysis:
- ACTION1: Overall video description
- ACTION2: Detail-focused observation (weighted selection)
- ACTION3: Temporal perspective analysis
- ACTION4: Spatial perspective analysis
- ACTION5: Background description
- ACTION6: Camera movement analysis
If you use AutoCaption or MCTS-VCB in your research, please cite our paper:
@misc{yu2025evaluatingmultimodallargelanguage,
title={Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search},
author={Linhao Yu and Xinguang Ji and Yahui Liu and Fanheng Kong and Chenxi Sun and Jingyuan Zhang and Hongzhi Zhang and V. W. and Fuzheng Zhang and Deyi Xiong},
year={2025},
eprint={2506.11155},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.11155},
}