Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search

📌 Overview

AutoCaption is a novel framework that employs Monte Carlo Tree Search (MCTS) to generate rich, diverse, and detailed video captions. The framework iteratively constructs high-quality video descriptions that thoroughly cover objects, actions, environments, and temporal dynamics.

MCTS-VCB is a fine-grained video captioning benchmark automatically constructed using AutoCaption, enabling comprehensive evaluation of Multimodal Large Language Models (MLLMs) on video understanding tasks.

🚀 Highlights

🧠 AutoCaption Framework: Iteratively constructs high-quality video descriptions using MCTS, covering objects, actions, environments, and more
📊 MCTS-VCB Benchmark: Contains diverse, multi-faceted video captions for robust MLLM evaluation
🔍 Comprehensive Evaluation: Benchmarked over 20 MLLMs with Gemini-1.5-Pro achieving the top F1 score of 71.2%
📈 Fine-tuning Results: InternVL2.5-8B fine-tuned on AutoCaption data achieved:
- +25.0% improvement on MCTS-VCB
- +16.3% improvement on DREAM-1K

🛠️ Installation

Prerequisites

Python 3.8+
CUDA-compatible GPU (recommended)
16GB+ GPU memory for Qwen2-VL-7B

Quick Install

git clone https://github.com/your-username/autocaption.git
cd autocaption
pip install -r requirements.txt

Development Install

git clone https://github.com/your-username/autocaption.git
cd autocaption
pip install -e .

Optional Dependencies

# For distributed processing
pip install mpi4py

# For experiment tracking
pip install wandb

🚀 Quick Start

1. Prepare Data

Create your input file in JSONL format:

{"video_name": "video1.mp4", "video_path": "/path/to/video1.mp4", "index": 0}
{"video_name": "video2.mp4", "video_path": "/path/to/video2.mp4", "index": 1}

2. Configure Settings

# Copy and modify configuration
cp config/config.yaml config/my_config.yaml
# Edit config/my_config.yaml as needed

3. Run AutoCaption

# Multi-GPU processing
python main.py \
    --input_path data/videos.jsonl \
    --output_path results/captions.jsonl \
    --process_num 4 \
    --gpu_nums_one_process 2 \
    --max_rollout_times 25 \
    --log_level INFO

📂 Repository Structure

autocaption/
├── 📁 generator/              # Model generators
│   └── qwen2vl_7b.py         # Qwen2-VL-7B wrapper
├── 📁 scripts/                # Utility scripts
│   └── run_autocaption.sh    # Main execution script
├── 🐍 main.py                 # Main entry point
├── 🐍 mcts.py                 # MCTS algorithm implementation
├── 🐍 util.py                 # Utility functions
├── 📋 requirements.txt        # Python dependencies
├── ⚙️ setup.py                # Package setup
└── 📄 README.md               # This file

🎯 MCTS Action Types

AutoCaption uses 6 different action types for comprehensive video analysis:

ACTION1: Overall video description
ACTION2: Detail-focused observation (weighted selection)
ACTION3: Temporal perspective analysis
ACTION4: Spatial perspective analysis
ACTION5: Background description
ACTION6: Camera movement analysis

📌 Citation

If you use AutoCaption or MCTS-VCB in your research, please cite our paper:

@misc{yu2025evaluatingmultimodallargelanguage,
    title={Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search}, 
    author={Linhao Yu and Xinguang Ji and Yahui Liu and Fanheng Kong and Chenxi Sun and Jingyuan Zhang and Hongzhi Zhang and V. W. and Fuzheng Zhang and Deyi Xiong},
    year={2025},
    eprint={2506.11155},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2506.11155},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search

📌 Overview

🚀 Highlights

🛠️ Installation

Prerequisites

Quick Install

Development Install

Optional Dependencies

🚀 Quick Start

1. Prepare Data

2. Configure Settings

3. Run AutoCaption

📂 Repository Structure

🎯 MCTS Action Types

📌 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
autocaption		autocaption
evaluation		evaluation
README.md		README.md

tjunlp-lab/MCTS-VCB

Folders and files

Latest commit

History

Repository files navigation

Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search

📌 Overview

🚀 Highlights

🛠️ Installation

Prerequisites

Quick Install

Development Install

Optional Dependencies

🚀 Quick Start

1. Prepare Data

2. Configure Settings

3. Run AutoCaption

📂 Repository Structure

🎯 MCTS Action Types

📌 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages