Official Codebase for CVPR 2026 Highlight Paper: "Building a Precise Video Language with Human–AI Oversight"
Zhiqiu Lin¹, Chancharik Mitra¹, Siyuan Cen¹, Isaac Li¹, Yuhan Huang¹, Yu Tong Tiffany Ling¹, Hewei Wang¹, Irene Pi¹, Shihang Zhu¹, Ryan Rao¹, George Liu¹, Jiaxi Li¹, Ruojin Li¹, Yili Han¹, Yilun Du², Deva Ramanan¹
¹Carnegie Mellon University ²Harvard University
- 2026-05-13: Released evaluation code, updated test set, and published the CHAI SFT 8B model.
Video–language models learn to reason about dynamic scenes through natural language, yet producing precise video captions remains challenging. CHAI (Critique-based Human–AI) is an oversight framework that pairs trained human experts with model-generated pre-captions: experts provide correctional critiques that guide revisions into improved post-captions. This division of labor offloads text generation to models so that humans can focus on verification, improving both accuracy and efficiency.
We release open datasets, benchmarks, and training recipes built on a structured captioning specification covering subjects, scenes, motion, spatial layout, and camera dynamics—grounded in hundreds of visual primitives developed with professional filmmakers. The resulting critiques and preferences provide rich supervision for improving open-source VLMs (Qwen3-VL) through SFT, DPO, and inference-time scaling on three tasks: caption generation, reward modeling, and critique generation.
- Python 3.10+
- Conda (recommended)
- GPU(s) for model inference (tested on NVIDIA A6000)
# Clone the repository
git clone https://github.com/TODO/CHAI.git
cd CHAI
# Create and activate conda environment
conda create -n chai python=3.10 -y
conda activate chai
# Install conda dependencies
conda install -c conda-forge ffmpeg=6.1.2 -y
# Install the package
pip install --no-build-isolation -e .Note: The
--no-build-isolationflag is required because thet2v_metricsdependency uses a legacysetup.pywithout explicitsetuptoolsdeclarations.
If you want to use the LLM judge for generation evaluation, create a .env file in the project root:
OPENAI_API_KEY=your-openai-key
This is not required for the default evaluation pipeline, which uses BLEU-4 and ROUGE-L.
The evaluation data and videos are hosted on HuggingFace. See Evaluation Data below for details on what each file contains.
# Download the full dataset (videos + evaluation JSONs)
hf download chancharikm/CHAI_testset --repo-type dataset --local-dir ./eval_dataThis populates eval_data/ with the test split, task-specific evaluation files, and all corresponding videos.
All evaluation files live under eval_data/. The raw test split and three task-specific reformatted versions are provided. The corresponding videos and copies of all test split files are hosted on HuggingFace at chancharikm/CHAI_testset.
The raw evaluation data. Each entry contains a video path, the model-generated pre-caption, a human-written critique, the revised final caption (post-caption), a pre-caption score (1–5), the caption type (e.g., Subject, Scene, Motion, Spatial, Camera), and associated metadata. This file serves as the source from which all task-specific evaluation sets below are derived.
Formatted for the caption generation task. Each sample pairs a video with a task instruction as the user turn and the final (post) caption as the target assistant response. Used to evaluate a model's ability to directly produce high-quality captions from video.
Formatted for the critique generation task. Each sample provides a video, a task instruction, and a caption to critique as the user turn. The target assistant response is a critique. For pre-captions scoring below 5, two training pairs are generated: one pairing the pre-caption with its human critique, and one pairing the final caption with a "perfect caption" sentinel critique, teaching the model to both identify errors and recognize when a caption needs no revision.
Formatted for the reward modeling (binary alignment scoring) task. Given a video, a task instruction, and a candidate caption, the model must judge whether the caption aligns with the video by responding "Yes" or "No". For pre-captions scoring below 5, two samples are generated: the final caption as a positive example ("Yes") and the pre-caption as a negative example ("No"), providing balanced supervision for learning caption quality.
The evaluation pipeline supports three tasks: caption generation, critique generation, and reward modeling (caption yes/no scoring). A single bash script orchestrates generation and evaluation across model checkpoints. The script also enables parallel workers per GPU for faster inference.
# Run the full pipeline with default settings
bash run_unified_evaluations.shEdit the top of run_unified_evaluations.sh to configure your run:
# GPU setup
GPUS="0,1,2,3,4,5,6,7"
WORKERS_PER_GPU=2
# Data
DATA_FILE="eval_data/test_split.json"
VIDEO_DIR="eval_data/captioning_videos"
# Models to evaluate (base or base;checkpoint)
MODELS=(
"qwen3-vl-8b" # base model
"qwen3-vl-8b;chancharikm/CHAI_SFT_model_8b" # fine-tuned
)
# Scoring formats (sequential evaluation via unified_eval.py)
SCORING_FORMATS=(
"caption_yes_or_no"
)
# Generation formats (parallel evaluation)
GENERATION_FORMATS=(
"caption_generation"
"critique_generation"
)For each model, the pipeline runs:
- Scoring generation — computes VQA scores (P(Yes) probability) for each caption using
caption_yes_or_noformat - Scoring evaluation — calculates pairwise accuracy from the generated scores
- Caption/critique generation — produces captions and critiques for test set videos
- Generation evaluation — evaluates outputs against ground truth using BLEU-4 and ROUGE-L by default, with an optional LLM judge as an alternative (requires
OPENAI_API_KEYin.envandUSE_LLM_JUDGE="true"in the script)
evaluation_outputs/
├── inference/
│ ├── scoring_<model>_<timestamp>.json
│ └── generation_<model>_<timestamp>.json
└── evaluation/
├── scoring_eval_<model>_<timestamp>.json
└── generation_eval_<model>_<timestamp>.json
CHAI/
├── assets/ # Banner, logo
├── eval_code/ # Evaluation modules
│ ├── __init__.py
│ ├── constants.py # Task constants and format definitions
│ ├── formats.py # Format conversion utilities
│ ├── parallel_unified_eval.py # Multi-worker evaluation
│ ├── parallel_unified_generation.py # Multi-GPU generation
│ ├── unified_eval.py # Single-process evaluation
│ ├── unified_generation.py # Single-process generation
│ └── video_caption_api.py # Video captioning API
├── eval_data/ # Evaluation data and videos
│ ├── test_split.json
│ ├── eval_caption_generation_test.json
│ ├── eval_critique_generation_test.json
│ ├── eval_caption_yes_or_no_test.json
│ └── captioning_videos/
├── pyproject.toml
├── run_unified_evaluations.sh # Main evaluation entry point
└── README.md
If you find this work useful, please cite:
@inproceedings{chai2026,
title = {Building a Precise Video Language with Human--AI Oversight},
author = {Zhiqiu Lin and Chancharik Mitra and Siyuan Cen and Isaac Li and Yuhan Huang and Yu Tong Tiffany Ling and Hewei Wang and Irene Pi and Shihang Zhu and Ryan Rao and George Liu and Jiaxi Li and Ruojin Li and Yili Han and Yilun Du and Deva Ramanan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}We are actively advancing CHAI with larger-scale datasets and stronger video understanding models. We welcome collaborations and funding opportunities with researchers and practitioners working on video understanding, captioning, and multimodal agents for professional-level video content.
If you're interested in accessing improved data or models, please reach out:
- Zhiqiu Lin — [email protected]
- Chancharik Mitra — [email protected]
This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE2140739. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

