Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

📖 Overview

Video SimpleQA is the first comprehensive benchmark specifically designed for evaluating factual grounding capabilities in Large Video Language Models (LVLMs). Unlike existing video benchmarks that often involve subjective speculation or conflate factual grounding with reasoning skills, Video SimpleQA focuses exclusively on objective factuality evaluation through multi-hop fact-seeking questions.

Key Features

🎯 Knowledge Required: Questions demand integration of external knowledge beyond video content
🔗 Multi-hop Fact-seeking: Each question involves multiple explicit facts with step-by-step sub-QAs
✅ Short-form Definitive Answers: Unambiguous, universally agreed-upon answers
⏰ Temporal Grounding: Answers rely on temporal segments rather than single frames
🌍 Open Domain: Covers diverse video types across 4 primary and 84 tertiary categories

📊 Dataset Statistics

Total QA Pairs: 1,504
Unique Videos: 1,079
Multi-hop Questions: 2/3/4-hop (928/469/107 pairs)
Categories: 4 primary, 15 secondary, 84 tertiary
Average Question Length: 15.64 words
Average Answer Length: 1.28 words

📁 Code Structure

code/
├── bestofn_evaluation.py          # Best-of-N evaluation strategy
├── metrics_analyzer.py            # Performance metrics analysis
├── multi_round_qa_fscore.py      # Multi-round QA F-score calculation
├── self_refine_evaluation.py     # Self-refinement evaluation
├── video_qa_evaluation_multi_round.py  # Multi-round video QA evaluation
├── video_qa_evaluation.py        # Main video QA evaluation script
├── video_qa_generator.py         # Video QA generation utilities
├── download.py                   # Video downloader from Wikimedia Commons
data/
├── VideoSimpleQA.json        		# Main dataset file
README.md

🚀 Quick Start

Prerequisites

pip install openai asyncio tqdm multiprocessing requests beautifulsoup4

For video downloading, also install FFmpeg:

Windows: Download from ffmpeg.org
macOS: brew install ffmpeg
Linux: sudo apt-get install ffmpeg

📥 Video Download Setup

Download Videos from Wikimedia Commons

The videos for VideoSimpleQA can be downloaded using the provided download script:

python download.py

Output:

Downloaded MP4 videos will be saved in videos/ directory
Videos are named using the date format: YYYY-MM-DD.mp4

📋 Execution Examples

1. Video QA Generation (`video_qa_generator.py`)

Generate question-answer pairs from video frames and descriptions:

python video_qa_generator.py

Configuration required:

Update API_KEY and BASE_URL in the script
Prepare video frames in ../../data/wiki_videos/frames_15/{video_id}/
Create descriptions.json with video descriptions
Create videos_name.txt with list of video files

Expected Output:

output.json: Generated QA pairs with verification status
error_output.txt: Error log for failed processing

2. Single Video QA Evaluation (`video_qa_evaluation.py`)

Evaluate a model's performance on video QA tasks:

python video_qa_evaluation.py \
    --target-model "gpt-4-vision-preview" \
    --grader-model "gpt-4" \
    --frame-num 32 \
    --data-file "VideoSimpleQA.json" \
    --pool-processes 20

Parameters:

--target-model: Model to be evaluated
--grader-model: Model used for grading responses
--frame-num: Number of frames to use (default: 32)
--frames-path: Path to frames directory (default: ./frames_{frame_num}/)
--data-file: Path to evaluation dataset
--pool-processes: Number of parallel processes

Expected Output:

evaluation_results_{model}_{frames}frames.json: Detailed evaluation results
error_log_{model}.txt: Error log

3. Multi-Round QA Evaluation (`video_qa_evaluation_multi_round.py`)

Evaluate multi-round question answering:

python video_qa_evaluation_multi_round.py \
    --target-model "claude-sonnet4" \
    --grader-model "gpt-4" \
    --frame-num 30 \
    --data-file "./data/test_data.json" \
    --output-file "./results/multi_round_results.json" \
    --processes 20

Parameters:

--frame-num: Number of frames to use
--frames-path: Path to frames directory
--target-model: Model to evaluate (required)
--grader-model: Grading model (required)
--output-file: Output file path
--processes: Number of parallel processes

Expected Output:

Evaluation results with QA pair grades and Multi-hop evaluation
Progress tracking and error logging

4. F-Score Calculation (`multi_round_qa_fscore.py`)

Calculate F-scores for multi-round QA results:

python multi_round_qa_fscore.py

Input Requirements:

Place evaluation result JSON files in evaluation_results/ directory
Files should follow naming pattern: evaluation_results_*.json

Expected Output:

all_models_fscore.json: F-scores for all models
Console summary table showing F-scores for each QA round and Multi-hop evaluation

Example Output:

MODEL F-SCORE SUMMARY (F1 Scores in %)
========================================
Model                          Samples  QA_Pair_1    QA_Pair_2    Multi_hop   
claude-sonnet4                 1504     65.2         58.7         42.3        
gpt-4-vision                   1504     72.1         69.4         55.8

5. Metrics Analysis (`metrics_analyzer.py`)

Analyze evaluation metrics with category and round breakdowns:

python metrics_analyzer.py

Input Requirements:

Evaluation result files in evaluation_results/ directory
Optional: category_mapping.json for category analysis
Optional: round_mapping.json for round analysis

Expected Output:

all_model_metrics.json: Comprehensive metrics
Console tables showing overall, category-wise, and round-wise F-scores

Features:

Overall accuracy, attempt rates, F1 scores
Category-wise performance breakdown
Round-wise performance analysis

6. Self-Refine Evaluation (`self_refine_evaluation.py`)

Evaluate models using iterative self-refinement:

python self_refine_evaluation.py

Configuration:

Update API_KEY and BASE_URL
Configure MODEL_FRAMES_CONFIG with model specifications
Set data_file path to your dataset

Process:

Generate initial answer
Provide feedback on the answer
Refine answer based on feedback
Repeat for up to 3 iterations

Expected Output:

self_refine_results.json: Final results with accuracies per iteration
self_refine_intermediate_results.json: Intermediate results for recovery
error_log_self_refine.txt: Error log

7. Best-of-N Evaluation (`bestofn_evaluation.py`)

Evaluate using Best-of-N sampling strategy:

python bestofn_evaluation.py

Configuration:

Update API_KEY and BASE_URL
Configure MODEL_FRAMES_MAP with model-to-frames mapping
Set dataset file path

Process:

Generate N candidate answers per question
Select best answer using selector model
Evaluate accuracy for different N values (1, 2, 4, 8, 16)

Expected Output:

best_of_n_results.json: Accuracy results for different N values
intermediate_results.json: Intermediate candidate answers
error_log_BoN.txt: Error log

License

The project is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. They are available strictly for non-commercial research. More guidelines of dataset can be found in here.

Opt-Out Approach

We uphold the rights of individuals and copyright holders. If you are featured in any of our video annotations or hold copyright to a video and wish to have its annotation removed from our dataset, please reach out to us. Send an email to [email protected] with the subject line beginning with VideoSimpleQA-optout, or raise an issue with the same title format. We commit to reviewing your request promptly and taking suitable action.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

📖 Overview

Key Features

📊 Dataset Statistics

📁 Code Structure

🚀 Quick Start

Prerequisites

📥 Video Download Setup

Download Videos from Wikimedia Commons

📋 Execution Examples

1. Video QA Generation (`video_qa_generator.py`)

2. Single Video QA Evaluation (`video_qa_evaluation.py`)

3. Multi-Round QA Evaluation (`video_qa_evaluation_multi_round.py`)

4. F-Score Calculation (`multi_round_qa_fscore.py`)

5. Metrics Analysis (`metrics_analyzer.py`)

6. Self-Refine Evaluation (`self_refine_evaluation.py`)

7. Best-of-N Evaluation (`bestofn_evaluation.py`)

License

Opt-Out Approach

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
code		code
data		data
README.md		README.md

VideoSimpleQA/VideoSimpleQA

Folders and files

Latest commit

History

Repository files navigation

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

📖 Overview

Key Features

📊 Dataset Statistics

📁 Code Structure

🚀 Quick Start

Prerequisites

📥 Video Download Setup

Download Videos from Wikimedia Commons

📋 Execution Examples

1. Video QA Generation (video_qa_generator.py)

2. Single Video QA Evaluation (video_qa_evaluation.py)

3. Multi-Round QA Evaluation (video_qa_evaluation_multi_round.py)

4. F-Score Calculation (multi_round_qa_fscore.py)

5. Metrics Analysis (metrics_analyzer.py)

6. Self-Refine Evaluation (self_refine_evaluation.py)

7. Best-of-N Evaluation (bestofn_evaluation.py)

License

Opt-Out Approach

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Video QA Generation (`video_qa_generator.py`)

2. Single Video QA Evaluation (`video_qa_evaluation.py`)

3. Multi-Round QA Evaluation (`video_qa_evaluation_multi_round.py`)

4. F-Score Calculation (`multi_round_qa_fscore.py`)

5. Metrics Analysis (`metrics_analyzer.py`)

6. Self-Refine Evaluation (`self_refine_evaluation.py`)

7. Best-of-N Evaluation (`bestofn_evaluation.py`)

Packages