Skip to content

dengandong/SciVideoBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 

Repository files navigation

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

SciVideoBench Dataset Examples

[🔬 Project Page] | [📖 arXiv Paper] | [📊 Dataset]


🔥 News

  • 2025.11.15 🔥 SciVideoBench is is now integrated into lmms-eval!
  • 2025.10.25 🚀 GPT-5 achieves 55.7% overall accuracy on SciVideoBench, setting a new record in Quantitative Reasoning (55.1%)!
  • 2025.10.20 🏆 Honored to share that SciVideoBench received the Best Paper Award (Benchmark Track) at the ICCV 2025 KnowledgeMR Workshop!
  • 2025.10.17 🔥 We evaluate Qwen3-VL on SciVideoBench. Check our leaderboard!
  • 2025.10.10 🎉 SciVideoBench introduced as the first benchmark for scientific video reasoning.

👀 Overview

Scientific experiments present unique challenges for video-language models (VLMs): precise perception of visual details, integration of multimodal signals (video, audio, transcripts), and complex reasoning across temporal scales. To address this gap, we introduce SciVideoBench, the first comprehensive benchmark dedicated to scientific video reasoning.

SciVideoBench evaluates models across Physics, Chemistry, Biology, and Medicine, covering both perceptual understanding and high-level reasoning tasks. It provides a rigorous benchmark for evaluating long-form video reasoning in domains where accuracy and explainability matter most.

SciVideoBench Overview

Figure 1: The overall design of SciVideoBench, showing multi-stage data construction, annotation protocol, and evaluation pipeline.

🎥 Dataset Examples

SciVideoBench Dataset Examples

Figure 2: Examples of SciVideoBench videos and their associated QA pairs across Physics, Chemistry, Biology, and Medicine.

📌 Key Features

  • Domain Coverage: 4 scientific disciplines (Physics, Chemistry, Biology, Medicine) with diverse experimental settings.
  • Scale: 1,000 high-quality, human-verified multiple-choice questions.
  • Reasoning Dimensions:
    • Conceptual Reasoning – understanding principles and experimental setups.
    • Quantitative Reasoning – extracting and reasoning with measurements, numbers, and calculations.
    • Hypothetical Reasoning – counterfactual and “what-if” scientific scenarios.
  • Rich Metadata: Each QA pair is annotated with discipline, subject, timestamp breakdowns, and rationale.
  • Evaluation Protocols: Compatible with lmms-eval for standardized model comparison.

📂 Dataset

  • Total Videos: 240+ scientific experiments
  • Disciplines: Physics, Chemistry, Biology, Medicine
  • QA Pairs: 1,000 carefully curated multiple-choice questions
  • Formats: JSON/JSONL with video-level metadata and QA annotations
  • Annotations: Timestamp breakdowns, rationales, difficulty levels

Example QA format:

{
  "video_id": "58827",
  "question_id": "1",
  "question": "What is the purpose of transferring the sample between chambers as shown between 02:22 and 02:33?",
  "options": {
    "A": "Align the sample with the deposition target",
    "B": "Cool the sample before deposition",
    "C": "Measure sample thickness prior to coating",
    "D": "Preserve main chamber vacuum integrity"
  },
  "answer": "D",
  "question_type": "Conceptual Reasoning",
  "discipline": "Chemistry",
  "subject": "Nanomaterials",
  "timestamp_breakdown": ["..."]
}

🏆 Leaderboard

Part of the evaluation results of proprietary and open-source models on SciVideoBench (%, higher is better).

Models Overall Conceptual Hypothetical Quantitative Biology Chemistry Medicine Physics
Random Guess 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00
Human Evaluation 17.40 18.11 18.70 14.29 15.88 16.06 21.19 18.88
Gemini-2.5-Pro 64.30 69.73 67.79 50.61 64.79 61.82 74.77 61.44
GPT-5 55.70 61.35 50.65 55.10 57.12 52.73 58.88 54.23
Gemini-2.5-Flash 46.40 50.81 44.16 43.27 44.01 49.70 55.14 44.83
InternVL-3-78B 38.50 56.76 39.22 9.80 37.65 37.58 46.73 37.30
InternVL-3-14B 35.70 53.51 35.32 9.39 35.94 33.94 38.32 35.42
Gemini-1.5-Pro 27.50 27.84 28.31 25.71 27.38 26.06 27.10 28.53
InternVL2-Llama3-76B 26.30 29.70 29.91 25.08 24.94 27.01 29.91 28.38
Gemini-2.0-Flash 25.70 28.38 24.94 22.86 24.69 26.06 22.43 27.90
GPT-4o 24.90 30.27 28.05 11.84 21.52 29.70 31.78 24.45
LLaVA-NeXT-Video-32B 21.10 22.42 23.36 21.32 19.80 22.86 23.36 26.22
Qwen2.5-VL-72B 20.30 16.97 24.30 19.44 21.27 20.78 24.30 21.62

🧪 Evaluation (via lmms-eval)

SciVideoBench has now been integrated into lmms-eval!

1) Install (Please refer to lmms-eval)

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .

2) Repo Layout

After cloning lmms-eval, place the scivideobench/ folder under tasks/:

lmms-eval/
  tasks/
    ├── activitynetqa/
    ├── ai2d/
    ├── aime/
    ├── air_bench/
    ├── ...
    ├── scivideobench/              # ✅ our benchmark lives here
    │   ├── scivideobench.yaml      # task definition(s) for evaluation
    │   ├── utils.py                # dataset loader, metrics, post-processing
    │   └── (optional) extra yaml   # if you split configs (chat, cot, etc.)
  ...
  • scivideobench.yaml → Defines how lmms-eval loads SciVideoBench (dataset path, media fields, eval settings).
  • utils.py → Custom dataloader + evaluation metrics (accuracy, discipline/reasoning type breakdown).
  • You can create multiple YAMLs (e.g., scivideobench_chat.yaml, scivideobench_cot.yaml) if you want variants, similar to how air_bench has multiple YAMLs.

3) Quick Start

Local Hugging Face models (Qwen2.5-VL, InternVL-3, etc.)

accelerate launch --num_processes 8 --main_process_port 12380 -m lmms_eval \
    --model internvl3 \
    --config lmms-eval/lmms_eval/tasks/scivideobench/scivideobench.yaml \
    --model_args pretrained=OpenGVLab/InternVL3-2B,modality=video,num_frame=32 \
    --gen_kwargs=max_new_tokens=1024 \
    --tasks scivideobench \
    --batch_size 1 \
    --log_samples \

📂 License

License & Access:

SciVideoBench is only used for academic research. Commercial use in any form is strictly prohibited.
The copyright of all videos belongs to the original video owners and JoVE.
If there is any infringement in SciVideoBench, please email us and we will promptly remove the content.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify SciVideoBench.
You must strictly comply with the above restrictions.

➡️ Access requirement: Please complete and sign our Dataset Access Agreement before using SciVideoBench:
🔏 Google Form — SciVideoBench Dataset Access Agreement

The SciVideoBench is available in HuggingFace.

For any questions, contact [email protected].

✨ Citation

If you use SciVideoBench, please cite our paper:

    @article{deng2025scivideobench,
        title={SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models},
        author={Andong Deng and Taojiannan Yang and Shoubin Yu and Lincoln Spencer and Mohit Bansal and Chen Chen and Serena Yeung-Levy and Xiaohan Wang},
        journal={arXiv preprint arXiv:2510.08559},
        year={2025}
    }

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages