Skip to content

yunlong10/Awesome-Video-LMM-Post-Training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

56 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome-Video-LMM-Post-Training Awesome

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Yolo Yunlong Tang1, Jing Bi1, Pinxin Liu1, Zhenyu Pan2, Zhangyun Tan1, Qianxiang Shen1, Jiani Liu1, Hang Hua1, Junjia Guo1, Yunzhong Xiao3, Chao Huang1, Zhiyuan Wang4, Susan Liang1, Xinyi Liu1, Yizhi Song5, Junhua Huang6, Jia-Xing Zhong7, Bozheng Li8, Daiqing Qi9, Ziyun Zeng1, Ali Vosoughi1, Luchuan Song1, Zeliang Zhang1, Daiki Shimada10, Han Liu2, Jiebo Luo1, Chenliang Xu1

1University of Rochester, 2Northwestern University, 3CMU, 4UCSB, 5Purdue University, 6UCLA, 7University of Oxford, 8Brown University, 9University of Virginia, 10Sony Group Corporation

hf_paper arXiv

image

News

  • [2025/10/06] πŸŽ‰ Our survey paper on Video-LMM Post-Training for Video Reasoning is now available on arXiv and Hugging Face Papers!
  • [2025/06/18] πŸš€ Initial release of the Awesome-Video-LMM-Post-Training repository! We welcome contributions via Pull Requests.
  • [2025/05/04] πŸ“’ Our survey paper on Video Understanding with Large Language Model has been accepted to the IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)! πŸ‘‰ IEEE Xplore | GitHub

Overview

This Awesome list systematically curates and tracks the latest research in the post-training of Video-LMMs, with a special emphasis on works that enhance their reasoning capabilities. Following the taxonomy of the field, we focus on three key paradigms:

  • 🧠 Reinforced Video-LMMs: Exploring how RL techniques are used to align Video-LMMs with human preferences or specific metrics. This includes methods like RLHF, DPO, GRPO and the design of effective reward models to enhance the logical consistency and factuality of model outputs.

  • βš™οΈ SFT for Reasoning: Collecting studies that leverage SFT on meticulously curated, reasoning-centric datasets. These works often incorporate CoT or other structured formats to directly teach models how to perform complex, multi-step reasoning.

  • πŸš€ Test-Time Scaling in Video Reasoning: Focusing on strategies that enhance reasoning capabilities at inference time without requiring further model training. This includes techniques like agentic frameworks, tool use, RAG, long CoT, and other methods that scale reasoning through computation.

  • πŸ“Š Benchmarks for Video Reasoning: Including the latest and most challenging benchmarks designed specifically to evaluate the complex reasoning abilities of Video-LMMs.

We hope this repository serves as a comprehensive and up-to-date resource hub for researchers and developers in this cutting-edge field. Contributions from the community are highly welcome via Pull Requests!

Table of Contents

image

πŸ“ Citation

If you find our survey useful for your research, please cite the following paper:

@misc{tang2025videollmposttraining,
  title={Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models},
  author={Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Junhua Huang, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu},
  journal={arXiv preprint arXiv:2510.05034},
  year={2025}

Latest Research in Video-LMMs Post-Training

Reinforced Video-LMMs

Title Paper Code Dataset Venue
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization Paper GitHub Dataset NeurIPS 2025
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception Paper GitHub NIPS 2025
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning Paper
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs Paper NeurIPS 2025
ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding Paper
AdsQA: Towards Advertisement Video Understanding Paper GitHub ICCV 2025
Kwai Keye-VL 1.5 Technical Report Paper Github
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding Paper
Ovis2.5 Technical Report Paper Github
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking Paper Github
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding Paper
VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning Paper Github
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning Paper Github Dataset
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video Paper Github
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models Paper
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts Paper Github
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark Paper
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments Paper
Scaling RL to Long Videos Paper GitHub Dataset NeurIPS 2025
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning Paper GitHub EMNLP 2025
Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning Paper
VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning Paper
Kwai Keye-VL Technical Report Paper GitHub
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning Paper
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning Paper GitHub Dataset
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks Paper GitHub Dataset
VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks Paper Github
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO Paper GitHub
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs Paper GitHub
MiMo-VL Technical Report Paper Github
EgoVLM: Policy Optimization for Egocentric Video Understanding Paper GitHub Dataset
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency Paper GitHub
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking Paper
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding Paper GitHub NeurIPS 2025
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding Paper
Reinforcing Video Reasoning with Focused Thinking Paper GitHub
VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning Paper GitHub Dataset
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding Paper
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding Paper GitHub
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration Paper GitHub
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought Paper
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization Paper GitHub
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning Paper
From Evaluation to Defense: Advancing Safety in Video Large Language Models Paper
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning Paper
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning Paper GitHub
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning Paper GitHub Dataset
BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation Paper
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning Paper GitHub Dataset NeurIPS 2025
Seed1.5-VL Technical Report Paper
Compile Scene Graphs with Reinforcement Learning Paper
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization Paper
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model Paper
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning Paper GitHub
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning Paper GitHub
Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning Paper GitHub Dataset
Improved Visual-Spatial Reasoning via R1-Zero-Like Training Paper
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 Paper GitHub Dataset
Video-R1: Reinforcing Video Reasoning in MLLMs Paper GitHub Dataset
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation Paper GitHub Dataset
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM Paper GitHub Dataset
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos Paper GitHub
Memory-enhanced Retrieval Augmentation for Long Video Understanding Paper
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model Paper GitHub
Unhackable Temporal Rewarding for Scalable Video MLLMs Paper
Temporal Preference Optimization for Long-Form Video Understanding Paper
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model Paper GitHub ACL 2025 Findings
VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning Paper
VideoSAVi: Self-Aligned Video Language Models without Human Supervision Paper
Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It Segments Paper
SAIL-VL2 Technical Report Paper

Video-LMM SFT for Reasoning

Title Paper Code Dataset Venue
Kwai Keye-VL 1.5 Technical Report Paper
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data Paper
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding Paper
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding Paper
Ovis2.5 Technical Report Paper
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking Paper
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding Paper
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning Paper
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models Paper
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts Paper
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark Paper
CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks Paper
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments Paper
Scaling RL to Long Videos Paper GitHub Dataset NeurIPS 2025
Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models Paper
Kwai Keye-VL Technical Report Paper GitHub
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning Paper
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning Paper GitHub Dataset
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning Paper
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks Paper GitHub Dataset
VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks Paper
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs Paper GitHub
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning Paper GitHub EMNLP 2025 Findings
MiMo-VL Technical Report Paper
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning Paper
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding Paper GitHub NeurIPS 2025
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning Paper GitHub
Universal Visuo-Tactile Video Understanding for Embodied Interaction Paper
Fostering Video Reasoning via Next-Event Prediction Paper GitHub Dataset
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding Paper
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought Paper
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning Paper
From Evaluation to Defense: Advancing Safety in Video Large Language Models Paper
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning Paper
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning Paper GitHub Dataset
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning Paper GitHub Dataset NeurIPS 2025
Seed1.5-VL Technical Report Paper
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action Paper
VEU-Bench: Towards Comprehensive Understanding of Video Editing Paper
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models Paper
Compile Scene Graphs with Reinforcement Learning Paper
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model Paper
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding Paper
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models Paper
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 Paper GitHub Dataset
Video-R1: Reinforcing Video Reasoning in MLLMs Paper GitHub Dataset
PAVE: Patching and Adapting Video Large Language Models Paper
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation Paper GitHub Dataset
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning Paper GitHub Dataset
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos Paper GitHub
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs Paper
Memory-enhanced Retrieval Augmentation for Long Video Understanding Paper
Token-Efficient Long Video Understanding for Multimodal LLMs Paper
M-LLM Based Video Frame Selection for Efficient Video Understanding Paper
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model Paper GitHub
Unhackable Temporal Rewarding for Scalable Video MLLMs Paper
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray Paper
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model Paper GitHub ACL 2025 Findings
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks Paper
LongViTU: Instruction Tuning for Long-Form Video Understanding Paper
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Paper
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper GitHub
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training Paper
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos Paper GitHub CVPR 2025
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Paper GitHub Dataset
Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It Segments Paper
SAIL-VL2 Technical Report Paper

Test-Time Scaling in Video Reasoning

Title Paper Code Dataset Venue
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception Paper
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding Paper
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding Paper
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning Paper
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering Paper
Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference Paper
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models Paper
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent Paper
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Paper
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models Paper
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning Paper GitHub EMNLP 2025
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling Paper
VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning Paper
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames Paper
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames Paper
DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025 Paper GitHub
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? Paper
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning Paper GitHub Dataset
VideoDeepResearch: Long Video Understanding With Agentic Tool Using Paper
CogStream: Context-guided Streaming Video Question Answering Paper GitHub Dataset
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency Paper
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought Paper
CyberV: Cybernetics for Test-time Scaling in Video Understanding Paper
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs Paper
VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning Paper
MiMo-VL Technical Report Paper
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning Paper
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding Paper GitHub NeurIPS 2025
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding Paper
SiLVR: A Simple Language-based Video Reasoning Framework Paper GitHub
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding Paper
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? Paper GitHub Dataset
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration Paper GitHub
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding Paper
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning Paper
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation Paper
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning Paper GitHub
RVTBench: A Benchmark for Visual Reasoning Tasks Paper GitHub Dataset
CoT-Vid: Dynamic Chain-of-Thought Routing with Self Verification for Training-Free Video Reasoning Paper
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models Paper
Seed1.5-VL Technical Report Paper
Empowering Agentic Video Analytics Systems with Video Language Models Paper
Divide and Conquer: Exploring Language-centric Tree Reasoning for Video Question-Answering Paper
SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding Paper GitHub CVPR 2025
VideoMultiAgents: A Multi-Agent Framework for Video Question Answering Paper
MR. Video: "MapReduce" is the Principle for Long Video Understanding Paper
Multimodal Long Video Modeling Based on Temporal Dynamic Context Paper GitHub
VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT Paper
WikiVideo: Article Generation from Multiple Videos Paper
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs Paper
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment Paper
Agentic Keyframe Search for Video Question Answering Paper
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning Paper GitHub Dataset
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma? Paper
Memory-enhanced Retrieval Augmentation for Long Video Understanding Paper
Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment Paper
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension Paper GitHub
Token-Efficient Long Video Understanding for Multimodal LLMs Paper
M-LLM Based Video Frame Selection for Efficient Video Understanding Paper
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding Paper GitHub Dataset ACL 2025 main
CoS: Chain-of-Shot Prompting for Long Video Understanding Paper
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray Paper
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge Paper GitHub ICLR2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model Paper GitHub ACL 2025 Findings
MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning Paper
VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning Paper
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs Paper
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding Paper
PruneVid: Visual Token Pruning for Efficient Video Large Language Models Paper
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Paper
VCA: Video Curious Agent for Long Video Understanding Paper
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper GitHub
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding Paper
VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding Paper
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos Paper GitHub CVPR 2025
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Paper GitHub Dataset
Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning Paper
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs Paper
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning Paper GitHub Dataset NeurIPS 2024 (Spotlight)
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition Paper GitHub ICML 2024 Oral

Benchmarks for Video Reasoning

Title Paper Code Dataset Venue
Scaling RL to Long Videos Paper GitHub Dataset NeurIPS 2025
AdsQA: Towards Advertisement Video Understanding Paper
CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning Paper
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking Paper
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark Paper
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Paper
ImplicitQA: Going beyond frames towards Implicit Video Reasoning Paper Dataset
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought Paper
Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning Paper GitHub
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Paper GitHub Dataset
Time Blindness: Why Video-Language Models Can't See What Humans Can? Paper
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding Paper
VidText: Towards Comprehensive Evaluation for Video Text Understanding Paper GitHub Dataset
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? Paper GitHub
From Evaluation to Defense: Advancing Safety in Video Large Language Models Paper
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation Paper
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding? Paper
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video Paper GitHub Dataset NeurIPS 2025
MINERVA: Evaluating Complex Video Reasoning Paper GitHub
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning Paper GitHub Dataset
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 Paper GitHub Dataset
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding Paper
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts Paper Github Dataset CVPR 2025
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation Paper GitHub Dataset
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning Paper GitHub Dataset
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation Paper
Towards Fine-Grained Video Question Answering Paper
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding Paper
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper GitHub Dataset
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? Paper GitHub Dataset
HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding Paper
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Paper
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark Paper
Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events Paper GitHub Dataset CVPR 2025
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding Paper
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Paper
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models Paper

Related Surveys

Title Paper Code Dataset Venue
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models Paper GitHub
VideoLLM Benchmarks and Evaluation: A Survey Paper
Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models Paper
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey Paper
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding Paper
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding Paper
Video Understanding with Large Language Models: A Survey Paper

🌟 Star History

Star History Chart

About

πŸ”₯πŸ”₯πŸ”₯ Latest Papers, Codes and Datasets on Video-LMM Post-Training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 8