Skip to content

lwpyh/Awesome-MLLM-Reasoning-Collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

109 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Awesome-MLLM-Reasoning-Collection

License: MIT

πŸ‘ Welcome to the Awesome-MLLM-Reasoning-Collections repository! This repository is a carefully curated collection of papers, code, datasets, benchmarks, and resources focused on reasoning within Multimodal Large Language Models (MLLMs).

Feel free to ⭐ star and fork this repository to keep up with the latest advancements and contribute to the community.

Table of Contents

Papers and Projects πŸ“„

Commonsense Reasoning

Image MLLM

Video MLLM

Audio MLLM

  • Utilizing GRPO to enhance audio reasoning performance

Omni MLLM

Reasoning Segmentation and Detection

Image MLLM

Video MLLM

Audio MLLM

Omni MLLM

Spatial and Temporal Grounding and Understanding

Image MLLM

Video MLLM

Audio MLLM

Omni MLLM

Math Reasoning

Image MLLM

Chart Rasoning

Benchmark

Visual-Audio Generation

Image MLLM

Video MLLM

Audio MLLM

Reasoning with Agent/Tool

Medical Reasoning

Image MLLM

Audio MLLM

Omni MLLM

Embodied Reasoning

Others

Image MLLM

Video MLLM

Audio MLLM

Omni LLM

Benchmarks πŸ“Š

Date Project Task Links
26.02 A Very Big Video Reasoning Suite (VBVR): 1M+ video clips across 200 reasoning tasks Video Reasoning [πŸ“‘ Paper] [πŸ€— Model] [πŸ€— Data]
26.02 OmniGAIA: Omni-Modal AI Agent Benchmark with hindsight-guided exploration Omni-Modal Agent Reasoning [πŸ“‘ Paper] [πŸ’» Code] [πŸ€— Data]
26.02 SpatiaLab: Wild Spatial Reasoning benchmark across 6 VQA categories Spatial Reasoning [πŸ“‘ Paper] [πŸ’» Code] [πŸ€— Data]
26.02 MuRGAt: Multimodal Fact-Level Attribution benchmark for verifiable reasoning Multimodal Attribution [πŸ“‘ Paper] [πŸ’» Code]
26.02 DeepVision-103K: Verifiable multimodal math dataset for RLVR training Math Reasoning [πŸ“‘ Paper] [πŸ’» Code] [πŸ€— Data]
26.02 UniVBench: Unified evaluation for video foundation models across understanding, generation, editing Video Foundation Model Evaluation [πŸ“‘ Paper] [πŸ’» Code]
26.02 RISE-Video: Benchmark for video generators decoding implicit world rules Video Generation Reasoning [πŸ“‘ Paper] [πŸ’» Code] [πŸ€— Data]
26.02 SAW-Bench: Egocentric Situated Awareness evaluation with 786 smart-glass videos and 2,071+ QA pairs Spatial Reasoning [πŸ“‘ Paper]
26.02 BrowseComp-V3: 300-question visual benchmark for complex multi-hop multimodal web search Multimodal Browsing [πŸ“‘ Paper]
26.02 BiManiBench: Hierarchical benchmark for bimanual coordination evaluation in MLLMs Bimanual Robotics [πŸ“‘ Paper] [πŸ’» Code]
26.01 MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods Multimodal Reasoning [πŸ“‘ Paper] [πŸ€— Model] [πŸ€— Data]
26.01 ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis Chart Reasoning [πŸ“‘ Paper] [πŸ’» Code] [πŸ€— Model] [πŸ€— Data]
26.01 VideoLoom: Joint Spatial-Temporal Understanding with LoomBench Spatial-Temporal Reasoning [πŸ“‘ Paper] [πŸ’» Code] [πŸ€— Model]
26.01 PROGRESSLM: Towards Progress Reasoning in Vision-Language Models Task Progress Reasoning [πŸ“‘ Paper] [πŸ’» Code] [πŸ€— Data]
26.01 FutureOmni: Evaluating Future Forecasting from Omni-Modal Context Omni-Modal Temporal Reasoning [πŸ“‘ Paper]
26.01 Afri-MCQA: Multimodal Cultural Question Answering for African Languages Multilingual Multimodal Reasoning [πŸ“‘ Paper]
26.01 AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark Cultural Multimodal Reasoning [πŸ“‘ Paper]
25.12 HERBench: Multi-Evidence Integration in Video Question Answering Video Reasoning [πŸ“‘ Paper]
25.12 SVBench: Evaluation of Video Generation Models on Social Reasoning Video Social Reasoning [πŸ“‘ Paper]
25.12 IF-Bench: Benchmarking MLLMs for Infrared Images Infrared Image Understanding [πŸ“‘ Paper]
25.12 VABench: Comprehensive Benchmark for Audio-Video Generation Audio-Video Generation [πŸ“‘ Paper]
25.11 MME-CC: Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity Cognitive Capacity [πŸ“‘ Paper]
25.11 GGBench: Geometric Generative Reasoning Benchmark for Unified Multimodal Models Geometric Reasoning [πŸ“‘ Paper]
25.11 WEAVE: Benchmarking In-context Interleaved Comprehension and Generation Multimodal Comprehension & Generation [πŸ“‘ Paper]
25.10 Uni-MMMU: Massive Multi-discipline Multimodal Unified Benchmark Multimodal Multi-discipline Reasoning [πŸ“‘ Paper]
25.10 PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs Physical Tool Understanding [πŸ“‘ Paper]
25.10 BEAR: Benchmarking Multimodal Language Models for Atomic Embodied Capabilities Embodied AI Capabilities [πŸ“‘ Paper]
25.10 OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs Long-context, Video-Audio Unerstanding & Reasonin [πŸ“‘ Paper] [πŸ’» Code] [🌐 Project] [πŸ€— Data]
25.10 XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models Capability Balancing among Different Modalities [πŸ“‘ Paper] [πŸ’» Code] [🌐 Project]
25.10 StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA Termporal Reasoning [πŸ“‘ Paper]
25.10 Valor32k-AVQA v2.0: Open-Ended Audio-Visual Question Answering Dataset and Benchmark Common Sense Omni Reasoning [πŸ“‘ Paper]
25.09 MARS2 2025 Challenge on Multimodal Reasoning Multimodal Reasoning Challenge [πŸ“‘ Paper]
25.09 Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images Table Reasoning [πŸ“‘ Paper]
25.09 AHELM: A Holistic Evaluation of Audio-Language Models Audio-Language Understanding [πŸ“‘ Paper]
25.09 MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark Complex, Multi-scene, & Dynamically Evolving Speech & Audio Reasonin [πŸ“‘ Paper] [πŸ’» Code]
25.09 MiMo-Audio-Eval Toolkit Speech/Sound/Music Reasoning [πŸ’» Code]
25.08 SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models Speech Reasoning [πŸ“‘ Paper] [πŸ’» Code] [Data]
25.08 MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence Long-form, Spatial, and Multi-audio Reasoning on Speech/Music/Sound [πŸ“‘ Paper] [πŸ€— Data]
25.08 RΒ²-AVSBench: Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation Segmentation Reasoning [πŸ“‘ Paper] [πŸ€— Data]
25.07 Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Video Reasoning and Understanding [πŸ“‘ Paper]. [🌐 Project] [πŸ€— Data]
25.06 FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation Financial Multi-Modal Reasoning Reasoning [πŸ“‘ Paper]. [πŸ’» Code]. [πŸ€— Data]
25.06 MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos Video Reasoning [πŸ“‘ Paper]. [πŸ’» Code]. [🌐 Project] [πŸ€— Data]
25.06 OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models Spatial Reasoning [πŸ“‘ Paper]. [πŸ’» Code]. [🌐 Project] [πŸ€— Data]
25.06 MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark Phonatics, Prosody, Rhetoric, Syntactics, Semantics, and Paralinguistics in Speech Understanding & Reasoning [πŸ“‘ Paper] [πŸ’» Code] [πŸ€— Data]
25.05 Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities Video&Audio Reasoning [πŸ“‘ Paper] [πŸ’» Code] [🌐 Project] [πŸ€— Data]
25.05 MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix Multi-step Audio Reasoning [πŸ“‘ Paper]. [πŸ’» Code]. [πŸŽ₯ demo] [πŸ€— Data]
25.05 On Path to Multimodal Generalist: General-Level and General-Bench Multimodal Generation [🌐 Project] [πŸ“‘ Paper] [πŸ€— Data]
25.04 VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models Visual Reasoning [🌐 Project] [πŸ“‘ Paper] [πŸ’» Code] [πŸ€— Data]
25.04 IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs Image-Grounded Video Perception and Reasoning [πŸ“‘ Paper] [πŸ’» Code]
25.04 Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing Reasoning-Informed viSual Editing [πŸ“‘ Paper] [πŸ’» Code]
25.04 CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following Music Information Retrieval & Knowledge [πŸ“‘ Paper] [πŸ’» Code]
25.03 MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX Common Sense Omni Reasoning [πŸ“‘ Paper] [🌐 Project]
25.03 V-STaR : Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning Spatio-temporal Reasoning [🌐 Project] [πŸ“‘ Paper] [πŸ€— Data]
25.03 MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs Spatio-temporal Understanding [πŸ“‘Paper]
25.03 Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning 3D-CoT [πŸ“‘ Paper] [πŸ€— Data]
25.02 MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models MM-IQ [πŸ“‘ Paper] [πŸ’» Code]
25.02 MM-RLHF: The Next Step Forward in Multimodal LLM Alignment MM-RLHF-RewardBench, MM-RLHF-SafetyBench [πŸ“‘ Paper]
25.02 ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models ZeroBench [🌐 Project] [πŸ€— Dataset] [πŸ’» Code]
25.02 MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency MME-CoT [πŸ“‘ Paper] [πŸ’» Code]
25.02 OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference MM-AlignBench [πŸ“‘ Paper] [πŸ’» Code]
25.01 AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs Adversarial attack, Compositional reasoning, and Modality-specific dependency in Visual&Audio [πŸ“‘ Paper]
25.01 LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs VRCBench [πŸ“‘ Paper] [πŸ’» Code]
24.12 Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method VideoChat-Online [PaperπŸ“‘] [CodeπŸ’»]
24.11 VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models VLRewardBench [πŸ“‘ Paper]
24.11 Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos MH-VidQA [PaperπŸ“‘] [CodeπŸ’»]
24.10 OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities Video&Audio Reasoning [πŸ“‘ Paper]
24.10 MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark Audio Understanding & Reasoning [🌐 Project] [πŸ“‘ Paper] [πŸ’»Code] [πŸ€— Data]
24.09 MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning Video Causal Reasoning [πŸ“‘ Paper] [πŸ’»Code] [πŸ€— Data]
24.09 OmniBench: Towards The Future of Universal Omni-Language Models Reasoning with Image & Speech/Sound/Music [πŸ“‘ Paper] [CodeπŸ’»] [🌐 Project] [πŸ€— Data]
24.08 MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models Music Knowledge & Reasoning [🌐 Project] [πŸ“‘ Paper] [πŸ’»Code] [ Data]
24.07 REXTIME: A Benchmark Suite for Reasoning-Across-Time in Videos REXTIME [PaperπŸ“‘] [CodeπŸ’»]
24.06 AudioBench: A Universal Benchmark for Audio Large Language Models Speech & Sound Understanding [PaperπŸ“‘] [CodeπŸ–₯️]
24.06 ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation ChartBench [Project🌐] [PaperπŸ“‘] [CodeπŸ–₯️]
24.05 M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought M3CoT [πŸ“‘ Paper]
24.02 AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension Speech & Sound Understanding [πŸ“‘ Paper] [CodeπŸ’»]
23.10 CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models Audio Reasoning (Attributes & Orders) [Project🌐] [PaperπŸ“‘]

Open-source Projects

Project GitHub Stars Links
Reason-RFT Reason-RFT πŸ’» GitHub πŸ€— Dataset
EasyR1 EasyR1 πŸ’» GitHub
Multimodal Open R1 Multimodal Open R1 πŸ’» GitHub πŸ€— Model πŸ€— Dataset
LMM-R1 LMM-R1 πŸ’» GitHub
MMR1 MMR1 πŸ’» GitHub πŸ€— Model πŸ€— Dataset
R1-V R1-V πŸ’» GitHub 🎯 Blog πŸ€— Dataset
R1-Multimodal-Journey R1-Multimodal-Journey πŸ’» GitHub
VLM-R1 VLM-R1 πŸ’» GitHub πŸ€— Model πŸ€— Dataset πŸ€— Demo
R1-Vision R1-Vision πŸ’» GitHub πŸ€— Cold-Start Dataset
R1-Onevision R1-Onevision πŸ’» GitHub πŸ€— Model πŸ€— Dataset πŸ€— Demo πŸ“ Report
Open R1 Video Open R1 Video πŸ’» GitHub πŸ€— Model πŸ€— Dataset
Video-R1 Video-R1 πŸ’» GitHub πŸ€— Dataset
Open-LLaVA-Video-R1 Open-LLaVA-Video-R1 πŸ’» GitHub
R1V-Free R1V-Free πŸ’» GitHub
SeekWorld SeekWorld πŸ’» GitHub
IE-Critic-R1 SeekWorld πŸ’» GitHub
πŸ€— Model
πŸ€— Data
πŸ€— ColdStart SFT

Contributing

If you are interested in contributing, please refer to HERE for instructions in contribution.

About

A collection of multimodal reasoning papers, codes, datasets, benchmarks and resources.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors