π Welcome to the Awesome-MLLM-Reasoning-Collections repository! This repository is a carefully curated collection of papers, code, datasets, benchmarks, and resources focused on reasoning within Multimodal Large Language Models (MLLMs).
Feel free to β star and fork this repository to keep up with the latest advancements and contribute to the community.
- Awesome-MLLM-Reasoning-Collection
- Table of Contents
- Papers and Projects π
- Benchmarks π
- Open-source Projects
- Contributing
- 26.02 From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models | Paperπ Codeπ₯οΈ Modelπ€
- Spiral-loop framework diagnosing capability gaps in MLLMs and generating targeted data and RL training to close them iteratively. | Task: Reasoning & Understanding
- 26.02 Imagination Helps Visual Reasoning, But Not Yet in Latent Space | Paperπ Codeπ₯οΈ
- CapImagine proposes text-based explicit imagination outperforming latent-space baselines on vision-centric benchmarks via causal mediation analysis. | Task: Reasoning & Understanding
- 26.02 NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors | Paperπ Codeπ₯οΈ
- Training-free decoding dynamically suppressing language priors by comparing multimodal vs. text-only output distributions, achieving +6.45/+7.21 accuracy on POPE. | Task: Reasoning & Understanding
- 26.02 Selective Training for Large Vision Language Models via Visual Information Gain | Paperπ
- Visual Information Gain (VIG) metric quantifying how much visual input reduces prediction uncertainty for improved visual grounding and reduced language bias. | Task: Reasoning & Understanding
- 26.02 MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning | Paperπ Codeπ₯οΈ Modelπ€
- End-to-end visual RL framework for image metaphor comprehension with TFQ-GRPO method, achieving 82.6% average improvement on image implication benchmarks. | Task: Reasoning & Understanding
- 26.02 Learning Self-Correction in Vision-Language Models via Rollout Augmentation | Paperπ Modelπ€
- Octopus synthesizes dense self-correction examples for VLMs via RL, achieving SOTA among open-source VLMs on 7 benchmarks. | Task: Reasoning & Understanding
- 26.02 SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs | Paperπ
- Decouples visual perception from reasoning in VLMs via a two-stage pipeline, enabling efficient test-time scaling with 200Γ lower token budget. | Task: Reasoning & Understanding
- 26.02 Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models | Paperπ Codeπ₯οΈ
- Fixed-frame Modality Gap Theory with training-free ReAlign alignment and scalable ReVision pretraining using unpaired data to bridge the modality gap. | Task: Reasoning & Understanding
- 26.02 Kimi K2.5: Visual Agentic Intelligence | Paperπ Codeπ₯οΈ Modelπ€
- Open-source multimodal agentic model achieving SOTA across coding, vision, reasoning, and agentic tasks via joint text-vision RL and Agent Swarm parallel execution. | Task: Reasoning & Understanding
- 26.02 Toward Cognitive Supersensing in Multimodal Large Language Model | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Trains MLLMs to generate internal visual imagery sequences for abstract visual reasoning, evaluated on CogSense-Bench spanning five cognitive dimensions. | Task: Reasoning & Understanding
- 26.02 Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling | Paperπ Codeπ₯οΈ
- Uses comics as a visual medium to improve multimodal reasoning efficiency while preserving temporal structure and narrative coherence. | Task: Reasoning & Understanding
- 26.02 SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs | Paperπ Codeπ₯οΈ Datasetπ€
- Hybrid autoregressive MLLM dynamically switching between text-only, vision-only, and interleaved vision-text reasoning modes based on input queries. | Task: Reasoning & Understanding
- 26.02 What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis | Paperπ Codeπ₯οΈ Modelπ€
- Analyzes what RL actually improves in VLMs for visual reasoning, finding RL primarily refines mid-to-late transformer layers that improve vision-to-reasoning alignment. | Task: Reasoning & Understanding
- 26.02 On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs | Paperπ
- Shows RL fine-tuning of VLMs introduces vulnerability to textual perturbations and reveals an accuracy-faithfulness trade-off undermining chain-of-thought reliability. | Task: Reasoning & Understanding
- 26.02 Thinking with Drafting: Optical Decompression via Logical Reconstruction | Paperπ
- TwD reconstructs logical structures from compressed visual tokens via Domain-Specific Language, forcing models to draft reasoning as executable code for self-verification. | Task: Reasoning & Understanding
- 26.02 Visual Persuasion: What Influences Decisions of Vision-Language Models? | Paperπ
- Studies VLM visual decision-making through controlled image-based choice tasks with systematic perturbations to identify visual vulnerabilities and safety concerns. | Task: Reasoning & Understanding
- 26.02 Adapting Vision-Language Models for E-commerce Understanding at Scale | Paperπ
- Adapts general-purpose VLMs for e-commerce product understanding via a 4M-item visual instruction tuning dataset covering deep product attributes and dynamic extraction. | Task: Reasoning & Understanding
- 26.02 Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception | Paperπ Codeπ₯οΈ Modelπ€
- Trains MLLMs to internally perform iterative zooming during inference via distillation, eliminating repeated tool calls while improving fine-grained visual perception. | Task: Reasoning & Understanding
- 26.02 DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories | Paperπ Codeπ₯οΈ Datasetπ€
- Reformulates image retrieval as multi-step reasoning over visual histories with DISBench benchmark and a modular agent with dual-memory system. | Task: Reasoning & Understanding
- 26.01 MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods | Paperπ Modelπ€ Datasetπ€
- A 1.8M-sample multimodal reasoning dataset with high-quality CoT annotations; the 8B model approaches Qwen3-VL-32B-Thinking performance. | Task: Reasoning & Understanding
- 26.01 DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Reformulates multimodal reasoning as a native image-to-image generative task using diffusion models. | Task: Reasoning & Understanding
- 26.01 Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models | Paperπ Codeπ₯οΈ Datasetπ€
- Proposes the visual superiority hypothesis: visual generation serves as a more natural world model for physical/spatial reasoning tasks. | Task: Reasoning & Understanding
- 26.01 VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning | Paperπ Codeπ₯οΈ
- Compresses textual reasoning traces into compact images as "optical memory" for VLMs, achieving 3.4x token compression. | Task: Reasoning & Understanding
- 26.01 UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision | Paperπ Codeπ₯οΈ Modelπ€
- Self-improvement framework partitioning a single model into Proposer/Solver/Judge roles via self-play to improve comprehension and generation. | Task: Reasoning & Understanding
- 26.01 LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Addresses the "Perception Gap" by aligning latent visual thoughts (attention trajectories) between teacher and student models. | Task: Reasoning & Understanding
- 26.01 STEP3-VL-10B Technical Report | Paperπ Codeπ₯οΈ Modelπ€
- A 10B multimodal foundation model with Parallel Coordinated Reasoning (PaCoRe) for test-time compute scaling. | Task: Reasoning & Understanding
- 26.01 Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation | Paperπ Codeπ₯οΈ
- Trains unified multimodal models to generate pixel, depth, and segmentation representations alongside understanding. | Task: Reasoning & Understanding
- 26.01 What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge | Paperπ
- First-place NeurIPS 2025 DCVLR challenge submission revealing difficulty-based example selection as dominant driver in data curation. | Task: Reasoning & Understanding
- 26.01 MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models | Paperπ
- Modality-adaptive decoding to mitigate cross-modal hallucinations in MLLMs by dynamically adjusting decoding. | Task: Reasoning & Understanding
- 25.12 OneThinker: All-in-one Reasoning Model for Image and Video | Paperπ
- Unifies image and video understanding across diverse visual tasks using RL with EMA-GRPO technique. | Task: Reasoning & Understanding
- 25.12 Puzzle Curriculum GRPO for Vision-Centric Reasoning | Paperπ
- Supervision-free RL method enhancing visual reasoning in VLMs through self-supervised puzzle environments. | Task: Reasoning & Understanding
- 25.12 Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding | Paperπ
- Enhances MLLM robustness to visual degradations by modeling degradation parameters through structured reasoning chains. | Task: Reasoning & Understanding
- 25.12 See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning | Paperπ
- Improves VLM multimodal reasoning via paired masked views to enforce fine-grained visual reliance. | Task: Reasoning & Understanding
- 25.11 OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe | Paperπ
- Open general-purpose framework for advancing multimodal reasoning. | Task: Reasoning & Understanding
- 25.11 ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning | Paperπ
- Studies emergent properties in multimodal interleaved chain-of-thought reasoning. | Task: Reasoning & Understanding
- 25.11 TiDAR: Think in Diffusion, Talk in Autoregression | Paperπ
- Combines diffusion-based thinking with autoregressive generation for multimodal reasoning. | Task: Reasoning & Understanding
- 25.10 TTRV: Test-Time Reinforcement Learning for Vision Language Models | Paperπ
- Test-time reinforcement learning applied to vision-language models for improved reasoning. | Task: Reasoning & Understanding
- 25.10 VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs | Paperπ
- Improves VLMs' ability to combine high-level reasoning with detailed visual perception. | Task: Reasoning & Understanding
- 25.10 ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping | Paperπ
- Adaptive reasoning for multimodal models using entropy shaping. | Task: Reasoning & Understanding
- 25.09 R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and RL | Paperπ
- Training method using RL and annealing to improve auto-thinking and reasoning in multimodal LLMs. | Task: Reasoning & Understanding
- 25.09 LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training | Paperπ
- Open-source framework for training multimodal vision-language models. | Task: Reasoning & Understanding
- 25.08 Thyme: Think Beyond Images | Paperπ
- Multimodal reasoning system that extends beyond surface-level image understanding to higher-level thinking. | Task: Reasoning & Understanding
- 25.08 Controlling Multimodal LLMs via Reward-guided Decoding | Paperπ
- Controls MLLM reasoning outputs through reward-based generation guidance at decoding time. | Task: Reasoning & Understanding
- 25.08 Self-Rewarding Vision-Language Model via Reasoning Decomposition | Paperπ
- VLM that uses reasoning decomposition and self-reward to improve visual reasoning quality. | Task: Reasoning & Understanding
- 25.08 GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models | Paperπ
- Foundation model with strong agentic, reasoning, and coding capabilities across modalities. | Task: Reasoning & Understanding
- 25.07 GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | Paperπ Codeπ₯οΈ
- A reasoning-centric training framework for general-purpose multimodal reasoning. | Task: Reasoning & Understainding
- 25.07 MiCo: Multi-image Contrast for Reinforcement Visual Reasoning | Paperπ
- Construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. | Task: Reasoning & Understainding
- 25.06 Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning | Paperπ Codeπ₯οΈ Modelπ€
- Simple visual perturbation framework that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. | Task: Reasoning & Understainding
- 25.05 Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning | Paperπ Codeπ₯οΈ Modelπ€
- 25.05 Sherlock: Self-Correcting Reasoning in Vision-Language Models | Paperπ Codeπ₯οΈ Modelπ€
- Explore self-correction as a strategy to enhance reasoning VLMs | Task: Reasoning & Understainding
- 25.05 EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning | Paperπ Codeπ₯οΈ Modelπ€
- The first general framework for unified audio-visual reasoning via reinforcement learning | Task: Reasoning & Understainding
- 25.03 Skywork-R1V: Pioneering Multimodal Reasoning with CoT | Paperπ Codeπ₯οΈ Modelπ€
- The first industry open-sourced multimodal reasoning model with advanced visual chain-of-thought capabilities | Task: Reasoning & Understainding
- 25.03 CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation | Paperπ
- Mimic human-like βslow thinkingβ for multi-image understanding. | Task: VQA
- 25.03 DAPO: an Open-Source LLM Reinforcement Learning System at Scale | Paperπ Codeπ₯οΈ Dataπ€
- Propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm. | Task: Math
- 25.03 VisRL: Intention-Driven Visual Perception via Reinforced Reasoning | Paperπ Codeπ₯οΈ
- The first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception | Task: VQA
- 25.03 Unified Reward Model for Multimodal Understanding and Generation | Paperπ Codeπ₯οΈ Datasetπ€
- Improve MLLM's understanding and generation ability with DPO | Task: VQA & Generation
- 25.02 Qwen2.5-VL Technical Report | Paperπ Codeπ₯οΈ Huggingfaceπ€
- The latest flagship model of Qwen vision-language series for various multimodal tasks | Task: Reasoning & Understainding
- 25.02 MM-RLHF: The Next Step Forward in Multimodal LLM Alignment | Paperπ Projectπ
- A comprehensive project for aligning MlLMs with human preferences | Task: Reward & VQA
- 25.01 Kimi k1.5: Scaling Reinforcement Learning with LLMs (MoonshotAI) | Projectπ
- The latest flagship model of Kimi series for various multimodal tasks | Task: Reasoning & Understainding
- 25.01 InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model | Paperπ Codeπ₯οΈ
- A simple yet effective multi-modal reward model that aligns MLLMs with human preferences | Reward & VQA
- 25.01 LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs | Paperπ Codeπ₯οΈ
- A combined multi-step curriculum learning and beam search multimodal reasoning model | VQA
- 25.01 ReFocus: Visual Editing as aΒ ChainΒ ofΒ ThoughtΒ for StructuredΒ ImageΒ Understanding | Paperπ Codeπ₯οΈ Modelπ€
- Perform visual chain of thought via input-image editing to help multimodal reasoning. | Task: VQA
- 24.12 Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search | Paperπ Codeπ₯οΈ Datasetπ€
- Improve MLLM reasoning ability via collective monte carlo tree search | VQA
- 24.11 LLaVA-CoT: Let Vision Language Models Reason Step-by-Step | Paperπ Codeπ₯οΈ Modelπ€
- A novel MLLM designed to conduct autonomous multistage reasoning. | VQA
- 24.11 Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models | Paperπ Codeπ₯οΈ Modelπ€
- Explore long-chain visual reasoning with MLLMs | VQA
- 24.11 Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization | Paperπ Codeπ₯οΈ Modelπ€
- A preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. | VQA
- 24.10 Improve Vision Language Model Chain-of-thought Reasoning | Paperπ Codeπ₯οΈ
- Apply reinforcement learning on 193k CoT sft data for reasoning | VQA
- 24.03 (NeurIPS24)Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning | Paperπ Codeπ₯οΈ
Datasetπ€
- Visual CoT for improve MLLMs' reasoning ability | VQA
- 23.02 Multimodal Chain-of-Thought Reasoning in Language Models | Paperπ Codeπ₯οΈ
- Visual CoT for MLLM reasoning | VQA
- 26.02 A Very Big Video Reasoning Suite | Paperπ Modelπ€ Datasetπ€
- 1M+ video clip dataset spanning 200 reasoning tasks (VBVR) with VBVR-Bench for verifiable evaluation, enabling emergent generalization via large-scale scaling. | Task: Video Understanding & Reasoning
- 26.02 Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning | Paperπ Projectπ
- Video generation models achieve zero-shot generalization for visual reasoning by using generated frames as intermediate reasoning steps with a visual test-time scaling law. | Task: Video Understanding & Reasoning
- 26.02 Multimodal Fact-Level Attribution for Verifiable Reasoning | Paperπ Codeπ₯οΈ
- MuRGAt benchmark requiring MLLMs to provide precise fact-level citations across video, audio, and modalities, finding strong models frequently hallucinate citations despite correct reasoning. | Task: Video Understanding & Reasoning
- 26.02 Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- ASID-Caption suite with 1M structured audiovisual annotations and quality verification pipeline for fine-grained audiovisual video understanding across multiple attribute dimensions. | Task: Video Understanding & Reasoning
- 26.02 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction | Paperπ Codeπ₯οΈ
- Execution-based framework evaluating physical reasoning in MLLMs by requiring executable simulator code from visual observations; VisPhyBench (209 scenes) reveals MLLMs struggle to infer physical parameters. | Task: Video Understanding & Reasoning
- 26.01 Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation | Paperπ
- Uses counterfactual video generation to reduce hallucinations and improve temporal reasoning in multimodal LLMs. | Task: Video Understanding & Reasoning
- 25.12 Rethinking Chain-of-Thought Reasoning for Videos | Paperπ
- Proposes improved chain-of-thought reasoning strategies specifically designed for video understanding tasks. | Task: Video Understanding & Reasoning
- 25.12 SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with RL | Paperπ
- RL-based framework training agents for long-horizon video reasoning across variable time spans. | Task: Video Understanding & Reasoning
- 25.11 Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination | Paperπ
- Enhances reasoning over text-rich video content via visual rumination. | Task: Video Understanding & Reasoning
- 25.10 Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning | Paperπ
- Reasoning framework enabling models to think with video inputs via RL. | Task: Video Understanding & Reasoning
- 25.10 StreamingVLM: Real-Time Understanding for Infinite Video Streams | Paperπ
- Real-time video stream understanding with multimodal LLMs. | Task: Video Understanding & Reasoning
- 25.09 Video models are zero-shot learners and reasoners | Paperπ
- Demonstrates zero-shot reasoning capabilities in video models. | Task: Video Understanding & Reasoning
- 25.07 Scaling RL to Long Videos| Paperπ Modelπ€ Codeπ₯οΈ
- 25.06 DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO|Paperπ
- 25.06 VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning | Paperπ Modelπ€ Codeπ₯οΈ
- Extend Reinforcement Fine-Tuning (RFT) to the video reasoning domain, a long-standing challenge. | Task: Video Understanding & Reasoning
- 25.06 VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks|Paperπ Modelπ€ Codeπ₯οΈ
- 25.05 SpaceR: Reinforcing MLLMs in Video Spatial Reasoning|Paperπ Modelπ€ Codeπ₯οΈ
- 25.05 Video-R1: Reinforcing Video Reasoning in MLLMs | PaperπModelπ€ Codeπ₯οΈ
- 25.04 TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning | Paperπ Modelπ€ Codeπ₯οΈ
- 25.04 Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning | Paperπ Projectπ Codeπ₯οΈ
- The first unified multimodal CoT reward model, capable of step-by-step long-chain reasoning for visual understanding and generation reward tasks. | Task: Video Understanding and Feneration
- 25.04 ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting | Paperπ
- A system to summarise hour long videos with no-supervision. | Task: Video Summary
- 25.04 TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning | Paperπ Codeπ₯οΈ | Modelπ€
- Present the small-scale video reasoning model TinyLLaVA-Video-R1 | Task: Video QA
- 25.04 VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning | Paperπ Codeπ₯οΈ | Datasetπ€
- A novel video-language agent designed for temporal-grounded video understanding. | Task: Video QA
- 25.04 Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 | Paperπ Codeπ₯οΈ | Datasetπ€
- Reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. | Task: Video QA
- 25.03 VIDEOTREE: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos |Paperπ Codeπ₯οΈ
- 25.02 CoS: Chain-of-Shot Prompting for Long Video Understanding | Paperπ Codeπ₯οΈ
- Approach long video understanding by optimising input video information to fully utilise MLLMβs ability to comprehend long videos. | Task: Video VQA
- 25.02 video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model | Paperπ Demoπ₯οΈ
- A open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. | Task: Video QA
- 25.02 Open-R1-Video | Codeπ₯οΈ Datasetπ€
- A open-source R1-style video understanding model | Task: Video QA
- 25.01 Temporal Preference Optimization for Long-Form Video Understanding | PaperπCodeπ₯οΈ
- A novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning | Task: Video QA
- 25.01 Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding | Paperπ Codeπ₯οΈ
Modelπ€
- A family of VLMs designed for high-quality video captioning and understanding | Task: Video captioning & QA
- 24.12 (ECCV24) VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding | Paperπ Codeπ₯οΈ Projectπ
- Explore how reconciling several foundation models with a novel unified memory mechanism could tackle the challenging video understanding problem | Task: Video captioning & QA
- 25.10 UALM: Unified Audio Language Model for Understanding, Generation and Reasoning Projectπ
- 25.09 MiMo Audio: Audio Language Models are Few-Shot Learners Projectπ Codeπ₯οΈ
- 25.07 Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
- 25.07 Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
- 25.05 AudSemThinker: Enhancing Audio-Language Models Through Reasoning over Semantics of Sound
- 25.05 Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
- Utilizing GRPO to enhance audio reasoning performance
- 25.04 SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning
- 25.04 Kimi-Audio Technical Report Codeπ₯οΈ
- 25.03 Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
- 25.03 Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models Projectπ
- Utilizing CoT data for audio understanding tasks.
- 25.03 Mellow: a small audio language model for reasoning Codeπ₯οΈ
- Small audio-language model (167M) designed for audio understanding, audio entailment, audio difference and captioning.
- 25.03 Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities Projectπ
- NVIDIA audio-language for various audio understanding and reasoning.
- 25.02 Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction Codeπ₯οΈ
- 25.01 Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model
- Finetuning Qwen2-Audio with CoT data for audio understanding and retrieval tasks.
- 24.07 Qwen2-Audio Technical Report Paperπ Codeπ₯οΈ
- Qwen audio-language series for various audio understanding tasks especially for speech.
- 24.07 (EMNLP2024) GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities Projectπ
- NVIDIA audio-language for various audio understanding and reasoning.
- 24.02 (ICML2024)Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities Codeπ₯οΈ
- audio-language for various audio understanding and reasoning with Q-formers.
- 23.11 Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models Codeπ₯οΈ
- Qwen audio-language series for various audio understanding tasks in speech sound and music.
- 23.10 (ICLR2024) SALMONN: Towards Generic Hearing Abilities for Large Language Models Codeπ₯οΈ
- Bytedance audio-language for various audio understanding tasks especially for speech and sound with Q-former.
- 23.09 (NAACL2024) MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
- Music-language for understanding and captioning tasks.
- 26.02 OmniGAIA: Towards Native Omni-Modal AI Agents | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- OmniGAIA benchmark for omni-modal agent evaluation on cross-modal reasoning and tool-use, with OmniAtlas agent trained via hindsight-guided tree exploration and OmniDPO. | Task: Reasoning & Understanding
- 26.02 Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device | Paperπ Codeπ₯οΈ Modelπ€
- Compact on-device unified multimodal model (~3s/512Γ512 on iPhone) outperforming Show-O and JanusFlow on generation and visual understanding benchmarks. | Task: Reasoning & Understanding
- 26.02 OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention | Paperπ
- Reinforced framework for omnivideo models improving mixed-modality reasoning by combining query-intensive grounding and modality-attentive fusion via contrastive learning. | Task: Reasoning & Understanding
- 26.02 UniT: Unified Multimodal Chain-of-Thought Test-time Scaling | Paperπ
- Framework enabling unified multimodal models to perform iterative CoT test-time scaling, showing sequential reasoning is more efficient than parallel sampling for both generation and understanding. | Task: Reasoning & Understanding
- 26.02 Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching | Paperπ
- UniDFlow, a unified discrete flow-matching framework decoupling understanding and generation via low-rank adapters and multimodal preference alignment, achieving SOTA across 8 benchmarks. | Task: Reasoning & Understanding
- 26.02 Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models | Paperπ Codeπ₯οΈ
- R3 (Reason-Reflect-Refine) framework reformulating single-step generation into multi-step generate-understand-regenerate process to resolve the trade-off between multimodal understanding and generation. | Task: Reasoning & Understanding
- 26.02 BrowseComp-V3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents | Paperπ
- 300-question benchmark for complex multi-hop reasoning across text and visual modalities with deep web search; even SOTA models achieve only 36% accuracy, with the OmniSeeker unified browsing agent. | Task: Reasoning & Understanding
- 25.12 Qwen3-VL Technical Report | Paperπ
- Advanced VLM excelling in text and multimodal understanding supporting up to 256K tokens of interleaved text, images, and video. | Task: Reasoning & Understanding
- 25.10 InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
- 25.10 Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation | Paperπ
- Unified sparse architecture for multimodal perception and generation across modalities. | Task: Reasoning & Understanding
- 25.10 OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM | Paperπ
- Multimodal LLM for comprehensive understanding across all modalities. | Task: Reasoning & Understanding
- 25.09 Qwen3-Omni Technical Report
- 25.09 Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation | Paperπ
- Unified model for multimodal understanding and generation across modalities. | Task: Reasoning & Understanding
- 25.07 Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
- 25.05 EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
- 25.03 Qwen2.5-Omni Technical Report
- 25.01 OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
- 24.10 Baichuan-Omni Technical Report
- 24.09 MIO: A Foundation Model on Multimodal Tokens
- 24.08 MiniCPM-V: A GPT-4V Level MLLM on Your Phone [Code]
- 24.02 AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
- 23.12 Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
- 26.02 Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation? | Paperπ Codeπ₯οΈ
- Retrieval-augmented test-time adapter for open-vocabulary segmentation fusing textual prompts with pixel-annotated visual support features to narrow zero-shot vs. supervised gap. | Task: Reasoning Segmentation
- 26.02 Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search | Paperπ Codeπ₯οΈ Datasetπ€
- Novel segmentation paradigm enabling interleaved reasoning and external search to overcome knowledge bottlenecks, with OK-VOS benchmark for open-knowledge video object segmentation. | Task: Reasoning Segmentation
- 26.02 Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision | Paperπ Codeπ₯οΈ
- CIS task grounding abstract intent-driven concepts into pixel-accurate masks beyond categorical queries, with ConverSeg benchmark, ConverSeg-Net model, and AI-powered scalable data engine. | Task: Reasoning Segmentation
- 26.01 Urban Socio-Semantic Segmentation with Vision-Language Reasoning | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Vision-language reasoning framework for urban satellite segmentation identifying both physical and social categories via multi-stage reasoning. | Task: Reasoning Segmentation
- 26.01 SAMTok: Representing Any Mask with Two Words | Paperπ
- Efficient mask tokenization representing arbitrary segmentation masks with just two tokens, enabling reasoning-driven segmentation. | Task: Reasoning Segmentation
- 26.01 Towards Pixel-Level VLM Perception via Simple Points Prediction | Paperπ
- Enables pixel-level perception in VLMs through simple points prediction, bridging VLM reasoning and fine-grained spatial detection. | Task: Detection & Grounding
- 25.12 ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning | Paperπ
- Uses RL to incentivize reasoning chains for improved video segmentation. | Task: Reasoning Segmentation
- 25.12 InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search | Paperπ
- Enhances multimodal models with generalized visual search for improved grounding. | Task: Detection & Grounding
- 25.11 SAM 3: Segment Anything with Concepts | Paperπ
- Advances segmentation with concept-based reasoning. | Task: Reasoning Segmentation
- 25.10 Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation | Paperπ
- Video reasoning and segmentation with multimodal models without training. | Task: Reasoning Segmentation
- 25.09 RefAM: Attention Magnets for Zero-Shot Referral Segmentation | Paperπ
- Zero-shot referral segmentation using attention-based visual reasoning. | Task: Reasoning Segmentation
- 25.07 UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding | Paperπ Codeπ₯οΈ
- A multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving performance across diverse urban tasks. | Task: Urban tasks
- 25.07 Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs | Paperπ
- A novel fine-grained preference optimization approach that significantly improves spatial reasoning capabilities in VLMs | Task: Spatial Tasks
- 25.06 Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning | Paperπ Codeπ₯οΈ
Modelπ€
- a grounded model reasons step-by-stepβjust like a human would | Task: Detection & Grounding
- 25.03 Visual-RFT: Visual Reinforcement Fine-Tuning | Paperπ Codeπ₯οΈ
Datasetπ€
- Extend Reinforcement Fine-Tuning on visual tasks with GRPO | Task: Detection & Grounding & Classification
- 25.03 Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning | Paperπ
- Improve generalization and reasoning of VLMs with GRPO | Task: Detection & Classification & Math
- 25.03 Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement | Paperπ Codeπ₯οΈ Modelπ€
- Address object detection and segmentation with GRPO | Task: Object Detection & Object Segmentation
- 24.08 (NeurIPS) Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation | Paperπ Codeπ₯οΈ
- Utilize hallucinations to mine task-related information from images and verify its accuracy for enhancing precision of the generated prompts. | Task: Reasoning Segmentation
- 24.07 (CVPR24) Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | Paperπ
- Explore how instruction fine-tuning objectives could inject spatial awareness into V-LLMs| | Task: Reasoning Localization
- 23.04 (AAAI24) Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt for Segmenting Camouflaged Objects | Paperπ Codeπ₯οΈ
- Employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts using the semantic information given by a generic text prompt. | Task: Reasoning segmentation
- 23.12 (CVPR24) PixelLM:Pixel Reasoning with Large Multimodal Model | Paperπ Codeπ₯οΈ
- An effective and efficient LMM for pixel-level reasoning and understanding | Task: Reasoning Segmentation
- 23.08 (CVPR24)LISA: Reasoning Segmentation via Large Language Model | Paperπ Codeπ₯οΈ Datasetπ€
- Inherit the language generation capabilities of the MLLM while also possessing the ability to produce segmentation masks. | Task: Reasoning Segmentation
- 26.02 VidEoMT: Your ViT is Secretly Also a Video Segmentation Model | Paperπ Codeπ₯οΈ Modelπ€
- Lightweight encoder-only video segmentation on plain ViT with query propagation and fusion, achieving 160 FPS with ViT-L without dedicated tracking modules. | Task: Reasoning Segmentation
- 26.02 Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction | Paperπ Codeπ₯οΈ
- Conditional binary segmentation with cycle-consistency training for object-level correspondence across egocentric/exocentric viewpoints without ground-truth annotations (CVPR 2026). | Task: Reasoning Segmentation
- 24.08 (ECCV24)VISA: Reasoning Video Object Segmentation via Large Language Model | Paperπ Codeπ₯οΈ Datasetπ€
- Leverage the world knowledge reasoning capabilities of MLLMs while possessing the ability to segment and track objects in videos with a mask decoder | Task: Reasoning Segmentation
- 24.07 (NeruIPS24)One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos | Paperπ Codeπ₯οΈ Modelπ€
- Integrating a Sparse Dense Sampling strategy into the video-LLM to balance temporal context and spatial detail within computational constraints | Task: Reasoning Segmentation
- 24.01 (CVPR24) OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding | Paperπ Codeπ₯οΈ
- A transformer-based encoder-decoder architecture with task-specific queries and outputs for multiple tasks | Task: Reasoning Segmentation/Detection
- 25.07 Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
- 24.08 Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
- 26.02 GeoWorld: Geometric World Models | Paperπ
- Hyperbolic JEPA preserving latent state structures for improved long-horizon world model prediction and Geometric RL planning (CVPR 2026). | Task: Spatial Reasoning
- 26.02 When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning | Paperπ Codeπ₯οΈ
- AVIC adaptively invokes visual imagination via world models to match or outperform fixed imagination strategies for spatial reasoning with far fewer calls. | Task: Spatial Reasoning
- 26.02 Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration? | Paperπ Codeπ₯οΈ Datasetπ€
- Evaluates VLMs' ability to construct spatial beliefs through active exploration, revealing Active-Passive Gap and Belief InertiaβVLMs fail to update stale spatial priors. | Task: Spatial Understanding
- 26.02 SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? | Paperπ Codeπ₯οΈ Datasetπ€
- Benchmark of 1,400 VQA pairs across six spatial reasoning categories revealing VLMs achieve only ~55% vs. 87.6% human accuracy (ICLR 2026). | Task: Spatial Reasoning
- 26.02 LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation | Paperπ Codeπ₯οΈ
- Multi-granularity open-vocabulary navigation task with 414 object categories and 18K+ navigation tasks across scene, room, region, and instance levels. | Task: Spatial Grounding & Navigation
- 26.02 GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Geolocation reasoning model using RL with geo-similarity and consistency rewards over GeoSeek dataset, enabling fine-grained address-level localization with human-like reasoning. | Task: Spatial Reasoning
- 26.01 CoV: Chain-of-View Prompting for Spatial Reasoning | Paperπ Codeπ₯οΈ
- Training-free test-time reasoning framework transforming VLMs into active viewpoint reasoners through coarse-to-fine 3D exploration, +11.56% on OpenEQA. | Task: Spatial Reasoning
- 26.01 Think3D: Thinking with Space for Spatial Reasoning | Paperπ
- Framework for spatial reasoning enabling models to reason in 3D space for improved visual understanding tasks. | Task: Spatial Reasoning & 3D Understanding
- 25.12 SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL | Paperπ
- Tool-augmented spatial reasoning using double interactive reinforcement learning. | Task: Spatial Reasoning
- 25.12 COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence | Paperπ
- Unified model combining cooperative perception with spatial intelligence reasoning. | Task: Spatial Reasoning
- 25.11 SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards | Paperπ
- Uses reinforcement learning with spatial rewards to improve 3D reasoning in MLLMs. | Task: Spatial Reasoning & 3D Understanding
- 25.11 G2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning | Paperπ
- Unifies 3D reconstruction and spatial reasoning in a geometry-grounded VLM. | Task: Spatial Reasoning & 3D Understanding
- 25.10 SpaceVista: All-Scale Visual Spatial Reasoning from mm to km | Paperπ
- Spatial reasoning across multiple scales in visual understanding. | Task: Spatial Reasoning
- 25.08 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding | Paperπ
- Enhances reasoning capabilities of 3D vision-language models for unified 3D scene understanding. | Task: Spatial Reasoning & 3D Understanding
- 25.04 Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation | Paperπ Projectπ Codeπ₯οΈ
- A framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. | Task: Spatial Reasoning & Understanding
- 25.04 Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning | Paperπ Projectπ Codeπ₯οΈ Datasetπ€
- Introduce a combined RL and SFT training paradigm to enhance visual reasoning capabilities in multimodal models. | Task: Spatial Reasoning & Understanding
- 25.04 InteractVLM: 3D Interaction Reasoning from 2D Foundational Models | Paperπ Codeπ»
- Harnesses the broad visual knowledge of large Vision-Language Models (VLMs), fine-tuned with limited 3D contact data. Task: 3D Reconstruction
- 25.03 Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks | Paperπ Codeπ» Projectπ Datasetπ€
- A model that extends O1-style reasoning to interactive embodied tasks. | Task: Interactive Embodied Tasks
- 25.03 VisualThinker-R1-Zero | Paperπ Codeπ»
- R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model | Task: Counting & Reasoning & 3D Understanding (CV-Bench)
- 25.03 (CVPR2025)GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks | Paperπ
- Fine-tune VLMs using GFlowNet to promote generation of diverse solutions.| Task: NumberLine (NL) & BlackJack (BJ)
- 25.02 R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 | Codeπ₯οΈ
- A open-source project for VLM reasoning with GRPO | Task: Counting, Number Related Reasoning and Geometry Reasoning
- 25.01 Imagine while Reasoning in Space: Multimodal Visualization-of-Thought | Paperπ
- Enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. | Task: Spatial Reasoning
- 26.02 TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions | Paperπ Codeπ₯οΈ Modelπ€
- Omni Dense Captioning with six-dimensional structural schema generating time-aware audio-visual narratives with explicit timestamps, surpassing Gemini-2.5-Pro on the task. | Task: Temporal Grounding/Understanding
- 26.02 4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere | Paperπ
- 4D dynamic scene reconstruction framework with conditional querying at arbitrary space-time locations for flexible spatiotemporal understanding of dynamic scenes. | Task: Spatial-Temporal Understanding
- 26.02 Learning Situated Awareness in the Real World | Paperπ
- SAW-Bench evaluates egocentric situated awareness using 786 real-world videos from smart glasses with 2,071+ QA pairs, revealing a 37.66% human-model performance gap in observer-centric spatial reasoning. | Task: Temporal Grounding/Understanding
- 26.02 MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation | Paperπ Codeπ₯οΈ
- Unified multimodal motion model combining SFT and RL with Chain-of-Motion (CoM) reasoning and large-scale CoT datasets for human motion understanding and generation. | Task: Spatial-Temporal Understanding
- 26.01 VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding | Paperπ Codeπ₯οΈ Modelπ€
- Unified Video LLM for joint spatial-temporal understanding with LoomData-8.7k dataset and LoomBench benchmark. | Task: Spatial-Temporal Understanding
- 26.01 VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice | Paperπ Codeπ₯οΈ
- Video understanding framework with "reason-when-necessary" strategy using confidence-based reasoning activation, reducing response length 3.3x. | Task: Video Understanding & Reasoning
- 26.01 Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding | Paperπ Codeπ₯οΈ
- Open-source video-language model family with point-driven grounding and video tracking capabilities surpassing Gemini 3 Pro on grounding. | Task: Spatial Understanding & Grounding
- 26.01 PROGRESSLM: Towards Progress Reasoning in Vision-Language Models | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Addresses task progress estimation in VLMs with Progress-Bench benchmark and ProgressLM-3B model. | Task: Temporal Reasoning & Understanding
- 26.01 HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding | Paperπ
- Efficient streaming video understanding via hierarchical KV cache memory enabling temporal reasoning over long videos. | Task: Temporal Reasoning
- 25.12 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation | Paperπ
- Region-level 4D (3D + temporal) understanding through perceptual distillation. | Task: Spatial-Temporal Understanding
- 25.12 MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence | Paperπ
- Comprehensive benchmark for evaluating spatial intelligence in video understanding. | Task: Spatial-Temporal Understanding
- 25.11 VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation | Paperπ
- Incorporates 4D spatiotemporal awareness into VLA models for coherent robotic manipulation. | Task: Spatial-Temporal Understanding
- 25.10 Trace Anything: Representing Any Video in 4D via Trajectory Fields | Paperπ
- 4D spatial-temporal representation learning from video. | Task: Spatial-Temporal Understanding
- 25.08 VLM4D: Towards Spatiotemporal Awareness in Vision Language Models | Paperπ
- Extends VLMs with spatiotemporal reasoning for understanding spatial and temporal dynamics. | Task: Spatial-Temporal Understanding
- 25.05 MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding | Paperπ Codeπ»
- 25.04 VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning | Paperπ Codeπ»
- A novel spatiao-temporal perception framework with GRPO | Task: Spatial Understanding and Grounding
- 25.04 VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search | Paperπ Codeπ»
- A novel framework that seamlessly integrates visuospatial and linguistic domains | Task: Geometry and Spatial Reasoning
- 25.04 Improved Visual-Spatial Reasoning via R1-Zero-Like Training | Paperπ Codeπ»
- Incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated VSI-100k dataset. | Task: Video Understanding
- 25.03 Envolving Temporal Reasoning Capability into LMMs via Temporal Consistent Reward | Codeπ» Modelπ€
- Investigate the potential of GRPO in the video temporal grounding task, which demands precise temporal alignment between visual and linguistic modalities as well as advanced reasoning capabilities | Task: Temporal Grounding
- 25.03 TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM | Paperπ Codeπ» Modelπ€
- A reasoning-guided MLLM for temporal video grounding, trained with GRPO. | Task: Temporal Grounding
- 25.03 LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding | Paperπ Codeπ»
- A MLLM for fine-grained spatial-temporal multimodal understanding. | Task: Spatial-Temporal Understanding
- 25.03 MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse | Codeπ₯οΈ
- Enhance spatial reasoning in VLMs using GRPO | Task: 3D Spatial Reasoning
- 25.02 Video-R1: Towards Super Reasoning Ability in Video Understanding | Codeπ₯οΈ
- Integrate deep thinking capabilities into video understanding tasks through the R1 paradigm | Task: Video Counting
- 24.12 TIMEREFINE: Temporal Grounding with Time Refining Video LLM | Paperπ | Codeπ₯οΈ
- Enhance Video LLMs to handle the temporal grounding task by modifying the learning objective | Task: Temporal Grounding
- 24.11 (CVPR2025) Number it: Temporal Grounding Videos like Flipping Manga | Paperπ | Codeπ»
- Enhances Video-LLMs by overlaying frame numbers onto video frames | Task: Temporal Grounding
- 24.11 TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability | Paperπ | Codeπ»
- A versatile Video-LLM featuring robust temporal localization abilities | Task: Temporal Grounding and Video QA
- 24.08 (AAAI2025) Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos | Paperπ | Codeπ»
- Leverage the world knowledge reasoning capabilities of MLLMs to retrieve temporal evidence in the video with flexible grounding tokens. | Task: Multi-Hop VideoQA
- 24.08 (ICLR2025) TRACE: Temporal Grounding Video LLM via Casual Event Modeling | Paperπ | Codeπ»
- Tailored to implement the causal event modeling framework through timestamps, salient scores, and textual captions. | Task: Temporal Grounding
- 25.07 Towards Spatial Audio Understanding via Question Answering
- 24.06 (InterSpeech 2024) Can Large Language Models Understand Spatial Audio? |
- 24.02 (ICML 2024)BAT: Learning to Reason about Spatial Sounds with Large Language Models |
- 26.02 P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads | Paperπ Codeπ₯οΈ Modelπ€
- Open-source VLM family for advanced scientific reasoning using curriculum RL and agentic augmentation, achieving the first open-source model winning 12 gold medals at physics olympiad level. | Task: Math
- 26.02 DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning | Paperπ Codeπ₯οΈ Datasetπ€
- 103K-sample RLVR training dataset for multimodal K12 mathematical reasoning with diverse topics and rich visual elements, generalizing to general multimodal reasoning tasks. | Task: Math
- 26.02 Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models | Paperπ Codeπ₯οΈ
- Multimodal deep-research paradigm enabling multi-turn, multi-entity, multi-scale visual and textual search via cold-start supervision and RL. | Task: Math
- 26.02 LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models | Paperπ
- Multimodal reasoning diffusion LM using SFT + multi-task RL with answer-forcing, tree search, and complementary likelihood estimation for visual math reasoning and image editing. | Task: Math
- 26.02 Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings | Paperπ Codeπ₯οΈ
- Reasoning-driven multimodal embedding framework using Embedder-Guided RL (EG-RL) to optimize Traceability Chain-of-Thought generation for improved cross-modal semantic consistency. | Task: Math
- 26.01 CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving | Paperπ Projectπ
- Cognitive-inspired three-stage framework (Perception-Internalization-Reasoning) for visual math with MathCog dataset of 120K+ annotations. | Task: Math
- 26.01 MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning | Paperπ
- Multimodal tool-integrated reasoning framework enhancing chain-of-thought with tool use for complex math/science problems. | Task: Math
- 26.01 MMFormalizer: Multimodal Autoformalization in the Wild | Paperπ
- Framework for automatically formalizing multimodal mathematical content from images and text into formal representations. | Task: Math
- 25.11 MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning | Paperπ
- Improves multimodal math reasoning via iterative self-evolution and reward-guided training. | Task: Math
- 25.10 Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning | Paperπ
- Process reward models for scaling multimodal reasoning at test time. | Task: Math
- 25.09 BaseReward: A Strong Baseline for Multimodal Reward Model | Paperπ
- Strong baseline reward model for multimodal RL-based alignment. | Task: Math
- 25.08 MathReal: A Real Scene Benchmark for Evaluating Math Reasoning in MLLMs | Paperπ
- Benchmark for evaluating multimodal math reasoning using real-world scene photographs. | Task: Math
- 25.11 Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning | Paperπ Codeπ₯οΈ Modelπ€
- Introduce a perception checklist to anchor RL policy updates in verified visual evidence and prevent hallucinations. | Task: Math
- 25.11 Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning | Paperπ Codeπ₯οΈ Modelπ€
- Use a mixture-of-experts framework with dynamic routing for balancing complex reasoning and general tasks. | Task: Math
- 25.10 Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start | Paperπ Codeπ₯οΈ
- Replace supervised fine-tuning with self-distilled, preference-based cold starts to improve RL generalization. | Task: Math
- 25.09 DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning | Paperπ Codeπ₯οΈ
- Internalize visual reasoning by directly manipulating visual embeddings using code-rendered trajectories, bypassing external tools and reducing grounding noise. | Task: Math
- 25.07 The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs Paperπ Codeπ₯οΈ
- a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. | Task: Math
- 25.06 Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning | Paperπ Codeπ₯οΈ Modelπ€
- Reverse the training pipeline by first using RL for reasoning exploration, then applying SFT with self-distilled and expert-augmented trajectories for stability and capability enhancement. | Task: Math
- 25.06 SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Paperπ Codeπ₯οΈ Modelπ€
- A novel framework that enhances the reasoning capabilities of multimodal large language models. | Task: Math
- 25.06 SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning Paperπ Codeπ₯οΈ Modelπ€
- scale the training data with correctness and distribution guarantees to achieve better performance. | Task: Math
- 25.05 Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO Paperπ Codeπ₯οΈ
- A Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO. | Task: Math
- 25.05 X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains | Paperπ Codeπ₯οΈ
- A training recipe that optimizes the reasoning capability of VLMs with SFT and RL on general-domain text-only data. | Task: Math
- 25.04 NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation | Paperπ Codeπ₯οΈ Modelπ€
- Introduces targeted rollout diversity by mixing rollouts from both clean and moderately distorted images, encouraging the model to learn more robust behaviors. | Task: Math
- 25.04 VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning | Paperπ Codeπ₯οΈ Modelπ€
- Aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the SOTA. | Task: Math
- 25.04 SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement | Paperπ Codeπ₯οΈ
- Propose a novel way of repurposing Monte Carlo Tree Search (MCTS) to enable effective data filtering. | Task: Math reasoning
- 25.04 GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning | Paperπ Projectπ Codeπ₯οΈ
- A generative process reward model that performs explicit COT reasoning with code verification before providing judgment for each reasoning step. | Task: Math
- 25.03 OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement | Paperπ Codeπ₯οΈ Datasetπ€
- Investigate whether R1-like reasoning capabilities can be successfully integrated into LVLMs and assesses their impact on challenging multimodal reasoning tasks. | Task: Math
- 25.03 R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization | Paperπ Codeπ₯οΈ Datasetπ€
- Design Step-wise Group Relative Policy Optimization (StepGRPO) that enables MLLMs to self-improve reasoning ability. | Task: Math
- 25.03 LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL | Paperπ Codeπ₯οΈ Datasetπ€
- A two-stage rule-based RL framework that efficiently enhances reasoning capabilities | Task: Math & Sokoban
- 25.03 VisualPRM: An Effective Process Reward Model for Multimodal Reasoning | Paperπ Codeπ₯οΈ Datasetπ€
- Improve the reasoning abilities of existing MLLMs with Best-of-N evaluation strategies | Task: Math & MMMU
- 25.03 R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization | Paperπ Codeπ₯οΈ Datasetπ€
- A multimodal reasoning model bridged the gap between multimodal capabilities and reasoning abilities with GRPO | Task: Math
- 25.03 MMR1: Advancing the Frontiers of Multimodal Reasoning | Codeπ₯οΈ
- a Large Multimodal Model specialized in mathematical tasks using GRPO | Task: Math
- 25.03 Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning | Paperπ
- Improve generalization and reasoning of VLMs with GRPO | Task: Detection & Classification & Math
- 25.03 Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models | PaperπCodeπ₯οΈ
- Improve reasoning ability of MLLM with GRPO | Task: Math
- 25.03 MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning | Paperπ Codeπ₯οΈ Datasetπ€
- Extend large-scale rule-based reinforcement learning to multimodal reasoning | Task: Math
- 25.03 [EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework] Codeπ₯οΈ
- A Multimodal GRPO training framework | Task: Math
- 25.02 [Qwen2.5-VL] Qwen2.5-VL Technical Report | Paperπ Codeπ₯οΈ Huggingfaceπ€
- The latest flagship model of Qwen vision-language series for various multimodal tasks | Task: Reasoning & Understainding * 25.02 Multimodal Open R1 | Codeπ₯οΈ
- A open-source database for video R1 reproduce. | Task: Math
- 25.02 Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking | Paperπ
- An automated structured thinking paradigm for multimodal reasoning via Monte Carlo Tree Search | Task: Math
- 25.02 MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification | Paperπ Codeπ₯οΈ
- Enhance multimodal reasoning through longer inference and more robust verification. | Task: Math
- 25.01 Kimi k1.5: Scaling Reinforcement Learning with LLMs (MoonshotAI) | Projectπ
- The latest flagship model of Kimi series for various multimodal tasks | Task: Reasoning & Understainding
- 25.01 Virgo: A Preliminary Exploration on Reproducing o1-like MLLM | Paperπ Codeπ₯οΈ Modelπ€
- A o1-like MLLM for multimodal reasoning |Task: Math & MMMU
- 26.02 OCR-Agent: Agentic OCR with Capability and Memory Reflection | Paperπ Codeπ₯οΈ
- Iterative self-correction framework using Capability Reflection (error diagnosis) and Memory Reflection (avoiding repeated attempts), achieving SOTA on OCRBench v2 without training. | Task: Document Reasoning
- 26.02 OmniOCR: Generalist OCR for Ethnic Minority Languages | Paperπ Codeπ₯οΈ
- Universal OCR framework using Dynamic LoRA for low-resource ethnic minority scripts, achieving 39-66% accuracy improvements on Tibetan, Shui, and other scripts. | Task: Document Reasoning
- 26.02 DODO: Discrete OCR Diffusion Models | Paperπ
- Adapts block discrete diffusion for OCR enabling parallel token processing, achieving up to 3Γ faster inference while maintaining near-SOTA accuracy. | Task: Document Reasoning
- 26.02 PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing | Paperπ
- Compact 0.9B VLM for multi-task document parsing in diverse real-world conditions covering OCR, layout understanding, and chart comprehension. | Task: Document Reasoning
- 26.02 ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images | Paperπ
- Benchmark unifying key entity extraction, relation extraction, and VQA for structured information extraction from document images, evaluating VLMs on schema adaptation and answer localization. | Task: Document Reasoning
- 26.02 MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning | Paperπ
- Layout-aware visual memory mechanisms for MLLMs to improve long-horizon document and OCR reasoning efficiency. | Task: Document Reasoning
- 26.01 ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Scalable chart reasoning framework using Rollout Posterior Entropy; ChartVerse-8B surpasses its teacher model Qwen3-VL-30B. | Task: Chart Reasoning
- 25.10 From Charts to Code: A Hierarchical Benchmark for Multimodal Models | Paperπ
- Benchmark for chart understanding and code generation from charts. | Task: Chart Reasoning
- 25.09 Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images | Paperπ
- Benchmark for visual question answering and reasoning over table images. | Task: Chart Reasoning
- 25.09 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing | Paperπ
- Efficient VLM for parsing and understanding high-resolution documents. | Task: Document Reasoning
- 25.09 Visual Programmability: A Guide for Code-as-Thought in Chart Understanding | Paperπ Codeπ₯οΈ
- Introduce an adaptive framework that enables VLMs to dynamically choose between code-based and visual reasoning pathways for chart understanding. | Task: Chart Reasoning
- 25.07 Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner | Paperπ Codeπ₯οΈ Modelπ€
- Combine chain-of-thought supervision with reinforcement learning, supported by programmatically synthesized step-by-step reasoning data. | Task: Chart Reasoning
- 25.06 ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering | Paperπ
- Combine chart code generation with long-chain reasoning LLMs to produce detailed reasoning processes. | Task: Chart Reasoning
- 25.05 Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning | Paperπ
- Introduce a visually grounded chain-of-thought (CoT) paradigm, enabling the model to generate CoT reasoning aligned with visual elements. | Task: Chart Reasoning
- 25.04 Bespoke-MiniChart-7B: Pushing The Frontiers Of Open VLMs For Chart Understanding | Projectπ Modelπ€
- Employ a three-stage training process, combining rejection sampling and DPO optimization to enhance out-of-distribution generalization. | Task: Chart Reasoning
- 25.03 MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding | Paperπ Codeπ₯οΈ
- Integrate text and image retrieval through various agents, enabling collaborative reasoning across modalities. | Task: Document Reasoning
- 24.11 ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Generate multi-task instruction-tuning data from real chart images and integrating both COT and POT reasoning pathways. | Task: Chart Reasoning
- 24.09 (ICLR25 Oral) ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Utilize diverse chart-text aligned tasks (chart -> table/json/python-code) to augment chart understanding and reasoning. | Task: Chart Reasoning
- 24.09 ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning | Projectπ Paperπ Codeπ₯οΈ
- Offer a new perspective on handling chart reasoning tasks that strongly depend on interpretable patterns. | Task: Chart Reasoning
- 24.07 (EMNLP24) Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model | Paperπ Projectπ Codeπ₯οΈ Datasetπ€
- A multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual reasoning instructions across daily scenarios. | Task: Chart Reasoning
- 24.04 (EMNLP24) TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Employ PoT learning for numerical reasoning and Vision Token Merging to compress visual features from high-resolution images. | Task: Chart Reasoning
- 24.04 (MM24) OneChart: Purify the Chart Structural Extraction via One Auxiliary Token | Paperπ Projectπ Codeπ₯οΈ Modelπ€
- Introduce an auxiliary token and decoder combined with a customized L1 loss to enhance the reliability of structured and numerical information extraction. | Task: Chart Reasoning
- 24.04 (MM24) NovaChart: A Large-scale Dataset towards Chart Understanding and Generation of Multimodal Large Language Models | Paperπ Codeπ₯οΈ Datasetπ€
- Construct a large-scale dataset for chart understanding and generation, covering 18 different chart types and 15 unique tasks. | Task: Chart Reasoning
- 24.02 (ACL24) ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning | Paperπ Codeπ₯οΈ Datasetπ€
- Use large-scale chart data to align and instruction tuning | Task: Chart Reasoning
- 23.11 ChartLlama: A Multimodal LLM for Chart Understanding and Generation | Paperπ Projectπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Generate a diverse and high-quality instruction-tuning dataset using GPT-4, and use LLaVA for unified multi-task training. | Task: Chart Reasoning
- 23.10 (EMNLP23) UniChart: A Universal Vision-language Pretrained Model for
Chart Comprehension and Reasoning | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Pretrains on a large and diverse chart dataset, explicitly modeling visual elements and structures. | Task: Chart Reasoning
- 25.11 (EMNLP25) ChartM3: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension | Paperπ
- Provide an evaluation set of 2,871 high-quality samples covering 62 chart types and 60 real-world scenarios, focusing on multi-dimensional and multi-step visual reasoning and complex business analysis. | Task: Chart Reasoning
- 25.05 ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models | Paperπ Projectπ Codeπ₯οΈ Datasetπ€
- Feature real-world chart images and four distinct question types that assess textual, visual, combined, and synthesis reasoning abilities. | Task: Chart Reasoning
- 25.04 CHARTQAPRO : A More Diverse and Challenging Benchmark for Chart Question Answering | Paperπ Codeπ₯οΈ Datasetπ€
- Introduce a diverse benchmark with 1,341 charts and 1,948 questions covering various chart types and question formats, designed to rigorously evaluate the chart reasoning capabilities of large vision-language models in real-world scenarios. | Task: Chart Reasoning
- 25.01 (AAAI25) EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding | Paperπ Codeπ₯οΈ Datasetπ€
- Feature 650 real-world charts, 1,250 expert-curated questions, and strict and flexible automatic evaluation metrics to assess chart comprehension abilities of VLMs in practical scenarios. | Task: Chart Reasoning
- 24.06 (NIPS24) CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | Paperπ Projectπ Codeπ₯οΈ Datasetπ€
- Focus on real and complex charts from arXiv papers, covering eight major domains. All content is expert-curated and verified, with evaluation using GPT-4o scoring and binary correctness metrics. | Task: Chart Reasoning
- 24.06 (VRISP25) ChartBench: A Benchmark for Complex Visual Reasoning in Charts | Paperπ Projectπ Codeπ₯οΈ Datasetπ€
- Cover 9 major categories and 42 subcategories of charts without data point annotations, emphasizing numerical extraction ability. | Task: Chart Reasoning
- 24.04 (NAACL24) MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning | Paperπ Codeπ₯οΈ Datasetπ€
- Propose a comprehensive human-annotated benchmark with nine distinct tasks evaluating reasoning capabilities over various charts, and support both GPT-4 scoring and multiple-choice exact matching. | Task: Chart Reasoning
- 22.05 (ACL22) ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning | Paperπ Codeπ₯οΈ Datasetπ€
- Use real-world charts and open-ended questions to evaluate chart understanding, reasoning, and data extraction, with relaxed accuracy as the metric. | Task: Chart Reasoning
- 26.02 DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing | Paperπ Codeπ₯οΈ Modelπ€
- Lightweight 5B unified model for image generation and editing using hierarchical feature extraction, learnable think tokens, and MR-GRPO reinforcement learning, outperforming much larger models. | Task: Image Generation
- 26.02 UniWeTok: An Unified Binary Tokenizer with Codebook Size 2^128 for Unified Multimodal Large Language Model | Paperπ Codeπ₯οΈ
- Unified discrete tokenizer with massive binary codebook (2^128) for high-fidelity image reconstruction and generation in multimodal LLMs, achieving FID 1.38 with lower training compute. | Task: Image Generation
- 26.02 UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing | Paperπ Codeπ₯οΈ Modelπ€
- Integrates text-to-image generation and editing through dual reasoning with world knowledge planning and visual refinement on reasoning-intensive benchmarks. | Task: Image Generation
- 26.02 Generated Reality: Human-centric World Simulation using Interactive Video Generation | Paperπ Projectπ
- Human-centric video world model conditioned on tracked head and hand poses via bidirectional video diffusion for dexterous XR interactions. | Task: Image/Video Generation
- 26.01 Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders | Paperπ Codeπ₯οΈ
- "Think-then-generate" paradigm where LLM encoders reason about prompts before image generation using Dual-GRPO reinforcement optimization. | Task: Image Generation
- 26.01 Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing | Paperπ Codeπ₯οΈ
- Bridges multimodal understanding and image generation via In-Context Chain-of-Thought (IC-CoT) with RL-based training. | Task: Image Generation & Editing
- 26.01 Unified Thinker: A General Reasoning Modular Core for Image Generation | Paperπ
- General reasoning modular core enhancing image generation models with chain-of-thought reasoning capabilities. | Task: Image Generation
- 25.12 REASONEDIT: Towards Reasoning-Enhanced Image Editing Models | Paperπ
- Enhances image editing models with explicit reasoning capabilities. | Task: Image Editing
- 25.12 EditThinker: Unlocking Iterative Reasoning for Any Image Editor | Paperπ
- Enables iterative reasoning in image editing through a reasoning-aware framework. | Task: Image Editing
- 25.11 IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€ ColdStart SFTπ€
- IE-Critic-R1 treats image editing quality assessment as a reasoning task and implement "R1 moment" (longer reasoning thoughts, better performance). It is a pointwise, generative reward model, leveraging Chain-of-Thought (CoT) reasoning SFT and RLVR to provide accurate, human-aligned evaluations of image editing. | Task: Image Editing Quality Asssessment
- 25.05 T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT | Paperπ Codeπ₯οΈ
- A novel reasoning-enhanced text-to-image generation model powered by RL with a bi-level CoT reasoning process | Task: Video Generation
- 25.03 GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing | Paperπ
- A paradigm that enables generation and editing through an explicit language reasoning process before outputting images | Task: Image Generation
- 25.03 Unified Reward Model for Multimodal Understanding and Generation | Paperπ Codeπ₯οΈ Datasetπ€
- Improve MLLM's understanding and generation ability with DPO | Task: VQA & Generation
- 25.01 (CVPR25) Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step | Paperπ Codeπ₯οΈ Modelπ€
- The first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. | Task: Image Generation
- 24.12 EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing | Paperπ
- A system designed to interpret such instructions in conjunction with reference visuals, producing precise and context-aware editing prompts. | Task: Image Generation
- 26.02 SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model | Paperπ
- Unified multimodal video foundation model enabling simultaneous video+audio generation, editing, and inpainting via dual-stream architecture, supporting 1080p/32FPS/15s with synchronized audio. | Task: Video Generation
- 26.02 AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories | Paperπ Codeπ₯οΈ
- Addresses long-term video generation consistency using multiple local geometric memories and multi-anchor weaving controller for camera-controllable long-horizon scene generation. | Task: Video Generation
- 26.02 OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence | Paperπ Codeπ₯οΈ Modelπ€
- Vision encoder applying codec-aligned sparsity to focus on 3.1%-25% of signal-rich patches, outperforming Qwen3-ViT and SigLIP2 across 16 benchmarks with 4.1% average improvement on video understanding. | Task: Video Understanding
- 26.02 CoPE-VideoLM: Codec Primitives For Efficient Video Language Models | Paperπ
- Uses video codec primitives (motion vectors and residuals) instead of dense per-frame embeddings, reducing time-to-first-token by up to 86% and token usage by up to 93% across 14 video benchmarks. | Task: Video Understanding
- 26.02 Solaris: Building a Multiplayer Video World Model in Minecraft | Paperπ Codeπ₯οΈ Modelπ€
- Multiplayer video world model for consistent multi-view observations in coordinated multi-agent Minecraft environments using Checkpointed Self Forcing technique. | Task: Video Generation
- 26.02 MOVA: Towards Scalable and Synchronized Video-Audio Generation | Paperπ Codeπ₯οΈ Modelπ€
- Open-source 32B MoE model generating high-quality synchronized audio-visual content including lip-synced speech, environment sounds, and music from image-text inputs. | Task: Video-Audio Generation
- 25.11 Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation | Paperπ
- Foundation model family for image and video generation. | Task: Video Generation
- 25.11 Planning with Sketch-Guided Verification for Physics-Aware Video Generation | Paperπ
- Physics-aware video generation with sketch-based planning and verification. | Task: Video Generation
- 25.10 PhysMaster: Mastering Physical Representation for Video Generation via RL | Paperπ
- Physical reasoning for video generation with reinforcement learning. | Task: Video Generation
- 25.02 C-Drag:Chain-of-Thought Driven Motion Controller for Video Generation | Paperπ Codeπ₯οΈ Datasetπ€
- A Chain-of-Thought-based motion controller for controllable video generation | Task: Video Generation
- 26.02 AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization | Paperπ Datasetπ€
- AVEm-DPO preference optimization improves audiovisual emotion reasoning in MLLMs by aligning responses with audiovisual cues and reducing text-prior hallucinations. | Task: Audio-Visual Reasoning
- 26.02 EgoAVU: Egocentric Audio-Visual Understanding | Paperπ Datasetπ€
- Scalable data engine and 3M-sample dataset for egocentric audio-visual understanding, enabling up to 113% performance improvement on joint audio-visual reasoning tasks. | Task: Audio-Visual Reasoning
- 26.01 LTX-2: Efficient Joint Audio-Visual Foundation Model | Paperπ Codeπ₯οΈ Modelπ€
- Open-source 14B+5B asymmetric dual-stream audiovisual diffusion model generating synchronized video and audio with bidirectional cross-attention. | Task: Audio-Visual Generation
- 25.11 UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions | Paperπ
- Unified audio-video generation using cross-modal interactions. | Task: Audio-Visual Generation
- 25.11 Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy | Paperπ
- Harmonizes audio and video generation via cross-task synergy. | Task: Audio-Visual Generation
- 25.06 ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
- 26.02 Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents | Paperπ Codeπ₯οΈ Modelπ€
- GUI-Owl-1.5 multi-platform GUI agent family achieving SOTA on GUI automation (56.5 OSWorld, 71.6 AndroidWorld) and grounding (80.3 ScreenSpotPro) via MRPO multi-platform RL. | Task: GUI Agent
- 26.02 GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL | Paperπ Codeπ₯οΈ Modelπ€
- Trains open-source GUI agents using action-aware SFT (81K curated dataset) and conservative RL with KL regularization for web and mobile tasks. | Task: GUI Agent
- 26.02 PyVision-RL: Forging Open Agentic Vision Models via RL | Paperπ Codeπ₯οΈ Modelπ€
- RL framework for open-weight multimodal agents using oversampling-filtering-ranking rollout; releases PyVision-Image-7B and PyVision-Video-7B for tool-augmented reasoning. | Task: Agent/Tool Use
- 26.02 Computer-Using World Model | Paperπ
- World model for desktop software predicting UI state changes via two-stage factorization to help agents simulate candidate actions before execution. | Task: GUI Agent
- 26.02 V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval | Paperπ Codeπ₯οΈ Modelπ€
- Reformulates multimodal retrieval as an agentic reasoning process where an MLLM selectively acquires visual evidence via external tools, achieving 23% average improvement. | Task: Agent/Tool Use
- 26.02 Reasoning-Augmented Representations for Multimodal Retrieval | Paperπ Codeπ₯οΈ
- Data-centric framework externalizing reasoning before retrieval by using VLMs to densely caption visual evidence and resolve ambiguous multimodal queries. | Task: Agent/Tool Use
- 26.02 WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents | Paperπ Codeπ₯οΈ
- Reasoning-first WebPRM formulating reward modeling as text generation to improve web navigation through structured justifications and preference verdicts (ICLR 2026). | Task: GUI Agent
- 26.02 Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation | Paperπ Codeπ₯οΈ
- SparseVideoNav uses video generation for sparse future planning in beyond-the-view VLN tasks, achieving 27Γ speed-up and 2.5Γ higher success rate over LLM baselines. | Task: Visual Reasoning Agent
- 26.02 WebWorld: A Large-Scale World Model for Web Agent Training | Paperπ
- Open-web simulator trained on 1M+ interactions enabling long-horizon reasoning for web agents; models trained on WebWorld-synthesized trajectories show +9.2% improvement on WebArena. | Task: GUI Agent
- 26.02 AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines | Paperπ
- Synthesizes controllable web environments as FSMs translated to interactive websites by coding agents for automated trajectory generation at $0.04/trajectory, with 7B agent outperforming baselines on WebVoyager. | Task: GUI Agent
- 26.02 MMA: Multimodal Memory Agent | Paperπ Codeπ₯οΈ
- Improves long-horizon multimodal agent performance via dynamic memory reliability scoring and introduces the "Visual Placebo Effect" with MMA-Bench for evaluating belief dynamics. | Task: Multimodal Agent
- 26.01 AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning | Paperπ Codeπ₯οΈ Modelπ€
- Multimodal model family learning tool usage as a reasoning skill via Tool-GRPO, +24.9% improvement surpassing GPT-4 on visual reasoning benchmarks. | Task: Visual Reasoning with Tools
- 26.01 SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning | Paperπ
- Multimodal agentic reasoning and search framework using RL to empower visual reasoning with agent capabilities. | Task: Multimodal Agentic Reasoning
- 26.01 EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience | Paperπ Codeπ₯οΈ Modelπ€
- SOTA computer-use agent (56.7% OSWorld) using autonomous task generation and iterative evolving learning with self-correction. | Task: GUI Agent
- 26.01 DocDancer: Towards Agentic Document-Grounded Information Seeking | Paperπ
- Agentic framework for document-grounded multimodal information seeking and reasoning. | Task: Document Reasoning Agent
- 26.01 ShowUI-pi: Flow-based Generative Models as GUI Dexterous Hands | Paperπ
- Flow-based generative models applied as GUI interaction agents with visual reasoning capabilities. | Task: GUI Agent
- 26.01 PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent | Paperπ
- Personalized GUI agent aligning hierarchical implicit user intent with long-term user-centric records. | Task: GUI Agent
- 25.12 Step-GUI Technical Report | Paperπ
- Step-by-step GUI agent with visual understanding. | Task: GUI Agent
- 25.12 MAI-UI Technical Report: Real-World Centric Foundation GUI Agents | Paperπ
- Foundation model for real-world GUI agent interaction with visual grounding. | Task: GUI Agent
- 25.11 Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
- 25.11 DeepEyesV2: Toward Agentic Multimodal Model | Paperπ
- Agentic multimodal model with tool-use and reasoning capabilities. | Task: Multimodal Agent
- 25.11 GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization | Paperπ
- Combines visual reasoning with web augmentation for agentic geolocalization. | Task: Visual Reasoning Agent
- 25.10 AudioToolAgent: An Agentic Framework for Audio-Language Models | Paperπ
- 25.10 GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness | Paperπ
- Efficient GUI interaction agents using visual understanding with spatio-temporal KV cache. | Task: GUI Agent
- 25.09 UItron: Foundational GUI Agent with Advanced Perception and Planning | Paperπ
- Multimodal agent for GUI understanding and interaction. | Task: GUI Agent
- 25.09 BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent | Paperπ
- Reasoning model for GUI agent visual understanding and interaction. | Task: GUI Agent
- 25.08 Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation
- 25.08 OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use | Paperπ
- Survey of MLLM-based agents that operate computing devices via visual understanding. | Task: GUI Agent
- 25.08 InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization | Paperπ
- Multimodal agent for GUI understanding with visual grounding and adaptive exploration. | Task: GUI Agent
- 25.08 CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent | Paperπ
- Dual-brain architecture for multimodal computer-use agent with decoupled RL. | Task: GUI Agent
- 25.06 Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning|Paperπ Codeπ₯οΈ Projectπ
- 25.05 ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning | Paperπ Codeπ₯οΈ
- 25.05 Reinforcement Learning for Long-Horizon Interactive LLM Agents|Paperπ
- 25.05 RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning |Paperπ Codeπ₯οΈ Projectπ
- 25.05 Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning | Paperπ Codeπ₯οΈ
- 25.05 Agent RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving| Paperπ Codeπ₯οΈ
- 25.04 ToolRL: Reward is All Tool Learning Needs|Paperπ Codeπ₯οΈ
- 25.04 Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning | Paperπ Codeπ₯οΈ
- 25.04 Acting Less is Reasoning More! Teaching Model to Act Efficiently
- 25.04 Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning | Paperπ
- 25.04 DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments |Paperπ Codeπ₯οΈ
- 25.03 TORL: Scaling Tool-Integrated RL | Paperπ Codeπ₯οΈ
- 25.03 R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning | Paperπ
- 25.02 (CVPR25)Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation | Paperπ
- 24.12 (ECCV24) VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding | Paperπ Codeπ₯οΈ Projectπ
- Explore how reconciling several foundation models with a novel unified memory mechanism could tackle the challenging video understanding problem | Task: Video captioning & QA
- 26.02 MediX-R1: Open Ended Medical Reinforcement Learning | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Open-ended RL framework for medical MLLMs enabling free-form clinical answers via Group-Based RL with composite rewards; 8B model outperforms 27B MedGemma with ~51K training samples. | Task: Medical Reasoning
- 26.02 Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making | Paperπ Codeπ₯οΈ Modelπ€
- Medical LLM shifting from passive Q&A to active clinical-grade decision support via proactive information acquisition, long-horizon reasoning, and hallucination suppression, achieving SOTA on HealthBench. | Task: Medical Reasoning
- 26.02 MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning | Paperπ Codeπ₯οΈ Modelπ€
- Reformulates medical image segmentation as multi-step decision-making using hybrid prompting and two-stage training with process rewards for autonomous reasoning. | Task: Medical Reasoning
- 26.02 Hepato-LLaVA: An Expert MLLM for Hepatocellular Pathology Analysis on Whole Slide Images | Paperπ Projectπ
- Specialized MLLM for hepatocellular carcinoma diagnosis with Sparse Topo-Pack Attention modeling tissue topology; includes HepatoPathoVQA (33K expert-validated Q&A pairs). | Task: Medical Reasoning
- 26.02 MedCLIPSeg: Probabilistic Vision-Language Adaptation for Medical Image Segmentation | Paperπ Codeπ₯οΈ Modelπ€
- Adapts CLIP for medical image segmentation via Probabilistic Vision-Language Adapter with uncertainty-aware attention, tested across 16 datasets spanning 5 modalities and 6 organs. | Task: Medical Reasoning
- 26.02 MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs | Paperπ
- Medical VLM combining entity-aware continual pretraining for rare diseases, RL for expert-level reasoning, and tool-augmented agentic training for multi-step diagnostic reasoning with reduced hallucination. | Task: Medical Reasoning
- 26.02 ClinAlign: Scaling Healthcare Alignment from Clinician Preference | Paperπ Codeπ₯οΈ Modelπ€
- Two-stage LLM alignment using physician-verified examples and distilled clinical principles, with a 30B model activating 3B parameters at inference achieving SOTA on medical benchmarks. | Task: Medical Reasoning
- 26.02 Uncertainty-Aware Vision-Language Segmentation for Medical Imaging | Paperπ Codeπ₯οΈ
- Multimodal segmentation with Modality Decoding Attention Blocks (MoDAB) and Spectral-Entropic Uncertainty Loss for medical image segmentation from radiological images and clinical text. | Task: Medical Reasoning
- 26.01 UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation | Paperπ Codeπ₯οΈ Modelπ€
- Unified medical foundation model combining autoregressive understanding and diffusion generation for chest X-rays, +46.1% in understanding. | Task: Medical Image Understanding & Generation
- 25.12 OralGPT-Omni: A Versatile Dental Multimodal Large Language Model | Paperπ
- Versatile dental MLLM for oral health diagnosis and reasoning across modalities. | Task: Medical Reasoning
- 25.12 DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry | Paperπ
- Incentivizes complex multimodal reasoning for dental diagnosis and treatment. | Task: Medical Reasoning
- 25.12 Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning | Paperπ
- Advances colonoscopy with multimodal understanding and clinical reasoning capabilities. | Task: Medical Reasoning
- 25.10 M3Retrieve: Benchmarking Multimodal Retrieval for Medicine | Paperπ
- Multimodal retrieval benchmark for medical domain. | Task: Medical Reasoning
- 25.09 MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection | Paperπ
- VLM for medical 3D CT analysis to reduce diagnostic errors. | Task: Medical Reasoning
- 25.08 MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine | Paperπ
- Tests multimodal LLMs on basic medical visual perception tasks. | Task: Medical Reasoning
- 25.04 (ICASSP 2025) AuscMLLM: Bridging Classification and Reasoning in Heart Sound Analysis with a Multimodal Large Language Model |
- 24.09 (JBHI 2024) Multi-Task Learning for Audio-Based Infant Cry Detection and Reasoning |
- 25.06 (ACL 2025) MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration | Paperπ Codeπ₯οΈ
-
26.02 VLANeXt: Recipes for Building Strong VLA Models | Paperπ Codeπ₯οΈ Modelπ€
- Systematically identifies 12 key design findings across foundational components for VLA models, yielding SOTA simulation and real-world benchmark performance (CVPR 2026). | Task: Robot Control
-
26.02 SimVLA: A Simple VLA Baseline for Robotic Manipulation | Paperπ Codeπ₯οΈ Modelπ€
- Minimal VLA baseline strictly decoupling perception from control with standard VL backbone, achieving SOTA on simulation benchmarks with only 0.5B parameters. | Task: Robotic Manipulation
-
26.02 GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning | Paperπ Codeπ₯οΈ Projectπ
- VLA trained via world model-based RL (RAMP) on 10,000+ hours of robot data, achieving ~30% improvement on challenging tasks like laundry folding and espresso preparation. | Task: Robotic Manipulation
-
26.02 Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning | Paperπ Codeπ₯οΈ Projectπ
- Recurrent VLA using latent iterative refinement instead of chain-of-thought tokens to adaptively scale compute at inference, achieving 0%β90%+ task success with 4 iterations. | Task: Robotic Manipulation
-
26.02 VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model | Paperπ Codeπ₯οΈ Modelπ€
- JEPA-style pretraining for VLA policies predicting future latent states from current observations, improving robustness to camera motion and irrelevant backgrounds. | Task: Robotic Manipulation
-
26.02 DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos | Paperπ Modelπ€ Projectπ
- Foundation world model trained on 44k hours of egocentric human video enabling teleoperation, policy evaluation, and model-based planning for dexterous robotics at 10.81 FPS. | Task: Robotic Manipulation
-
26.02 ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation | Paperπ Codeπ₯οΈ Projectπ
- Unified VLA navigation model with hierarchical Brain-Action architecture achieving SOTA on 7 benchmarks across 5 navigation task types, trained on 16.9M expert trajectories. | Task: Embodied Navigation
-
26.02 TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments | Paperπ Codeπ₯οΈ Projectπ
- Latency-aware VLA framework modeling delayed semantic reasoning during action generation via delayed semantic-control interface for real-time navigation. | Task: Embodied Navigation
-
26.02 QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models | Paperπ Codeπ₯οΈ
- Training-free PTQ framework for VLA models combining selective quantization, attention temperature matching, and output head balancing, achieving ~70% memory savings (CVPR 2026). | Task: Robot Control
-
26.02 FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment | Paperπ Codeπ₯οΈ Modelπ€
- Improves world-awareness in robotic policies via parallel progressive latent alignment with visual foundation models, reducing error accumulation in multi-step prediction. | Task: Robotic Manipulation
-
26.02 TactAlign: Human-to-Robot Policy Transfer via Tactile Alignment | Paperπ Projectπ
- Cross-embodiment tactile alignment using rectified flow for zero-shot transfer on contact-rich manipulation tasks including pivoting, insertion, and lid closing. | Task: Robotic Manipulation
-
26.02 World Guidance: World Modeling in Condition Space for Action Generation | Paperπ Projectπ
- WoG maps predicted future observations into compact condition representations for fine-grained action generation, validated across simulation and real-world robot environments. | Task: Robot Control
-
26.02 Green-VLA: Staged Vision-Language-Action Model for Generalist Robots | Paperπ Codeπ₯οΈ
- Five-stage VLA framework for real-world robot deployment achieving generalization across embodiments via multimodal training and RL, reaching 69.5% success on ALOHA Table-Cleaning. | Task: Robotic Manipulation
-
26.02 Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs | Paperπ Codeπ₯οΈ
- Reflective Test-Time Planning with reflection-in-action and reflection-on-action enabling long-horizon credit assignment in robot decision-making. | Task: Embodied Reasoning
-
26.02 RISE: Self-Improving Robot Policy with Compositional World Model | Paperπ
- Addresses VLA brittleness in contact-rich manipulation using a compositional world model to predict multi-view futures and evaluate imagined outcomes, achieving +35-45% gains across real-world tasks. | Task: Robotic Manipulation
-
26.02 chi_0: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies | Paperπ Codeπ₯οΈ Modelπ€
- Resource-efficient robotic manipulation using model arithmetic weight-space merging and stage-aware advantage estimation for dual-arm garment tasks, achieving 250% higher success than pi_0.5. | Task: Robotic Manipulation
-
26.02 EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration | Paperπ
- Enables humanoid loco-manipulation via co-training VLA policies using abundant egocentric human demonstrations with limited robot data, achieving 51% improvement over robot-only baselines. | Task: Robotic Manipulation
-
26.02 MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation | Paperπ Codeπ₯οΈ
- Open ecosystem with 230k+ diverse indoor environments and 130k annotated assets supporting MuJoCo/Isaac/ManiSkill, with 8 benchmark tasks and strong sim-to-real correlation (R=0.96). | Task: Robot Simulation
-
26.02 ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning | Paperπ Codeπ₯οΈ Modelπ€
- Unified robotic manipulation framework standardizing 6M+ trajectories and introducing Action Manifold Learning (AML) for improved action prediction on low-dimensional manifolds. | Task: Robotic Manipulation
-
26.02 RLinf-Co: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models | Paperπ
- RL-based sim-real co-training for VLA policies using two-stage warm-start SFT + RL fine-tuning with auxiliary supervised loss, achieving +24% on OpenVLA and +20% on pi_0.5 in real-world success. | Task: Robotic Manipulation
-
26.02 Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution | Paperπ Codeπ₯οΈ
- Open-source VLA combining large-scale cross-embodiment pretraining with asynchronous execution for real-time deployment, achieving SOTA on simulation benchmarks with consumer-grade GPU compatibility. | Task: Robotic Manipulation
-
26.02 GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning | Paperπ Codeπ₯οΈ
- Hierarchical VLA for zero-shot robotic manipulation via affordance segmentation, 3D trajectory planning, and 3D-aware control policy, outperforming VoxPoser without real-world demonstrations. | Task: Robotic Manipulation
-
26.02 RynnBrain: Open Embodied Foundation Models | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- Open-source spatiotemporal foundation model unifying perception, reasoning, and planning for embodied intelligence, outperforming existing models on 20 embodied benchmarks. | Task: Embodied Reasoning
-
26.02 Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation | Paperπ
- HERO enables humanoid robots to manipulate diverse real-world objects using inverse kinematics and neural forward model, reducing end-effector tracking error 3.2x for reliable surface manipulation. | Task: Robotic Manipulation
-
26.02 World Action Models are Zero-shot Policies | Paperπ Codeπ₯οΈ
- DreamZero uses video diffusion as a World Action Model (WAM) for robot skill generalization with 2x improvement in novel environments at real-time 7Hz closed-loop control. | Task: Robotic Manipulation
-
26.02 Learning Native Continuation for Action Chunking Flow Policies | Paperπ
- Legato, a training-time continuation method for action-chunked VLA policies producing smoother trajectories, achieving ~10% improvements in smoothness and task completion across 5 manipulation tasks. | Task: Robotic Manipulation
-
26.02 BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models | Paperπ Codeπ₯οΈ
- Evaluates 30+ MLLMs on bimanual robotic tasks across spatial reasoning, action planning, and end-effector control tiers, finding persistent failures in dual-arm spatial grounding. | Task: Robotic Manipulation
-
26.01 ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models | Paperπ
- Action Chain-of-Thought paradigm for VLA models with Explicit and Implicit Action Reasoner components, achieving 98.5% on LIBERO. | Task: Robotic Manipulation
-
26.01 Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning | Paperπ Modelπ€
- Adapts pretrained video models into robot policies through single-stage post-training, achieving 98.5% on LIBERO and SOTA on real-world bimanual manipulation. | Task: Robot Control
-
26.01 DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation | Paperπ Codeπ₯οΈ Datasetπ€
- Compact 0.4B VLA model for dynamic object manipulation with continuous inference and latent-aware action streaming. | Task: Robotic Manipulation
-
26.01 SOP: A Scalable Online Post-Training System for Vision-Language-Action Models | Paperπ Projectπ
- Scalable online distributed post-training system for VLA models enabling real-world robot policy adaptation through fleet learning. | Task: Robot Control
-
26.01 FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation | Paperπ Codeπ₯οΈ Modelπ€
- Implicit reasoning framework for vision-language navigation encoding imagined visual tokens in latent space, reducing inference latency by an order of magnitude. | Task: Vision-Language Navigation
-
26.01 RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation | Paperπ Codeπ₯οΈ
- Visual identity prompting for multi-view video generation to augment robot manipulation data. | Task: Robotic Manipulation
-
26.01 VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory | Paperπ
- Embodied navigation agent with adaptive reasoning combining visual perception and linguistic memory. | Task: Embodied Navigation
-
25.12 DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action | Paperπ
- Decouples reasoning and action for more generalizable embodied agents. | Task: Robotic Manipulation
-
25.12 HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for VLA Models | Paperπ
- Enriches VLA models with hindsight, insight, and foresight via motion representations. | Task: Robotic Manipulation
-
25.12 LEO-RobotAgent: A General-purpose Robotic Agent for Language-driven Embodied Operator | Paperπ
- General-purpose language-driven robotic agent for embodied task execution. | Task: Robotic Manipulation
-
25.12 Steering VLA Models as Anti-Exploration: A Test-Time Scaling Approach | Paperπ
- Test-time scaling approach for steering VLA models for safe embodied behavior. | Task: Robot Control
-
25.11 WMPO: World Model-based Policy Optimization for Vision-Language-Action Models | Paperπ
- World model-based policy optimization for VLA models in robotics. | Task: Robot Control
-
25.11 RynnVLA-002: A Unified Vision-Language-Action and World Model | Paperπ
- Unified VLA and world model for robotic manipulation. | Task: Robot Control
-
25.11 Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight | Paperπ
- VLA model with disentangled visual foresight for robotic control. | Task: Robot Control
-
25.11 MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots | Paperπ
- Reinforcement-based VLA model for mobile robot tasks. | Task: Robot Control
-
25.10 VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards | Paperπ
- Fine-tuning VLA models using RL with verified rewards in world simulators. | Task: Robot Control
-
25.10 InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy | Paperπ
- VLA framework for robotic control with spatial grounding. | Task: Robot Control
-
25.10 X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model | Paperπ
- Cross-embodiment VLA model for scalable robot learning. | Task: Robot Control
-
25.10 GigaBrain-0: A World Model-Powered Vision-Language-Action Model | Paperπ
- VLA model integrating world models for robot reasoning. | Task: Robot Control
-
25.09 Robix: A Unified Model for Robot Interaction, Reasoning and Planning | Paperπ
- Unified robotics model combining visual reasoning with interaction and planning. | Task: Robot Control
-
25.09 FLOWER: Democratizing Generalist Robot Policies with Efficient VLA Flow Policies | Paperπ
- Vision-language-action model for generalist robot policies. | Task: Robot Control
-
25.08 RynnEC: Bringing MLLMs into Embodied World | Paperπ
- Integrates multimodal LLMs into embodied AI settings for physical-world reasoning. | Task: Embodied Reasoning
-
25.08 Do What? Teaching Vision-Language-Action Models to Reject the Impossible | Paperπ
- Trains VLA models to reason about task feasibility and reject impossible instructions. | Task: Robot Control
-
25.08 Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in VLA Policies | Paperπ
- Uses discrete diffusion for action decoding in vision-language-action robotic policies. | Task: Robot Control
-
23.07 RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | Paperπ Projectπ
- Co-finetunes a VLM on web and robot data, establishing the VLA paradigm by transferring internet-scale knowledge to robot control. | Task: General Robotic Manipulation
-
24.05 Octo: An Open-Source Generalist Robot Policy | Paperπ Codeπ₯οΈ Projectπ Modelπ€
- An open-source, generalist transformer policy pretrained on the large-scale Open X-Embodiment dataset, designed for efficient fine-tuning to new robots and tasks. | Task: Robotics
-
24.06 OpenVLA: An Open-Source Vision-Language-Action Model | Paperπ Codeπ₯οΈ Projectπ Modelπ€
- A 7B-parameter open-source VLA model trained on the Open X-Embodiment dataset, achieving state-of-the-art performance for generalist manipulation. | Task: VLA
-
24.10 Οβ: A Vision-Language-Action Flow Model for General Robot Control | Paperπ Codeπ₯οΈ
- A generalist policy using a novel flow matching architecture atop a pretrained VLM, enabling zero-shot generalization for dexterous manipulation. | Task: Robot Control
-
25.01 FAST: Efficient Action Tokenization for Vision-Language-Action Models | Paperπ Codeπ₯οΈ
- A compression-based action tokenization scheme that accelerates autoregressive VLA training by 5x with performance comparable to diffusion models. | Task: Robot Control
-
25.02 Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models | Paperπ
- A hierarchical VLA model with a high-level VLM for reasoning and a low-level VLA for execution, enabling complex, open-ended instruction following. | Task: Robot Control
-
25.03 Gemini Robotics: Bringing AI into the Physical World | Paperπ Codeπ₯οΈ Projectπ Datasetπ€
- A VLA model built on the Gemini foundation model, demonstrating significant improvements in generality, interactivity, and dexterity for complex tasks. | Task: Advanced & Dexterous Manipulation
-
25.03 COT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models | Paperπ Projectπ
- A method that incorporates explicit visual CoT reasoning into VLAs by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. | Task: Robotics
-
25.03 GR00T: A Foundation Model for General-Purpose Robotics | Paperπ Codeπ₯οΈ Modelπ€ Datasetπ€
- A general-purpose foundation model for robot learning that takes multimodal instructions and past observations to generate actions for the robot to execute. | Task: Robotics
-
25.04 Ο0.5: a Vision-Language-Action Model with Open-World Generalization | Paperπ
- An evolution of Οβ that uses co-training on diverse tasks to achieve long-horizon, dexterous manipulation in novel, unseen environments. | Task: Robot Control
-
25.06 Chain-of-Action: Faithful and Deterministic Robot Policy via Language-guided State-Action Augmentation | Paperπ Codeπ₯οΈ Projectπ Modelπ€
- A novel robot policy, Chain-of-Action (CoA), that uses language as an intermediate representation to explicitly reason about the chain of actions for a given task, while being fully deterministic during inference. | Task: Robotics
-
25.07 Vision-Language-Action Instruction Tuning: From Understanding to Manipulation | Paperπ Codeπ₯οΈ Projectπ Modelπ€
- An end-to-end VLA model, InstructVLA, that introduces a novel training paradigm called Vision-Language-Action Instruction Tuning (VLA-IT) to preserve the flexible reasoning of VLMs while delivering high-performance robotic manipulation. | Task: Robotic
-
25.07 MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis | Paperπ Codeπ₯οΈ Projectπ
- A dual-system world model, MinD, that enables real-time, risk-aware planning by conditioning a high-frequency action policy on single-step latent predictions from a low-frequency video generation model. | Task: Robotic
- 26.02 VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? | Paperπ
- Benchmark testing whether VLMs truly understand text rendered visually in images as well as plain text, revealing a significant comprehension gap. | Task: Reasoning
- 26.02 From Perception to Action: An Interactive Benchmark for Vision Reasoning | Paperπ Codeπ₯οΈ
- CHAIN 3D physics-driven interactive benchmark evaluating whether VLMs understand causal constraints and execute structured action sequences in mechanical puzzles. | Task: Reasoning
- 26.02 SAM 3D Body: Robust Full-Body Human Mesh Recovery | Paperπ Codeπ₯οΈ
- Promptable model for single-image 3D human mesh recovery using the Momentum Human Rig (MHR) parametric representation, supporting 2D keypoint/mask prompts with strong generalization. | Task: 3D Human Pose Estimation
- 26.01 CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation | Paperπ
- Uses video generation models as visual reasoners for text-to-image generation, showing temporal modeling transfers to improved spatial reasoning. | Task: Image Generation
- 26.01 OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models | Paperπ
- Holistic OCR framework within end-to-end vision-language models for comprehensive text understanding in images. | Task: OCR & Document Understanding
- 25.12 GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation | Paperπ
- Exposes and evaluates visual grounding gaps in MLLMs across multiple dimensions. | Task: Visual Grounding
- 25.11 Monet: Reasoning in Latent Visual Space Beyond Images and Language | Paperπ
- Enables vision-language reasoning in latent visual space, going beyond standard image-text paradigms. | Task: Reasoning
- 25.10 SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs | Paperπ
- Enables multimodal reasoning in text-only LLMs through agentic information flow. | Task: Reasoning
- 25.04 InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners | Paperπ Codeπ₯οΈ
- an MLLM-based GUI agent designed to progressively evolve agents from Reactive Actors to Deliberative Reasoners. | task: UI
- 25.04 GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents | Paperπ
- Enhances GUI agent through RL with unified action space modeling, achieving superior cross-platform performance using only 0.02% of the data required by previous methods. | Task: UI
- 25.03 UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning | Paperπ
- Introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms like GRPO. | Task: UI
- 25.03 VLM-R1: A stable and generalizable R1-style Large Vision-Language Model Codeπ₯οΈ Datasetπ€ Modelπ€
- A reproduced R1-style VLM | Task: Referring Expression Comprehension
- 25.02 MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning| Paperπ
- A MLLM trained with GRPO for medical image VQA.| Task: Medical Image VQA
- 25.03 R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning | Paperπ Codeπ₯οΈ Modelπ€
- Impove reasoning capability, emotion recognition accuracy, and generalization ability with RLVR. | Task: Emotion recognition
- 26.01 The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization | Paperπ
- Benchmark for audio-language models on spatial audio geo-localization reasoning tasks. | Task: Audio Reasoning
- 25.02 ADIFF: Explaining audio difference using natural language Codeπ₯οΈ Model
- 24.09 What Are They Doing? Joint Audio-Speech Co-Reasoning
- 24.09 Chain-of-Thought Prompting for Speech Translation
- 26.01 FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs | Paperπ
- Benchmark evaluating multimodal LLMs' ability to forecast future events from omni-modal context including temporal reasoning. | Task: Omni Reasoning
- 25.05 AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding
- 25.03 R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning
- 23.11 X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-Modal Reasoning
| Date | Project | Task | Links |
|---|---|---|---|
| 26.02 | A Very Big Video Reasoning Suite (VBVR): 1M+ video clips across 200 reasoning tasks | Video Reasoning | [π Paper] [π€ Model] [π€ Data] |
| 26.02 | OmniGAIA: Omni-Modal AI Agent Benchmark with hindsight-guided exploration | Omni-Modal Agent Reasoning | [π Paper] [π» Code] [π€ Data] |
| 26.02 | SpatiaLab: Wild Spatial Reasoning benchmark across 6 VQA categories | Spatial Reasoning | [π Paper] [π» Code] [π€ Data] |
| 26.02 | MuRGAt: Multimodal Fact-Level Attribution benchmark for verifiable reasoning | Multimodal Attribution | [π Paper] [π» Code] |
| 26.02 | DeepVision-103K: Verifiable multimodal math dataset for RLVR training | Math Reasoning | [π Paper] [π» Code] [π€ Data] |
| 26.02 | UniVBench: Unified evaluation for video foundation models across understanding, generation, editing | Video Foundation Model Evaluation | [π Paper] [π» Code] |
| 26.02 | RISE-Video: Benchmark for video generators decoding implicit world rules | Video Generation Reasoning | [π Paper] [π» Code] [π€ Data] |
| 26.02 | SAW-Bench: Egocentric Situated Awareness evaluation with 786 smart-glass videos and 2,071+ QA pairs | Spatial Reasoning | [π Paper] |
| 26.02 | BrowseComp-V3: 300-question visual benchmark for complex multi-hop multimodal web search | Multimodal Browsing | [π Paper] |
| 26.02 | BiManiBench: Hierarchical benchmark for bimanual coordination evaluation in MLLMs | Bimanual Robotics | [π Paper] [π» Code] |
| 26.01 | MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods | Multimodal Reasoning | [π Paper] [π€ Model] [π€ Data] |
| 26.01 | ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis | Chart Reasoning | [π Paper] [π» Code] [π€ Model] [π€ Data] |
| 26.01 | VideoLoom: Joint Spatial-Temporal Understanding with LoomBench | Spatial-Temporal Reasoning | [π Paper] [π» Code] [π€ Model] |
| 26.01 | PROGRESSLM: Towards Progress Reasoning in Vision-Language Models | Task Progress Reasoning | [π Paper] [π» Code] [π€ Data] |
| 26.01 | FutureOmni: Evaluating Future Forecasting from Omni-Modal Context | Omni-Modal Temporal Reasoning | [π Paper] |
| 26.01 | Afri-MCQA: Multimodal Cultural Question Answering for African Languages | Multilingual Multimodal Reasoning | [π Paper] |
| 26.01 | AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark | Cultural Multimodal Reasoning | [π Paper] |
| 25.12 | HERBench: Multi-Evidence Integration in Video Question Answering | Video Reasoning | [π Paper] |
| 25.12 | SVBench: Evaluation of Video Generation Models on Social Reasoning | Video Social Reasoning | [π Paper] |
| 25.12 | IF-Bench: Benchmarking MLLMs for Infrared Images | Infrared Image Understanding | [π Paper] |
| 25.12 | VABench: Comprehensive Benchmark for Audio-Video Generation | Audio-Video Generation | [π Paper] |
| 25.11 | MME-CC: Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity | Cognitive Capacity | [π Paper] |
| 25.11 | GGBench: Geometric Generative Reasoning Benchmark for Unified Multimodal Models | Geometric Reasoning | [π Paper] |
| 25.11 | WEAVE: Benchmarking In-context Interleaved Comprehension and Generation | Multimodal Comprehension & Generation | [π Paper] |
| 25.10 | Uni-MMMU: Massive Multi-discipline Multimodal Unified Benchmark | Multimodal Multi-discipline Reasoning | [π Paper] |
| 25.10 | PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs | Physical Tool Understanding | [π Paper] |
| 25.10 | BEAR: Benchmarking Multimodal Language Models for Atomic Embodied Capabilities | Embodied AI Capabilities | [π Paper] |
| 25.10 | OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs | Long-context, Video-Audio Unerstanding & Reasonin | [π Paper] [π» Code] [π Project] [π€ Data] |
| 25.10 | XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models | Capability Balancing among Different Modalities | [π Paper] [π» Code] [π Project] |
| 25.10 | StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA | Termporal Reasoning | [π Paper] |
| 25.10 | Valor32k-AVQA v2.0: Open-Ended Audio-Visual Question Answering Dataset and Benchmark | Common Sense Omni Reasoning | [π Paper] |
| 25.09 | MARS2 2025 Challenge on Multimodal Reasoning | Multimodal Reasoning Challenge | [π Paper] |
| 25.09 | Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images | Table Reasoning | [π Paper] |
| 25.09 | AHELM: A Holistic Evaluation of Audio-Language Models | Audio-Language Understanding | [π Paper] |
| 25.09 | MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark | Complex, Multi-scene, & Dynamically Evolving Speech & Audio Reasonin | [π Paper] [π» Code] |
| 25.09 | MiMo-Audio-Eval Toolkit | Speech/Sound/Music Reasoning | [π» Code] |
| 25.08 | SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models | Speech Reasoning | [π Paper] [π» Code] [Data] |
| 25.08 | MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence | Long-form, Spatial, and Multi-audio Reasoning on Speech/Music/Sound | [π Paper] [π€ Data] |
| 25.08 | RΒ²-AVSBench: Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation | Segmentation Reasoning | [π Paper] [π€ Data] |
| 25.07 | Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding | Video Reasoning and Understanding | [π Paper]. [π Project] [π€ Data] |
| 25.06 | FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation | Financial Multi-Modal Reasoning Reasoning | [π Paper]. [π» Code]. [π€ Data] |
| 25.06 | MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos | Video Reasoning | [π Paper]. [π» Code]. [π Project] [π€ Data] |
| 25.06 | OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models | Spatial Reasoning | [π Paper]. [π» Code]. [π Project] [π€ Data] |
| 25.06 | MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark | Phonatics, Prosody, Rhetoric, Syntactics, Semantics, and Paralinguistics in Speech Understanding & Reasoning | [π Paper] [π» Code] [π€ Data] |
| 25.05 | Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities | Video&Audio Reasoning | [π Paper] [π» Code] [π Project] [π€ Data] |
| 25.05 | MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix | Multi-step Audio Reasoning | [π Paper]. [π» Code]. [π₯ demo] [π€ Data] |
| 25.05 | On Path to Multimodal Generalist: General-Level and General-Bench | Multimodal Generation | [π Project] [π Paper] [π€ Data] |
| 25.04 | VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models | Visual Reasoning | [π Project] [π Paper] [π» Code] [π€ Data] |
| 25.04 | IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs | Image-Grounded Video Perception and Reasoning | [π Paper] [π» Code] |
| 25.04 | Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing | Reasoning-Informed viSual Editing | [π Paper] [π» Code] |
| 25.04 | CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following | Music Information Retrieval & Knowledge | [π Paper] [π» Code] |
| 25.03 | MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX | Common Sense Omni Reasoning | [π Paper] [π Project] |
| 25.03 | V-STaR : Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning | Spatio-temporal Reasoning | [π Project] [π Paper] [π€ Data] |
| 25.03 | MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs | Spatio-temporal Understanding | [πPaper] |
| 25.03 | Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning | 3D-CoT | [π Paper] [π€ Data] |
| 25.02 | MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models | MM-IQ | [π Paper] [π» Code] |
| 25.02 | MM-RLHF: The Next Step Forward in Multimodal LLM Alignment | MM-RLHF-RewardBench, MM-RLHF-SafetyBench | [π Paper] |
| 25.02 | ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models | ZeroBench | [π Project] [π€ Dataset] [π» Code] |
| 25.02 | MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency | MME-CoT | [π Paper] [π» Code] |
| 25.02 | OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference | MM-AlignBench | [π Paper] [π» Code] |
| 25.01 | AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs | Adversarial attack, Compositional reasoning, and Modality-specific dependency in Visual&Audio | [π Paper] |
| 25.01 | LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs | VRCBench | [π Paper] [π» Code] |
| 24.12 | Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method | VideoChat-Online | [Paperπ] [Codeπ»] |
| 24.11 | VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models | VLRewardBench | [π Paper] |
| 24.11 | Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos | MH-VidQA | [Paperπ] [Codeπ»] |
| 24.10 | OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities | Video&Audio Reasoning | [π Paper] |
| 24.10 | MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark | Audio Understanding & Reasoning | [π Project] [π Paper] [π»Code] [π€ Data] |
| 24.09 | MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning | Video Causal Reasoning | [π Paper] [π»Code] [π€ Data] |
| 24.09 | OmniBench: Towards The Future of Universal Omni-Language Models | Reasoning with Image & Speech/Sound/Music | [π Paper] [Codeπ»] [π Project] [π€ Data] |
| 24.08 | MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models | Music Knowledge & Reasoning | [π Project] [π Paper] [π»Code] [ Data] |
| 24.07 | REXTIME: A Benchmark Suite for Reasoning-Across-Time in Videos | REXTIME | [Paperπ] [Codeπ»] |
| 24.06 | AudioBench: A Universal Benchmark for Audio Large Language Models | Speech & Sound Understanding | [Paperπ] [Codeπ₯οΈ] |
| 24.06 | ChartMimic: Evaluating LMMβs Cross-Modal Reasoning Capability via Chart-to-Code Generation | ChartBench | [Projectπ] [Paperπ] [Codeπ₯οΈ] |
| 24.05 | M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought | M3CoT | [π Paper] |
| 24.02 | AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension | Speech & Sound Understanding | [π Paper] [Codeπ»] |
| 23.10 | CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models | Audio Reasoning (Attributes & Orders) | [Projectπ] [Paperπ] |
| Project | GitHub Stars | Links |
|---|---|---|
| Reason-RFT | π» GitHub π€ Dataset | |
| EasyR1 | π» GitHub | |
| Multimodal Open R1 | π» GitHub π€ Model π€ Dataset | |
| LMM-R1 | π» GitHub | |
| MMR1 | π» GitHub π€ Model π€ Dataset | |
| R1-V | π» GitHub π― Blog π€ Dataset | |
| R1-Multimodal-Journey | π» GitHub | |
| VLM-R1 | π» GitHub π€ Model π€ Dataset π€ Demo | |
| R1-Vision | π» GitHub π€ Cold-Start Dataset | |
| R1-Onevision | π» GitHub π€ Model π€ Dataset π€ Demo π Report | |
| Open R1 Video | π» GitHub π€ Model π€ Dataset | |
| Video-R1 | π» GitHub π€ Dataset | |
| Open-LLaVA-Video-R1 | π» GitHub | |
| R1V-Free | π» GitHub | |
| SeekWorld | π» GitHub | |
| IE-Critic-R1 | π» GitHub π€ Model π€ Data π€ ColdStart SFT |
If you are interested in contributing, please refer to HERE for instructions in contribution.