Skip to content

MM-Thinking/Metis-HOME

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning

arXivCode License

💡 Overview

Current multimodal reasoning models face a critical dilemma: they often "overthink" on simple tasks (inefficiency) and suffer from general capability degradation when optimized for reasoning.

We introduce Metis-HOME (Hybrid Optimized Mixture-of-Experts), a novel framework that enables a "Hybrid Thinking" paradigm. By structuring the original dense model (Qwen2.5-VL-7B) into two distinct expert branches—a Thinking Expert for complex reasoning and a Non-Thinking Expert for rapid inference—controlled by a lightweight router, Metis-HOME effectively resolves the reasoning-vs-generalization trade-off.

Metis-RISE Framework Overview Metis-RISE Framework Overview

✨ Highlights

  • 🧠 Hybrid Thinking Paradigm: Explicitly decouples "System 1" (fast, intuitive) and "System 2" (slow, deliberative) reasoning within a unified multimodal MoE architecture.

  • 🔄 Router Mechanism: A lightweight, trainable router dynamically allocates queries based on complexity, avoiding computational waste on simple tasks like OCR or Captioning.

  • 🚀 Performance:

    • +6.9% improvement on reasoning benchmarks (MathVista, etc.) compared to the baseline.
    • ~1% gain on general benchmarks, reversing the degradation trend observed in other reasoning-specialized models.
  • 🛠️ Efficient Training: A multi-stage strategy combining Reinforcement Learning (RL) for reasoning enhancement and Mixed Supervised Fine-Tuning (SFT) for expert specialization.

📊 Results

Thinking Ratio

As shown in the following figure, the thinking ratio analysis of Metis-HOME reveals adaptive routing behavior:

  • High ratios (78%–98%) on reasoning-heavy benchmarks (WeMath, MathVision, etc.), indicating effective use of the thinking expert for multi-step inference.
  • Low ratios (2%–5%) on general benchmarks (MMBench, OCRBench), showing preference for the non-thinking expert.

This aligns with our design: deliberate reasoning for complex tasks, fast inference for simple ones, optimizing computational efficiency.

Metis-RISE Framework Overview

Benchmarks

Model Reasoning General
MathVista MathVision MathVerse DynaMath WeMath LogicVista Avg. Avg.
Proprietary Models
Gemini-2.0-Pro 71.3 48.1 67.3 43.3 56.5 53.2 56.6 73.3
Gemini-2.0-Flash 70.4 43.6 47.8 42.1 47.4 52.3 50.6 72.6
Claude 3.7 Sonnet 66.8 41.9 46.7 39.7 49.3 58.2 50.4 70.1
ChatGPT-4o 60.0 31.2 40.6 34.5 45.8 52.8 44.2 72.0
Open-source Models
LLaVA-OneVision-72B 67.1 25.3 27.2 15.6 32.0 40.9 34.7 68.0
Kimi-VL-A3B-Instruct 66.0 21.8 34.1 18.0 32.3 42.7 35.8 69.1
InternVL3-8B 70.5 30.0 38.5 25.7 39.5 44.5 41.4 73.6
VL-Rethinker-7B 75.5 29.3 47.2 25.4 37.8 47.0 43.7 68.3
Metis-RISE-7B 75.8 28.7 51.0 27.7 45.2 49.7 46.4 68.4
Baseline 67.4 26.2 41.1 20.2 34.5 45.6 39.2 70.3
Baseline+RL 72.8 28.7 46.8 26.2 43.3 46.5 44.0 67.2
Metis-HOME 76.0 29.5 47.7 26.4 45.6 51.5 46.1 71.2

🔍 Usage Example

You can use the demo inference script in the examples folder:

python examples/demo_inference.py

📌 Acknowledgement

We sincerely appreciate LLaMA-Factory and MM-EUREKA for providing reference training framework.

📖 Citation

@article{lan2025metis,
  title={Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning},
  author={Lan, Xiaohan and Liu, Fanfan and Qiu, Haibo and Yang, Siqi and Ruan, Delian and Shi, Peng and Ma, Lin},
  journal={arXiv preprint arXiv:2510.20519},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages