Skip to content

Latest commit

 

History

History
311 lines (291 loc) · 17.8 KB

File metadata and controls

311 lines (291 loc) · 17.8 KB

📐 A Toolkit for Evaluating NEO Models

Comprehensive Evaluation for NEO across Knowledge, Hallucination, General and OCR VQA

Models   Benchmarks

Usage & ScriptsModel ZooResults


🏗️ QuickStart

See [QuickStart | 快速开始] for a quick start guide.

🤖 Model Zoo

We release 2B and 9B NEO in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT).

Model Name Model Weight
NEO-2B-PT 🤗 NEO-2B-PT HF link
NEO-2B-MT 🤗 NEO-2B-MT HF link
NEO-2B-SFT 🤗 NEO-2B-SFT HF link
NEO-9B-PT 🤗 NEO-9B-PT HF link
NEO-9B-MT 🤗 NEO-9B-MT HF link
NEO-9B-SFT 🤗 NEO-9B-SFT HF link

📊 Benchmark Results

TABLE NOTE:

  • “# Data” = data scale for pre-training / mid-training / supervised fine-tuning.
  • “†“ = vision-language models using Reinforcement Learning (RL).
  • “Any Res.” = any resolution; “Tile-wise” = image split into tiles;
    “Any Rat.” = any aspect ratio; “Fix Res.” = fixed resolution.
  • “MoE“ = Mixture-of-Experts; “DaC“ = Divide-and-Conquer.
  • Bold = best score in each column.

Model_NAME Base_LLM_NAME #Data_PT·MT·SFT Input_TYPE RoPE_TYPE Knowledge General VQA OCR VQA Hallucination
MMMU MMB MMVet MMStar SEED_I AI2D DocVQA ChartQA InfoVQA TextVQA OCRBench POPE HallB
🔻Modular Vision Language Models (Instruct-2B)
Qwen2-VLQwen2-1.5B--·--·--Any Res.M-RoPE 41.174.949.548.0-- 74.790.173.565.579.780.9 --41.7
InternVL2.5InternLM2.5-1.8B>6B·100M·16MTile-wise1D-RoPE 43.674.760.853.7-- 74.988.779.260.974.380.4 90.642.6
InternVL3†Qwen2.5-1.5B>6B·100M·22MTile-wise1D-RoPE 48.681.162.260.7-- 78.788.380.266.177.083.5 89.642.5
Qwen2.5-VL†Qwen2.5-3B--·--·--Any Res.M-RoPE 51.279.161.855.9-- 81.693.984.077.179.379.7 --46.3
Encoder_BasedQwen3-1.7B>6B·40M·4MTile-wise1D-RoPE 47.175.837.452.773.6 77.489.978.465.973.383.5 87.044.4
🔻Native Vision Language Models (Instruct-2B)
Mono-InternVLInternLM2-1.8B1.2B·143M·7MTile-wise1D-RoPE 33.765.540.1--67.4 68.680.073.743.072.676.7 --34.8
Mono-InternVL-1.5InternLM2-1.8B400M·150M·7MTile-wise1D-RoPE 39.164.054.0--66.9 67.481.772.247.973.780.1 --32.5
HoVLEInternLM2-1.8B550M·50M·7MTile-wise1D-RoPE 32.273.343.8--70.9 73.086.178.655.770.974.0 87.438.4
OneCATQwen2.5-1.5B436M·70M·13MAny Res.M-RoPE 39.072.442.4--70.9 72.487.176.256.367.0-- ----
NEOQwen3-1.7B345M·40M·4MAny Res.Native_RoPE 48.676.049.654.274.2 80.189.981.263.274.077.1 87.543.1

Model_NAME Base_LLM_NAME #Data_PT·MT·SFT Input_TYPE RoPE_TYPE 📚 Knowledge 💬 General VQA 🔍 OCR VQA 👻 Hallucination
MMMU MMB MMVet MMStar SEED_I AI2D DocVQA ChartQA InfoVQA TextVQA OCRBench POPE HallB
🔻Modular Vision Language Models (Instruct-8B)
Qwen2-VLQwen2-7B--·--·--Any Res.M-RoPE 54.183.062.060.7-- 83.094.583.076.584.386.6 88.150.6
InternVL2.5InternLM2.5-7B>6B·50M·4MTile-wise1D-RoPE 56.084.662.864.4-- 84.593.084.877.679.182.2 90.650.1
Qwen2.5-VL†Qwen2.5-7B--·--·--Any Res.M-RoPE 55.083.567.163.9-- 83.995.787.382.684.986.4 86.452.9
InternVL3†Qwen2.5-7B>6B·100M·22MTile-wise1D-RoPE 62.783.481.368.2-- 85.292.786.676.880.288.0 91.149.9
Encoder-BasedQwen3-8B>6B·40M·4MTile-wise1D-RoPE 54.184.060.063.576.2 82.992.183.575.077.185.3 87.851.4
🔻Native Vision Language Models (Instruct-8B)
FuyuPersimmon-8B--·--·--Any Res.1D-RoPE 27.910.721.4--59.3 64.5--------36.6 84.0--
Chameleonfrom scratch1.4B·0M·1.8MFix Res.1D-RoPE 25.431.18.3--30.6 46.01.52.95.04.80.7 19.417.1
EVEVicuna-7B33M·0M·1.8MAny Rat.1D-RoPE 32.652.325.7--64.6 61.053.059.125.056.839.8 85.026.4
SOLOMistral-7B44M·0M·2MAny Res.1D-RoPE --67.730.4--64.4 61.4--------12.6 78.6--
Emu3from scratch--·--·--Fix Res.1D-RoPE 31.658.537.2--68.2 70.076.368.643.864.768.7 85.2--
EVEv2Qwen2.5-7B77M·15M·7MAny Rat.1D-RoPE 39.366.345.0--71.4 74.8--73.9--71.170.2 87.6--
BREENQwen2.5-7B13M·0M·4MAny Res.1D-RoPE 42.771.438.951.2-- 76.4------65.7-- --37.0
VoRAQwen2.5-7B30M·0M·0.6MAny Res.1D-RoPE 32.061.333.7--68.9 61.1------58.7-- 85.5--
SAILMistral-7B512M·86M·6MAny Res.M-RoPE --70.146.353.172.9 76.7------77.178.3 85.854.2
NEOQwen3-8B345M·40M·4MAny Res.Native_RoPE 54.682.153.662.476.3 83.188.682.160.975.077.7 88.446.4

📊 Demonstration

# Demo
from vlmeval.config import supported_VLM
model = supported_VLM['NEO1_0-2B-SFT']()
# Forward Single Image
ret = model.generate(['assets/apple.jpg', 'What is in this image?'])
print(ret)  # The image features a red apple with a leaf on it.
# Forward Multiple Images
ret = model.generate(['assets/apple.jpg', 'assets/apple.jpg', 'How many apples are there in the provided images? '])
print(ret)  # There are two apples in the provided images.