README.md

📐 A Toolkit for Evaluating NEO Models

Comprehensive Evaluation for NEO across Knowledge, Hallucination, General and OCR VQA

🏗️ QuickStart

See [QuickStart | 快速开始] for a quick start guide.

🤖 Model Zoo

We release 2B and 9B NEO in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT).

Model Name	Model Weight
NEO-2B-PT	🤗 NEO-2B-PT HF link
NEO-2B-MT	🤗 NEO-2B-MT HF link
NEO-2B-SFT	🤗 NEO-2B-SFT HF link
NEO-9B-PT	🤗 NEO-9B-PT HF link
NEO-9B-MT	🤗 NEO-9B-MT HF link
NEO-9B-SFT	🤗 NEO-9B-SFT HF link

📊 Benchmark Results

TABLE NOTE:

“# Data” = data scale for pre-training / mid-training / supervised fine-tuning.

“†“ = vision-language models using Reinforcement Learning (RL).

“Any Res.” = any resolution; “Tile-wise” = image split into tiles;
“Any Rat.” = any aspect ratio; “Fix Res.” = fixed resolution.

“MoE“ = Mixture-of-Experts; “DaC“ = Divide-and-Conquer.

Bold = best score in each column.

Model_NAME	Base_LLM_NAME	#Data_PT·MT·SFT	Input_TYPE	RoPE_TYPE	Knowledge	General VQA				OCR VQA						Hallucination
Model_NAME	Base_LLM_NAME	#Data_PT·MT·SFT	Input_TYPE	RoPE_TYPE	MMMU	MMB	MMVet	MMStar	SEED_I	AI2D	DocVQA	ChartQA	InfoVQA	TextVQA	OCRBench	POPE	HallB
🔻Modular Vision Language Models (Instruct-2B)
Qwen2-VL	Qwen2-1.5B	--·--·--	Any Res.	M-RoPE	41.1	74.9	49.5	48.0	--	74.7	90.1	73.5	65.5	79.7	80.9	--	41.7
InternVL2.5	InternLM2.5-1.8B	>6B·100M·16M	Tile-wise	1D-RoPE	43.6	74.7	60.8	53.7	--	74.9	88.7	79.2	60.9	74.3	80.4	90.6	42.6
InternVL3†	Qwen2.5-1.5B	>6B·100M·22M	Tile-wise	1D-RoPE	48.6	81.1	62.2	60.7	--	78.7	88.3	80.2	66.1	77.0	83.5	89.6	42.5
Qwen2.5-VL†	Qwen2.5-3B	--·--·--	Any Res.	M-RoPE	51.2	79.1	61.8	55.9	--	81.6	93.9	84.0	77.1	79.3	79.7	--	46.3
Encoder_Based	Qwen3-1.7B	>6B·40M·4M	Tile-wise	1D-RoPE	47.1	75.8	37.4	52.7	73.6	77.4	89.9	78.4	65.9	73.3	83.5	87.0	44.4
🔻Native Vision Language Models (Instruct-2B)
Mono-InternVL	InternLM2-1.8B	1.2B·143M·7M	Tile-wise	1D-RoPE	33.7	65.5	40.1	--	67.4	68.6	80.0	73.7	43.0	72.6	76.7	--	34.8
Mono-InternVL-1.5	InternLM2-1.8B	400M·150M·7M	Tile-wise	1D-RoPE	39.1	64.0	54.0	--	66.9	67.4	81.7	72.2	47.9	73.7	80.1	--	32.5
HoVLE	InternLM2-1.8B	550M·50M·7M	Tile-wise	1D-RoPE	32.2	73.3	43.8	--	70.9	73.0	86.1	78.6	55.7	70.9	74.0	87.4	38.4
OneCAT	Qwen2.5-1.5B	436M·70M·13M	Any Res.	M-RoPE	39.0	72.4	42.4	--	70.9	72.4	87.1	76.2	56.3	67.0	--	--	--
NEO	Qwen3-1.7B	345M·40M·4M	Any Res.	Native_RoPE	48.6	76.0	49.6	54.2	74.2	80.1	89.9	81.2	63.2	74.0	77.1	87.5	43.1

Model_NAME	Base_LLM_NAME	#Data_PT·MT·SFT	Input_TYPE	RoPE_TYPE	📚 Knowledge	💬 General VQA				🔍 OCR VQA						👻 Hallucination
Model_NAME	Base_LLM_NAME	#Data_PT·MT·SFT	Input_TYPE	RoPE_TYPE	MMMU	MMB	MMVet	MMStar	SEED_I	AI2D	DocVQA	ChartQA	InfoVQA	TextVQA	OCRBench	POPE	HallB
🔻Modular Vision Language Models (Instruct-8B)
Qwen2-VL	Qwen2-7B	--·--·--	Any Res.	M-RoPE	54.1	83.0	62.0	60.7	--	83.0	94.5	83.0	76.5	84.3	86.6	88.1	50.6
InternVL2.5	InternLM2.5-7B	>6B·50M·4M	Tile-wise	1D-RoPE	56.0	84.6	62.8	64.4	--	84.5	93.0	84.8	77.6	79.1	82.2	90.6	50.1
Qwen2.5-VL†	Qwen2.5-7B	--·--·--	Any Res.	M-RoPE	55.0	83.5	67.1	63.9	--	83.9	95.7	87.3	82.6	84.9	86.4	86.4	52.9
InternVL3†	Qwen2.5-7B	>6B·100M·22M	Tile-wise	1D-RoPE	62.7	83.4	81.3	68.2	--	85.2	92.7	86.6	76.8	80.2	88.0	91.1	49.9
Encoder-Based	Qwen3-8B	>6B·40M·4M	Tile-wise	1D-RoPE	54.1	84.0	60.0	63.5	76.2	82.9	92.1	83.5	75.0	77.1	85.3	87.8	51.4
🔻Native Vision Language Models (Instruct-8B)
Fuyu	Persimmon-8B	--·--·--	Any Res.	1D-RoPE	27.9	10.7	21.4	--	59.3	64.5	--	--	--	--	36.6	84.0	--
Chameleon	from scratch	1.4B·0M·1.8M	Fix Res.	1D-RoPE	25.4	31.1	8.3	--	30.6	46.0	1.5	2.9	5.0	4.8	0.7	19.4	17.1
EVE	Vicuna-7B	33M·0M·1.8M	Any Rat.	1D-RoPE	32.6	52.3	25.7	--	64.6	61.0	53.0	59.1	25.0	56.8	39.8	85.0	26.4
SOLO	Mistral-7B	44M·0M·2M	Any Res.	1D-RoPE	--	67.7	30.4	--	64.4	61.4	--	--	--	--	12.6	78.6	--
Emu3	from scratch	--·--·--	Fix Res.	1D-RoPE	31.6	58.5	37.2	--	68.2	70.0	76.3	68.6	43.8	64.7	68.7	85.2	--
EVEv2	Qwen2.5-7B	77M·15M·7M	Any Rat.	1D-RoPE	39.3	66.3	45.0	--	71.4	74.8	--	73.9	--	71.1	70.2	87.6	--
BREEN	Qwen2.5-7B	13M·0M·4M	Any Res.	1D-RoPE	42.7	71.4	38.9	51.2	--	76.4	--	--	--	65.7	--	--	37.0
VoRA	Qwen2.5-7B	30M·0M·0.6M	Any Res.	1D-RoPE	32.0	61.3	33.7	--	68.9	61.1	--	--	--	58.7	--	85.5	--
SAIL	Mistral-7B	512M·86M·6M	Any Res.	M-RoPE	--	70.1	46.3	53.1	72.9	76.7	--	--	--	77.1	78.3	85.8	54.2
NEO	Qwen3-8B	345M·40M·4M	Any Res.	Native_RoPE	54.6	82.1	53.6	62.4	76.3	83.1	88.6	82.1	60.9	75.0	77.7	88.4	46.4

📊 Demonstration

# Demo
from vlmeval.config import supported_VLM
model = supported_VLM['NEO1_0-2B-SFT']()
# Forward Single Image
ret = model.generate(['assets/apple.jpg', 'What is in this image?'])
print(ret)  # The image features a red apple with a leaf on it.
# Forward Multiple Images
ret = model.generate(['assets/apple.jpg', 'assets/apple.jpg', 'How many apples are there in the provided images? '])
print(ret)  # There are two apples in the provided images.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📐 A Toolkit for Evaluating NEO Models

🏗️ QuickStart

🤖 Model Zoo

📊 Benchmark Results

📊 Demonstration

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

📐 A Toolkit for Evaluating NEO Models

🏗️ QuickStart

🤖 Model Zoo

📊 Benchmark Results

📊 Demonstration