[2025.08.15] π We-Math 2.0 homepage is live at we-math2.github.io. π
[2025.08.15] π We-Math 2.0 paper is now available on arXiv. π
[2025.08.15] π¦ We-Math 2.0 dataset is now available on Hugging Face Datasets. π
[2025.05.16] π We-Math is accepted by ACL 2025 π
[2025.02.20] π We-Math is officially supported by VLMEvalKit for fast evalution π.
[2024.07.02] Our paper is now accessible at https://arxiv.org/abs/2407.01284.
[2024.07.02] Our dataset is now accessible at Huggingface Datasets.
[2024.07.02] Our project homepage can be accessed at https://we-math.github.io/.
- π₯ News π₯
- π About We-Math
- π Leaderboard on We-Math π
- π Evaluation Piplines on We-Math
- π We-Math Dataset
- π License
- π€ Contributors
Inspired by human-like mathematical reasoning, we introduce We-Math, the first benchmark specifically designed to explore the problem-solving principles beyond the end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity.
Overview diagram and the statistics of We-Math.
We firstly decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM) to hierarchically assess inherent issues in LMMsβ reasoning process.
The pipeline of knowledge-based data decomposition (an example of a three-step problem in We-Math).
An example of the four-dimensional metrics for evaluating a two-step problem, using both loose and strict settings.
With We-Math, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving step and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategy. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization they correctly solve composite problems involving multiple knowledge concepts, yet fail in answering sub-problems. We anticipate that We-Math will open new pathways for advancements in visual mathematical reasoning for LMMs.
Overview of LMMs' performances on We-Math. Figures from left to right illustrates the (1) accuracy of different LMMs on various problem-solving steps, (2) the performance in different visual mathematics categories and (3) the result in knowledge based reasoning evaluation.
π¨π¨ The Leaderboard is continuously being updated. We welcome the results of your model! To submit your results to the leaderboard on the testmini subset, please send to this email with your result JSON file and score CSV file.
The models generate responses based on the given questions and images. Examples for generating responses from some LMMs are provided in the evaluation. Our prompt specifies the format of answer generation to facilitate subsequent extraction of the answer using string matching. Please refer to the following template to prepare your result JSON files for subsequent evaluation.
{
"ID": "3steps_165",
"split": "testmini",
"knowledge concept": "Area of Circles",
"question": "As shown in the figure, there is a circular flower bed. Mary walked from the northernmost point of the flower bed along the edge to the easternmost point, taking a total of 80 steps. It is known that Mary's average step length is 0.628 cm, what is the area of the flower bed ( ) mΒ²?(Ο = 3.14)",
"option": "A. 200.96;B. 3215.36;C. 6280;D. 32; E. No correct answer",
"answer": "B",
"image_path": "3steps/image/165-3.png",
"key": "3steps_3",
"question number": 1575,
"knowledge concept description": "Area of ...",
"response": "<Thought process>: ... <Answer>: ..."
}Due to the multiple-choice question format of our dataset and the specific answer generation prompt, we use string matching to directly extract answers, which eliminates the high cost of using additional models for further answer extraction. The extracted answer is normalized to an option letter and calculate scores on our proposed four-dimensional metrics in four_dimensional_metrics.py.
cd evaluation
python four_dimensional_metrics_refine.py \
--model_name GPT-4o \
--output_json ../output/GPT-4o.json \
--main_results_csv_path ../result/four_dimensional_metrics.csvPerformences on One-Step / Two-Step / Three-Step problems and different problem domains are obtained from accuracy.py.
cd evaluation
python accuracy.py \
--model_name GPT-4o \
--output_json ../output/GPT-4o.json \
--knowledge_structure_nodes_path /data/knowledge_structure_nodes.json \Based on the decomposed multi-step problems, we further reveal the inherent issues of LMMs in problem-solving process. We feed both the M one-step sub-problems and the original problem into LMMs, and classifying the responses into four categories
- Insufficient Knowledge (IK): Part of one-step problems contain errors, and the multi-step problem is wrong. It is reasonable because model's insufficient grasp of single knowledge concept may lead to errors in multi-step problem.
- Inadequate Generalization (IG): One-Step problems are all correct, but the multi-step problem is incorrect. This is also considered reasonable. While LMMs are capable of understanding individual knowledge concepts, they may struggle to generalize that knowledge to solve composite problems.
- Complete Mastery (CM): One-Step problems are all correct, and multi-step problem is also answered correctly. This result demonstrates that the model's results are both reliable and accurate.
- Rote Memorization (RM): One-Step problems contain errors, but the multi-step problem is answered correctly, which contradicts human logical thinking. If a model can solve composite multi-step problems but fails to answer the one-step problems needed in the process, it raises doubts about the model's reliability.
Our dataset are distributed under the CC BY-NC 4.0 license.
If you find We-Math useful for your your research and applications, please kindly cite using this BibTeX:
@inproceedings{qiao2025we,
title={We-math: Does your large multimodal model achieve human-like mathematical reasoning?},
author={Qiao, Runqi and Tan, Qiuna and Dong, Guanting and MinhuiWu, MinhuiWu and Sun, Chong and Song, Xiaoshuai and Wang, Jiapeng and GongQue, Zhuoma and Lei, Shanglin and Zhang, Yifan and others},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={20023--20070},
year={2025}
}@article{qiao2025wemath2,
title={We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning},
author={Qiao, Runqi and Tan, Qiuna and Yang, Peiqing and Wang, Yanzi and Wang, Xiaowan and Wan, Enhui and Zhou, Sitong and Dong, Guanting and Zeng, Yuchen and Xu, Yida and others},
journal={arXiv preprint arXiv:2508.10433},
year={2025}
}@article{qiao2025v,
title={V-Thinker: Interactive Thinking with Images},
author={Qiao, Runqi and Tan, Qiuna and Yang, Minghan and Dong, Guanting and Yang, Peiqing and Lang, Shiqiang and Wan, Enhui and Wang, Xiaowan and Xu, Yida and Yang, Lan and others},
journal={arXiv preprint arXiv:2511.04460},
year={2025}
}Here are the key contributors to this project:
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang.


