Weihao Xuan*
Rui Yang†
Heli Qi‡
Qingcheng Zeng§
Yunze Xiao¶
Aosong Feng#
Dairui Liu♯
Yun Xing♮
Junjue Wang*
Fan Gao*
Jinghui Lu♭
Yuang Jiang♭
Huitao Li†
Xin Li†
Kunyu Yu†
Ruihai Dong♯
Shangding Gu⊕
Yuekang Li⊗
Xiaofei XieΔ
Felix Juefei-XuΛ
Foutse KhomhΩ
Osamu Yoshie‡
Qingyu Chen#
Douglas TeodoroΨ
Nan Liu†
Randy GoebelΓ
Lei Ma*
Edison Marrese-Taylor*
Shijian Lu♮
Yusuke Iwasawa*
Yutaka Matsuo*
Irene Li*
*The University of Tokyo, Japan, †Duke-NUS Medical School, Singapore, ‡Waseda University, Japan,
§Northwestern University, United States, ¶Carnegie Mellon University, United States,
#Yale University, United States, ♯University College Dublin, Ireland, ♮Nanyang Technological University, Singapore,
♭Smartor Inc, Japan, ⊕University of California, Berkeley, United States, ⊗University of New South Wales, Australia,
ΔSingapore Management University, Singapore, ΛNew York University, United States,
ΩPolytechnique Montreal, Canada, ΨUniversity of Geneva, Switzerland, ΓUniversity of Alberta, Canada
MMLU-ProX is a multilingual benchmark that builds upon MMLU-Pro, extending to 29 typologically diverse languages, designed to evaluate large language models' reasoning capabilities across linguistic and cultural boundaries.
MMLU-ProX addresses critical limitations in existing multilingual benchmarks by:
- Extending coverage to 29 typologically diverse languages
- Building upon the challenging, reasoning-focused design of MMLU-Pro
- Employing a rigorous semi-automatic translation process with expert validation
- Ensuring conceptual accuracy, terminological consistency, and cultural relevance
- [2025/08] 🎉 MMLU-ProX was accepted by EMNLP 2025 (Main)! We are working on more languages for the second version. Stay tuned!
- [2025/05] MMLU-ProX now contains 29 languages, all available on Huggingface! We provide both a lite version and a full version.
- [2025/03] MMLU-ProX's evaluation is now available on lm-evaluation-harness!
- [2025/03] MMLU-ProX is now available on Hugging Face!
- [2025/03] We are still expanding this dataset to more languages! Stay tuned!
To reproduce the results posted in our paper, we support vLLM evaluation by lm-evaluation-harness (Here) by the following command:
model_id=<your-target-model>
tensor_parallel_size=<number-of-gpu-you-want-to-use>
lang=<your-target-language>
python -m lm_eval \
--model vllm \
--model_args pretrained=${model_id},tensor_parallel_size=${tensor_parallel_size},dtype=auto,gpu_memory_utilization=0.9 \
--batch_size auto \
--tasks mmlu_prox_${lang}
Please refer to lm-evaluation-harness for more details about how to setup.
Note: Please install vllm=0.7.3 to reproduce our results other than Llama3.1-405B which is evaluated by vllm=0.6.6.
@article{xuan2025mmluprox,
title={Mmlu-prox: A multilingual benchmark for advanced large language model evaluation},
author={Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and others},
journal={arXiv preprint arXiv:2503.10497},
year={2025}
}
For questions or feedback about MMLU-ProX, please open an issue.