Skip to content

weihao1115/MMLU-ProX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 

Repository files navigation

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Full Version Full Version arXiv

*The University of Tokyo, Japan, Duke-NUS Medical School, Singapore, Waseda University, Japan,
§Northwestern University, United States, Carnegie Mellon University, United States,
#Yale University, United States, University College Dublin, Ireland, Nanyang Technological University, Singapore,
Smartor Inc, Japan, University of California, Berkeley, United States, University of New South Wales, Australia,
ΔSingapore Management University, Singapore, ΛNew York University, United States,
ΩPolytechnique Montreal, Canada, ΨUniversity of Geneva, Switzerland, ΓUniversity of Alberta, Canada

Overview

MMLU-ProX is a multilingual benchmark that builds upon MMLU-Pro, extending to 29 typologically diverse languages, designed to evaluate large language models' reasoning capabilities across linguistic and cultural boundaries.

MMLU-ProX addresses critical limitations in existing multilingual benchmarks by:

  • Extending coverage to 29 typologically diverse languages
  • Building upon the challenging, reasoning-focused design of MMLU-Pro
  • Employing a rigorous semi-automatic translation process with expert validation
  • Ensuring conceptual accuracy, terminological consistency, and cultural relevance

News

  • [2025/08] 🎉 MMLU-ProX was accepted by EMNLP 2025 (Main)! We are working on more languages for the second version. Stay tuned!
  • [2025/05] MMLU-ProX now contains 29 languages, all available on Huggingface! We provide both a lite version and a full version.
  • [2025/03] MMLU-ProX's evaluation is now available on lm-evaluation-harness!
  • [2025/03] MMLU-ProX is now available on Hugging Face!
  • [2025/03] We are still expanding this dataset to more languages! Stay tuned!

Usage

To reproduce the results posted in our paper, we support vLLM evaluation by lm-evaluation-harness (Here) by the following command:

model_id=<your-target-model>
tensor_parallel_size=<number-of-gpu-you-want-to-use>
lang=<your-target-language>

python -m lm_eval \
  --model vllm \
  --model_args pretrained=${model_id},tensor_parallel_size=${tensor_parallel_size},dtype=auto,gpu_memory_utilization=0.9 \
  --batch_size auto \
  --tasks mmlu_prox_${lang}

Please refer to lm-evaluation-harness for more details about how to setup.

Note: Please install vllm=0.7.3 to reproduce our results other than Llama3.1-405B which is evaluated by vllm=0.6.6.

Citation

@article{xuan2025mmluprox,
  title={Mmlu-prox: A multilingual benchmark for advanced large language model evaluation},
  author={Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and others},
  journal={arXiv preprint arXiv:2503.10497},
  year={2025}
}

Contact

For questions or feedback about MMLU-ProX, please open an issue.

About

[EMNLP 2025 Main] The official repo of MMLU-ProX benchmark.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •