Pushing Evaluation to the Limit
Abstract: Hundreds of benchmarks dedicated to evaluating large models from multiple perspectives have been presented over the past few years. Albeit substantial efforts, most of them remain closed-ended and are prone to overfitting due to the potential data contamination in the ever-growing training corpus of large models, thereby undermining the credibility of the evaluation. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation to gauge the advancing capabilities of large models. In this paper, we introduce MACEval, a Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define a new set of metrics to quantify performance longitudinally and sustainably. MACEval adopts an interactive and autonomous evaluation mode that employs role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 9 open-ended tasks with 23 participating large models demonstrate that MACEval is (1) human-free and automatic, mitigating laborious result processing with inter-agent judgment guided; (2) efficient and economical, reducing a considerable amount of data and overhead to obtain similar results compared to related benchmarks; and (3) flexible and scalable, migrating or integrating existing benchmarks via customized evaluation topologies. We hope that MACEval can broaden future directions of large model evaluation.
Unlike existing LLM or MLLM benchmarks that are strongly human-involving in source content collection and relatively independent in cross-ability evaluation, our MACEval follows two basic principles:
- (1) Not requiring pre-collected evaluation datasets, and all visual or text query-answer pairs are dynamically generated during the process;
- (2) Progressive capability evaluation scheme with real-time task adjustment.
Since existing benchmark datasets are finite and closed-ended, the measured performance scores do not reflect the model’s maximum capabilities.
- To push the evaluated model to the limit, we adopt a stress-testing strategy in which the model is continuously challenged with increasingly difficult query-answer tasks until it fails to provide a correct response. Henceforth, we can derive a long-standing performance metric by iteratively updating the envelope area formed by connecting performance points across different difficulty levels.
We conduct a preliminary exploration of 9 tasks across five key domains currently emphasized in the field: visual perception, textual comprehension, math, algorithms, and coding.
We design an Area Under Curve (AUC)-inspired metric:
Our experiments include 23 large models total, with a mix of cutting-edge proprietary models, open-source LLMs, and MLLMs. Specifically, for proprietary models, we include OpenAI models such as GPT-4o, and GPT-4.1 (2025-04-14), Google models such as Gemini-1.5-Pro, Gemini-2.0-Flash, and Gemini-2.5-Pro. For open-source LLMs, we include three mainstream language backbones widely used in many large models, {\it i.e.}, the DeepSeeks (DeepSeek-V3 and DeepSeek-R1), the Qwens (Qwen3-8B, Qwen2.5-{7, 14, 72}B, Qwen2-7B), and the Llamas (Llama3.3-70B, Llama3.2-3B, Llama3.1-8B). For open-source MLLMs, we include models such as Qwen2.5-vl-{3, 7, 72}B, Qwen2-vl-7B, InternVL3-8B, InternVL2.5-{8, 38}B, InternVL2-8B.
The performance of different interviewees (LLMs) on 4 text-only capabilities under a 1-hop line evaluation topology configuration. (click to expand)
Maximum performance level vs. ACC-AUC. We plot the performance of 15 LLMs com- pared to the corresponding maximum difficulty levels with each a fit line and correlation coefficient. The red dotted lines denote the performance saturation line.(click to expand)
To be released.
@misc{chen2025maceval,
title={MACEval: A Multi-Agent Continual Evaluation Network for Large Models},
author={Zijian Chen and Yuze Sun and Yuan Tian and Wenjun Zhang and Guangtao Zhai},
year={2025},
eprint={2511.09139},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.09139},
}
Please contact the first author of this paper for queries.
- Zijian Chen,
[email protected]







