MACEval: A Multi-Agent Continual Evaluation Network for Large Models

Pushing Evaluation to the Limit

Zijian Chen^1,2, Yuze Sun¹, Yuan Tian², Wenjun Zhang¹, Guangtao Zhai^1,2*,

¹Shanghai Jiao Tong University, ²Shanghai AI Laboratory

^*Corresponding author

Abstract: Hundreds of benchmarks dedicated to evaluating large models from multiple perspectives have been presented over the past few years. Albeit substantial efforts, most of them remain closed-ended and are prone to overfitting due to the potential data contamination in the ever-growing training corpus of large models, thereby undermining the credibility of the evaluation. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation to gauge the advancing capabilities of large models. In this paper, we introduce MACEval, a Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define a new set of metrics to quantify performance longitudinally and sustainably. MACEval adopts an interactive and autonomous evaluation mode that employs role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 9 open-ended tasks with 23 participating large models demonstrate that MACEval is (1) human-free and automatic, mitigating laborious result processing with inter-agent judgment guided; (2) efficient and economical, reducing a considerable amount of data and overhead to obtain similar results compared to related benchmarks; and (3) flexible and scalable, migrating or integrating existing benchmarks via customized evaluation topologies. We hope that MACEval can broaden future directions of large model evaluation.

Design Principles

Focusing on Autonomous Evaluation Process

Unlike existing LLM or MLLM benchmarks that are strongly human-involving in source content collection and relatively independent in cross-ability evaluation, our MACEval follows two basic principles:

(1) Not requiring pre-collected evaluation datasets, and all visual or text query-answer pairs are dynamically generated during the process;
(2) Progressive capability evaluation scheme with real-time task adjustment.

Pushing to the Limit

Since existing benchmark datasets are finite and closed-ended, the measured performance scores do not reflect the model’s maximum capabilities.

To push the evaluated model to the limit, we adopt a stress-testing strategy in which the model is continuously challenged with increasingly difficult query-answer tasks until it fails to provide a correct response. Henceforth, we can derive a long-standing performance metric by iteratively updating the envelope area formed by connecting performance points across different difficulty levels.

Task Card

We conduct a preliminary exploration of 9 tasks across five key domains currently emphasized in the field: visual perception, textual comprehension, math, algorithms, and coding.

Evaluating Continually

We design an Area Under Curve (AUC)-inspired metric:

Benchmark Candidates

Our experiments include 23 large models total, with a mix of cutting-edge proprietary models, open-source LLMs, and MLLMs. Specifically, for proprietary models, we include OpenAI models such as GPT-4o, and GPT-4.1 (2025-04-14), Google models such as Gemini-1.5-Pro, Gemini-2.0-Flash, and Gemini-2.5-Pro. For open-source LLMs, we include three mainstream language backbones widely used in many large models, {\it i.e.}, the DeepSeeks (DeepSeek-V3 and DeepSeek-R1), the Qwens (Qwen3-8B, Qwen2.5-{7, 14, 72}B, Qwen2-7B), and the Llamas (Llama3.3-70B, Llama3.2-3B, Llama3.1-8B). For open-source MLLMs, we include models such as Qwen2.5-vl-{3, 7, 72}B, Qwen2-vl-7B, InternVL3-8B, InternVL2.5-{8, 38}B, InternVL2-8B.

Main Results

The performance of different interviewees (LLMs) on 4 text-only capabilities under a 1-hop line evaluation topology configuration. (click to expand)

Performance curves and ACC-AUC values of different series of MLLMs. (click to expand)

Correlations across different tasks. (click to expand)

Maximum performance level vs. ACC-AUC. We plot the performance of 15 LLMs com- pared to the corresponding maximum difficulty levels with each a fit line and correlation coefficient. The red dotted lines denote the performance saturation line.(click to expand)

Code

To be released.

Citation

@misc{chen2025maceval,
      title={MACEval: A Multi-Agent Continual Evaluation Network for Large Models}, 
      author={Zijian Chen and Yuze Sun and Yuan Tian and Wenjun Zhang and Guangtao Zhai},
      year={2025},
      eprint={2511.09139},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.09139}, 
}

Contact

Please contact the first author of this paper for queries.

Zijian Chen, [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MACEval: A Multi-Agent Continual Evaluation Network for Large Models

Design Principles

Focusing on Autonomous Evaluation Process

Pushing to the Limit

Task Card

Evaluating Continually

Benchmark Candidates

Main Results

Code

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MACEval: A Multi-Agent Continual Evaluation Network for Large Models

Design Principles

Focusing on Autonomous Evaluation Process

Pushing to the Limit

Task Card

Evaluating Continually

Benchmark Candidates

Main Results

Code

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages