arXiv:2511.09139v1 Announce Type: new 
Abstract: Hundreds of benchmarks dedicated to evaluating large models from multiple perspectives have been presented over the past few years. Albeit substantial efforts, most of them remain closed-ended and are prone to overfitting due to the potential data contamination in the ever-growing training corpus of large models, thereby undermining the credibility of the evaluation. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation to gauge the advancing capabilities of large models. In this paper, we introduce MACEval, a \Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define a new set of metrics to quantify performance longitudinally and sustainably. MACEval adopts an interactive and autonomous evaluation mode that employs role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 9 open-ended tasks with 23 participating large models demonstrate that MACEval is (1) human-free and automatic, mitigating laborious result processing with inter-agent judgment guided; (2) efficient and economical, reducing a considerable amount of data and overhead to obtain similar results compared to related benchmarks; and (3) flexible and scalable, migrating or integrating existing benchmarks via customized evaluation topologies. We hope that MACEval can broaden future directions of large model evaluation.

تقدم الورقة المعنونة 'MACEval: شبكة تقييم مستمرة متعددة الوكلاء للنماذج الكبيرة' إطار تقييم جديد للنماذج الكبيرة، حيث تعالج مشكلات الإفراط في التخصيص وتلوث البيانات في المعايير الحالية. تعتمد MACEval طريقة تقييم تفاعلية ومستقلة، مما يظهر كفاءتها ومرونتها عبر تسع مهام مع 23 نموذجًا. تعتبر هذه الابتكار مهمًا لأنه يعزز مصداقية وملاءمة تقييمات النماذج في مشهد الذكاء الاصطناعي المتطور بسرعة.

El artículo titulado 'MACEval: Una red de evaluación continua multiagente para grandes modelos' presenta un nuevo marco de evaluación para grandes modelos, abordando problemas de sobreajuste y contaminación de datos en los benchmarks existentes. MACEval emplea un método de evaluación autónomo e interactivo, demostrando eficiencia y flexibilidad en nueve tareas con 23 modelos. Esta innovación es significativa ya que mejora la credibilidad y adaptabilidad de las evaluaciones de modelos en el paisaje de IA en rápida evolución.

L'article intitulé 'MACEval: Un réseau d'évaluation continue multi-agents pour de grands modèles' présente un nouveau cadre d'évaluation pour de grands modèles, abordant les problèmes de surajustement et de contamination des données dans les benchmarks existants. MACEval adopte une méthode d'évaluation autonome et interactive, démontrant son efficacité et sa flexibilité à travers neuf tâches avec 23 modèles. Cette innovation est significative car elle améliore la crédibilité et l'adaptabilité des évaluations de modèles dans le paysage de l'IA en évolution rapide.

The paper titled 'MACEval: A Multi-Agent Continual Evaluation Network for Large Models' introduces a new evaluation framework for large models, addressing issues of overfitting and data contamination in existing benchmarks. MACEval employs an autonomous, interactive evaluation method, demonstrating efficiency and flexibility across nine tasks with 23 models. This innovation is significant as it enhances the credibility and adaptability of model evaluations in the rapidly evolving AI landscape.

MACEval: A Multi-Agent Continual Evaluation Network for Large Models

Was this article worth reading? Share it

Ready to build your own newsroom?