arXiv:2504.02106v3 Announce Type: replace 
Abstract: Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.

تم تقديم مقياس تقييم جديد يسمى ContrastScore لتحسين الجودة وتقليل التحيز وزيادة كفاءة تقييم النصوص المولدة. تم اختبار هذه المقياس على مهام الترجمة الآلية والتلخيص، حيث أظهر ارتباطًا أقوى مع التقييمات البشرية مقارنة بالنماذج الحالية، بما في ذلك Qwen 3B و Qwen 0.5B.

Se ha introducido una nueva métrica de evaluación llamada ContrastScore para mejorar la calidad, reducir el sesgo y aumentar la eficiencia en la evaluación de textos generados. Esta métrica ha sido probada en tareas de traducción automática y resumen, mostrando una correlación más fuerte con las evaluaciones humanas en comparación con modelos existentes, incluidos Qwen 3B y Qwen 0.5B.

Un nouvel indicateur d'évaluation appelé ContrastScore a été introduit pour améliorer la qualité, réduire les biais et accroître l'efficacité de l'évaluation des textes générés. Cet indicateur a été testé sur des tâches de traduction automatique et de résumé, montrant une corrélation plus forte avec les évaluations humaines par rapport aux modèles existants, y compris Qwen 3B et Qwen 0.5B.

A new evaluation metric called ContrastScore has been introduced to enhance the quality, reduce bias, and improve the efficiency of assessing generated text. This metric has been tested on machine translation and summarization tasks, showing stronger correlation with human evaluations compared to existing models, including Qwen 3B and Qwen 0.5B.

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Was this article worth reading? Share it

Humanize AI

CoGrader

Color Contrast Checker

Bytefitz

ZeroGPT.org

Junia AI

Ready to build your own newsroom?