arXiv:2508.12792v2 Announce Type: replace-cross 
Abstract: Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.

تم تقديم إطار إحصائي جديد يسمى Bridge لربط التقييمات بين نماذج اللغة الكبيرة (LLMs) والأحكام البشرية. يتناول هذا الإطار التباينات الملحوظة عندما تُستخدم LLMs كقضاة لتقييم مخرجات النماذج، مقترحًا طريقة لتحسين تقييمات LLMs من خلال درجة تفضيل بشرية كامنة لكل زوج من الاستجابة-المطالبة.

Se ha introducido un nuevo marco estadístico llamado Bridge para alinear las evaluaciones entre los modelos de lenguaje de gran tamaño (LLMs) y los juicios humanos. Este marco aborda las discrepancias observadas cuando se utilizan LLMs como jueces para evaluar las salidas de los modelos, proponiendo un método para refinar las calificaciones de los LLMs a través de un puntaje de preferencia humana latente para cada par de respuesta-prompt.

Un nouveau cadre statistique nommé Bridge a été introduit pour aligner les évaluations entre les modèles de langage de grande taille (LLMs) et les jugements humains. Ce cadre aborde les divergences observées lorsque les LLMs sont utilisés comme juges pour évaluer les résultats des modèles, proposant une méthode pour affiner les évaluations des LLMs grâce à un score de préférence humaine latent pour chaque paire de réponse-prompt.

A new statistical framework named Bridge has been introduced to align evaluations between large language models (LLMs) and human judgments. This framework addresses the discrepancies observed when LLMs are used as judges to evaluate model outputs, proposing a method to refine LLM ratings through a latent human preference score for each prompt-response pair.

Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

arXiv:2512.11150v1 Announce Type: cross 
Abstract: LLM-as-judge evaluation has become the de facto standard for scaling model assessment, but the practice is statistically unsound: uncalibrated scores can invert preferences, naive confidence intervals on uncalibrated scores achieve near-0% coverage, and importance-weighted estimators collapse under limited overlap despite high effective sample size (ESS). We introduce Causal Judge Evaluation (CJE), a framework that fixes all three failures. On n=4,961 Chatbot Arena prompts (after filtering from 5k), CJE achieves 99% pairwise ranking accuracy at full sample size (94% averaged across configurations), matching oracle quality, at 14x lower cost (for ranking 5 policies) by calibrating a 16x cheaper judge on just 5% oracle labels (~250 labels). CJE combines three components: (i) AutoCal-R, reward calibration via mean-preserving isotonic regression; (ii) SIMCal-W, weight stabilization via stacking of S-monotone candidates; and (iii) Oracle-Uncertainty Aware (OUA) inference that propagates calibration uncertainty into confidence intervals. We formalize the Coverage-Limited Efficiency (CLE) diagnostic, which explains why IPS-style estimators fail even when ESS exceeds 90%: the logger rarely visits regions where target policies concentrate. Key findings: SNIPS inverts rankings even with reward calibration (38% pairwise, negative Kendall's tau) due to weight instability; calibrated IPS remains near-random (47%) despite weight stabilization, consistent with CLE; OUA improves coverage from near-0% to ~86% (Direct) and ~96% (stacked-DR), where naive intervals severely under-cover.

تم تقديم إطار جديد يسمى تقييم القاضي السببي (CJE) لمعالجة أوجه القصور الإحصائية في استخدام نماذج اللغة الكبيرة (LLMs) كقضاة في تقييمات النماذج. يحقق CJE دقة تصنيف بنسبة 99% على 4,961 من مطالبات Chatbot Arena مع تقليل التكاليف بشكل كبير من خلال استخدام قاضٍ مُعاير مع 5% فقط من تسميات الأوراق.

Se ha introducido un nuevo marco llamado Causal Judge Evaluation (CJE) para abordar las deficiencias estadísticas del uso de modelos de lenguaje (LLM) como jueces en las evaluaciones de modelos. CJE logra una precisión de clasificación por pares del 99 % en 4,961 solicitudes de Chatbot Arena, al tiempo que reduce significativamente los costos al utilizar un juez calibrado con solo el 5 % de las etiquetas oráculo.

Un nouveau cadre appelé Causal Judge Evaluation (CJE) a été introduit pour remédier aux lacunes statistiques de l'utilisation des modèles de langage (LLM) en tant que juges dans les évaluations de modèles. CJE atteint une précision de classement par paires de 99 % sur 4 961 invites de Chatbot Arena tout en réduisant considérablement les coûts en utilisant un juge calibré avec seulement 5 % des étiquettes oracle.

A new framework called Causal Judge Evaluation (CJE) has been introduced to address the statistical shortcomings of using large language models (LLMs) as judges in model assessments. CJE achieves a 99% pairwise ranking accuracy on 4,961 prompts from Chatbot Arena while significantly reducing costs by utilizing a calibrated judge with only 5% of oracle labels.

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

One More Thing in AI – Your Shortcut to AI Mastery

Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Chattermate

Linkjob AI

Supametas.AI

CoGrader

Ready to build your own newsroom?