arXiv:2505.23799v4 Announce Type: replace 
Abstract: Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility, one of which is to measure the consistency of LLM responses -- the model's confidence in the response or likelihood of generating a similar response when resampled. In previous work, measuring LLM response consistency often relied on calculating the probability of a response appearing within a pool of resampled responses, analyzing internal states, or evaluating logits of responses. However, it was not clear how well these approaches approximated users' perceptions of consistency of LLM responses. To find out, we performed a user study ($n=2,976$) demonstrating that current methods for measuring LLM response consistency typically do not align well with humans' perceptions of LLM consistency. We propose a logit-based ensemble method for estimating LLM consistency and show that our method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods for estimating LLM consistency without human evaluation are sufficiently imperfect to warrant broader use of evaluation with human input; this would avoid misjudging the adequacy of models because of the imperfections of automated consistency metrics.

أظهرت دراسة حديثة شملت 2,976 مشاركًا أن الأساليب الحالية لقياس اتساق نماذج اللغة الكبيرة (LLMs) لا تتماشى جيدًا مع تصورات البشر. وهذا يبرز التحديات المتعلقة بضمان نتائج موثوقة من LLMs، المعروفة بتوليد نصوص غير متسقة وأحيانًا خاطئة بسبب الهلاوس والحساسية للتوجيهات.

Un reciente estudio con 2,976 participantes reveló que los métodos existentes para medir la consistencia de los modelos de lenguaje de gran tamaño (LLMs) no se alinean bien con las percepciones humanas. Esto resalta los desafíos de garantizar resultados confiables de los LLMs, que son conocidos por generar textos inconsistentes y, a veces, erróneos debido a alucinaciones y sensibilidad a las indicaciones.

Une récente étude utilisateur impliquant 2 976 participants a révélé que les méthodes existantes pour mesurer la cohérence des modèles de langage de grande taille (LLMs) ne s'alignent pas bien avec les perceptions humaines. Cela met en évidence les défis liés à l'assurance de résultats fiables des LLMs, connus pour générer des textes incohérents et parfois erronés en raison des hallucinations et de la sensibilité aux invites.

A recent user study involving 2,976 participants revealed that existing methods for measuring the consistency of large language models (LLMs) do not align well with human perceptions. This highlights the challenges of ensuring reliable outputs from LLMs, which are known for generating inconsistent and sometimes erroneous text due to hallucinations and sensitivity to prompts.

Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

Was this article worth reading? Share it

Supametas.AI

LangWatch

Grubby.AI