arXiv:2511.02246v1 Announce Type: new 
Abstract: Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: https://github.com/BBN-E/medic-neurips-2025-demo.

تسلط الأبحاث الحديثة الضوء على التحديات التي تواجه نماذج اللغة الكبيرة (LLMs)، خاصة في السياقات الطبية حيث يمكن أن تؤدي التحيزات والأخطاء إلى تقديم نصائح غير متسقة. تؤكد هذه الدراسة على أهمية فهم الظروف التي قد تفشل فيها هذه الدردشة الآلية، خاصة عندما تكون المعلومات الديموغرافية ذات صلة. من خلال تطوير بنية تحتية جديدة لتحليل هذه القضايا، تهدف الأبحاث إلى تحسين موثوقية الدردشة الآلية الطبية، وهو أمر حاسم لضمان سلامة المرضى والتواصل الفعال في الرعاية الصحية.

Investigaciones recientes destacan los desafíos que enfrentan los modelos de lenguaje grandes (LLMs), especialmente en contextos médicos donde los sesgos y errores pueden llevar a consejos inconsistentes. Este estudio enfatiza la importancia de entender las condiciones bajo las cuales estos chatbots pueden fallar, especialmente cuando están involucrados factores demográficos. Al desarrollar una nueva infraestructura para analizar estos problemas, la investigación busca mejorar la fiabilidad de los chatbots médicos, lo cual es crucial para garantizar la seguridad del paciente y una comunicación efectiva en el cuidado de la salud.

Des recherches récentes soulignent les défis auxquels sont confrontés les grands modèles de langage (LLMs), en particulier dans les contextes médicaux où les biais et les erreurs peuvent entraîner des conseils incohérents. Cette étude met en avant l'importance de comprendre les conditions dans lesquelles ces chatbots peuvent échouer, surtout lorsque des facteurs démographiques entrent en jeu. En développant une nouvelle infrastructure pour analyser ces problèmes, la recherche vise à améliorer la fiabilité des chatbots médicaux, ce qui est crucial pour garantir la sécurité des patients et une communication efficace dans le domaine de la santé.

Recent research highlights the challenges faced by large language models (LLMs), particularly in medical contexts where biases and errors can lead to inconsistent advice. This study emphasizes the importance of understanding the conditions under which these chatbots may fail, especially when demographic factors come into play. By developing a new infrastructure to analyze these issues, the research aims to improve the reliability of medical chatbots, which is crucial for ensuring patient safety and effective communication in healthcare.

Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results

Was this article worth reading? Share it

Ready to build your own newsroom?