arXiv:2511.16544v2 Announce Type: replace 
Abstract: As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA through DSPy to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $\kappa$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.

أظهرت دراسة حديثة قيود الاعتماد على معدل خطأ الكلمات (WER) في تقييم أنظمة التعرف التلقائي على الكلام (ASR) المستخدمة في الحوارات السريرية. تشير الأبحاث إلى أن المقاييس التقليدية لا تعكس بدقة التأثير السريري لأخطاء النسخ، كما تم تقييمه من قبل الأطباء الخبراء الذين يقارنون مخرجات ASR بالعبارات الصحيحة.

Un estudio reciente ha destacado las limitaciones de confiar en la Tasa de Error de Palabras (WER) para evaluar los sistemas de Reconocimiento Automático de Voz (ASR) utilizados en diálogos clínicos. La investigación indica que las métricas tradicionales no reflejan con precisión el impacto clínico de los errores de transcripción, según lo evaluado por clínicos expertos que comparan las salidas de ASR con enunciados de referencia.

Une étude récente a mis en évidence les limites de l'utilisation du Taux d'Erreur de Mots (WER) pour évaluer les systèmes de Reconnaissance Automatique de la Parole (ASR) utilisés dans les dialogues cliniques. La recherche indique que les métriques traditionnelles ne reflètent pas avec précision l'impact clinique des erreurs de transcription, tel qu'évalué par des cliniciens experts comparant les sorties ASR aux énoncés de référence.

A recent study has highlighted the limitations of relying on Word Error Rate (WER) in evaluating Automatic Speech Recognition (ASR) systems used in clinical dialogues. The research indicates that traditional metrics do not accurately reflect the clinical impact of transcription errors, as assessed by expert clinicians comparing ASR outputs to ground-truth utterances.

WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

arXiv:2509.24613v4 Announce Type: replace 
Abstract: Despite advances in multilingual automatic speech recognition (ASR), code-switching (CS), the mixing of languages within an utterance common in daily speech, remains a severely underexplored challenge. In this paper, we introduce HiKE: the Hierarchical Korean-English code-switching benchmark, the first globally accessible non-synthetic evaluation framework for Korean-English CS, aiming to provide a means for the precise evaluation of multilingual ASR models and to foster research in the field. The proposed framework not only consists of high-quality, natural CS data across various topics, but also provides meticulous loanword labels and a hierarchical CS-level labeling scheme (word, phrase, and sentence) that together enable a systematic evaluation of a model's ability to handle each distinct level of code-switching. Through evaluations of diverse multilingual ASR models and fine-tuning experiments, this paper demonstrates that although most multilingual ASR models initially exhibit inadequate CS-ASR performance, this capability can be enabled through fine-tuning with synthetic CS data. HiKE is available at https://github.com/ThetaOne-AI/HiKE.

يمثل تقديم HiKE، الإطار الهرمي لتقييم التبديل اللغوي بين الكورية والإنجليزية، تقدمًا كبيرًا في مجال التعرف التلقائي على الكلام متعدد اللغات (ASR). يهدف هذا الإطار إلى معالجة التحديات التي يطرحها التبديل اللغوي، والذي يتضمن خلط اللغات في الكلام، من خلال توفير نظام تقييم شامل لنماذج ASR.

La introducción de HiKE, el marco jerárquico de evaluación para el cambio de código coreano-inglés, marca un avance significativo en el campo del reconocimiento automático del habla multilingüe (ASR). Este marco tiene como objetivo abordar los desafíos que plantea el cambio de código, que implica la mezcla de idiomas en el habla, proporcionando un sistema de evaluación integral para los modelos de ASR.

L'introduction de HiKE, le cadre hiérarchique d'évaluation pour le code-switching coréen-anglais, représente une avancée significative dans le domaine de la reconnaissance automatique de la parole multilingue (ASR). Ce cadre vise à relever les défis posés par le code-switching, qui implique le mélange de langues dans la parole, en fournissant un système d'évaluation complet pour les modèles ASR.

The introduction of HiKE, the Hierarchical Korean-English code-switching benchmark, marks a significant advancement in the field of multilingual automatic speech recognition (ASR). This framework aims to address the challenges posed by code-switching, which involves the mixing of languages in speech, by providing a comprehensive evaluation system for ASR models.

WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

Was this article worth reading? Share it

Humanize AI

Airparser

Nudge AI

SoundWise.ai

SafeWrite AI

Wordwriter

Ready to build your own newsroom?