arXiv:2511.09085v1 Announce Type: new 
Abstract: In this work, we propose a streaming speech recognition framework for Amdo Tibetan, built upon a hybrid CTC/Atten-tion architecture with a context-aware dynamic chunking mechanism. The proposed strategy adaptively adjusts chunk widths based on encoding states, enabling flexible receptive fields, cross-chunk information exchange, and robust adaptation to varying speaking rates, thereby alleviating the context truncation problem of fixed-chunk methods. To further capture the linguistic characteristics of Tibetan, we construct a lexicon grounded in its orthographic principles, providing linguistically motivated modeling units. During decoding, an external language model is integrated to enhance semantic consistency and improve recognition of long sentences. Experimental results show that the proposed framework achieves a word error rate (WER) of 6.23% on the test set, yielding a 48.15% relative improvement over the fixed-chunk baseline, while significantly reducing recognition latency and maintaining performance close to global decoding.

تم تطوير إطار عمل جديد للتعرف على الكلام المتدفق باللغة التبتية أمدو، باستخدام بنية هجينة تعتمد على CTC/الانتباه مع آلية تقسيم ديناميكية واعية بالسياق. تتكيف هذه الطريقة مع عرض القطع بناءً على حالات الترميز، مما يحسن بشكل كبير من دقة التعرف مع معدل خطأ كلمات يبلغ 6.23%، محققة تحسينًا نسبيًا قدره 48.15% مقارنة بالطرق الثابتة. هذه الخطوة مهمة لتعزيز إمكانية الوصول واستخدام تكنولوجيا اللغة التبتية.

Se ha desarrollado un nuevo marco para el reconocimiento de voz en streaming en tibetano Amdo, utilizando una arquitectura híbrida CTC/Atención con un mecanismo de segmentación dinámica consciente del contexto. Este enfoque ajusta adaptativamente los anchos de los segmentos según los estados de codificación, mejorando significativamente la precisión del reconocimiento con una tasa de error de palabras del 6.23%, logrando una mejora relativa del 48.15% sobre los métodos de segmentos fijos. Este avance es crucial para mejorar la accesibilidad y usabilidad de la tecnología del lenguaje tibetano.

Un nouveau cadre pour la reconnaissance vocale en streaming en tibétain Amdo a été développé, utilisant une architecture hybride CTC/Attention avec un mécanisme de découpage dynamique conscient du contexte. Cette approche adapte les largeurs de découpe en fonction des états d'encodage, améliorant considérablement la précision de reconnaissance avec un taux d'erreur de mots de 6,23 %, et atteignant une amélioration relative de 48,15 % par rapport aux méthodes à découpage fixe. Cette avancée est cruciale pour améliorer l'accessibilité et l'utilisabilité de la technologie linguistique tibétaine.

A new framework for streaming speech recognition in Amdo Tibetan has been developed, utilizing a hybrid CTC/Attention architecture with a context-aware dynamic chunking mechanism. This approach adapts chunk widths based on encoding states, significantly improving recognition accuracy with a word error rate of 6.23%, and achieving a 48.15% relative improvement over previous fixed-chunk methods. This advancement is crucial for enhancing the accessibility and usability of Tibetan language technology.

Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition

Was this article worth reading? Share it

Ready to build your own newsroom?