Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

arXiv — cs.CL•Tuesday, December 9, 2025 at 5:00:00 AM

A systematic investigation into automatic speech recognition (ASR) for low-resource languages has been conducted, focusing on Perso-Arabic languages such as Persian, Arabic, and Urdu. The study demonstrates that leveraging cross-lingual unlabeled data can effectively enhance ASR performance without the need for extensive labeled datasets. A 300M parameter model was developed, achieving results comparable to larger systems while utilizing a 3,000-hour multilingual corpus.
This development is significant as it addresses the critical challenge of data scarcity in low-resource languages, enabling better recognition accuracy and accessibility for speakers of Persian, Arabic, and Urdu. By employing innovative techniques such as continual pretraining and morphologically-aware tokenization, the model represents a substantial advancement in ASR technology, potentially transforming communication and technology access in these linguistic communities.
The findings resonate with ongoing efforts to improve natural language processing (NLP) capabilities across various languages, particularly those with limited resources. The integration of context-aware strategies in ASR, as seen in recent advancements, highlights a growing recognition of the importance of linguistic diversity and the need for tailored solutions in AI. This trend reflects a broader commitment to inclusivity in technology, addressing the unique challenges posed by dialectal variations and complex grammatical structures in languages like Arabic and Persian.

— via World Pulse Now AI Editorial System

Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

Was this article worth reading? Share it

Airparser

ShareSpeak

LucidQuery AI