arXiv:2511.08389v1 Announce Type: cross 
Abstract: Speech Foundation Models have gained significant attention recently. Prior works have shown that the fusion of representations from multiple layers of the same model or the fusion of multiple models can improve performance on downstream tasks. We unify these two fusion strategies by proposing an interface module that enables fusion across multiple upstream speech models while integrating information across their layers. We conduct extensive experiments on different self-supervised and supervised models across various speech tasks, including ASR and paralinguistic analysis, and demonstrate that our method outperforms prior fusion approaches. We further analyze its scalability concerning model size and count, highlighting the importance of selecting appropriate upstream models. Our results show that the proposed interface provides an additional performance boost when given a suitable upstream model selection, making it a promising approach for utilizing Speech Foundation Models.

تقدم دراسة جديدة نهجًا موحدًا للدمج في نماذج أساس الكلام، حيث تجمع بين تقنيات دمج الطبقات والنماذج. تم اختبار هذه الطريقة في مهام الكلام المختلفة مثل التعرف التلقائي على الكلام (ASR) والتحليل البرالينغوي، وتظهر أداءً محسنًا مقارنة بالطرق السابقة، مما يبرز أهمية اختيار نماذج مناسبة. هذه التطورات مهمة لتعزيز قدرات تقنيات التعرف على الكلام.

Un nuevo estudio propone un enfoque unificado para la fusión en Modelos de Fundación de Voz, combinando técnicas de fusión de capas y de modelos. Este método, probado en diversas tareas de voz como el reconocimiento automático de voz (ASR) y el análisis paralingüístico, muestra un rendimiento mejorado en comparación con métodos anteriores, enfatizando la importancia de seleccionar modelos adecuados. Este avance es significativo para mejorar las capacidades de las tecnologías de reconocimiento de voz.

Une nouvelle étude propose une approche unifiée de fusion dans les modèles de fond de parole, combinant des techniques de fusion de couches et de modèles. Cette méthode, testée sur diverses tâches de parole comme la reconnaissance automatique de la parole (ASR) et l'analyse paralinguistique, montre une performance améliorée par rapport aux méthodes précédentes, soulignant l'importance de sélectionner des modèles en amont appropriés. Cette avancée est significative pour améliorer les capacités des technologies de reconnaissance vocale.

A new study proposes a unified approach to fusion in Speech Foundation Models, combining layer and model fusion techniques. This method, tested on various speech tasks like ASR and paralinguistic analysis, shows improved performance over previous methods, emphasizing the importance of selecting appropriate upstream models. This advancement is significant for enhancing the capabilities of speech recognition technologies.

Unifying Model and Layer Fusion for Speech Foundation Models

Was this article worth reading? Share it

Ready to build your own newsroom?