End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering

arXiv — cs.CLThursday, November 13, 2025 at 5:00:00 AM
The introduction of CLSR marks a significant advancement in the field of spoken question answering (SQA), addressing the limitations of existing methods that struggle with long audio recordings. Traditional large audio language models often fail to process lengthy inputs effectively, leading to a gap in performance. CLSR's innovative design incorporates an intermediate step that transforms acoustic features into text-like representations, bridging the gap between audio and text modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR outperforms both end-to-end speech-related retrievers and conventional pipeline approaches that combine speech recognition with text retrieval. This breakthrough not only enhances the efficiency of extracting relevant segments from long audio but also lays a robust foundation for advancing practical applications in long-form SQA, potentially transforming how users interact with audio content.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it