arXiv:2512.08040v1 Announce Type: new 
Abstract: Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.

تم تطوير نموذج موحد جديد لفهم لغة الإشارة، يركز على ترجمة لغة الإشارة (SLT) ومحاذاة العناوين الفرعية (SSA). يهدف هذا النموذج إلى تحويل مقاطع الفيديو المستمرة للإشارة إلى نص بلغة منطوقة ومحاذاة الإشارات مع العناوين الفرعية، مما يعزز التواصل العملي والتطبيقات التعليمية.

Se ha desarrollado un nuevo modelo unificado para la comprensión de la lengua de signos, centrado en la traducción de la lengua de signos (SLT) y la alineación de subtítulos (SSA). Este modelo tiene como objetivo convertir videos de signos continuos en texto en lengua hablada y alinear los signos con los subtítulos, mejorando así la comunicación práctica y las aplicaciones educativas.

Un nouveau modèle unifié pour la compréhension de la langue des signes a été développé, se concentrant sur la traduction de la langue des signes (SLT) et l'alignement des sous-titres (SSA). Ce modèle vise à convertir des vidéos de signes continus en texte de langue parlée et à aligner les signes avec les sous-titres, améliorant ainsi la communication pratique et les applications éducatives.

A new unified model for sign language understanding has been developed, focusing on sign language translation (SLT) and sign-subtitle alignment (SSA). This model aims to convert continuous signing videos into spoken language text and align signing with subtitles, enhancing practical communication and educational applications.

Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment

arXiv:2512.07273v1 Announce Type: new 
Abstract: Gloss-free sign language translation (SLT) is hindered by two key challenges: **inadequate sign representation** that fails to capture nuanced visual cues, and **sentence-level semantic misalignment** in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage **r**einforcing **v**ision-**l**anguage **f**ramework (**RVLF**). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level semantic misalignment, we introduce a GRPO-based optimization strategy that fine-tunes the SLT-SFT model with a reward function combining translation fidelity (BLEU) and sentence completeness (ROUGE), yielding the optimized model termed SLT-GRPO. Our conceptually simple framework yields substantial gains under the gloss-free SLT setting without pre-training on any external large-scale sign language datasets, improving BLEU-4 scores by +5.1, +1.11, +1.4, and +1.61 on the CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL datasets, respectively. To the best of our knowledge, this is the first work to incorporate GRPO into SLT. Extensive experiments and ablation studies validate the effectiveness of GRPO-based optimization in enhancing both translation quality and semantic consistency.

تم تقديم إطار عمل جديد يسمى RVLF لتحسين ترجمة لغة الإشارة بدون تلميحات، من خلال معالجة التحديات المتعلقة بتمثيل الإشارات والتوافق الدلالي. يجمع هذا الإطار المعزز للغة الرؤية في ثلاث مراحل بين نموذج كبير للغة الرؤية مع التعلم المعزز لتحسين أداء الترجمة، باستخدام تقنيات متقدمة مثل إشارات الحركة المستندة إلى الهيكل العظمي وميزات بصرية مستخرجة عبر DINOv2.

Se ha presentado un nuevo marco llamado RVLF para mejorar la traducción de la lengua de señas sin glosas, abordando los desafíos en la representación de signos y el alineamiento semántico. Este marco de visión-lenguaje en tres etapas combina un gran modelo de visión-lenguaje con aprendizaje por refuerzo para mejorar el rendimiento de la traducción, utilizando técnicas avanzadas como señales de movimiento basadas en esqueletos y características visuales extraídas mediante DINOv2.

Un nouveau cadre appelé RVLF a été introduit pour améliorer la traduction de la langue des signes sans glosses en s'attaquant aux défis de la représentation des signes et de l'alignement sémantique. Ce cadre de vision-langage renforcé en trois étapes combine un grand modèle de vision-langage avec un apprentissage par renforcement pour améliorer les performances de traduction, en utilisant des techniques avancées telles que les indices de mouvement basés sur le squelette et les caractéristiques visuelles de DINOv2.

A new framework called RVLF has been introduced to enhance gloss-free sign language translation by addressing challenges in sign representation and semantic alignment. This three-stage reinforcing vision-language framework combines a large vision-language model with reinforcement learning to improve translation performance, utilizing advanced techniques such as skeleton-based motion cues and DINOv2 visual features.

Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment

Was this article worth reading? Share it

ShareSpeak

OpenL Translator

VideoDubber Video Translator