arXiv:2509.00030v3 Announce Type: replace-cross 
Abstract: Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign, ChicagoFSWildPlus, and BOBSL datasets with a BLEU-4 score of 22.1, 73.2% letter accuracy and BLEU-4 score of 6.8 respectively. These results validate our core hypothesis: isolating and solving distinct recognition tasks before fusion provides a more powerful and effective pathway to robust, high-fidelity sign language translation.

تم تقديم SignBind-LLM كإطار عمل معياري يهدف إلى تحسين ترجمة لغة الإشارة (SLT) من خلال معالجة التحديات المتعلقة بالتعرف على التهجئة السريعة ودمج الإشارات غير اليدوية. تستخدم هذه الطريقة المبتكرة متنبئين متخصصين للإشارة المستمرة، والتهجئة، وقراءة الشفاه، والتي يتم دمجها بعد ذلك بواسطة محول خفيف الوزن لتحسين دقة الترجمة.

Se ha presentado SignBind-LLM como un marco modular diseñado para mejorar la traducción de la lengua de signos (SLT) al abordar los desafíos del reconocimiento de la ortografía rápida y la integración de señales no manuales. Este enfoque innovador utiliza predictores especializados para la firma continua, la ortografía y la lectura de labios, que luego se fusionan mediante un transformador ligero para mejorar la precisión de la traducción.

SignBind-LLM a été introduit comme un cadre modulaire visant à améliorer la traduction de la langue des signes (SLT) en s'attaquant aux défis de la reconnaissance de l'orthographe rapide et de l'intégration des signaux non manuels. Cette approche innovante utilise des prédicteurs spécialisés pour la langue des signes continue, l'orthographe et la lecture des lèvres, qui sont ensuite fusionnés par un transformateur léger pour améliorer la précision de la traduction.

SignBind-LLM has been introduced as a modular framework aimed at enhancing Sign Language Translation (SLT) by addressing the challenges of high-speed fingerspelling recognition and the integration of non-manual cues. This innovative approach utilizes specialized predictors for continuous signing, fingerspelling, and lipreading, which are then fused by a lightweight transformer to improve translation accuracy.

SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation

arXiv:2512.08040v1 Announce Type: new 
Abstract: Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.

تم تطوير نموذج موحد جديد لفهم لغة الإشارة، يركز على ترجمة لغة الإشارة (SLT) ومحاذاة العناوين الفرعية (SSA). يهدف هذا النموذج إلى تحويل مقاطع الفيديو المستمرة للإشارة إلى نص بلغة منطوقة ومحاذاة الإشارات مع العناوين الفرعية، مما يعزز التواصل العملي والتطبيقات التعليمية.

Se ha desarrollado un nuevo modelo unificado para la comprensión de la lengua de signos, centrado en la traducción de la lengua de signos (SLT) y la alineación de subtítulos (SSA). Este modelo tiene como objetivo convertir videos de signos continuos en texto en lengua hablada y alinear los signos con los subtítulos, mejorando así la comunicación práctica y las aplicaciones educativas.

Un nouveau modèle unifié pour la compréhension de la langue des signes a été développé, se concentrant sur la traduction de la langue des signes (SLT) et l'alignement des sous-titres (SSA). Ce modèle vise à convertir des vidéos de signes continus en texte de langue parlée et à aligner les signes avec les sous-titres, améliorant ainsi la communication pratique et les applications éducatives.

A new unified model for sign language understanding has been developed, focusing on sign language translation (SLT) and sign-subtitle alignment (SSA). This model aims to convert continuous signing videos into spoken language text and align signing with subtitles, enhancing practical communication and educational applications.

Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment

arXiv:2512.07273v1 Announce Type: new 
Abstract: Gloss-free sign language translation (SLT) is hindered by two key challenges: **inadequate sign representation** that fails to capture nuanced visual cues, and **sentence-level semantic misalignment** in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage **r**einforcing **v**ision-**l**anguage **f**ramework (**RVLF**). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level semantic misalignment, we introduce a GRPO-based optimization strategy that fine-tunes the SLT-SFT model with a reward function combining translation fidelity (BLEU) and sentence completeness (ROUGE), yielding the optimized model termed SLT-GRPO. Our conceptually simple framework yields substantial gains under the gloss-free SLT setting without pre-training on any external large-scale sign language datasets, improving BLEU-4 scores by +5.1, +1.11, +1.4, and +1.61 on the CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL datasets, respectively. To the best of our knowledge, this is the first work to incorporate GRPO into SLT. Extensive experiments and ablation studies validate the effectiveness of GRPO-based optimization in enhancing both translation quality and semantic consistency.

تم تقديم إطار عمل جديد يسمى RVLF لتحسين ترجمة لغة الإشارة بدون تلميحات، من خلال معالجة التحديات المتعلقة بتمثيل الإشارات والتوافق الدلالي. يجمع هذا الإطار المعزز للغة الرؤية في ثلاث مراحل بين نموذج كبير للغة الرؤية مع التعلم المعزز لتحسين أداء الترجمة، باستخدام تقنيات متقدمة مثل إشارات الحركة المستندة إلى الهيكل العظمي وميزات بصرية مستخرجة عبر DINOv2.

Se ha presentado un nuevo marco llamado RVLF para mejorar la traducción de la lengua de señas sin glosas, abordando los desafíos en la representación de signos y el alineamiento semántico. Este marco de visión-lenguaje en tres etapas combina un gran modelo de visión-lenguaje con aprendizaje por refuerzo para mejorar el rendimiento de la traducción, utilizando técnicas avanzadas como señales de movimiento basadas en esqueletos y características visuales extraídas mediante DINOv2.

Un nouveau cadre appelé RVLF a été introduit pour améliorer la traduction de la langue des signes sans glosses en s'attaquant aux défis de la représentation des signes et de l'alignement sémantique. Ce cadre de vision-langage renforcé en trois étapes combine un grand modèle de vision-langage avec un apprentissage par renforcement pour améliorer les performances de traduction, en utilisant des techniques avancées telles que les indices de mouvement basés sur le squelette et les caractéristiques visuelles de DINOv2.

A new framework called RVLF has been introduced to enhance gloss-free sign language translation by addressing challenges in sign representation and semantic alignment. This three-stage reinforcing vision-language framework combines a large vision-language model with reinforcement learning to improve translation performance, utilizing advanced techniques such as skeleton-based motion cues and DINOv2 visual features.

SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation

Was this article worth reading? Share it

LucidQuery AI

ShareSpeak

OpenL Translator