arXiv:2506.04308v3 Announce Type: replace-cross 
Abstract: Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes. Please see the project page at https://zhoues.github.io/RoboRefer.

RoboRefer هو تطوير مبتكر في مجال الروبوتات، يعزز كيفية فهم الروبوتات والتفاعل مع البيئات ثلاثية الأبعاد. يتناول هذا النموذج الجديد للغة البصرية التحديات التي تواجه النماذج الحالية في تفسير المشاهد المعقدة بدقة والتفكير في التعليمات المكانية. من خلال تحسين قدرات الإشارة المكانية، يمهد RoboRefer الطريق لتفاعلات روبوتية أكثر فعالية وذكاءً في البيئات الواقعية، مما يمثل تقدمًا كبيرًا في هذا المجال.

RoboRefer es un desarrollo innovador en robótica que mejora la forma en que los robots comprenden e interactúan con entornos 3D. Este nuevo modelo de lenguaje visual aborda los desafíos que enfrentan los modelos existentes para interpretar con precisión escenas complejas y razonar sobre instrucciones espaciales. Al mejorar las capacidades de referencia espacial, RoboRefer allana el camino para interacciones robóticas más efectivas e inteligentes en entornos del mundo real, lo que representa un avance significativo en el campo.

RoboRefer est un développement révolutionnaire dans le domaine de la robotique, améliorant la manière dont les robots comprennent et interagissent avec les environnements 3D. Ce nouveau modèle de langage visuel aborde les défis rencontrés par les modèles existants pour interpréter avec précision des scènes complexes et raisonner sur des instructions spatiales. En améliorant les capacités de référence spatiale, RoboRefer ouvre la voie à des interactions robotiques plus efficaces et intelligentes dans des contextes réels, représentant ainsi une avancée significative dans le domaine.

RoboRefer is a groundbreaking development in robotics, enhancing how robots understand and interact with 3D environments. This new vision-language model addresses the challenges faced by existing models in accurately interpreting complex scenes and reasoning about spatial instructions. By improving spatial referring capabilities, RoboRefer paves the way for more effective and intelligent robotic interactions in real-world settings, making it a significant advancement in the field.

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Almost half of our attention during face-to-face conversation focuses on lip motion. Yet, robots still struggle to move their lips correctly. Even the most advanced humanoids make little more than muppet mouth gestures—if they have a face at all.

تعلم روبوت كيفية مزامنة الشفاه من خلال مشاهدة مقاطع الفيديو على يوتيوب، مما يعالج تحديًا كبيرًا في مجال الروبوتات حيث تكافح الروبوتات الشبيهة بالبشر غالبًا لتحريك شفاهها بشكل واقعي أثناء المحادثات. يسلط هذا التقدم الضوء على أهمية حركة الشفاه في التفاعل البشري، والتي تشكل ما يقرب من نصف الانتباه أثناء التواصل وجهًا لوجه.

Un robot ha aprendido a sincronizar los labios observando videos de YouTube, abordando un desafío significativo en la robótica donde los humanoides a menudo luchan por realizar movimientos realistas de los labios durante las conversaciones. Este avance destaca la importancia del movimiento de los labios en la interacción humana, que constituye casi la mitad de la atención durante la comunicación cara a cara.

Un robot a appris à synchroniser ses lèvres en observant des vidéos sur YouTube, ce qui répond à un défi majeur en robotique où les humanoïdes ont souvent du mal à reproduire des mouvements réalistes des lèvres lors des conversations. Cette avancée souligne l'importance du mouvement des lèvres dans l'interaction humaine, qui représente près de la moitié de l'attention lors des échanges en face à face.

A robot has learned to lip sync by observing YouTube videos, addressing a significant challenge in robotics where humanoids often struggle with realistic lip movements during conversations. This advancement highlights the importance of lip motion in human interaction, which constitutes nearly half of the attention during face-to-face communication.

Robot learns to lip sync by watching YouTube

arXiv:2601.06874v2 Announce Type: replace 
Abstract: Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier, termed Foreground Gradient Dilution (FGD), where sparse 3D signals lead to weak supervision. To resolve this, we introduce Per-view No-target Suppression Optimization (PVSO), which provides stronger and more balanced gradients across views, enabling stable and efficient learning. To support consistent evaluation, we build MVRefer, a benchmark that defines standardized settings and metrics for MV-3DRES. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives. Code and models are publicly available at https://mvggt.github.io.

تم تقديم المحول متعدد الوسائط للجيوميتري البصرية (MVGGT) كإطار جديد لتجزئة التعبير المرجعي ثلاثي الأبعاد متعدد المناظر (MV-3DRES)، حيث يعالج القيود المفروضة على الأساليب الحالية التي تعتمد على سحب النقاط الكثيفة. يسمح MVGGT بالتجزئة مباشرة من الصور متعددة المناظر النادرة، مما يعزز الكفاءة والأداء في التطبيقات الواقعية.

Se ha presentado el Transformador Multimodal de Geometría Visual (MVGGT) como un nuevo marco para la Segmentación de Expresión Referencial 3D Multivista (MV-3DRES), abordando las limitaciones de los métodos existentes que dependen de nubes de puntos densas. MVGGT permite la segmentación directamente a partir de imágenes multivista escasas, mejorando la eficiencia y el rendimiento en aplicaciones del mundo real.

Le Transformateur Multimodal de Géométrie Visuelle (MVGGT) a été introduit comme un cadre novateur pour la Segmentation d'Expression Référentielle 3D Multivue (MV-3DRES), répondant aux limitations des méthodes existantes qui dépendent de nuages de points denses. MVGGT permet la segmentation directement à partir d'images multivues éparses, améliorant l'efficacité et la performance dans des applications réelles.

The Multimodal Visual Geometry Grounded Transformer (MVGGT) has been introduced as a novel framework for Multiview 3D Referring Expression Segmentation (MV-3DRES), addressing the limitations of existing methods that depend on dense point clouds. MVGGT enables segmentation directly from sparse multi-view images, enhancing efficiency and performance in real-world applications.

MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

One More Thing in AI – Your Shortcut to AI Mastery

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Deptho.ai

Attentive AI

Dubsmart LLC

Com.locatelloapp

Ready to build your own newsroom?