arXiv:2511.17335v1 Announce Type: cross 
Abstract: Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.

تقدم دراسة جديدة طريقة لتعزيز التفاعل بين الإنسان والروبوت من خلال استخدام Q-former طويل السياق المدمج مع نماذج اللغة متعددة الوسائط (LLMs). تركز هذه الطريقة على توليد تأكيدات عمل الروبوت وتخطيط خطوات العمل بناءً على فهم شامل للمشهد، مما يعالج القيود التي تواجه الأساليب الحالية التي تعتمد بشكل أساسي على معالجة مستوى المقطع.

Un nuevo estudio presenta un método para mejorar la interacción humano-robot mediante el uso de un Q-former de largo contexto integrado con modelos de lenguaje multimodal (LLMs). Este enfoque se centra en generar confirmaciones de acción del robot y planificar pasos de acción basados en una comprensión integral de la escena, abordando las limitaciones de los métodos actuales que se basan principalmente en el procesamiento a nivel de clips.

Une nouvelle étude présente une méthode visant à améliorer l'interaction homme-robot en utilisant un Q-former à long contexte intégré avec des modèles de langage multimodaux (LLMs). Cette approche se concentre sur la génération de confirmations d'actions des robots et la planification des étapes d'action basées sur une compréhension complète de la scène, répondant aux limites des méthodes actuelles qui s'appuient principalement sur le traitement au niveau des clips.

A new study presents a method for enhancing human-robot interaction by utilizing a long-context Q-former integrated with multimodal large language models (LLMs). This approach focuses on generating robot action confirmations and planning action steps based on comprehensive scene understanding, addressing limitations of current methods that primarily rely on clip-level processing.

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

Was this article worth reading? Share it

LucidQuery AI

Guidejar-4eb95b

Cont3xt.dev