arXiv:2509.11514v2 Announce Type: replace 
Abstract: During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re-use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state-of-the-art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object-matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus and code for reproducibility and to facilitate future research.

تسلط دراسة حديثة الضوء على قيود نماذج اللغة البصرية الكبيرة (LVLM) في فهم التواصل الإشاري البشري خلال المحادثات العفوية. تكافح هذه النماذج لفهم التعبيرات الإشارية الجديدة التي ينشئها المتحدثون ويعيدون استخدامها، وهو أمر حاسم للتفاعل الفعال في المهام الواقعية. تعتبر هذه الأبحاث مهمة لأنها تسلط الضوء على التحديات التي تواجه الذكاء الاصطناعي في تقليد التواصل البشري، مما يبرز الحاجة إلى تحسين تكامل المهارات اللغوية والبصرية والحوارية.

Un estudio reciente destaca las limitaciones de los grandes modelos de lenguaje visual (LVLM) para entender la comunicación referencial humana durante conversaciones espontáneas. Estos modelos tienen dificultades para captar nuevas expresiones referenciales que los hablantes crean y reutilizan, lo cual es crucial para una interacción efectiva en tareas del mundo real. Esta investigación es significativa ya que pone de relieve los desafíos que enfrenta la IA para imitar la comunicación humana, enfatizando la necesidad de una mejor integración de habilidades lingüísticas, visuales y conversacionales.

Une étude récente met en lumière les limites des grands modèles de langage visuel (LVLM) dans la compréhension de la communication référentielle humaine lors de conversations spontanées. Ces modèles ont du mal à saisir les nouvelles expressions référentielles que les locuteurs créent et réutilisent, ce qui est crucial pour une interaction efficace dans des tâches réelles. Cette recherche est significative car elle souligne les défis auxquels l'IA est confrontée pour imiter la communication humaine, en mettant l'accent sur la nécessité d'une meilleure intégration des compétences linguistiques, visuelles et conversationnelles.

A recent study highlights the limitations of large vision language models (LVLMs) in understanding human referential communication during spontaneous conversations. These models struggle to grasp novel referring expressions that speakers create and reuse, which is crucial for effective interaction in real-world tasks. This research is significant as it sheds light on the challenges faced by AI in mimicking human communication, emphasizing the need for better integration of language, vision, and conversational skills.

LVLMs are Bad at Overhearing Human Referential Communication

arXiv:2410.01870v3 Announce Type: replace 
Abstract: Fine-tuning large pre-trained foundation models often yields excellent downstream performance but is prohibitively expensive when updating all parameters. Parameter-efficient fine-tuning (PEFT) methods such as LoRA alleviate this by introducing lightweight update modules, yet they commonly rely on weight-agnostic linear approximations, limiting their expressiveness. In this work, we propose PEANuT, a novel PEFT framework that introduces weight-aware neural tweakers, compact neural modules that generate task-adaptive updates conditioned on frozen pre-trained weights. PEANuT provides a flexible yet efficient way to capture complex update patterns without full model tuning. We theoretically show that PEANuT achieves equivalent or greater expressivity than existing linear PEFT methods with comparable or fewer parameters. Extensive experiments across four benchmarks with over twenty datasets demonstrate that PEANuT consistently outperforms strong baselines in both NLP and vision tasks, while maintaining low computational overhead.

تمثل مقدمة PEANuT، وهو إطار جديد للتكيف الفعال من حيث المعلمات، خطوة نحو تحسين تكيف النماذج الكبيرة المدربة مسبقًا من خلال استخدام مُعدلات عصبية واعية للوزن تُنتج تحديثات خاصة بالمهمة بناءً على الأوزان المجمدة. تتناول هذه الطريقة القيود المفروضة على الأساليب الحالية مثل LoRA، التي تعتمد غالبًا على تقريبات غير واعية للوزن.

La introducción de PEANuT, un nuevo marco de ajuste eficiente de parámetros, tiene como objetivo mejorar la adaptación de grandes modelos preentrenados mediante el uso de ajustadores neuronales conscientes del peso que generan actualizaciones específicas de la tarea basadas en pesos congelados. Este enfoque aborda las limitaciones de métodos existentes como LoRA, que a menudo dependen de aproximaciones independientes del peso.

L'introduction de PEANuT, un nouveau cadre de réglage efficace des paramètres, vise à améliorer l'adaptation des grands modèles pré-entraînés en utilisant des ajusteurs neuronaux sensibles au poids qui génèrent des mises à jour spécifiques à la tâche basées sur des poids gelés. Cette approche répond aux limitations des méthodes existantes comme LoRA, qui reposent souvent sur des approximations indépendantes du poids.

The introduction of PEANuT, a novel parameter-efficient fine-tuning framework, aims to enhance the adaptation of large pre-trained models by utilizing weight-aware neural tweakers that generate task-specific updates based on frozen weights. This approach addresses the limitations of existing methods like LoRA, which often rely on weight-agnostic approximations.

LVLMs are Bad at Overhearing Human Referential Communication

Was this article worth reading? Share it

LucidQuery AI

Usercall

The Visualizer