arXiv:2511.16454v1 Announce Type: new 
Abstract: Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.

LLaVA$^3$ هو نموذج جديد للغة متعددة الوسائط مصمم لتحسين فهم المشاهد ثلاثية الأبعاد من قبل نماذج اللغة البصرية (VLM). يستخدم صور ثنائية الأبعاد متعددة الزوايا دون الحاجة إلى ضبط دقيق، مستلهمًا من الرسامين التكعيبيين الذين يمثلون وجهات نظر متعددة في صورة واحدة. يعتمد النموذج على تمثيلات بصرية شاملة مشتقة من إعادة بناء ثلاثية الأبعاد متعددة الزوايا، مما يظهر أداءً متفوقًا في مهام الأسئلة والأجوبة البصرية ثلاثية الأبعاد وتأسيس اللغة مقارنةً بالحلول السابقة المعتمدة على الصور ثنائية الأبعاد.

LLaVA$^3$ es un nuevo modelo de lenguaje multimodal diseñado para mejorar la comprensión de escenas 3D por parte de los modelos de lenguaje visual (VLM). Utiliza imágenes 2D de múltiples vistas sin requerir ajuste fino, inspirado en los pintores cubistas que representaban múltiples perspectivas en una sola imagen. El modelo emplea representaciones visuales omnidireccionales derivadas de una reconstrucción 3D de múltiples vistas, mostrando un rendimiento superior en tareas de preguntas y respuestas visuales 3D y anclaje lingüístico en comparación con soluciones anteriores basadas en 2D.

LLaVA$^3$ est un nouveau modèle de langage multimodal conçu pour améliorer la compréhension des scènes 3D par les modèles de langage visuel (VLM). Il utilise des images 2D multi-vues sans nécessiter de réglage fin, inspiré par les peintres cubistes qui représentaient plusieurs perspectives dans une seule image. Le modèle exploite des représentations visuelles omnidirectionnelles dérivées d'une reconstruction 3D multi-vues, montrant des performances supérieures dans les tâches de questions-réponses visuelles 3D et d'ancrage linguistique par rapport aux solutions précédentes basées sur 2D.

LLaVA$^3$ is a new multi-modal language model designed to enhance the understanding of 3D scenes by visual language models (VLMs). It leverages multi-view 2D images without requiring fine-tuning, inspired by Cubist painters who depicted multiple perspectives in a single image. The model utilizes omnidirectional visual representations derived from a multi-view 3D reconstruction, demonstrating superior performance in 3D visual question answering (VQA) and language grounding tasks compared to previous 2D-based solutions.

LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs

Was this article worth reading? Share it

LucidQuery AI

VECTARY

The Visualizer

Deptho.ai

Artefacts.ai

Blunge

Ready to build your own newsroom?