arXiv:2512.06769v1 Announce Type: new 
Abstract: Existing vision-language models often suffer from spatial hallucinations, i.e., generating incorrect descriptions about the relative positions of objects in an image. We argue that this problem mainly stems from the asymmetric properties between images and text. To enrich the spatial understanding ability of vision-language models, we propose a simple, annotation-free, plug-and-play method named $\text{Stitch and Tell}$ (abbreviated as SiTe), which injects structured spatial supervision into data. It constructs stitched image-text pairs by stitching images along a spatial axis and generating spatially-aware captions or question answer pairs based on the layout of stitched image, without relying on costly advanced models or human involvement. We evaluate SiTe across three architectures including LLaVA-v1.5-7B, LLaVA-Qwen2-1.5B and HALVA-7B, two training datasets, and eight benchmarks. Experiments show that SiTe improves spatial understanding tasks such as $\text{MME}_{\text{Position}}$ (+5.50%) and Spatial-MM (+4.19%), while maintaining or improving performance on general vision-language benchmarks including COCO-QA (+1.02%) and MMBench (+4.76%). Our findings suggest that explicitly injecting spatially-aware structure into training data offers an effective way to mitigate spatial hallucinations and improve spatial understanding, while preserving general vision-language capabilities.

تم اقتراح طريقة جديدة تُدعى Stitch and Tell (SiTe) لتعزيز الفهم المكاني لنماذج الرؤية واللغة، حيث تعالج مشكلة الهلاوس المكانية التي تؤدي إلى أوصاف غير صحيحة لمواقع الكائنات في الصور. تقوم هذه الطريقة بإنشاء أزواج من الصور والنصوص المدمجة وتوليد تسميات واعية مكانيًا دون الحاجة إلى تعليقات مكثفة أو نماذج متقدمة.

Se ha propuesto un nuevo método llamado Stitch and Tell (SiTe) para mejorar la comprensión espacial de los modelos de visión-lenguaje, abordando el problema de las alucinaciones espaciales que conducen a descripciones incorrectas de las posiciones de los objetos en las imágenes. Este método construye pares de imagen-texto cosidos y genera subtítulos conscientes del espacio sin requerir anotaciones extensas o modelos avanzados.

Une nouvelle méthode appelée Stitch and Tell (SiTe) a été proposée pour améliorer la compréhension spatiale des modèles de vision-langage, en s'attaquant au problème des hallucinations spatiales qui entraînent des descriptions incorrectes des positions des objets dans les images. Cette méthode construit des paires image-texte assemblées et génère des légendes conscientes de l'espace sans nécessiter d'annotations étendues ou de modèles avancés.

A new method called Stitch and Tell (SiTe) has been proposed to enhance the spatial understanding of vision-language models, addressing the issue of spatial hallucinations that lead to incorrect descriptions of object positions in images. This method constructs stitched image-text pairs and generates spatially-aware captions without requiring extensive annotations or advanced models.

Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding

arXiv:2512.08240v1 Announce Type: new 
Abstract: Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.

يقدم إدخال HybridToken-VLM (HTC-VLM) نهجًا مبتكرًا لضغط الرموز الهجينة لنماذج الرؤية-اللغة (VLM)، حيث يتناول التحديات الحاسوبية التي تواجهها الأساليب التقليدية التي تكافح مع متطلبات الذاكرة العالية ونوافذ السياق. يستخدم HTC-VLM إطارًا مزدوج القناة لفصل التفاصيل الدقيقة والمرتكزات الرمزية، محققًا متوسط احتفاظ بالأداء بنسبة 87.2% عبر سبعة معايير.

La introducción de HybridToken-VLM (HTC-VLM) presenta un enfoque novedoso para la compresión híbrida de tokens en modelos de visión-lenguaje (VLM), abordando los desafíos computacionales que plantean los métodos tradicionales que luchan con altas demandas de memoria y ventanas de contexto. HTC-VLM utiliza un marco de doble canal para separar los detalles finos y los anclajes simbólicos, logrando una impresionante retención de rendimiento promedio del 87.2% en siete benchmarks.

L'introduction de HybridToken-VLM (HTC-VLM) présente une approche novatrice de compression hybride des tokens pour les modèles de vision-langage (VLM), répondant aux défis computationnels posés par les méthodes traditionnelles qui peinent avec les exigences élevées en mémoire et en fenêtres de contexte. HTC-VLM utilise un cadre à double canal pour séparer les détails fins et les ancres symboliques, atteignant une impressionnante rétention de performance moyenne de 87,2 % sur sept benchmarks.

The introduction of HybridToken-VLM (HTC-VLM) presents a novel approach to hybrid token compression for vision-language models (VLMs), addressing the computational challenges posed by traditional methods that struggle with high memory and context window demands. HTC-VLM utilizes a dual-channel framework to separate fine-grained details and symbolic anchors, achieving an impressive average performance retention of 87.2% across seven benchmarks.

Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding

Was this article worth reading? Share it

LucidQuery AI

Com.locatelloapp

Attentive AI