arXiv:2511.13719v1 Announce Type: cross 
Abstract: Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.

أظهرت التقدمات الأخيرة في نماذج الأساس متعددة الوسائط وجود نقص ملحوظ في الذكاء المكاني. تهدف عائلة SenseNova-SI إلى معالجة هذه النواقص من خلال الاستفادة من نماذج معروفة مثل Qwen3-VL وInternVL3 وBagel. يعد إنشاء SenseNova-SI-8M، وهو مجموعة بيانات تضم ثمانية ملايين عينة متنوعة مصنفة حسب القدرات المكانية، عنصرًا أساسيًا في هذه المبادرة. حقق النموذج نتائج مثيرة للإعجاب عبر عدة معايير، بما في ذلك 68.7% على VSI-Bench و85.6% على MindCube، مع الحفاظ على فهم متعدد الوسائط قوي.

Los recientes avances en modelos de fundación multimodal han revelado deficiencias significativas en la inteligencia espacial. La familia SenseNova-SI busca abordar estas deficiencias aprovechando modelos establecidos como Qwen3-VL, InternVL3 y Bagel. Un componente clave de esta iniciativa es la creación de SenseNova-SI-8M, un conjunto de datos que comprende ocho millones de muestras diversas categorizadas por capacidades espaciales. El modelo ha logrado resultados impresionantes en varios benchmarks, incluyendo un 68.7% en VSI-Bench y un 85.6% en MindCube, mientras demuestra una sólida compre…

Des progrès récents dans les modèles de fondation multimodaux ont révélé des lacunes significatives en matière d'intelligence spatiale. La famille SenseNova-SI vise à remédier à ces déficiences en s'appuyant sur des modèles établis tels que Qwen3-VL, InternVL3 et Bagel. Un élément clé de cette initiative est la création de SenseNova-SI-8M, un ensemble de données comprenant huit millions d'échantillons diversifiés classés par capacités spatiales. Le modèle a obtenu des résultats impressionnants sur divers benchmarks, notamment 68,7 % sur VSI-Bench et 85,6 % sur MindCube, tout en démontrant une …

Recent advancements in multimodal foundation models have revealed significant gaps in spatial intelligence. The SenseNova-SI family aims to address these deficiencies by leveraging established models such as Qwen3-VL, InternVL3, and Bagel. A key component of this initiative is the creation of SenseNova-SI-8M, a dataset comprising eight million diverse samples categorized by spatial capabilities. The model has achieved impressive results on various benchmarks, including 68.7% on VSI-Bench and 85.6% on MindCube, while also demonstrating strong multimodal understanding.

Scaling Spatial Intelligence with Multimodal Foundation Models

arXiv:2511.17952v1 Announce Type: new 
Abstract: Understanding social interaction in video requires reasoning over a dynamic interplay of verbal and non-verbal cues: who is speaking, to whom, and with what gaze or gestures. While Multimodal Large Language Models (MLLMs) are natural candidates, simply adding visual inputs yields surprisingly inconsistent gains on social tasks. Our quantitative analysis of cross-modal attention inside state-of-the-art MLLMs reveals a core failure mode: in multi-speaker scenes, visual and textual tokens lack speaker-consistent alignment, exhibiting substantially weaker cross-modal attention than in object-centric images. To address this, we propose a multimodal multi-speaker attention alignment method that can be integrated into existing MLLMs. First, we introduce dynamic cross-modal head selection to identify attention heads most responsible for grounding. Then, an adaptive social-aware attention bias, computed from existing attention patterns and speaker locations, is injected into the attention mechanism. This bias reinforces alignment between a speaker's visual representation and their utterances without introducing trainable parameters or architectural changes. We integrate our method into three distinct MLLMs (LLaVA-NeXT-Video, Qwen2.5-VL, and InternVL3) and evaluate on three benchmarks (TVQA+, MMSI, OnlineMMSI). Across four social tasks, results demonstrate that our approach improves the ability of MLLMs and achieves state-of-the-art results. Attention visualizations confirm our method successfully focuses the model on speaker-relevant regions, enabling more robust multi-party social reasoning. Our implementation and model will be available at https://github.com/ut-vision/SocialInteraction.

تم اقتراح طريقة جديدة لتحسين فهم التفاعل الاجتماعي في مقاطع الفيديو، تركز على محاذاة الإشارات اللفظية وغير اللفظية في سيناريوهات متعددة المتحدثين. تتناول هذه الطريقة القيود التي لوحظت في نماذج اللغة متعددة الوسائط (MLLMs) الحالية، التي تواجه صعوبة في الحفاظ على اتساق الانتباه بين الأنماط في مثل هذه السياقات.

Se ha propuesto un nuevo método para mejorar la comprensión de la interacción social en videos, centrado en el alineamiento de las señales verbales y no verbales en escenarios de múltiples hablantes. Este enfoque aborda las limitaciones observadas en los Modelos de Lenguaje Multimodal (MLLMs) existentes, que luchan con la consistencia de la atención cruzada en tales contextos.

Une nouvelle méthode pour améliorer la compréhension des interactions sociales dans les vidéos a été proposée, axée sur l'alignement des indices verbaux et non verbaux dans des scénarios à plusieurs intervenants. Cette approche répond aux limitations observées dans les modèles de langage multimodaux (MLLMs) existants, qui ont du mal à maintenir une cohérence d'attention intermodale dans de tels contextes.

A new method for enhancing social interaction understanding in videos has been proposed, focusing on the alignment of verbal and non-verbal cues in multi-speaker scenarios. This approach addresses the limitations observed in existing Multimodal Large Language Models (MLLMs), which struggle with cross-modal attention consistency in such contexts.

Multi-speaker Attention Alignment for Multimodal Social Interaction

arXiv:2511.18448v1 Announce Type: new 
Abstract: Multimodal large language models (MLLMs) have made significant advancements in event-based vision, yet the comprehensive evaluation of their capabilities within a unified benchmark remains largely unexplored. In this work, we introduce EventBench, a benchmark that offers eight diverse task metrics together with a large-scale event stream dataset. EventBench differs from existing event-based benchmarks in four key aspects: (1) openness in accessibility, releasing all raw event streams and task instructions across eight evaluation metrics; (2) diversity in task coverage, spanning understanding, recognition, and spatial reasoning tasks for comprehensive capability assessment; (3) integration in spatial dimensions, pioneering the design of 3D spatial reasoning tasks for event-based MLLMs; and (4) scale in data volume, with an accompanying training set of over one million event-text pairs supporting large-scale training and evaluation. Using EventBench, we evaluate state-of-the-art closed-source models such as GPT-5 and Gemini-2.5 Pro, leading open-source models including Qwen2.5-VL and InternVL3, and event-based MLLMs such as EventGPT that directly process raw event streams. Extensive evaluation reveals that while current event-based MLLMs demonstrate strong performance in event stream understanding, they continue to struggle with fine-grained recognition and spatial reasoning.

تم تقديم معيار جديد يسمى EventBench لتقييم قدرات نماذج اللغة متعددة الوسائط (MLLM) في الرؤية المعتمدة على الأحداث. يتضمن هذا المعيار ثماني مقاييس مهام متنوعة ومجموعة بيانات كبيرة الحجم من تدفقات الأحداث، بهدف تقديم تقييم شامل لأداء MLLM عبر مهام متنوعة، بما في ذلك الفهم والتعرف والتفكير المكاني.

Se ha introducido un nuevo benchmark llamado EventBench para evaluar las capacidades de los modelos de lenguaje multimodal (MLLM) en la visión basada en eventos. Este benchmark presenta ocho métricas de tareas diversas y un conjunto de datos de flujo de eventos a gran escala, con el objetivo de proporcionar una evaluación integral del rendimiento de los MLLM en diversas tareas, incluyendo comprensión, reconocimiento y razonamiento espacial.

Un nouveau benchmark appelé EventBench a été introduit pour évaluer les capacités des modèles de langage multimodaux (MLLM) dans la vision basée sur des événements. Ce benchmark comprend huit métriques de tâches diverses et un ensemble de données de flux d'événements à grande échelle, visant à fournir une évaluation complète des performances des MLLM à travers diverses tâches, y compris la compréhension, la reconnaissance et le raisonnement spatial.

A new benchmark called EventBench has been introduced to evaluate the capabilities of multimodal large language models (MLLMs) in event-based vision. This benchmark features eight diverse task metrics and a large-scale event stream dataset, aiming to provide a comprehensive assessment of MLLMs' performance across various tasks, including understanding, recognition, and spatial reasoning.

EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs

arXiv:2511.01295v2 Announce Type: replace 
Abstract: Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.

أدت التطورات الأخيرة في النماذج التوليدية متعددة الوسائط إلى اقتراح UniREditBench، وهو معيار موحد مصمم لتقييم قدرات تحرير الصور بشكل منهجي عبر سيناريوهات تفكير متنوعة. يتناول هذا المعيار القيود المفروضة على النماذج الحالية التي تكافح مع المهام المعقدة التي تتطلب تفكيرًا ضمنيًا وتفاعلات متعددة الكائنات.

Los avances recientes en modelos generativos multimodales han llevado a la propuesta de UniREditBench, un marco unificado diseñado para evaluar sistemáticamente las capacidades de edición de imágenes en diversos escenarios de razonamiento. Este marco aborda las limitaciones de los modelos existentes que luchan con tareas complejas que requieren razonamiento implícito e interacciones entre múltiples objetos.

Des avancées récentes dans les modèles génératifs multimodaux ont conduit à la proposition d'UniREditBench, une référence unifiée conçue pour évaluer systématiquement les capacités d'édition d'images à travers divers scénarios de raisonnement. Cette référence aborde les limitations des modèles existants qui peinent avec des tâches complexes nécessitant un raisonnement implicite et des interactions multi-objets.

Recent advancements in multi-modal generative models have led to the proposal of UniREditBench, a unified benchmark designed to systematically evaluate image editing capabilities across diverse reasoning scenarios. This benchmark addresses the limitations of existing models that struggle with complex tasks requiring implicit reasoning and multi-object interactions.

Scaling Spatial Intelligence with Multimodal Foundation Models

Was this article worth reading? Share it

MindStudio

Dynamiq

One More Thing in AI