arXiv:2510.21323v1 Announce Type: cross 
Abstract: The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination. Codes are available at https://github.com/ssfgunner/VL-SAE.

تُعتبر مقدمة VL-SAE تقدمًا كبيرًا في مجال نماذج الرؤية واللغة من خلال تعزيز قابليتها للتفسير وقدراتها على التفكير متعدد الوسائط. يتناول هذا المشفر النادر التحديات المتعلقة بمحاذاة التمثيلات البصرية واللغوية، مما يسهل فهم كيفية عمل هذه النماذج. يُعتبر هذا التطور مهمًا لأنه لا يحسن فقط أداء نماذج VLM، بل يفتح أيضًا آفاقًا جديدة للبحث في الذكاء الاصطناعي، مما قد يؤدي إلى تطبيقات أكثر حداثة وفعالية.

La introducción de VL-SAE marca un avance significativo en el campo de los modelos de visión-lenguaje al mejorar su interpretabilidad y capacidades de razonamiento multimodal. Este nuevo autoencoder disperso aborda los desafíos de alinear las representaciones visuales y lingüísticas, facilitando la comprensión de cómo funcionan estos modelos. Este desarrollo es crucial, ya que no solo mejora el rendimiento de los VLM, sino que también abre nuevas avenidas para la investigación en inteligencia artificial, lo que podría llevar a aplicaciones más intuitivas y efectivas.

L'introduction de VL-SAE représente une avancée significative dans le domaine des modèles vision-langage en améliorant leur interprétabilité et leurs capacités de raisonnement multimodal. Ce nouvel autoencodeur sparse répond aux défis d'alignement des représentations visuelles et linguistiques, facilitant ainsi la compréhension du fonctionnement de ces modèles. Ce développement est crucial car il améliore non seulement les performances des VLM, mais ouvre également de nouvelles voies de recherche en intelligence artificielle, ce qui pourrait conduire à des applications plus intuitives et efficaces.

The introduction of VL-SAE marks a significant advancement in the field of vision-language models by enhancing their interpretability and multi-modal reasoning capabilities. This new sparse autoencoder addresses the challenges of aligning vision and language representations, making it easier to understand how these models work. This development is crucial as it not only improves the performance of VLMs but also opens up new avenues for research in artificial intelligence, potentially leading to more intuitive and effective applications.

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

arXiv:2601.06204v2 Announce Type: replace 
Abstract: Intelligent anomaly detection in dynamic visual environments requires reconciling real-time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction-based models capture low-level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision-language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi-agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish-subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large-scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.

تم تقديم إطار عمل جديد للكشف عن الشذوذ متعدد الوكلاء في أنظمة المراقبة، باستخدام نماذج الرؤية-اللغة وتصنيف قائم على التضمين لتحسين الأداء في الوقت الحقيقي وقابلية التفسير الدلالي. تتكامل هذه الطريقة مع منهجيات مختلفة، بما في ذلك التصفية المعتمدة على إعادة البناء والتقييم على مستوى الكائن، لمعالجة تعقيدات الكشف عن الشذوذ في البيئات البصرية الديناميكية.

Se ha introducido un nuevo marco para la detección de anomalías en cascada mediante múltiples agentes en sistemas de vigilancia, utilizando modelos de visión-lenguaje y clasificación basada en embeddings para mejorar el rendimiento en tiempo real y la interpretabilidad semántica. Este enfoque integra diversas metodologías, incluyendo filtrado basado en reconstrucción y evaluaciones a nivel de objeto, para abordar las complejidades de la detección de anomalías en entornos visuales dinámicos.

Un nouveau cadre pour la détection d'anomalies en cascade à plusieurs agents dans les systèmes de surveillance a été introduit, utilisant des modèles de vision-langage et une classification basée sur l'embedding pour améliorer la performance en temps réel et l'interprétabilité sémantique. Cette approche intègre diverses méthodologies, y compris le filtrage basé sur la reconstruction et les évaluations au niveau des objets, pour répondre aux complexités de la détection d'anomalies dans des environnements visuels dynamiques.

A new framework for cascading multi-agent anomaly detection in surveillance systems has been introduced, utilizing vision-language models and embedding-based classification to enhance real-time performance and semantic interpretability. This approach integrates various methodologies, including reconstruction-gated filtering and object-level assessments, to address the complexities of detecting anomalies in dynamic visual environments.

Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification

arXiv:2508.13680v3 Announce Type: replace-cross 
Abstract: We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark designed to evaluate how vision-language models (VLMs) interpret and reason over visual and textual information beyond English. VMMU consists of 2.5k multimodal questions across 7 tasks, covering a diverse range of problem contexts, including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning. All questions require genuine multimodal integration, rather than reliance on text-only cues or OCR-based shortcuts. We evaluate a diverse set of state-of-the-art proprietary and open-source VLMs on VMMU. Despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Further analysis shows that the primary source of failure is not OCR, but instead multimodal grounding and reasoning over text and visual evidence. Code and data are available at https://vmmu.github.io.

تم تقديم VMMU، وهو معيار فيتنامي لفهم وتقييم المهام المتعددة متعددة الوسائط، بهدف تقييم قدرات نماذج اللغة والرؤية (VLM) في تفسير واستنتاج المعلومات المرئية والنصية باللغة الفيتنامية. يتضمن هذا المعيار 2.5 ألف سؤال متعدد الوسائط عبر سبع مهام متنوعة، مع التركيز على التكامل الحقيقي متعدد الوسائط بدلاً من الاعتماد على الإشارات النصية فقط.

La introducción de VMMU, un referente vietnamita para la evaluación de la comprensión y el razonamiento multimodal, tiene como objetivo evaluar las capacidades de los modelos de lenguaje y visión (VLM) para interpretar y razonar sobre información visual y textual en vietnamita. Este referente incluye 2.5k preguntas multimodales en siete tareas diversas, enfatizando la integración multimodal genuina en lugar de depender únicamente de pistas textuales.

L'introduction de VMMU, une référence vietnamienne pour l'évaluation de la compréhension et du raisonnement multimodal, vise à évaluer les capacités des modèles de langage et de vision (VLM) à interpréter et raisonner sur des informations visuelles et textuelles en vietnamien. Cette référence comprend 2,5k questions multimodales réparties sur sept tâches diverses, mettant l'accent sur une véritable intégration multimodale plutôt que sur des indices uniquement textuels.

The introduction of VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark, aims to assess the capabilities of vision-language models (VLMs) in interpreting and reasoning over visual and textual information in Vietnamese. This benchmark includes 2.5k multimodal questions across seven diverse tasks, emphasizing genuine multimodal integration rather than text-only cues.

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Was this article worth reading? Share it

LucidQuery AI

The Visualizer

Attentive AI

Https

Supametas.AI

VideoDubber Video Translator

Ready to build your own newsroom?