arXiv:2510.26441v1 Announce Type: new 
Abstract: Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.

تقدم دراسة حديثة خصائص ضبط تنوع الزاوية لضبط الموجه في وقت الاختبار (TPT) لنماذج اللغة والرؤية (VLM)، مما يعالج مشكلة حاسمة في تكييف هذه النماذج مع مهام جديدة دون الاعتماد على بيانات مصنفة. تسلط الأبحاث الضوء على كيفية تحسين تشتت الميزات النصية لتعزيز أداء الضبط، مما يزيد من موثوقية وثقة نماذج VLM. هذه الخطوة مهمة لأنها تمهد الطريق لتطبيقات أكثر فعالية وأمانًا للذكاء الاصطناعي في مجالات متنوعة، مما يضمن أن هذه النماذج يمكن الوثوق بها في سيناريوهات العالم الحقيقي.

Un estudio reciente presenta las Propiedades de Calibración de Diversidad Angular para el Ajuste de Prompt en el Momento de la Prueba (TPT) de Modelos de Lenguaje-Visión (VLM), abordando un problema crítico en la adaptación de estos modelos a nuevas tareas sin datos etiquetados. La investigación destaca cómo mejorar la dispersión de las características textuales puede aumentar el rendimiento de calibración, mejorando así la fiabilidad y confianza en los VLM. Este avance es significativo ya que allana el camino para aplicaciones de IA más efectivas y seguras en diversos campos, asegurando que estos modelos sean confiables en escenarios del mundo real.

Une étude récente présente les Propriétés de Calibration de Diversité Angulaire pour le Réglage de Prompt au Moment du Test (TPT) des Modèles Vision-Langage (VLM), abordant un problème critique dans l'adaptation de ces modèles à de nouvelles tâches sans données étiquetées. La recherche souligne comment l'amélioration de la dispersion des caractéristiques textuelles peut renforcer la performance de calibration, augmentant ainsi la fiabilité et la confiance dans les VLM. Cette avancée est significative car elle ouvre la voie à des applications plus efficaces et plus sûres de l'IA dans divers domaines, garantissant que ces modèles peuvent être fiables dans des scénarios réels.

A recent study introduces Angular Diversity Calibration Properties for Test-Time Prompt Tuning (TPT) of Vision-Language Models (VLMs), addressing a critical issue in adapting these models to new tasks without labeled data. The research highlights how improving the dispersion of textual features can enhance calibration performance, ultimately boosting the reliability and trustworthiness of VLMs. This advancement is significant as it paves the way for more effective and safer applications of AI in various fields, ensuring that these models can be trusted in real-world scenarios.

A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models

arXiv:2601.06204v2 Announce Type: replace 
Abstract: Intelligent anomaly detection in dynamic visual environments requires reconciling real-time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction-based models capture low-level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision-language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi-agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish-subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large-scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.

تم تقديم إطار عمل جديد للكشف عن الشذوذ متعدد الوكلاء في أنظمة المراقبة، باستخدام نماذج الرؤية-اللغة وتصنيف قائم على التضمين لتحسين الأداء في الوقت الحقيقي وقابلية التفسير الدلالي. تتكامل هذه الطريقة مع منهجيات مختلفة، بما في ذلك التصفية المعتمدة على إعادة البناء والتقييم على مستوى الكائن، لمعالجة تعقيدات الكشف عن الشذوذ في البيئات البصرية الديناميكية.

Se ha introducido un nuevo marco para la detección de anomalías en cascada mediante múltiples agentes en sistemas de vigilancia, utilizando modelos de visión-lenguaje y clasificación basada en embeddings para mejorar el rendimiento en tiempo real y la interpretabilidad semántica. Este enfoque integra diversas metodologías, incluyendo filtrado basado en reconstrucción y evaluaciones a nivel de objeto, para abordar las complejidades de la detección de anomalías en entornos visuales dinámicos.

Un nouveau cadre pour la détection d'anomalies en cascade à plusieurs agents dans les systèmes de surveillance a été introduit, utilisant des modèles de vision-langage et une classification basée sur l'embedding pour améliorer la performance en temps réel et l'interprétabilité sémantique. Cette approche intègre diverses méthodologies, y compris le filtrage basé sur la reconstruction et les évaluations au niveau des objets, pour répondre aux complexités de la détection d'anomalies dans des environnements visuels dynamiques.

A new framework for cascading multi-agent anomaly detection in surveillance systems has been introduced, utilizing vision-language models and embedding-based classification to enhance real-time performance and semantic interpretability. This approach integrates various methodologies, including reconstruction-gated filtering and object-level assessments, to address the complexities of detecting anomalies in dynamic visual environments.

Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification

arXiv:2508.13680v3 Announce Type: replace-cross 
Abstract: We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark designed to evaluate how vision-language models (VLMs) interpret and reason over visual and textual information beyond English. VMMU consists of 2.5k multimodal questions across 7 tasks, covering a diverse range of problem contexts, including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning. All questions require genuine multimodal integration, rather than reliance on text-only cues or OCR-based shortcuts. We evaluate a diverse set of state-of-the-art proprietary and open-source VLMs on VMMU. Despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Further analysis shows that the primary source of failure is not OCR, but instead multimodal grounding and reasoning over text and visual evidence. Code and data are available at https://vmmu.github.io.

تم تقديم VMMU، وهو معيار فيتنامي لفهم وتقييم المهام المتعددة متعددة الوسائط، بهدف تقييم قدرات نماذج اللغة والرؤية (VLM) في تفسير واستنتاج المعلومات المرئية والنصية باللغة الفيتنامية. يتضمن هذا المعيار 2.5 ألف سؤال متعدد الوسائط عبر سبع مهام متنوعة، مع التركيز على التكامل الحقيقي متعدد الوسائط بدلاً من الاعتماد على الإشارات النصية فقط.

La introducción de VMMU, un referente vietnamita para la evaluación de la comprensión y el razonamiento multimodal, tiene como objetivo evaluar las capacidades de los modelos de lenguaje y visión (VLM) para interpretar y razonar sobre información visual y textual en vietnamita. Este referente incluye 2.5k preguntas multimodales en siete tareas diversas, enfatizando la integración multimodal genuina en lugar de depender únicamente de pistas textuales.

L'introduction de VMMU, une référence vietnamienne pour l'évaluation de la compréhension et du raisonnement multimodal, vise à évaluer les capacités des modèles de langage et de vision (VLM) à interpréter et raisonner sur des informations visuelles et textuelles en vietnamien. Cette référence comprend 2,5k questions multimodales réparties sur sept tâches diverses, mettant l'accent sur une véritable intégration multimodale plutôt que sur des indices uniquement textuels.

The introduction of VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark, aims to assess the capabilities of vision-language models (VLMs) in interpreting and reasoning over visual and textual information in Vietnamese. This benchmark includes 2.5k multimodal questions across seven diverse tasks, emphasizing genuine multimodal integration rather than text-only cues.

VMMU: A Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark

One More Thing in AI – Your Shortcut to AI Mastery

A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models

Was this article worth reading? Share it

One More Thing in AI

PromptKit

ShareSpeak

LucidQuery AI

Hypertune

LangWatch

Ready to build your own newsroom?