arXiv:2511.00067v1 Announce Type: new 
Abstract: The objective of domain generalization (DG) is to enable models to be robust against domain shift. DG is crucial for deploying vision-language models (VLMs) in real-world applications, yet most existing methods rely on domain labels that may not be available and often ambiguous. We instead study the DG setting where models must generalize well without access to explicit domain labels. Our key idea is to represent an unseen target domain as a combination of latent domains automatically discovered from training data, enabling the model to adaptively transfer knowledge across domains. To realize this, we perform latent domain clustering on image features and fuse domain-specific text features based on the similarity between the input image and each latent domain. Experiments on four benchmarks show that this strategy yields consistent gains over VLM-based baselines and provides new insights into improving robustness under domain shift.

تسلط دراسة جديدة حول تعلم النماذج اللغوية البصرية (VLM) الضوء على تقدم كبير في تعميم المجال (DG). هذه الأبحاث مهمة لأنها تتناول التحدي المتمثل في نشر VLM في سيناريوهات العالم الحقيقي حيث قد تكون تسميات المجال غير متاحة أو غير واضحة. من خلال التركيز على كيفية قدرة النماذج على التعميم بشكل فعال دون تسميات مجال صريحة، يمهد هذا العمل الطريق لتطبيقات ذكاء اصطناعي أكثر قوة، مما يعزز قدرة VLM على التكيف عبر سياقات متنوعة.

Un nuevo estudio sobre el aprendizaje de prompts de dominio latente para modelos de visión-lenguaje (VLM) destaca un avance significativo en la generalización de dominio (DG). Esta investigación es importante porque aborda el desafío de implementar VLM en escenarios del mundo real donde las etiquetas de dominio pueden no estar disponibles o ser poco claras. Al centrarse en cómo los modelos pueden generalizar eficazmente sin etiquetas de dominio explícitas, este trabajo allana el camino para aplicaciones de IA más robustas, mejorando la adaptabilidad de los VLM en diversos contextos.

Une nouvelle étude sur l'apprentissage par prompt de domaine latent pour les modèles de vision-langage (VLM) met en lumière une avancée significative dans la généralisation de domaine (DG). Cette recherche est importante car elle aborde le défi de déployer des VLM dans des scénarios réels où les étiquettes de domaine peuvent être indisponibles ou peu claires. En se concentrant sur la manière dont les modèles peuvent se généraliser efficacement sans étiquettes de domaine explicites, ce travail ouvre la voie à des applications d'IA plus robustes, améliorant l'adaptabilité des VLM dans divers contextes.

A new study on latent domain prompt learning for vision-language models (VLMs) highlights a significant advancement in domain generalization (DG). This research is important because it addresses the challenge of deploying VLMs in real-world scenarios where domain labels may be unavailable or unclear. By focusing on how models can effectively generalize without explicit domain labels, this work paves the way for more robust AI applications, enhancing the adaptability of VLMs across various contexts.

Latent Domain Prompt Learning for Vision-Language Models

arXiv:2601.06204v2 Announce Type: replace 
Abstract: Intelligent anomaly detection in dynamic visual environments requires reconciling real-time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction-based models capture low-level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision-language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi-agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish-subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large-scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.

تم تقديم إطار عمل جديد للكشف عن الشذوذ متعدد الوكلاء في أنظمة المراقبة، باستخدام نماذج الرؤية-اللغة وتصنيف قائم على التضمين لتحسين الأداء في الوقت الحقيقي وقابلية التفسير الدلالي. تتكامل هذه الطريقة مع منهجيات مختلفة، بما في ذلك التصفية المعتمدة على إعادة البناء والتقييم على مستوى الكائن، لمعالجة تعقيدات الكشف عن الشذوذ في البيئات البصرية الديناميكية.

Se ha introducido un nuevo marco para la detección de anomalías en cascada mediante múltiples agentes en sistemas de vigilancia, utilizando modelos de visión-lenguaje y clasificación basada en embeddings para mejorar el rendimiento en tiempo real y la interpretabilidad semántica. Este enfoque integra diversas metodologías, incluyendo filtrado basado en reconstrucción y evaluaciones a nivel de objeto, para abordar las complejidades de la detección de anomalías en entornos visuales dinámicos.

Un nouveau cadre pour la détection d'anomalies en cascade à plusieurs agents dans les systèmes de surveillance a été introduit, utilisant des modèles de vision-langage et une classification basée sur l'embedding pour améliorer la performance en temps réel et l'interprétabilité sémantique. Cette approche intègre diverses méthodologies, y compris le filtrage basé sur la reconstruction et les évaluations au niveau des objets, pour répondre aux complexités de la détection d'anomalies dans des environnements visuels dynamiques.

A new framework for cascading multi-agent anomaly detection in surveillance systems has been introduced, utilizing vision-language models and embedding-based classification to enhance real-time performance and semantic interpretability. This approach integrates various methodologies, including reconstruction-gated filtering and object-level assessments, to address the complexities of detecting anomalies in dynamic visual environments.

Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification

arXiv:2508.13680v3 Announce Type: replace-cross 
Abstract: We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark designed to evaluate how vision-language models (VLMs) interpret and reason over visual and textual information beyond English. VMMU consists of 2.5k multimodal questions across 7 tasks, covering a diverse range of problem contexts, including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning. All questions require genuine multimodal integration, rather than reliance on text-only cues or OCR-based shortcuts. We evaluate a diverse set of state-of-the-art proprietary and open-source VLMs on VMMU. Despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Further analysis shows that the primary source of failure is not OCR, but instead multimodal grounding and reasoning over text and visual evidence. Code and data are available at https://vmmu.github.io.

تم تقديم VMMU، وهو معيار فيتنامي لفهم وتقييم المهام المتعددة متعددة الوسائط، بهدف تقييم قدرات نماذج اللغة والرؤية (VLM) في تفسير واستنتاج المعلومات المرئية والنصية باللغة الفيتنامية. يتضمن هذا المعيار 2.5 ألف سؤال متعدد الوسائط عبر سبع مهام متنوعة، مع التركيز على التكامل الحقيقي متعدد الوسائط بدلاً من الاعتماد على الإشارات النصية فقط.

La introducción de VMMU, un referente vietnamita para la evaluación de la comprensión y el razonamiento multimodal, tiene como objetivo evaluar las capacidades de los modelos de lenguaje y visión (VLM) para interpretar y razonar sobre información visual y textual en vietnamita. Este referente incluye 2.5k preguntas multimodales en siete tareas diversas, enfatizando la integración multimodal genuina en lugar de depender únicamente de pistas textuales.

L'introduction de VMMU, une référence vietnamienne pour l'évaluation de la compréhension et du raisonnement multimodal, vise à évaluer les capacités des modèles de langage et de vision (VLM) à interpréter et raisonner sur des informations visuelles et textuelles en vietnamien. Cette référence comprend 2,5k questions multimodales réparties sur sept tâches diverses, mettant l'accent sur une véritable intégration multimodale plutôt que sur des indices uniquement textuels.

The introduction of VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark, aims to assess the capabilities of vision-language models (VLMs) in interpreting and reasoning over visual and textual information in Vietnamese. This benchmark includes 2.5k multimodal questions across seven diverse tasks, emphasizing genuine multimodal integration rather than text-only cues.

Latent Domain Prompt Learning for Vision-Language Models

Was this article worth reading? Share it

LucidQuery AI

Dubsmart LLC

OpenL Translator

The Visualizer

VideoDubber Video Translator

VoiceCheap

Ready to build your own newsroom?