arXiv:2511.11206v1 Announce Type: new 
Abstract: Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.

حققت نماذج اللغة البصرية (VLM) تقدمًا ملحوظًا، إلا أن موثوقيتها في مواجهة تغييرات الإدخال الطفيفة التي لا تغير المعنى لا تزال غير مفهومة جيدًا. تكشف دراسة شاملة أن نماذج VLM الحديثة، بما في ذلك نماذج مثل GPT-4o و Gemini 2.0 Flash، حساسة للغاية للت perturbations البصرية والنصية الطفيفة. تشمل هذه الاضطرابات التحولات على مستوى البكسل، والتحولات الهندسية الخفيفة، وإعادة الصياغة التي لا تغير المعنى الأساسي. تشير النتائج إلى أن جزءًا كبيرًا من العينات يغير إجاباتها المتوقعة بسبب هذه التعديلات الطفيفة.

Los Modelos de Lenguaje Visual (VLM) han logrado avances significativos, pero su fiabilidad ante pequeños cambios de entrada que no alteran el significado se comprende poco. Un estudio exhaustivo revela que los VLM modernos, incluidos modelos como GPT-4o y Gemini 2.0 Flash, son altamente sensibles a pequeñas perturbaciones visuales y textuales. Estas perturbaciones incluyen desplazamientos a nivel de píxeles, transformaciones geométricas y paráfrasis que mantienen la semántica original. Los hallazgos indican que una parte notable de las muestras cambia sus respuestas predichas debido a estas m…

Les modèles de langage visuel (VLM) ont réalisé des progrès significatifs, mais leur fiabilité face à de petits changements d'entrée non altérants reste mal comprise. Une étude complète révèle que les VLM modernes, y compris des modèles comme GPT-4o et Gemini 2.0 Flash, présentent une grande sensibilité aux petites perturbations visuelles et textuelles. Ces perturbations incluent des déplacements au niveau des pixels, des transformations géométriques et des reformulations qui conservent la sémantique d'origine. Les résultats montrent qu'une part notable des échantillons modifie ses réponses pr…

Visual Language Models (VLMs) have shown significant advancements, yet their reliability in response to minor, non-altering input changes is not well understood. A comprehensive study reveals that modern VLMs, including models like GPT-4o and Gemini 2.0 Flash, exhibit high sensitivity to small visual and textual perturbations. These perturbations include pixel-level shifts, geometric transformations, and paraphrasing that maintain the original semantics. The findings indicate that a notable portion of the samples alters their predicted answers due to these minor modifications.

Questioning the Stability of Visual Question Answering

arXiv:2406.09838v4 Announce Type: replace 
Abstract: Meteorological heatmaps play a vital role in deciphering extreme weather phenomena, yet their inherent complexities marked by irregular contours, unstructured patterns, and complex color variations present unique analytical hurdles for state-of-the-art Vision-Language Models (VLMs). Current state-of-the-art models like GPT-4o, Qwen-VL, and LLaVA 1.6 struggle with tasks such as precise color identification and spatial localization, resulting in inaccurate or incomplete interpretations. To address these challenges, we introduce Sparse Position and Outline Tracking (SPOT), a novel algorithm specifically designed to process irregularly shaped colored regions in visual data. SPOT identifies and localizes these regions by extracting their spatial coordinates, enabling structured representations of irregular shapes. Building on SPOT, we construct ClimateIQA, a novel meteorological visual question answering (VQA) dataset, comprising 26,280 high-resolution heatmaps and 762,120 instruction samples for wind gust, total precipitation, wind chill index and heat index analysis. ClimateIQA enhances VLM training by incorporating spatial cues, geographic metadata, and reanalysis data, improving model accuracy in interpreting and describing extreme weather features. Furthermore, we develop Climate-Zoo, a suite of fine-tuned VLMs based on SPOT-empowered ClimateIQA, which significantly outperforms existing models in meteorological heatmap tasks.

تم تقديم مجموعة بيانات جديدة تُدعى ClimateIQA لتعزيز قدرات نماذج اللغة البصرية (VLM) في تحليل الشذوذات الجوية. تشمل هذه المجموعة 26,280 صورة عالية الجودة، وتهدف إلى معالجة التحديات التي تواجه النماذج الحالية مثل GPT-4o وQwen-VL في تفسير خرائط الحرارة الجوية المعقدة التي تتميز بأشكال غير منتظمة وتباينات لونية.

Se ha introducido un nuevo conjunto de datos llamado ClimateIQA para mejorar las capacidades de los Modelos de Lenguaje Visual (VLM) en el análisis de anomalías meteorológicas. Este conjunto de datos, que incluye 26,280 imágenes de alta calidad, busca abordar los desafíos que enfrentan modelos existentes como GPT-4o y Qwen-VL al interpretar mapas de calor meteorológicos complejos caracterizados por formas irregulares y variaciones de color.

Un nouveau jeu de données nommé ClimateIQA a été introduit pour améliorer les capacités des modèles de langage visuel (VLM) dans l'analyse des anomalies météorologiques. Ce jeu de données, qui comprend 26 280 images de haute qualité, vise à relever les défis rencontrés par les modèles existants comme GPT-4o et Qwen-VL dans l'interprétation des cartes thermiques météorologiques complexes, caractérisées par des formes irrégulières et des variations de couleur.

A new dataset named ClimateIQA has been introduced to enhance the capabilities of Vision-Language Models (VLMs) in analyzing meteorological anomalies. This dataset, which includes 26,280 high-quality images, aims to address the challenges faced by existing models like GPT-4o and Qwen-VL in interpreting complex meteorological heatmaps characterized by irregular shapes and color variations.

ClimateIQA: A New Dataset and Benchmark to Advance Vision-Language Models in Meteorology Anomalies Analysis

arXiv:2503.18712v2 Announce Type: replace 
Abstract: Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. Emerging multimodal large language models (MLLMs) are promising candidates, but their fine-grained action understanding ability has not been fully examined. In this work, we reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action recognition datasets, into a MLLM benchmark (EPIC-KITCHENS-100-MQA). We demonstrate that when we sample difficult answers based on specialist models as distractors, leading MLLMs struggle to recognize the correct actions. How can we increase the performance of MLLMs? We curated a supervised finetuning dataset that includes `hard' action recognition, temporal detection, captioning, and free-form question answering to improve models' diverse action understanding capabilities. We introduce a new model called LLaVAction that adds an action token to boost models' attention on visual tokens and a two-stage pipeline to obtain structured actions. LLaVAction greatly improves the MLLMs' ability of action understanding, achieving strong improvements on both MLLM benchmarks (21 points in accuracy over GPT-4o on EPIC-KITCHENS-100-MQA) and established action recognition benchmarks, suggesting that our methods prepare MLLMs to be a promising path forward for complex action tasks. Code, data, the benchmark, and models are available at https://github.com/AdaptiveMotorControlLab/LLaVAction.

تركز الدراسة المعنونة 'LLaVAction' على تقييم وتدريب نماذج اللغة متعددة الوسائط (MLLM) لفهم الأفعال، من خلال إعادة صياغة مجموعة بيانات EPIC-KITCHENS-100 لتكون معيارًا لنماذج MLLM. تكشف الدراسة أن النماذج الرائدة في هذا المجال تواجه صعوبة في التعرف على الأفعال الصحيحة عند مواجهة مشتتات صعبة، مما يبرز فجوة في قدراتها على فهم الأفعال بشكل دقيق.

La investigación titulada 'LLaVAction' se centra en la evaluación y el entrenamiento de modelos de lenguaje multimodal (MLLM) para la comprensión de acciones, reformulando el conjunto de datos EPIC-KITCHENS-100 en un referente para los MLLM. El estudio revela que los MLLM líderes tienen dificultades para reconocer las acciones correctas cuando se enfrentan a distractores difíciles, destacando una brecha en sus capacidades de comprensión detallada de acciones.

La recherche intitulée 'LLaVAction' se concentre sur l'évaluation et la formation de modèles de langage multimodaux (MLLM) pour la compréhension des actions, en reformulant le jeu de données EPIC-KITCHENS-100 en une référence pour les MLLM. L'étude révèle que les MLLM leaders ont du mal à reconnaître les actions correctes lorsqu'ils sont confrontés à des distracteurs difficiles, mettant en évidence un écart dans leurs capacités de compréhension fine des actions.

The research titled 'LLaVAction' focuses on evaluating and training multi-modal large language models (MLLMs) for action understanding, reformulating the EPIC-KITCHENS-100 dataset into a benchmark for MLLMs. The study reveals that leading MLLMs struggle with recognizing correct actions when faced with difficult distractors, highlighting a gap in their fine-grained action understanding capabilities.

LLaVAction: evaluating and training multi-modal large language models for action understanding

arXiv:2505.20665v2 Announce Type: replace 
Abstract: Effective autonomous driving hinges on robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. While recent vision-language models (VLMs) have been applied to driving tasks, they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language QA problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for multi-stage decision-making. DriveRX achieves strong performance on the public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. DriveRX serves as a high-level semantic reasoning backbone, producing structured stage-wise reasoning chains that enhance decision consistency. These outputs also provide high-quality supervisory signals for annotation and downstream planning/control models. We release the AutoDriveRL framework and DriveRX to support future research.

تم تقديم DriveRX كنموذج للسببية بين الرؤية واللغة يهدف إلى تحسين القيادة الذاتية عبر المهام، من خلال معالجة قيود النماذج التقليدية من النهاية إلى النهاية، التي تواجه صعوبة في السيناريوهات المعقدة بسبب نقص التفكير المنظم. هذا النموذج هو جزء من إطار عمل أوسع يسمى AutoDriveRL، الذي يقوم بتحسين أربع مهام أساسية من خلال نهج تدريب موحد.

DriveRX se ha presentado como un modelo de razonamiento visión-lenguaje destinado a mejorar la conducción autónoma entre tareas, abordando las limitaciones de los modelos tradicionales de extremo a extremo, que luchan en escenarios complejos debido a la falta de razonamiento estructurado. Este modelo forma parte de un marco más amplio llamado AutoDriveRL, que optimiza cuatro tareas centrales a través de un enfoque de entrenamiento unificado.

DriveRX a été présenté comme un modèle de raisonnement vision-langage visant à améliorer la conduite autonome inter-tâches en s'attaquant aux limites des modèles traditionnels de bout en bout, qui peinent dans des scénarios complexes en raison d'un manque de raisonnement structuré. Ce modèle fait partie d'un cadre plus large appelé AutoDriveRL, qui optimise quatre tâches clés grâce à une approche de formation unifiée.

DriveRX has been introduced as a vision-language reasoning model aimed at enhancing cross-task autonomous driving by addressing the limitations of traditional end-to-end models, which struggle with complex scenarios due to a lack of structured reasoning. This model is part of a broader framework called AutoDriveRL, which optimizes four core tasks through a unified training approach.

Questioning the Stability of Visual Question Answering

Was this article worth reading? Share it

LucidQuery AI

Airparser

The Visualizer

OpenL Translator

ChatOne

4o Image Gen

Ready to build your own newsroom?