Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

arXiv — cs.CVThursday, December 11, 2025 at 5:00:00 AM
  • A new benchmark titled 'Do You See Me' has been introduced to evaluate the visual perception capabilities of Multimodal Large Language Models (MLLMs), revealing that leading models struggle with visual interpretation despite achieving correct reasoning answers. The benchmark includes 1,758 images and 2,612 questions across various complexity levels, highlighting a significant performance gap between human accuracy and MLLM results.
  • This development is crucial for advancing MLLMs, as it systematically addresses the visual perception errors that hinder their reasoning capabilities. The benchmark aims to provide a clearer understanding of these models' limitations, which is essential for improving their design and functionality in real-world applications.
  • The introduction of this benchmark reflects ongoing challenges in the field of artificial intelligence, particularly regarding the integration of visual and textual understanding. As MLLMs continue to evolve, addressing issues such as catastrophic forgetting, hallucinations, and diagram comprehension will be vital for enhancing their overall performance and reliability in multimodal tasks.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Do you ask AI deep questions at night? 37.5 million Copilot conversations show you're not alone
PositiveArtificial Intelligence
A Microsoft study reveals that 37.5 million conversations with its AI Copilot demonstrate a significant integration of AI into daily life, spanning work-related discussions during the day and personal inquiries at night. This highlights the growing reliance on AI for various aspects of human interaction.
How the Next Big Thing in Carbon Removal Sank Without a Trace
NegativeArtificial Intelligence
Running Tide, once touted as a leader in carbon removal with backing from major companies like Microsoft, Stripe, and Shopify, has faced significant setbacks, culminating in the controversial decision to dump thousands of tons of wood chips into the ocean. This move raises questions about the effectiveness and sustainability of their carbon removal strategies.
IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting
PositiveArtificial Intelligence
The introduction of IF-Bench marks a significant advancement in the evaluation of multimodal large language models (MLLMs) specifically for infrared images, utilizing a dataset of 499 images and 680 visual question-answer pairs to assess understanding across ten dimensions. This benchmark aims to fill the gap in current research regarding MLLMs' capabilities in interpreting infrared imagery.
Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding
PositiveArtificial Intelligence
The introduction of Video-QTR, a Query-Driven Temporal Reasoning framework, aims to enhance lightweight video understanding by optimizing the processing of visual content through query-guided reasoning rather than exhaustive frame encoding. This approach addresses the inefficiencies associated with traditional methods that lead to high memory consumption and limited scalability in long-video comprehension.
LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations
PositiveArtificial Intelligence
LongT2IBench has been introduced as a new benchmark aimed at evaluating long Text-to-Image (T2I) generation, addressing the limitations of existing models that primarily focus on short prompts. This benchmark includes 14,000 long text-image pairs with graph-structured human annotations, enhancing the interpretability of image-text alignment in complex scenarios.
A group of state AGs sent a letter to Meta, Microsoft, Google, Apple, and others warning their chatbots' "delusional outputs" could be violating state laws (Courtney Rozen/Reuters)
NegativeArtificial Intelligence
A coalition of state attorneys general has issued a warning to major tech companies, including Meta, Microsoft, Google, and Apple, regarding the potential legal implications of their chatbots producing what they describe as 'delusional outputs.' This letter emphasizes concerns that such outputs may violate state laws, highlighting the need for accountability in AI technologies.
State attorneys general warn Microsoft, OpenAI, Google, and other AI giants to fix ‘delusional’ outputs
NegativeArtificial Intelligence
State attorneys general have issued a warning to major AI companies, including Microsoft, OpenAI, and Google, demanding the implementation of new safeguards to prevent harmful psychological impacts from their AI outputs, which have been described as 'delusional.'
Microsoft faces reality check on AI ambitions as Copilot and Foundry struggle to meet goals
NegativeArtificial Intelligence
Microsoft is facing significant challenges in its efforts to integrate artificial intelligence into its core product strategy, as evidenced by the underperformance of its Azure sales units and the Foundry marketplace for AI models and tools, which have not met growth expectations. This situation has prompted a reevaluation of its AI ambitions.