Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

arXiv — cs.CV•Thursday, December 11, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

A new benchmark titled 'Do You See Me' has been introduced to evaluate the visual perception capabilities of Multimodal Large Language Models (MLLMs), revealing that leading models struggle with visual interpretation despite achieving correct reasoning answers. The benchmark includes 1,758 images and 2,612 questions across various complexity levels, highlighting a significant performance gap between human accuracy and MLLM results.
This development is crucial for advancing MLLMs, as it systematically addresses the visual perception errors that hinder their reasoning capabilities. The benchmark aims to provide a clearer understanding of these models' limitations, which is essential for improving their design and functionality in real-world applications.
The introduction of this benchmark reflects ongoing challenges in the field of artificial intelligence, particularly regarding the integration of visual and textual understanding. As MLLMs continue to evolve, addressing issues such as catastrophic forgetting, hallucinations, and diagram comprehension will be vital for enhancing their overall performance and reliability in multimodal tasks.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

Magicley AI

Access a suite of AI generators for all your creative and productivity tasks.

AI & DataView app details

The Visualizer

Transform complex topics into clear, visual explanations for effortless learning.

AI & DataView app details

Continue Readings

ZDNET — Artificial Intelligence12 hours ago

Do you ask AI deep questions at night? 37.5 million Copilot conversations show you're not alone

PositiveArtificial Intelligence

A Microsoft study reveals that 37.5 million conversations with its AI Copilot demonstrate a significant integration of AI into daily life, spanning work-related discussions during the day and personal inquiries at night. This highlights the growing reliance on AI for various aspects of human interaction.

Read full article

via ZDNET — Artificial Intelligence

WIRED — Business (Latest)a day ago

How the Next Big Thing in Carbon Removal Sank Without a Trace

NegativeArtificial Intelligence

Running Tide, once touted as a leader in carbon removal with backing from major companies like Microsoft, Stripe, and Shopify, has faced significant setbacks, culminating in the controversial decision to dump thousands of tons of wood chips into the ocean. This move raises questions about the effectiveness and sustainability of their carbon removal strategies.

Read full article

via WIRED — Business (Latest)

arXiv — cs.CVa day ago

IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

PositiveArtificial Intelligence

The introduction of IF-Bench marks a significant advancement in the evaluation of multimodal large language models (MLLMs) specifically for infrared images, utilizing a dataset of 499 images and 680 visual question-answer pairs to assess understanding across ten dimensions. This benchmark aims to fill the gap in current research regarding MLLMs' capabilities in interpreting infrared imagery.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

PositiveArtificial Intelligence

The introduction of Video-QTR, a Query-Driven Temporal Reasoning framework, aims to enhance lightweight video understanding by optimizing the processing of visual content through query-guided reasoning rather than exhaustive frame encoding. This approach addresses the inefficiencies associated with traditional methods that lead to high memory consumption and limited scalability in long-video comprehension.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations

PositiveArtificial Intelligence

LongT2IBench has been introduced as a new benchmark aimed at evaluating long Text-to-Image (T2I) generation, addressing the limitations of existing models that primarily focus on short prompts. This benchmark includes 14,000 long text-image pairs with graph-structured human annotations, enhancing the interpretability of image-text alignment in complex scenarios.

Read full article

via arXiv — cs.CV

Techmemea day ago

A group of state AGs sent a letter to Meta, Microsoft, Google, Apple, and others warning their chatbots' "delusional outputs" could be violating state laws (Courtney Rozen/Reuters)

NegativeArtificial Intelligence

A coalition of state attorneys general has issued a warning to major tech companies, including Meta, Microsoft, Google, and Apple, regarding the potential legal implications of their chatbots producing what they describe as 'delusional outputs.' This letter emphasizes concerns that such outputs may violate state laws, highlighting the need for accountability in AI technologies.

Read full article

via Techmeme

TechCruncha day ago

State attorneys general warn Microsoft, OpenAI, Google, and other AI giants to fix ‘delusional’ outputs

NegativeArtificial Intelligence

State attorneys general have issued a warning to major AI companies, including Microsoft, OpenAI, and Google, demanding the implementation of new safeguards to prevent harmful psychological impacts from their AI outputs, which have been described as 'delusional.'

Read full article

via TechCrunch

TechSpota day ago

Microsoft faces reality check on AI ambitions as Copilot and Foundry struggle to meet goals

NegativeArtificial Intelligence

Microsoft is facing significant challenges in its efforts to integrate artificial intelligence into its core product strategy, as evidenced by the underperformance of its Azure sales units and the Foundry marketplace for AI models and tools, which have not met growth expectations. This situation has prompted a reevaluation of its AI ambitions.

Read full article

via TechSpot