6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

arXiv — cs.CV•Friday, December 5, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

A new benchmark called AdversarialAnatomyBench has been introduced to evaluate vision-language models (VLMs) against naturally occurring rare anatomical variants, revealing significant performance drops in state-of-the-art models like GPT-5 and Gemini 2.5 Pro when faced with atypical anatomy. The accuracy decreased from 74% on typical anatomy to just 29% on atypical cases.
This development highlights critical weaknesses in VLMs, which are increasingly used in clinical settings. The findings suggest that existing models may not be adequately prepared to handle the complexities of rare anatomical presentations, potentially impacting diagnostic accuracy and patient care.
The introduction of AdversarialAnatomyBench reflects a growing recognition of the need for more robust evaluation frameworks in AI, particularly in healthcare. As benchmarks like this emerge, they underscore the importance of addressing biases in AI models and ensuring that advancements in technology translate effectively into clinical practice, especially in diverse medical scenarios.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

Augmeta

AI peers for collaborative problem-solving and enhanced team productivity.

AI & DataView app details

AIPortalX

Browse, compare, and use over 100 verified AI models with detailed insights and filtering.

Creative & DesignView app details

Continue Readings

THE DECODER21 hours ago

Physicist Steve Hsu publishes research built around a core idea generated by GPT-5

NeutralArtificial Intelligence

Physicist Steve Hsu has published a research paper based on an idea generated by GPT-5, highlighting the potential of AI in scientific inquiry while cautioning about its reliability, likening it to a 'brilliant but unreliable genius.'

Read full article

via THE DECODER

VentureBeat — AI21 hours ago

AI denial is becoming an enterprise risk: Why dismissing “slop” obscures real capability gains

NegativeArtificial Intelligence

The recent release of GPT-5 by OpenAI has sparked a negative shift in public sentiment towards AI, with many users criticizing the model for its perceived flaws rather than recognizing its capabilities. This backlash has led to claims that AI progress is stagnating, with some commentators labeling the technology as 'AI slop'.

Read full article

via VentureBeat — AI

ZDNET — Artificial Intelligencea day ago

OpenAI is training models to 'confess' when they lie - what it means for future AI

NeutralArtificial Intelligence

OpenAI has developed a version of GPT-5 that can admit to its own errors, a significant step in addressing concerns about AI honesty and transparency. This new capability, referred to as 'confessions', aims to enhance the reliability of AI systems by encouraging them to self-report misbehavior. However, experts caution that this is not a comprehensive solution to the broader safety issues surrounding AI technology.

Read full article

via ZDNET — Artificial Intelligence

arXiv — cs.CLa day ago

ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

PositiveArtificial Intelligence

ThaiOCRBench has been introduced as the first comprehensive benchmark for evaluating vision-language models (VLMs) specifically for Thai text-rich visual understanding tasks, featuring a diverse dataset of 2,808 samples across 13 categories. This initiative addresses the underrepresentation of Thai in existing multimodal modeling benchmarks, which primarily focus on high-resource languages.

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models

PositiveArtificial Intelligence

The introduction of ViRectify presents a new benchmark aimed at evaluating the error correction capabilities of multimodal large language models (MLLMs) in complex video reasoning tasks. This benchmark addresses the existing gap in systematic evaluation, providing a dataset of over 30,000 instances across various domains such as dynamic perception and scientific reasoning.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Object Counting with GPT-4o and GPT-5: A Comparative Study

PositiveArtificial Intelligence

A comparative study has been conducted on the object counting capabilities of two multi-modal large language models, GPT-4o and GPT-5, focusing on their performance in zero-shot scenarios using only textual prompts. The evaluation was carried out on the FSC-147 and CARPK datasets, revealing that both models achieved results comparable to state-of-the-art methods, with some instances exceeding them.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

A Definition of AGI

NeutralArtificial Intelligence

A recent paper has introduced a quantifiable framework for defining Artificial General Intelligence (AGI), proposing that AGI should match the cognitive versatility of a well-educated adult. This framework is based on the Cattell-Horn-Carroll theory and evaluates AI systems across ten cognitive domains, revealing significant gaps in current AI models, particularly in long-term memory storage.

Read full article

via arXiv — cs.LG

VentureBeat — AI2 days ago

Anthropic vs. OpenAI red teaming methods reveal different security priorities for enterprise AI

NeutralArtificial Intelligence

Anthropic and OpenAI have recently showcased their respective AI models, Claude Opus 4.5 and GPT-5, highlighting their distinct approaches to security validation through system cards and red-team exercises. Anthropic's extensive 153-page system card contrasts with OpenAI's 60-page version, revealing differing methodologies in assessing AI robustness and security metrics.

Read full article

via VentureBeat — AI