6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

arXiv — cs.CVFriday, December 5, 2025 at 5:00:00 AM
  • A new benchmark called AdversarialAnatomyBench has been introduced to evaluate vision-language models (VLMs) against naturally occurring rare anatomical variants, revealing significant performance drops in state-of-the-art models like GPT-5 and Gemini 2.5 Pro when faced with atypical anatomy. The accuracy decreased from 74% on typical anatomy to just 29% on atypical cases.
  • This development highlights critical weaknesses in VLMs, which are increasingly used in clinical settings. The findings suggest that existing models may not be adequately prepared to handle the complexities of rare anatomical presentations, potentially impacting diagnostic accuracy and patient care.
  • The introduction of AdversarialAnatomyBench reflects a growing recognition of the need for more robust evaluation frameworks in AI, particularly in healthcare. As benchmarks like this emerge, they underscore the importance of addressing biases in AI models and ensuring that advancements in technology translate effectively into clinical practice, especially in diverse medical scenarios.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Physicist Steve Hsu publishes research built around a core idea generated by GPT-5
NeutralArtificial Intelligence
Physicist Steve Hsu has published a research paper based on an idea generated by GPT-5, highlighting the potential of AI in scientific inquiry while cautioning about its reliability, likening it to a 'brilliant but unreliable genius.'
AI denial is becoming an enterprise risk: Why dismissing “slop” obscures real capability gains
NegativeArtificial Intelligence
The recent release of GPT-5 by OpenAI has sparked a negative shift in public sentiment towards AI, with many users criticizing the model for its perceived flaws rather than recognizing its capabilities. This backlash has led to claims that AI progress is stagnating, with some commentators labeling the technology as 'AI slop'.
OpenAI is training models to 'confess' when they lie - what it means for future AI
NeutralArtificial Intelligence
OpenAI has developed a version of GPT-5 that can admit to its own errors, a significant step in addressing concerns about AI honesty and transparency. This new capability, referred to as 'confessions', aims to enhance the reliability of AI systems by encouraging them to self-report misbehavior. However, experts caution that this is not a comprehensive solution to the broader safety issues surrounding AI technology.
ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai
PositiveArtificial Intelligence
ThaiOCRBench has been introduced as the first comprehensive benchmark for evaluating vision-language models (VLMs) specifically for Thai text-rich visual understanding tasks, featuring a diverse dataset of 2,808 samples across 13 categories. This initiative addresses the underrepresentation of Thai in existing multimodal modeling benchmarks, which primarily focus on high-resource languages.
ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models
PositiveArtificial Intelligence
The introduction of ViRectify presents a new benchmark aimed at evaluating the error correction capabilities of multimodal large language models (MLLMs) in complex video reasoning tasks. This benchmark addresses the existing gap in systematic evaluation, providing a dataset of over 30,000 instances across various domains such as dynamic perception and scientific reasoning.
Object Counting with GPT-4o and GPT-5: A Comparative Study
PositiveArtificial Intelligence
A comparative study has been conducted on the object counting capabilities of two multi-modal large language models, GPT-4o and GPT-5, focusing on their performance in zero-shot scenarios using only textual prompts. The evaluation was carried out on the FSC-147 and CARPK datasets, revealing that both models achieved results comparable to state-of-the-art methods, with some instances exceeding them.
A Definition of AGI
NeutralArtificial Intelligence
A recent paper has introduced a quantifiable framework for defining Artificial General Intelligence (AGI), proposing that AGI should match the cognitive versatility of a well-educated adult. This framework is based on the Cattell-Horn-Carroll theory and evaluates AI systems across ten cognitive domains, revealing significant gaps in current AI models, particularly in long-term memory storage.
Anthropic vs. OpenAI red teaming methods reveal different security priorities for enterprise AI
NeutralArtificial Intelligence
Anthropic and OpenAI have recently showcased their respective AI models, Claude Opus 4.5 and GPT-5, highlighting their distinct approaches to security validation through system cards and red-team exercises. Anthropic's extensive 153-page system card contrasts with OpenAI's 60-page version, revealing differing methodologies in assessing AI robustness and security metrics.