KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?
NeutralArtificial Intelligence
- A new benchmark called KidVis has been introduced to evaluate the visual perceptual capabilities of Multimodal Large Language Models (MLLMs), specifically assessing their performance against that of 6-7 year old children across six atomic visual capabilities. The results reveal a significant performance gap, with human children scoring an average of 95.32 compared to GPT-5's score of 67.33.
- This development is crucial as it highlights the limitations of current MLLMs in replicating fundamental human visual perception, raising questions about their applicability in tasks requiring intuitive understanding.
- The findings underscore ongoing challenges in the field of artificial intelligence, particularly in bridging the gap between human-like perception and machine learning capabilities, as various frameworks and benchmarks continue to emerge to enhance MLLMs' visual reasoning and contextual understanding.
— via World Pulse Now AI Editorial System

