mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models
NeutralArtificial Intelligence
The introduction of mmJEE-Eval marks a significant advancement in evaluating scientific reasoning within vision-language models (VLMs). This bilingual benchmark, featuring 1,460 questions from India's JEE Advanced exams across Physics, Chemistry, and Mathematics, aims to differentiate true reasoning capabilities from mere pattern-matching. Current VLMs demonstrate high accuracy on existing benchmarks, yet they struggle with deeper reasoning tasks, as evidenced by the performance of 17 state-of-the-art models. While models like GPT-5 and Gemini 2.5 Pro/Flash achieve 77-84% accuracy on the held-out 2025 questions, open-source alternatives show a stark contrast with only 37-45% accuracy despite their extensive parameter scaling. The results underscore a critical gap in AI's ability to handle complex reasoning, as even the most advanced models falter under increased cognitive loads, with GPT-5 correcting just 5.2% of errors when faced with challenging questions. The release of mmJEE-Eval, …
— via World Pulse Now AI Editorial System

