mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

arXiv — cs.CLThursday, November 13, 2025 at 5:00:00 AM
The introduction of mmJEE-Eval marks a significant advancement in evaluating scientific reasoning within vision-language models (VLMs). This bilingual benchmark, featuring 1,460 questions from India's JEE Advanced exams across Physics, Chemistry, and Mathematics, aims to differentiate true reasoning capabilities from mere pattern-matching. Current VLMs demonstrate high accuracy on existing benchmarks, yet they struggle with deeper reasoning tasks, as evidenced by the performance of 17 state-of-the-art models. While models like GPT-5 and Gemini 2.5 Pro/Flash achieve 77-84% accuracy on the held-out 2025 questions, open-source alternatives show a stark contrast with only 37-45% accuracy despite their extensive parameter scaling. The results underscore a critical gap in AI's ability to handle complex reasoning, as even the most advanced models falter under increased cognitive loads, with GPT-5 correcting just 5.2% of errors when faced with challenging questions. The release of mmJEE-Eval, …
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
PositiveArtificial Intelligence
A recent study introduces Uniqueness-Aware Reinforcement Learning (UARL), a novel approach aimed at enhancing the problem-solving capabilities of large language models (LLMs) by rewarding rare and effective solution strategies. This method addresses the common issue of exploration collapse in reinforcement learning, where models tend to converge on a limited set of reasoning patterns, thereby stifling diversity in solutions.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about