mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

arXiv — cs.CLThursday, November 13, 2025 at 5:00:00 AM
The introduction of mmJEE-Eval marks a significant advancement in evaluating scientific reasoning within vision-language models (VLMs). This bilingual benchmark, featuring 1,460 questions from India's JEE Advanced exams across Physics, Chemistry, and Mathematics, aims to differentiate true reasoning capabilities from mere pattern-matching. Current VLMs demonstrate high accuracy on existing benchmarks, yet they struggle with deeper reasoning tasks, as evidenced by the performance of 17 state-of-the-art models. While models like GPT-5 and Gemini 2.5 Pro/Flash achieve 77-84% accuracy on the held-out 2025 questions, open-source alternatives show a stark contrast with only 37-45% accuracy despite their extensive parameter scaling. The results underscore a critical gap in AI's ability to handle complex reasoning, as even the most advanced models falter under increased cognitive loads, with GPT-5 correcting just 5.2% of errors when faced with challenging questions. The release of mmJEE-Eval, …
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Sector HQ Weekly Digest - November 17, 2025
NeutralArtificial Intelligence
The Sector HQ Weekly Digest for November 17, 2025, highlights the latest developments in the AI industry, focusing on the performance of top companies. OpenAI leads with a score of 442385.7 and 343 events, followed by Anthropic and Amazon. The report also notes significant movements, with Sony jumping 277 positions in the rankings, reflecting the dynamic nature of the AI sector.
Google will allow experienced users to install apps from third-party sources on Android
PositiveArtificial Intelligence
Google has announced a partial reversal of its policy against third-party app stores, allowing experienced users to install Android apps from alternative sources. This change comes after the company had previously maintained a strict stance against such practices. The decision is seen as a significant shift in Google's approach to app distribution on its Android platform.
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
PositiveArtificial Intelligence
UI programming is a complex aspect of software development. Recent advancements in visual language models (VLMs) show promise for automatic UI coding, yet existing methods face limitations in multimodal capabilities and iterative feedback. The UI2Code^N model addresses these issues through an interactive UI-to-code approach, enhancing performance by integrating UI generation, editing, and polishing. This model is trained using staged pretraining, fine-tuning, and reinforcement learning, aiming to improve multimodal coding significantly.
Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate
PositiveArtificial Intelligence
A recent study published on arXiv investigates whether advanced text-to-speech systems can learn social nuances, specifically the human tendency to slow speech for politeness. Researchers tested 22 synthetic voices from AI Studio and OpenAI under polite and casual conditions, finding that the polite prompts resulted in significantly slower speech across both platforms. This suggests that AI can internalize and replicate subtle psychological cues in human communication.
MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model
PositiveArtificial Intelligence
MicroVQA++ is a newly introduced high-quality microscopy reasoning dataset designed for multimodal large language models (MLLMs). It is derived from the BIOMEDICA archive and consists of a three-stage process that includes expert-validated figure-caption pairs, a novel heterogeneous graph for filtering inconsistent samples, and human-checked multiple-choice questions. This dataset aims to enhance scientific reasoning in biomedical imaging, addressing the current limitations due to the lack of large-scale training data.
Forecasters at the US National Hurricane Center are increasingly leaning on Google's new DeepMind prediction model, though questions about its methods remain (Eric Holthaus/The Guardian)
NeutralArtificial Intelligence
Forecasters at the US National Hurricane Center are increasingly utilizing Google's new DeepMind prediction model, which is designed to provide faster and more accurate hurricane forecasts. Despite its advantages, questions regarding the model's methods and reliability persist. The model is noted for being less expensive and time-consuming, potentially aiding in saving lives and property during hurricane events.
Building RSSRenaissance: AI-Powered Summaries for Smarter Reading
PositiveArtificial Intelligence
Building RSSRenaissance aims to create a tool that helps users stay informed without being overwhelmed by excessive articles. The platform fetches RSS feeds from various sources like TechCrunch and The Verge, processes them using a PostgreSQL database, and employs AI to generate instant summaries. This allows users to quickly grasp key points from the content.
AI Agents: From Zero to Hero in 5-Days With Kaggle and Google
PositiveArtificial Intelligence
The article discusses a five-day journey of learning about AI agents using Google's Agent Development Kit (ADK) and Kaggle. The author, who is involved in developing AI workflows at their company, found the course particularly engaging due to its white papers, which provide in-depth insights into various topics. The experience promises to enhance their understanding of AI agents and their applications in complex workflows.