The 2025 Planning Performance of Frontier Large Language Models

arXiv — cs.LGThursday, November 13, 2025 at 5:00:00 AM
The evaluation of frontier Large Language Models (LLMs) in 2025 reveals notable advancements in their planning capabilities, particularly for models like DeepSeek R1, Gemini 2.5 Pro, and GPT-5. Conducted using PDDL domain and task descriptions, the study found that GPT-5's performance in solving tasks is competitive with the established planner LAMA. However, when faced with obfuscated tasks designed to test pure reasoning, all models experienced a decline in performance, albeit less severely than earlier generations. This indicates that while there are improvements, challenges remain in reasoning tasks. The results underscore the ongoing evolution of LLMs, suggesting that they are becoming increasingly capable in complex planning scenarios, thus bridging the gap with traditional planning methods. As the field continues to advance, these findings are crucial for understanding the potential applications and limitations of LLMs in real-world problem-solving contexts.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
PositiveArtificial Intelligence
UI programming is a complex aspect of software development. Recent advancements in visual language models (VLMs) show promise for automatic UI coding, yet existing methods face limitations in multimodal capabilities and iterative feedback. The UI2Code^N model addresses these issues through an interactive UI-to-code approach, enhancing performance by integrating UI generation, editing, and polishing. This model is trained using staged pretraining, fine-tuning, and reinforcement learning, aiming to improve multimodal coding significantly.
Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D
NeutralArtificial Intelligence
Large language models (LLMs) have shown potential in various fields, but their effectiveness in diagnosing rare diseases from narrative medical cases is still largely unexamined. A new dataset comprising 176 symptom-diagnosis pairs from the medical series House M.D. has been introduced for this purpose. Four advanced LLMs, including GPT 4o mini and Gemini 2.5 Pro, were evaluated, revealing performance accuracy ranging from 16.48% to 38.64%, with newer models showing a 2.3 times improvement in diagnostic reasoning tasks.
MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model
PositiveArtificial Intelligence
MicroVQA++ is a newly introduced high-quality microscopy reasoning dataset designed for multimodal large language models (MLLMs). It is derived from the BIOMEDICA archive and consists of a three-stage process that includes expert-validated figure-caption pairs, a novel heterogeneous graph for filtering inconsistent samples, and human-checked multiple-choice questions. This dataset aims to enhance scientific reasoning in biomedical imaging, addressing the current limitations due to the lack of large-scale training data.