arXiv:2510.22340v2 Announce Type: replace-cross 
Abstract: Solid geometry problem solving demands spatial mathematical reasoning that integrates spatial intelligence and symbolic reasoning. However, most existing multimodal mathematical reasoning benchmarks focus primarily on 2D plane geometry, rely on static datasets prone to data contamination and memorization, and evaluate models solely by final answers, overlooking the reasoning process. To address these limitations, we introduce DynaSolidGeo, the first dynamic benchmark for evaluating genuine spatial reasoning in Vision-Language Models (VLMs). Constructed through a semi-automatic annotation pipeline, DynaSolidGeo contains 503 expert-curated seed questions that can, in principle, dynamically generate an unbounded number of diverse multimodal text-visual instances. Beyond answer accuracy, we incorporate process evaluation based on expert-annotated reasoning chains to measure logical validity and causal coherence. Experiments across representative open-source and closed-source VLMs reveal large performance gaps, severe degradation in dynamic settings, and poor performance on tasks requiring high-level spatial intelligence, such as mental rotation and visualization. The code and dataset are available at \href{https://zgca-ai4edu.github.io/DynaSolidGeo/}{DynaSolidGeo}.

تم تقديم DynaSolidGeo كأول معيار ديناميكي لتقييم التفكير الرياضي المكاني في نماذج الرؤية-اللغة (VLMs). على عكس المعايير الحالية التي تركز على الهندسة ثنائية الأبعاد الثابتة، يوفر DynaSolidGeo نظامًا شبه آلي مع 503 سؤالًا تم إعدادها من قبل خبراء، مما يسمح بإنشاء حالات متعددة الوسائط متنوعة. هذه الابتكار مهم لأنه يقيم ليس فقط دقة الإجابات ولكن أيضًا عملية التفكير، مما يكشف عن فجوات أداء كبيرة في VLMs، خاصة في المهام التي تتطلب ذكاءً مكانيًا متقدمًا.

DynaSolidGeo se presenta como el primer benchmark dinámico para evaluar el razonamiento matemático espacial en Modelos de Visión-Lenguaje (VLMs). A diferencia de los benchmarks existentes que se centran en la geometría 2D estática, DynaSolidGeo ofrece un sistema semi-automatizado con 503 preguntas elaboradas por expertos, permitiendo la generación de instancias multimodales diversas. Esta innovación es crucial ya que evalúa no solo la precisión de las respuestas, sino también el proceso de razonamiento, revelando brechas de rendimiento significativas en los VLMs, especialmente en tareas que requieren inteligencia espacial avanzada.

DynaSolidGeo est présenté comme le premier benchmark dynamique pour évaluer le raisonnement mathématique spatial dans les modèles de vision-langage (VLMs). Contrairement aux benchmarks existants qui se concentrent sur la géométrie 2D statique, DynaSolidGeo propose un système semi-automatisé avec 503 questions élaborées par des experts, permettant la génération d'instances multimodales diverses. Cette innovation est cruciale car elle évalue non seulement l'exactitude des réponses mais aussi le processus de raisonnement, révélant des écarts de performance significatifs dans les VLMs, notamment dans les tâches nécessitant une intelligence spatiale avancée.

DynaSolidGeo is introduced as the first dynamic benchmark for assessing spatial mathematical reasoning in Vision-Language Models (VLMs). Unlike existing benchmarks that focus on static 2D geometry, DynaSolidGeo offers a semi-automated system with 503 expert-curated questions, allowing for the generation of diverse multimodal instances. This innovation is crucial as it evaluates not only the accuracy of answers but also the reasoning process, revealing significant performance gaps in VLMs, particularly in tasks requiring advanced spatial intelligence.

DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry

Was this article worth reading? Share it

Ready to build your own newsroom?