arXiv:2511.02779v1 Announce Type: new 
Abstract: We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.

MIRA هو معيار جديد يهدف إلى تعزيز التفكير البصري في النماذج من خلال مطالبتها بإنشاء صور وسيطة مثل الرسومات والمخططات. تعكس هذه الطريقة تقنيات حل المشكلات البشرية، مما يجعلها خطوة مهمة إلى الأمام في هذا المجال.

MIRA es un nuevo benchmark diseñado para mejorar el razonamiento visual en modelos al requerirles generar imágenes intermedias como bocetos y diagramas. Este enfoque refleja las técnicas de resolución de problemas humanas, lo que representa un avance significativo en el campo.

MIRA est un nouveau benchmark conçu pour améliorer le raisonnement visuel des modèles en leur demandant de générer des images intermédiaires telles que des croquis et des diagrammes. Cette approche reflète les techniques de résolution de problèmes humaines, ce qui en fait une avancée significative dans le domaine.

MIRA is an innovative benchmark aimed at enhancing visual reasoning in models by requiring them to generate intermediate images like sketches and diagrams. This approach reflects human problem-solving techniques, making it a significant step forward in the field.

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Was this article worth reading? Share it

Ready to build your own newsroom?