PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

arXiv — cs.CV•Tuesday, December 2, 2025 at 5:00:00 AM

arXiv:2510.23594v4 Announce Type: replace Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress on vision-language tasks, yet their reasoning processes remain sometimes unreliable. We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.

— via World Pulse Now AI Editorial System

PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Magicley AI

MindPrism AI

Sellm

The Visualizer

Ready to build your own newsroom?