VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

arXiv — cs.LGTuesday, November 25, 2025 at 5:00:00 AM
  • A new dataset named VisReason has been introduced to enhance visual Chain-of-Thought (CoT) reasoning in multimodal large language models (MLLMs). Comprising 489,000 annotated examples across four domains, VisReason aims to facilitate complex reasoning by providing multi-round, human-like rationales that guide MLLMs through visual reasoning steps. Additionally, a subset called VisReason-Pro, featuring 165,000 examples, has been curated with expert-level annotations.
  • The development of VisReason is significant as it addresses the current limitations in existing visual-CoT resources, which are often small or domain-specific. By providing a large-scale dataset, VisReason is expected to improve the interpretability and performance of MLLMs, enabling them to better understand and reason about visual information, thus advancing the field of AI.
  • This initiative reflects a broader trend in AI research focused on enhancing reasoning capabilities in multimodal models. As frameworks like ReVeL and EvoLMM emerge, aiming to improve question-answering and reasoning without heavy reliance on human-annotated data, the introduction of VisReason aligns with ongoing efforts to create more robust and autonomous AI systems capable of complex visual reasoning.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
PositiveArtificial Intelligence
A new framework called Chain-of-Visual-Thought (COVT) has been introduced to enhance Vision-Language Models (VLMs) by enabling them to reason with continuous visual tokens, which encapsulate rich perceptual cues. This approach aims to address the limitations of current VLMs in dense visual perception tasks, such as spatial reasoning and geometric awareness, by distilling knowledge from lightweight vision experts within a budget of approximately 20 tokens.
Real-Time Personalized Content Adaptation through Matrix Factorization and Context-Aware Federated Learning
PositiveArtificial Intelligence
A new study has introduced a multifaceted approach to enhancing user interaction and content relevance on social media platforms through a federated learning framework, focusing on personalized LLM Federated Learning and context-based models. This framework allows multiple client entities to fine-tune a foundational GPT model using locally collected data while ensuring data privacy through federated aggregation.
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
PositiveArtificial Intelligence
A new framework called ReVeL (Rewrite and Verify by LLM) has been proposed to enhance the multiple-choice question answering (MCQA) format used in evaluating multimodal language models. This framework transforms MCQA into open-form questions while ensuring answers remain verifiable, addressing issues of answer guessing and unreliable accuracy metrics during reinforcement fine-tuning (RFT).
Computational frame analysis revisited: On LLMs for studying news coverage
NeutralArtificial Intelligence
A recent study has revisited the effectiveness of large language models (LLMs) like GPT and Claude in analyzing media frames, particularly in the context of news coverage surrounding the US Mpox epidemic of 2022. The research systematically evaluated these generative models against traditional methods, revealing that manual coders consistently outperformed LLMs in frame analysis tasks.
Multi-speaker Attention Alignment for Multimodal Social Interaction
PositiveArtificial Intelligence
A new method for enhancing social interaction understanding in videos has been proposed, focusing on the alignment of verbal and non-verbal cues in multi-speaker scenarios. This approach addresses the limitations observed in existing Multimodal Large Language Models (MLLMs), which struggle with cross-modal attention consistency in such contexts.
EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs
NeutralArtificial Intelligence
A new benchmark called EventBench has been introduced to evaluate the capabilities of multimodal large language models (MLLMs) in event-based vision. This benchmark features eight diverse task metrics and a large-scale event stream dataset, aiming to provide a comprehensive assessment of MLLMs' performance across various tasks, including understanding, recognition, and spatial reasoning.
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
PositiveArtificial Intelligence
EvoLMM, a self-evolving framework for large multimodal models, has been introduced to enhance reasoning capabilities without relying on human-annotated data. This framework consists of two cooperative agents: a Proposer that generates diverse questions and a Solver that answers them through a continuous self-rewarding process. This innovation aims to improve the autonomy and scalability of multimodal models.
Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models
PositiveArtificial Intelligence
A new approach called Query-aware Token Selector (QTSplus) has been introduced to enhance long-video understanding in multimodal large language models (MLLMs). This module addresses the challenge of increasing vision token counts with video length, which leads to higher attention costs and latency. QTSplus dynamically selects the most relevant visual tokens based on text queries, improving efficiency in processing long videos.