VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

arXiv — cs.LG•Tuesday, November 25, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A new dataset named VisReason has been introduced to enhance visual Chain-of-Thought (CoT) reasoning in multimodal large language models (MLLMs). Comprising 489,000 annotated examples across four domains, VisReason aims to facilitate complex reasoning by providing multi-round, human-like rationales that guide MLLMs through visual reasoning steps. Additionally, a subset called VisReason-Pro, featuring 165,000 examples, has been curated with expert-level annotations.
The development of VisReason is significant as it addresses the current limitations in existing visual-CoT resources, which are often small or domain-specific. By providing a large-scale dataset, VisReason is expected to improve the interpretability and performance of MLLMs, enabling them to better understand and reason about visual information, thus advancing the field of AI.
This initiative reflects a broader trend in AI research focused on enhancing reasoning capabilities in multimodal models. As frameworks like ReVeL and EvoLMM emerge, aiming to improve question-answering and reasoning without heavy reliance on human-annotated data, the introduction of VisReason aligns with ongoing efforts to create more robust and autonomous AI systems capable of complex visual reasoning.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Zemith-3bda3b

Your all-in-one AI platform for work and research assistance.

AI & DataTry the app

Sellm

Track brand mentions across ChatGPT, Perplexity, and other AI platforms.

Marketing & CommerceTry the app

TypeThinkAI

Compare top AI models and generate text, images, and videos in one platform.

AI & DataTry the app

Continue Readings

arXiv — cs.CVa day ago

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

PositiveArtificial Intelligence

A new framework called Chain-of-Visual-Thought (COVT) has been introduced to enhance Vision-Language Models (VLMs) by enabling them to reason with continuous visual tokens, which encapsulate rich perceptual cues. This approach aims to address the limitations of current VLMs in dense visual perception tasks, such as spatial reasoning and geometric awareness, by distilling knowledge from lightweight vision experts within a budget of approximately 20 tokens.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

Real-Time Personalized Content Adaptation through Matrix Factorization and Context-Aware Federated Learning

PositiveArtificial Intelligence

A new study has introduced a multifaceted approach to enhancing user interaction and content relevance on social media platforms through a federated learning framework, focusing on personalized LLM Federated Learning and context-based models. This framework allows multiple client entities to fine-tune a foundational GPT model using locally collected data while ensuring data privacy through federated aggregation.

Read full article

via arXiv — cs.LG

arXiv — cs.CLa day ago

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

PositiveArtificial Intelligence

A new framework called ReVeL (Rewrite and Verify by LLM) has been proposed to enhance the multiple-choice question answering (MCQA) format used in evaluating multimodal language models. This framework transforms MCQA into open-form questions while ensuring answers remain verifiable, addressing issues of answer guessing and unreliable accuracy metrics during reinforcement fine-tuning (RFT).

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

Computational frame analysis revisited: On LLMs for studying news coverage

NeutralArtificial Intelligence

A recent study has revisited the effectiveness of large language models (LLMs) like GPT and Claude in analyzing media frames, particularly in the context of news coverage surrounding the US Mpox epidemic of 2022. The research systematically evaluated these generative models against traditional methods, revealing that manual coders consistently outperformed LLMs in frame analysis tasks.

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

Multi-speaker Attention Alignment for Multimodal Social Interaction

PositiveArtificial Intelligence

A new method for enhancing social interaction understanding in videos has been proposed, focusing on the alignment of verbal and non-verbal cues in multi-speaker scenarios. This approach addresses the limitations observed in existing Multimodal Large Language Models (MLLMs), which struggle with cross-modal attention consistency in such contexts.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs

NeutralArtificial Intelligence

A new benchmark called EventBench has been introduced to evaluate the capabilities of multimodal large language models (MLLMs) in event-based vision. This benchmark features eight diverse task metrics and a large-scale event stream dataset, aiming to provide a comprehensive assessment of MLLMs' performance across various tasks, including understanding, recognition, and spatial reasoning.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

PositiveArtificial Intelligence

EvoLMM, a self-evolving framework for large multimodal models, has been introduced to enhance reasoning capabilities without relying on human-annotated data. This framework consists of two cooperative agents: a Proposer that generates diverse questions and a Solver that answers them through a continuous self-rewarding process. This innovation aims to improve the autonomy and scalability of multimodal models.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

PositiveArtificial Intelligence

A new approach called Query-aware Token Selector (QTSplus) has been introduced to enhance long-video understanding in multimodal large language models (MLLMs). This module addresses the challenge of increasing vision token counts with video length, which leads to higher attention costs and latency. QTSplus dynamically selects the most relevant visual tokens based on text queries, improving efficiency in processing long videos.

Read full article

via arXiv — cs.CV