Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

arXiv — cs.CV•Tuesday, November 25, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A new framework called Chain-of-Visual-Thought (COVT) has been introduced to enhance Vision-Language Models (VLMs) by enabling them to reason with continuous visual tokens, which encapsulate rich perceptual cues. This approach aims to address the limitations of current VLMs in dense visual perception tasks, such as spatial reasoning and geometric awareness, by distilling knowledge from lightweight vision experts within a budget of approximately 20 tokens.
The development of COVT is significant as it allows VLMs to improve their reasoning capabilities not just through language but also through visual information, potentially leading to better performance in complex multimodal tasks. By capturing properties like 2D appearance and 3D geometry, COVT could enhance applications in various fields, including robotics, autonomous systems, and augmented reality.
This advancement reflects a broader trend in AI research focusing on bridging the gap between visual and linguistic understanding. Challenges remain in visual perception tasks, as highlighted by recent studies, which emphasize the need for improved methodologies in VLMs. The introduction of frameworks like COVT and others aims to tackle these issues, indicating a growing recognition of the importance of integrating visual reasoning into AI systems.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

VSDECO

Instantly visualize room transformations with AI-powered photorealistic restyling.

Business & ProductivityTry the app

The Visualizer

Transform complex topics into clear, visual explanations for effortless learning.

AI & DataTry the app

Republiclabs.ai

Generate custom images and videos with the people's AI playground.

Creative & DesignTry the app

Continue Readings

arXiv — cs.CVa day ago

Multi-speaker Attention Alignment for Multimodal Social Interaction

PositiveArtificial Intelligence

A new method for enhancing social interaction understanding in videos has been proposed, focusing on the alignment of verbal and non-verbal cues in multi-speaker scenarios. This approach addresses the limitations observed in existing Multimodal Large Language Models (MLLMs), which struggle with cross-modal attention consistency in such contexts.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

PositiveArtificial Intelligence

VCU-Bridge has been introduced as a framework aimed at enhancing hierarchical visual connotation understanding in multimodal large language models (MLLMs). This framework addresses the limitations of current models that often process visual information in isolation, lacking the ability to integrate low-level perception with high-level reasoning. The accompanying HVCU-Bench benchmark is designed to evaluate this new approach effectively.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

NeutralArtificial Intelligence

A novel vulnerability in vision-language models (VLMs) has been identified through the introduction of IAG, a method that enables multi-target backdoor attacks on VLM-based visual grounding systems. This technique utilizes dynamically generated, input-aware triggers that are text-guided, allowing for imperceptible manipulation of visual inputs while maintaining normal performance on benign samples.

Read full article

via arXiv — cs.CV

arXiv — cs.CLa day ago

L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

PositiveArtificial Intelligence

Researchers have introduced L2V-CoT, a novel training-free approach that facilitates the transfer of Chain-of-Thought (CoT) reasoning from large language models (LLMs) to Vision-Language Models (VLMs) using Linear Artificial Tomography (LAT). This method addresses the challenges VLMs face in multi-step reasoning tasks due to limited multimodal reasoning data.

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

PositiveArtificial Intelligence

A new dataset named VisReason has been introduced to enhance visual Chain-of-Thought (CoT) reasoning in multimodal large language models (MLLMs). Comprising 489,000 annotated examples across four domains, VisReason aims to facilitate complex reasoning by providing multi-round, human-like rationales that guide MLLMs through visual reasoning steps. Additionally, a subset called VisReason-Pro, featuring 165,000 examples, has been curated with expert-level annotations.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models

PositiveArtificial Intelligence

A recent study has proposed a new framework called Subspace Projection Debiasing (SPD) to address the pervasive demographic biases in Vision-Language Models (VLMs). This framework challenges the traditional post-hoc debiasing methods that focus on coordinate-wise adjustments, revealing that biases are distributed across linear subspaces rather than isolated coordinates.

Read full article

via arXiv — cs.CV

arXiv — cs.CLa day ago

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

PositiveArtificial Intelligence

A new framework called ReVeL (Rewrite and Verify by LLM) has been proposed to enhance the multiple-choice question answering (MCQA) format used in evaluating multimodal language models. This framework transforms MCQA into open-form questions while ensuring answers remain verifiable, addressing issues of answer guessing and unreliable accuracy metrics during reinforcement fine-tuning (RFT).

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning

PositiveArtificial Intelligence

The introduction of Perceptual-Evidence Anchored Reinforced Learning (PEARL) marks a significant advancement in multimodal reasoning, addressing the limitations of traditional Reinforcement Learning with Verifiable Rewards (RLVR) in Vision-Language Models (VLMs). PEARL enhances reasoning by anchoring it to verified visual evidence, thus mitigating issues like visual hallucinations and reward hacking.

Read full article

via arXiv — cs.CV