Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

arXiv — cs.CVTuesday, November 25, 2025 at 5:00:00 AM
  • A new framework called Chain-of-Visual-Thought (COVT) has been introduced to enhance Vision-Language Models (VLMs) by enabling them to reason with continuous visual tokens, which encapsulate rich perceptual cues. This approach aims to address the limitations of current VLMs in dense visual perception tasks, such as spatial reasoning and geometric awareness, by distilling knowledge from lightweight vision experts within a budget of approximately 20 tokens.
  • The development of COVT is significant as it allows VLMs to improve their reasoning capabilities not just through language but also through visual information, potentially leading to better performance in complex multimodal tasks. By capturing properties like 2D appearance and 3D geometry, COVT could enhance applications in various fields, including robotics, autonomous systems, and augmented reality.
  • This advancement reflects a broader trend in AI research focusing on bridging the gap between visual and linguistic understanding. Challenges remain in visual perception tasks, as highlighted by recent studies, which emphasize the need for improved methodologies in VLMs. The introduction of frameworks like COVT and others aims to tackle these issues, indicating a growing recognition of the importance of integrating visual reasoning into AI systems.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Multi-speaker Attention Alignment for Multimodal Social Interaction
PositiveArtificial Intelligence
A new method for enhancing social interaction understanding in videos has been proposed, focusing on the alignment of verbal and non-verbal cues in multi-speaker scenarios. This approach addresses the limitations observed in existing Multimodal Large Language Models (MLLMs), which struggle with cross-modal attention consistency in such contexts.
VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging
PositiveArtificial Intelligence
VCU-Bridge has been introduced as a framework aimed at enhancing hierarchical visual connotation understanding in multimodal large language models (MLLMs). This framework addresses the limitations of current models that often process visual information in isolation, lacking the ability to integrate low-level perception with high-level reasoning. The accompanying HVCU-Bench benchmark is designed to evaluate this new approach effectively.
IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
NeutralArtificial Intelligence
A novel vulnerability in vision-language models (VLMs) has been identified through the introduction of IAG, a method that enables multi-target backdoor attacks on VLM-based visual grounding systems. This technique utilizes dynamically generated, input-aware triggers that are text-guided, allowing for imperceptible manipulation of visual inputs while maintaining normal performance on benign samples.
L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention
PositiveArtificial Intelligence
Researchers have introduced L2V-CoT, a novel training-free approach that facilitates the transfer of Chain-of-Thought (CoT) reasoning from large language models (LLMs) to Vision-Language Models (VLMs) using Linear Artificial Tomography (LAT). This method addresses the challenges VLMs face in multi-step reasoning tasks due to limited multimodal reasoning data.
VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning
PositiveArtificial Intelligence
A new dataset named VisReason has been introduced to enhance visual Chain-of-Thought (CoT) reasoning in multimodal large language models (MLLMs). Comprising 489,000 annotated examples across four domains, VisReason aims to facilitate complex reasoning by providing multi-round, human-like rationales that guide MLLMs through visual reasoning steps. Additionally, a subset called VisReason-Pro, featuring 165,000 examples, has been curated with expert-level annotations.
Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models
PositiveArtificial Intelligence
A recent study has proposed a new framework called Subspace Projection Debiasing (SPD) to address the pervasive demographic biases in Vision-Language Models (VLMs). This framework challenges the traditional post-hoc debiasing methods that focus on coordinate-wise adjustments, revealing that biases are distributed across linear subspaces rather than isolated coordinates.
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
PositiveArtificial Intelligence
A new framework called ReVeL (Rewrite and Verify by LLM) has been proposed to enhance the multiple-choice question answering (MCQA) format used in evaluating multimodal language models. This framework transforms MCQA into open-form questions while ensuring answers remain verifiable, addressing issues of answer guessing and unreliable accuracy metrics during reinforcement fine-tuning (RFT).
Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning
PositiveArtificial Intelligence
The introduction of Perceptual-Evidence Anchored Reinforced Learning (PEARL) marks a significant advancement in multimodal reasoning, addressing the limitations of traditional Reinforcement Learning with Verifiable Rewards (RLVR) in Vision-Language Models (VLMs). PEARL enhances reasoning by anchoring it to verified visual evidence, thus mitigating issues like visual hallucinations and reward hacking.