CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

arXiv — cs.CVWednesday, November 26, 2025 at 5:00:00 AM
  • CodeV has been introduced as a code-based visual agent that utilizes Tool-Aware Policy Optimization (TAPO) to enhance visual reasoning in AI models. This development highlights the need for faithful visual reasoning, as existing models often achieve high accuracy while misusing visual tools or ignoring relevant outputs. The proposed faithfulness evaluation protocol aims to address these shortcomings by measuring the relevance of intermediate visual tool outputs.
  • The introduction of CodeV and TAPO represents a significant advancement in the field of AI, particularly in improving the reliability of vision-language models. By focusing on faithful tool use, this framework seeks to enhance the accuracy of visual reasoning tasks, which is crucial for applications in various domains, including robotics and automated reasoning systems.
  • This development reflects a broader trend in AI research towards enhancing multimodal reasoning capabilities and addressing the limitations of traditional reinforcement learning methods. The emphasis on verifiable rewards and faithful reasoning aligns with ongoing efforts to improve the robustness and adaptability of AI systems, as seen in related frameworks like PEARL and ReVeL, which also aim to refine the evaluation and training processes for visual and language models.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Soft Adaptive Policy Optimization
PositiveArtificial Intelligence
A new framework called Soft Adaptive Policy Optimization (SAPO) has been proposed to improve policy optimization in reinforcement learning (RL), particularly for large language models (LLMs). SAPO addresses the high variance in token-level importance ratios that can lead to unstable updates, especially in Mixture-of-Experts models, by utilizing a smooth, temperature-controlled gate for off-policy updates.
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
PositiveArtificial Intelligence
The introduction of the Discriminative Constrained Optimization (DisCO) framework aims to enhance large reasoning models (LRMs) by addressing limitations found in the Group Relative Policy Optimization (GRPO) method, particularly regarding question-level difficulty bias. DisCO emphasizes a discriminative objective and utilizes non-clipping reinforcement learning surrogate objectives, marking a significant shift in reinforcement learning strategies for LRMs.
Toward Honest Language Models for Deductive Reasoning
NeutralArtificial Intelligence
Recent research highlights the challenges of ensuring honesty in language models during deductive reasoning tasks, where models must derive conclusions strictly from given premises. The study introduces a framework for honest deductive reasoning, emphasizing the need for models to abstain from answering when conclusions are not logically entailed by the premises.
ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection
PositiveArtificial Intelligence
ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes) has been proposed to enhance the detection of hateful memes, addressing limitations in existing models that primarily provide binary predictions without context. This new approach aims to incorporate reasoning similar to human annotators, improving the understanding of policy-relevant cues such as targets and attack types.
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
PositiveArtificial Intelligence
A new framework called ReVeL (Rewrite and Verify by LLM) has been proposed to enhance the multiple-choice question answering (MCQA) format used in evaluating multimodal language models. This framework transforms MCQA into open-form questions while ensuring answers remain verifiable, addressing issues of answer guessing and unreliable accuracy metrics during reinforcement fine-tuning (RFT).
Periodic Asynchrony: An Effective Method for Accelerating On-Policy Reinforcement Learning
PositiveArtificial Intelligence
A new study introduces Periodic Asynchrony as a method to enhance on-policy reinforcement learning, addressing the inefficiencies of synchronous execution in mainstream frameworks. By separating inference and training, this approach allows for independent scaling of components while maintaining accuracy equivalent to traditional methods.
VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL
PositiveArtificial Intelligence
The introduction of VADE, a Variance-Aware Dynamic Sampling framework, aims to enhance group-based policy optimization methods in multimodal reinforcement learning (RL) by addressing the gradient vanishing problem. This issue arises when identical rewards are assigned to all responses within a group, leading to diminished training signals. VADE proposes an online sample-level difficulty estimation to improve the selection of effective samples during training.
SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization
PositiveArtificial Intelligence
The SPINE framework introduces a token-selective approach to test-time reinforcement learning, addressing the challenges faced by large language models (LLMs) and multimodal LLMs (MLLMs) during distribution shifts at test-time. By focusing on high-entropy tokens and applying an entropy-band regularizer, SPINE aims to enhance model performance and maintain exploration during reinforcement learning processes.