VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL

arXiv — cs.LGTuesday, November 25, 2025 at 5:00:00 AM
  • The introduction of VADE, a Variance-Aware Dynamic Sampling framework, aims to enhance group-based policy optimization methods in multimodal reinforcement learning (RL) by addressing the gradient vanishing problem. This issue arises when identical rewards are assigned to all responses within a group, leading to diminished training signals. VADE proposes an online sample-level difficulty estimation to improve the selection of effective samples during training.
  • This development is significant as it seeks to improve the efficiency and effectiveness of training multimodal models, which are increasingly vital in AI applications. By mitigating the challenges associated with gradient vanishing, VADE could lead to more robust and adaptable RL systems, enhancing their performance in complex tasks.
  • The advancement of VADE reflects a broader trend in AI research focusing on improving reinforcement learning methodologies. Similar approaches, such as Group Adaptive Policy Optimization (GAPO) and Bayesian Prior-Guided Optimization (BPGO), also aim to refine advantage estimation and reward modeling, indicating a growing recognition of the need for dynamic and adaptable frameworks in the evolving landscape of multimodal AI.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Toward Honest Language Models for Deductive Reasoning
NeutralArtificial Intelligence
Recent research has focused on improving the honesty of language models in deductive reasoning, emphasizing their ability to provide answers only when logically entailed by the premises. The study introduces multi-step tasks and datasets to evaluate this capability, revealing that existing training methods struggle with these challenges.
ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection
PositiveArtificial Intelligence
ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes) has been proposed to enhance the detection of hateful memes, addressing limitations in existing models that primarily provide binary predictions without context. This new approach aims to incorporate reasoning similar to human annotators, improving the understanding of policy-relevant cues such as targets and attack types.
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
PositiveArtificial Intelligence
A new framework called ReVeL (Rewrite and Verify by LLM) has been proposed to enhance the multiple-choice question answering (MCQA) format used in evaluating multimodal language models. This framework transforms MCQA into open-form questions while ensuring answers remain verifiable, addressing issues of answer guessing and unreliable accuracy metrics during reinforcement fine-tuning (RFT).
Periodic Asynchrony: An Effective Method for Accelerating On-Policy Reinforcement Learning
PositiveArtificial Intelligence
A new study introduces Periodic Asynchrony as a method to enhance on-policy reinforcement learning, addressing the inefficiencies of synchronous execution in mainstream frameworks. By separating inference and training, this approach allows for independent scaling of components while maintaining accuracy equivalent to traditional methods.
SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization
PositiveArtificial Intelligence
The SPINE framework introduces a token-selective approach to test-time reinforcement learning, addressing the challenges faced by large language models (LLMs) and multimodal LLMs (MLLMs) during distribution shifts at test-time. By focusing on high-entropy tokens and applying an entropy-band regularizer, SPINE aims to enhance model performance and maintain exploration during reinforcement learning processes.
Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning
PositiveArtificial Intelligence
The introduction of Perceptual-Evidence Anchored Reinforced Learning (PEARL) marks a significant advancement in multimodal reasoning, addressing the limitations of traditional Reinforcement Learning with Verifiable Rewards (RLVR) in Vision-Language Models (VLMs). PEARL enhances reasoning by anchoring it to verified visual evidence, thus mitigating issues like visual hallucinations and reward hacking.
Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation
PositiveArtificial Intelligence
The introduction of Bayesian Prior-Guided Optimization (BPGO) enhances Group Relative Policy Optimization (GRPO) by addressing the inherent ambiguity in visual generation tasks. BPGO incorporates a semantic prior anchor to model reward uncertainty, allowing for more effective optimization by emphasizing reliable feedback while down-weighting ambiguous signals.
Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning
PositiveArtificial Intelligence
A new method called Syn-GRPO (Synthesis-GRPO) has been proposed to enhance the reinforcement learning capabilities of Multimodal Large Language Models (MLLMs) by synthesizing high-quality training data through an online data generator. This approach aims to address the existing challenges of low data quality that limit the exploration scope in MLLM training.