Soft Adaptive Policy Optimization

arXiv — cs.LG•Wednesday, November 26, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A new framework called Soft Adaptive Policy Optimization (SAPO) has been proposed to improve policy optimization in reinforcement learning (RL), particularly for large language models (LLMs). SAPO addresses the high variance in token-level importance ratios that can lead to unstable updates, especially in Mixture-of-Experts models, by utilizing a smooth, temperature-controlled gate for off-policy updates.
This development is significant as it enhances the stability and effectiveness of learning in RL applications, which are crucial for the advancement of LLMs. By replacing hard clipping methods with a more adaptive approach, SAPO aims to maintain coherence in sequences while allowing for more nuanced learning signals.
The introduction of SAPO reflects a broader trend in AI research towards improving reinforcement learning methodologies, particularly in addressing the challenges of high variance and instability. This aligns with ongoing efforts in the field to develop more robust frameworks, such as Group Relative Policy Optimization (GRPO) and its variants, which also seek to optimize learning processes in complex models.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LCW

An invisible AI copilot that helps you ace every coding interview.

AI & DataTry the app

ISMS Policy Generator

Generate comprehensive security policies instantly with AI assistance.

AI & DataTry the app

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataTry the app

Continue Readings

arXiv — cs.CVa day ago

CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

PositiveArtificial Intelligence

CodeV has been introduced as a code-based visual agent that utilizes Tool-Aware Policy Optimization (TAPO) to enhance visual reasoning in AI models. This development highlights the need for faithful visual reasoning, as existing models often achieve high accuracy while misusing visual tools or ignoring relevant outputs. The proposed faithfulness evaluation protocol aims to address these shortcomings by measuring the relevance of intermediate visual tool outputs.

Read full article

via arXiv — cs.CV

arXiv — cs.CLa day ago

Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

NegativeArtificial Intelligence

The Adversarial Confusion Attack has been introduced as a new threat to multimodal large language models (MLLMs), aiming to disrupt their output by generating incoherent or confidently incorrect responses. This attack utilizes adversarial images to compromise the reliability of MLLM-powered agents, demonstrating its effectiveness across various models, including proprietary ones like GPT-5.1.

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

When to Think and When to Look: Uncertainty-Guided Lookback

PositiveArtificial Intelligence

A systematic analysis of test-time thinking in large vision language models (LVLMs) has been conducted, revealing that generating explicit reasoning chains can enhance performance, but longer chains may lead to errors that detract from visual reasoning. The study evaluated ten variants from the InternVL3.5 and Qwen3-VL families on the MMMU-val dataset under generous token budgets and multi-pass decoding.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

PositiveArtificial Intelligence

The introduction of the Discriminative Constrained Optimization (DisCO) framework aims to enhance large reasoning models (LRMs) by addressing limitations found in the Group Relative Policy Optimization (GRPO) method, particularly regarding question-level difficulty bias. DisCO emphasizes a discriminative objective and utilizes non-clipping reinforcement learning surrogate objectives, marking a significant shift in reinforcement learning strategies for LRMs.

Read full article

via arXiv — cs.LG

arXiv — cs.CLa day ago

Toward Honest Language Models for Deductive Reasoning

NeutralArtificial Intelligence

Recent research highlights the challenges of ensuring honesty in language models during deductive reasoning tasks, where models must derive conclusions strictly from given premises. The study introduces a framework for honest deductive reasoning, emphasizing the need for models to abstain from answering when conclusions are not logically entailed by the premises.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

PositiveArtificial Intelligence

ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes) has been proposed to enhance the detection of hateful memes, addressing limitations in existing models that primarily provide binary predictions without context. This new approach aims to incorporate reasoning similar to human annotators, improving the understanding of policy-relevant cues such as targets and attack types.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

PositiveArtificial Intelligence

A new framework called ReVeL (Rewrite and Verify by LLM) has been proposed to enhance the multiple-choice question answering (MCQA) format used in evaluating multimodal language models. This framework transforms MCQA into open-form questions while ensuring answers remain verifiable, addressing issues of answer guessing and unreliable accuracy metrics during reinforcement fine-tuning (RFT).

Read full article

via arXiv — cs.CL

arXiv — cs.LG2 days ago

Periodic Asynchrony: An Effective Method for Accelerating On-Policy Reinforcement Learning

PositiveArtificial Intelligence

A new study introduces Periodic Asynchrony as a method to enhance on-policy reinforcement learning, addressing the inefficiencies of synchronous execution in mainstream frameworks. By separating inference and training, this approach allows for independent scaling of components while maintaining accuracy equivalent to traditional methods.

Read full article

via arXiv — cs.LG