DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

arXiv — cs.LG•Wednesday, November 26, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The introduction of the Discriminative Constrained Optimization (DisCO) framework aims to enhance large reasoning models (LRMs) by addressing limitations found in the Group Relative Policy Optimization (GRPO) method, particularly regarding question-level difficulty bias. DisCO emphasizes a discriminative objective and utilizes non-clipping reinforcement learning surrogate objectives, marking a significant shift in reinforcement learning strategies for LRMs.
This development is crucial as it seeks to improve the performance and adaptability of LRMs, which are increasingly utilized in various AI applications. By refining the optimization process, DisCO could lead to more accurate and efficient models, ultimately benefiting industries reliant on advanced reasoning capabilities.
The evolution of reinforcement learning techniques, such as DisCO, reflects a broader trend towards enhancing model training through innovative approaches. This includes the emergence of methods like Self-Paced GRPO and Group Turn Policy Optimization, which aim to tackle specific challenges in video generation and multi-turn reasoning, respectively. Such advancements indicate a growing recognition of the need for more sophisticated and context-aware AI systems.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Research AI

Find untapped prospects with AI-powered research and outreach.

AI & DataTry the app

polygrai

Analyze and verify any content with AI-powered intelligence.

Business & ProductivityTry the app

Octofy

Access all top AI models with one subscription, automatically optimized for your needs.

AI & DataTry the app

Continue Readings

arXiv — cs.CVa day ago

CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

PositiveArtificial Intelligence

CodeV has been introduced as a code-based visual agent that utilizes Tool-Aware Policy Optimization (TAPO) to enhance visual reasoning in AI models. This development highlights the need for faithful visual reasoning, as existing models often achieve high accuracy while misusing visual tools or ignoring relevant outputs. The proposed faithfulness evaluation protocol aims to address these shortcomings by measuring the relevance of intermediate visual tool outputs.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

Soft Adaptive Policy Optimization

PositiveArtificial Intelligence

A new framework called Soft Adaptive Policy Optimization (SAPO) has been proposed to improve policy optimization in reinforcement learning (RL), particularly for large language models (LLMs). SAPO addresses the high variance in token-level importance ratios that can lead to unstable updates, especially in Mixture-of-Experts models, by utilizing a smooth, temperature-controlled gate for off-policy updates.

Read full article

via arXiv — cs.LG

arXiv — cs.CLa day ago

Toward Honest Language Models for Deductive Reasoning

NeutralArtificial Intelligence

Recent research highlights the challenges of ensuring honesty in language models during deductive reasoning tasks, where models must derive conclusions strictly from given premises. The study introduces a framework for honest deductive reasoning, emphasizing the need for models to abstain from answering when conclusions are not logically entailed by the premises.

Read full article

via arXiv — cs.CL

VentureBeat — AI2 days ago

Alibaba's AgentEvolver lifts model performance in tool use by ~30% using synthetic, auto-generated tasks

PositiveArtificial Intelligence

Researchers at Alibaba’s Tongyi Lab have introduced AgentEvolver, a framework that enables self-evolving agents to autonomously generate their own training data by exploring their environments. This innovation reportedly enhances model performance in tool use by approximately 30% compared to traditional methods.

Read full article

via VentureBeat — AI

arXiv — cs.CL2 days ago

ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

PositiveArtificial Intelligence

ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes) has been proposed to enhance the detection of hateful memes, addressing limitations in existing models that primarily provide binary predictions without context. This new approach aims to incorporate reasoning similar to human annotators, improving the understanding of policy-relevant cues such as targets and attack types.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

PositiveArtificial Intelligence

A new framework called ReVeL (Rewrite and Verify by LLM) has been proposed to enhance the multiple-choice question answering (MCQA) format used in evaluating multimodal language models. This framework transforms MCQA into open-form questions while ensuring answers remain verifiable, addressing issues of answer guessing and unreliable accuracy metrics during reinforcement fine-tuning (RFT).

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Learning to Reason: Training LLMs with GPT-OSS or DeepSeek R1 Reasoning Traces

PositiveArtificial Intelligence

Recent advancements in large language models (LLMs) have introduced test-time scaling techniques that enhance reasoning capabilities, as demonstrated by models like DeepSeek-R1 and OpenAI's gpt-oss. These models generate intermediate reasoning traces to improve accuracy in solving complex problems, allowing for effective post-training of smaller models without extensive human input.

Read full article

via arXiv — cs.CL

arXiv — cs.LG2 days ago

The Alignment Paradox of Medical Large Language Models in Infertility Care: Decoupling Algorithmic Improvement from Clinical Decision-making Quality

NeutralArtificial Intelligence

A recent study evaluated the alignment of large language models (LLMs) in infertility care, assessing four strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL). The findings revealed that GRPO achieved the highest algorithmic accuracy, while clinicians preferred SFT for its clearer reasoning and therapeutic feasibility.

Read full article

via arXiv — cs.LG