World PulseNowPowered by AI

Trending:

On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

arXiv — cs.CL•Friday, December 5, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The recent study on Group Relative Policy Optimization (GRPO) in Search-R1 highlights a significant issue known as Lazy Likelihood Displacement (LLD), which leads to a collapse in training effectiveness. This phenomenon results in a self-reinforcing cycle of declining response quality, characterized by low-confidence outputs and inflated gradients. The research empirically demonstrates this collapse across various models engaged in search-integrated question answering tasks.
Understanding the implications of LLD is crucial for the advancement of reinforcement learning frameworks, particularly as GRPO is favored for its fast convergence and value-free formulation. The identification of LLD as a core failure mechanism underscores the need for improved training methodologies to ensure the reliability and effectiveness of large language models (LLMs) in complex reasoning tasks.
This development reflects ongoing challenges in the reinforcement learning landscape, particularly concerning the stability and performance of LLMs. Various approaches, such as Group Turn Policy Optimization and Distributional Value Modeling-based Policy Optimization, are being explored to address similar issues of training collapse and response diversity. The evolution of these frameworks indicates a broader trend towards enhancing the robustness of AI systems in handling multi-step reasoning and tool integration.

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataTry the app

Research AI

Find untapped prospects with AI-powered research and outreach.

AI & DataTry the app

Augmeta

AI peers for collaborative problem-solving and enhanced team productivity.

AI & DataTry the app

Continue Readings

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

arXiv — cs.CV13 hours ago

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

PositiveArtificial Intelligence

TempR1 has been introduced as a temporal-aware multi-task reinforcement learning framework designed to enhance the temporal understanding of Multimodal Large Language Models (MLLMs). This framework aims to improve capabilities in long-form video analysis, including tasks such as temporal localization and action detection.

Read full article

via arXiv — cs.CV

Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

arXiv — cs.CL13 hours ago

Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

PositiveArtificial Intelligence

The introduction of Semantic Soft Bootstrapping (SSB) represents a significant advancement in long context reasoning for large language models (LLMs), allowing them to enhance cognitive capabilities without relying on reinforcement learning with verifiable rewards (RLVR). This self-distillation technique enables the model to act as both teacher and student, improving its reasoning abilities through varied semantic contexts during training.

Read full article

via arXiv — cs.CL

SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards

arXiv — cs.CV13 hours ago

SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards

PositiveArtificial Intelligence

A new paradigm for Image Quality Assessment (IQA) has been introduced, focusing on the aesthetic quality of interior images through a framework called Spatial Aesthetics. This framework evaluates images based on layout, harmony, lighting, and distortion, supported by the SA-BENCH benchmark, which includes 18,000 images and 50,000 annotations. The SA-IQA methodology has been developed to enhance the assessment of AI-generated images (AIGI) and is applied in optimizing generation pipelines and selecting high-quality outputs.

Read full article

via arXiv — cs.CV

TTRV: Test-Time Reinforcement Learning for Vision Language Models

arXiv — cs.CV13 hours ago

TTRV: Test-Time Reinforcement Learning for Vision Language Models

PositiveArtificial Intelligence

The introduction of Test-Time Reinforcement Learning (TTRV) aims to enhance vision language models by adapting them during inference without relying on labeled data. This method builds upon the Group Relative Policy Optimization (GRPO) framework, optimizing rewards based on output frequency and controlling output diversity through low entropy rewards. The approach has shown significant improvements in object recognition and visual question answering, with gains of up to 52.4% and 29.8%, respectively.

Read full article

via arXiv — cs.CV

EtCon: Edit-then-Consolidate for Reliable Knowledge Editing

arXiv — cs.CL13 hours ago

EtCon: Edit-then-Consolidate for Reliable Knowledge Editing

PositiveArtificial Intelligence

A new study titled 'EtCon: Edit-then-Consolidate for Reliable Knowledge Editing' has been published on arXiv, addressing the challenges of knowledge editing in large language models (LLMs). The research identifies significant gaps between controlled evaluations and real-world applications, highlighting issues such as overfitting and the lack of a knowledge consolidation stage in existing methods.

Read full article

via arXiv — cs.CL

Structured Document Translation via Format Reinforcement Learning

arXiv — cs.CL13 hours ago

Structured Document Translation via Format Reinforcement Learning

PositiveArtificial Intelligence

Recent advancements in structured document translation have been made with the introduction of Format Reinforcement Learning (FormatRL), which utilizes Group Relative Policy Optimization to enhance translation quality and structural integrity in complex document formats like XML and HTML. The method optimizes novel structure-aware rewards, demonstrating significant improvements in translation metrics on the SAP software-documentation benchmark.

Read full article

via arXiv — cs.CL

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

arXiv — cs.CL13 hours ago

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

PositiveArtificial Intelligence

The introduction of QA-LIGN represents a significant advancement in the alignment of large language models (LLMs) by decomposing scalar rewards into interpretable evaluations based on principles such as helpfulness and honesty. This structured approach allows models to learn through a draft, critique, and revise pipeline, leading to improved safety and performance metrics, including a reduction in attack success rates by up to 68.7% while maintaining a low false refusal rate.

Read full article

via arXiv — cs.CL

TaoSR1: The Thinking Model for E-commerce Relevance Search

arXiv — cs.CL13 hours ago

TaoSR1: The Thinking Model for E-commerce Relevance Search

PositiveArtificial Intelligence

The TaoSR1 framework has been introduced to enhance query-product relevance prediction in e-commerce search, addressing limitations of existing BERT-based models by incorporating Large Language Models (LLMs) and a structured Chain-of-Thought (CoT) approach. The framework consists of three stages: Supervised Fine-Tuning, offline sampling with Direct Preference Optimization, and dynamic sampling to reduce hallucination errors.

Read full article

via arXiv — cs.CL