World PulseNowPowered by AI

Trending:

GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control

arXiv — cs.LG•Thursday, December 4, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The introduction of Group-relative Trajectory-based Policy Optimization (GTPO) aims to enhance the stability and performance of Group Relative Policy Optimization (GRPO) in training Large Language Models (LLMs). GTPO addresses critical issues such as conflicting gradient updates on valuable tokens and policy collapse, which have hindered effective model alignment and training processes. By amplifying positive feedback and filtering out high-entropy completions, GTPO seeks to improve convergence and reliability.
This development is significant as it represents a step forward in refining reinforcement learning techniques for LLMs, which are increasingly pivotal in various applications, including natural language processing and AI-driven tools. The enhancements proposed by GTPO could lead to more robust and reliable models, ultimately benefiting developers and users who rely on LLMs for complex tasks.
The challenges of training stability and effective policy optimization are common themes in the field of AI, particularly concerning LLMs. Various approaches, such as Distributional Value Modeling-based Policy Optimization (DVPO) and Group-Aware Policy Optimization (GAPO), have emerged to tackle similar issues. The ongoing exploration of methods like GTPO reflects a broader trend towards improving model performance and adaptability in dynamic environments, highlighting the importance of addressing training inefficiencies in AI development.

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataTry the app

Airparser

Extract and parse data from documents using GPT-4 automation.

AI & DataTry the app

Augmeta

AI peers for collaborative problem-solving and enhanced team productivity.

AI & DataTry the app

Continue Readings

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

arXiv — cs.LGa day ago

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

PositiveArtificial Intelligence

A novel approach called TRIM-KV has been introduced to enhance token retention in memory-bounded key-value caches for large language models (LLMs). This method utilizes a lightweight retention gate to predict the importance of tokens at creation time, allowing for the eviction of less critical tokens when memory limits are reached. The approach aims to address the challenges posed by the quadratic cost of self-attention and the growing size of KV caches during long-horizon inference.

Read full article

via arXiv — cs.LG

Better World Models Can Lead to Better Post-Training Performance

arXiv — cs.LGa day ago

Better World Models Can Lead to Better Post-Training Performance

PositiveArtificial Intelligence

A recent study investigates the impact of explicit world-modeling objectives on the internal representations and performance of Transformers, particularly in the context of a controlled Rubik's Cube task. The research compares standard next-token prediction with two world-modeling strategies, revealing that explicit modeling enhances representation quality and downstream performance after reinforcement learning post-training.

Read full article

via arXiv — cs.LG

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

arXiv — cs.LGa day ago

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

PositiveArtificial Intelligence

DVPO, or Distributional Value Modeling-based Policy Optimization, has been introduced as a new reinforcement learning framework aimed at enhancing the post-training phase of large language models (LLMs). This framework addresses the challenges posed by noisy supervision and aims to improve both robustness and generalization by utilizing conditional risk theory and token-level value distributions.

Read full article

via arXiv — cs.LG

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

arXiv — cs.LGa day ago

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

PositiveArtificial Intelligence

AdaptVision has been introduced as a new paradigm in Vision-Language Models (VLMs), focusing on adaptive visual token acquisition to enhance efficiency in visual question answering tasks. By employing a coarse-to-fine approach, the model selectively acquires visual information as needed, addressing the computational overhead associated with traditional methods that rely on fixed-ratio compression.

Read full article

via arXiv — cs.LG

Kardia-R1: Unleashing LLMs to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning

arXiv — cs.CL2 days ago

Kardia-R1: Unleashing LLMs to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning

PositiveArtificial Intelligence

Kardia-R1 has introduced KardiaBench, a benchmark designed to enhance emotional reasoning in conversational agents by utilizing a dataset of 178,080 QA pairs from 671 real-world profiles, addressing the limitations of existing systems that lack personalized emotional understanding.

Read full article

via arXiv — cs.CL

Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

arXiv — cs.LG2 days ago

Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

PositiveArtificial Intelligence

Recent research highlights the challenges of pruning reasoning language models (RLMs) like OpenAI's o1 and DeepSeek-R1, which are crucial for multi-step reasoning tasks. The study reveals that traditional pruning methods can severely impair the accuracy and coherence of these models, even at moderate levels of sparsity.

Read full article

via arXiv — cs.LG

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

arXiv — cs.CL2 days ago

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

PositiveArtificial Intelligence

A new adaptive curriculum mechanism called CAPO (Curriculum Advantage Policy Optimization) has been proposed to enhance cross-domain reasoning tasks in reinforcement learning. This mechanism aims to improve reasoning capabilities by utilizing advantage signals, initially focusing on positive samples to establish a solid foundation before incorporating negative signals for better discrimination.

Read full article

via arXiv — cs.CL

SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

arXiv — cs.LG2 days ago

SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

PositiveArtificial Intelligence

The introduction of SeeNav-Agent marks a significant advancement in Vision-Language Navigation (VLN) by addressing common errors in perception, reasoning, and planning that hinder navigation performance. This framework incorporates a dual-view Visual Prompt technique to enhance spatial understanding and a novel step-level Reinforcement Fine-Tuning method, Step Reward Group Policy Optimization (SRGPO), to improve navigation task rewards.

Read full article

via arXiv — cs.LG