DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

arXiv — cs.LG•Thursday, December 4, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

DVPO, or Distributional Value Modeling-based Policy Optimization, has been introduced as a new reinforcement learning framework aimed at enhancing the post-training phase of large language models (LLMs). This framework addresses the challenges posed by noisy supervision and aims to improve both robustness and generalization by utilizing conditional risk theory and token-level value distributions.
The introduction of DVPO is significant as it seeks to overcome limitations in existing reinforcement learning methods, which often lead to overly conservative policies and inconsistent performance across various real-world scenarios. By providing fine-grained supervision and risk-aware policy optimization, DVPO aims to enhance the effectiveness of LLMs in practical applications.
This development reflects a broader trend in the field of artificial intelligence, where researchers are increasingly focusing on improving the generalizability and stability of reinforcement learning algorithms. Techniques such as staggered environment resets and adaptive policy optimization are being explored to address similar challenges, indicating a collective effort to refine RL methodologies for better performance in diverse environments.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataTry the app

Augmeta

AI peers for collaborative problem-solving and enhanced team productivity.

AI & DataTry the app

Keywords AI

Monitor and optimize your AI models with comprehensive observability tools.

Business & ProductivityTry the app

Continue Readings

arXiv — cs.CLa day ago

Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

PositiveArtificial Intelligence

A novel approach has been proposed to enhance the reasoning capabilities of large language models (LLMs) through offline goal-conditioned reinforcement learning (RL), addressing the limitations of current multi-turn RL training methods that are costly and inefficient. This method utilizes goal-conditioned value functions to predict task outcomes based on actions, enabling better planning and reasoning in complex tasks such as negotiation and persuasion.

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

OneThinker: All-in-one Reasoning Model for Image and Video

PositiveArtificial Intelligence

OneThinker has been introduced as an all-in-one reasoning model that integrates image and video understanding across various visual tasks, including question answering and segmentation. This model aims to overcome the limitations of existing approaches that treat image and video reasoning as separate domains, thereby enhancing scalability and knowledge sharing.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

Better World Models Can Lead to Better Post-Training Performance

PositiveArtificial Intelligence

A recent study investigates the impact of explicit world-modeling objectives on the internal representations and performance of Transformers, particularly in the context of a controlled Rubik's Cube task. The research compares standard next-token prediction with two world-modeling strategies, revealing that explicit modeling enhances representation quality and downstream performance after reinforcement learning post-training.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

Automatic Attack Discovery for Few-Shot Class-Incremental Learning via Large Language Models

PositiveArtificial Intelligence

A recent study has introduced a novel method called ACraft for automatic attack discovery in Few-Shot Class-Incremental Learning (FSCIL) using Large Language Models (LLMs). This research highlights the challenges posed by traditional attack methods like PGD and FGSM, which either fail to effectively target base classes or require extensive expert knowledge, thus necessitating a specialized approach for FSCIL.

Read full article

via arXiv — cs.LG

$Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1+($\lambda$,$\lambda$))-GA$

arXiv — cs.LGa day ago

Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1+($\lambda$,$\lambda$))-GA

PositiveArtificial Intelligence

A comprehensive study has been conducted on the application of deep reinforcement learning (RL) algorithms for dynamic algorithm configuration (DAC), specifically focusing on optimizing the population size parameter of the (1+($\lambda$,$\lambda$))-GA on OneMax instances. The research identifies significant challenges such as scalability degradation and learning instability, attributed to under-exploration and planning horizon coverage.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

PositiveArtificial Intelligence

AdaptVision has been introduced as a new paradigm in Vision-Language Models (VLMs), focusing on adaptive visual token acquisition to enhance efficiency in visual question answering tasks. By employing a coarse-to-fine approach, the model selectively acquires visual information as needed, addressing the computational overhead associated with traditional methods that rely on fixed-ratio compression.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control

PositiveArtificial Intelligence

The introduction of Group-relative Trajectory-based Policy Optimization (GTPO) aims to enhance the stability and performance of Group Relative Policy Optimization (GRPO) in training Large Language Models (LLMs). GTPO addresses critical issues such as conflicting gradient updates on valuable tokens and policy collapse, which have hindered effective model alignment and training processes. By amplifying positive feedback and filtering out high-entropy completions, GTPO seeks to improve convergence and reliability.

Read full article

via arXiv — cs.LG

arXiv — cs.CL2 days ago

Kardia-R1: Unleashing LLMs to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning

PositiveArtificial Intelligence

Kardia-R1 has introduced KardiaBench, a benchmark designed to enhance emotional reasoning in conversational agents by utilizing a dataset of 178,080 QA pairs from 671 real-world profiles, addressing the limitations of existing systems that lack personalized emotional understanding.

Read full article

via arXiv — cs.CL