PrefPoE: Advantage-Guided Preference Fusion for Learning Where to Explore

arXiv — cs.LG•Wednesday, November 12, 2025 at 5:00:00 AM

The PrefPoE framework, introduced on November 12, 2025, represents a breakthrough in reinforcement learning, particularly in the area of exploration. Traditional methods often struggle with high variance and inefficient policy updates due to naive entropy maximization. PrefPoE addresses these issues by employing a Preference-Product-of-Experts approach, which intelligently guides exploration based on action advantages. This method not only stabilizes policy updates but also enhances training stability and sample efficiency. The framework has demonstrated impressive performance improvements across various control tasks, including a 321% increase on HalfCheetah-v4, a 69% increase on Ant-v4, and a 276% increase on LunarLander-v2. Unlike standard PPO, which can suffer from entropy collapse, PrefPoE maintains adaptive exploration dynamics, preventing premature convergence and ensuring more effective learning. This advancement highlights the importance of learning where to explore, which is …

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.LG4 hours ago

EvoLM: In Search of Lost Language Model Training Dynamics

PositiveArtificial Intelligence

EvoLM is a new model suite designed to analyze the training dynamics of language models (LMs) across various stages, including pre-training and fine-tuning. By training over 100 LMs with 1B and 4B parameters, EvoLM provides insights into the effectiveness of design choices and their impact on both language modeling and problem-solving capabilities. Key findings emphasize the diminishing returns of excessive pre-training and the importance of continued pre-training to mitigate forgetting during domain-specific tasks.

Read full article

via arXiv — cs.LG

arXiv — cs.LG2 days ago

Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm

PositiveArtificial Intelligence

The paper titled 'Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm' addresses vulnerabilities in sequential recommenders, particularly to adversarial attacks. It highlights the Profile Pollution Attack (PPA), which subtly contaminates user interactions to induce mispredictions. The authors propose a new method called CREAT, which combines bi-level optimization with reinforcement learning to enhance the stealthiness and effectiveness of such attacks, overcoming limitations of previous methods.

Read full article

via arXiv — cs.LG

arXiv — cs.CL2 days ago

LDC: Learning to Generate Research Idea with Dynamic Control

PositiveArtificial Intelligence

Recent advancements in large language models (LLMs) highlight their potential in automating scientific research ideation. Current methods often produce ideas that do not meet expert standards of novelty, feasibility, and effectiveness. To address these issues, a new framework is proposed that combines Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL) to enhance the quality of generated research ideas through a two-stage approach.

Read full article

via arXiv — cs.CL

arXiv — cs.LG2 days ago

Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

PositiveArtificial Intelligence

The paper titled 'Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning' addresses the challenges of high-variance return estimates in reinforcement learning algorithms. It highlights that well-designed behavior policies can collect off-policy data, leading to lower variance return estimates. This finding suggests that on-policy data collection is not optimal for variance, and the authors extend this insight to online reinforcement learning, where policy evaluation and improvement occur simultaneously.

Read full article

via arXiv — cs.LG

arXiv — cs.CL2 days ago

Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction

PositiveArtificial Intelligence

The article presents Thinker, a hierarchical thinking model designed to enhance the reasoning capabilities of large language models (LLMs) through multi-turn interactions. Unlike previous methods that relied on end-to-end reinforcement learning without supervision, Thinker allows for a more structured reasoning process by breaking down complex problems into manageable sub-problems. Each sub-problem is represented in both natural language and logical functions, improving the coherence and rigor of the reasoning process.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models

NeutralArtificial Intelligence

Recent advancements in large language models (LLMs) have shifted the focus of reasoning as a benchmark for intelligence evaluation. This article critiques the uniform reasoning strategies employed by current LLMs, which often generate lengthy reasoning for simple tasks while struggling with complex ones. It introduces the concept of adaptive reasoning, which allows models to adjust their reasoning efforts based on task difficulty and uncertainty, and outlines three key contributions to understanding this approach.

Read full article

via arXiv — cs.CL

arXiv — cs.LG2 days ago

Harnessing Bounded-Support Evolution Strategies for Policy Refinement

PositiveArtificial Intelligence

The article discusses the use of Triangular-Distribution Evolution Strategies (TD-ES) for refining robot policies through on-policy reinforcement learning (RL). It addresses challenges posed by noisy gradients and proposes a method that combines bounded triangular noise with a centered-rank finite-difference estimator. The two-stage process, involving PPO pretraining followed by TD-ES refinement, enhances success rates by 26.5% while reducing variance, making it a promising approach for improving robotic manipulation tasks.

Read full article

via arXiv — cs.LG

arXiv — cs.LG2 days ago

DiAReL: Reinforcement Learning with Disturbance Awareness for Robust Sim2Real Policy Transfer in Robot Control

PositiveArtificial Intelligence

The paper titled 'DiAReL: Reinforcement Learning with Disturbance Awareness for Robust Sim2Real Policy Transfer in Robot Control' discusses the introduction of a disturbance-augmented Markov decision process (DAMDP) to enhance reinforcement learning in robotic control. It addresses the challenges of sim2real transfer, where models trained in simulation often fail to perform effectively in real-world scenarios due to discrepancies in system dynamics. The proposed approach aims to improve the robustness and stabilization of control responses in robotic systems.

Read full article

via arXiv — cs.LG