HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization
PositiveArtificial Intelligence
The recent introduction of History-Aware Policy Optimization (HAPO) marks a significant advancement in the training of large language models (LLMs). Traditional methods for scaling response length often lead to verbose outputs and increased inference costs, which can hinder performance. HAPO addresses this by leveraging historical information from previous encounters with similar problems, allowing models to learn and produce more concise solutions over time. This method employs a novel length reward function that encourages the discovery of shorter, correct responses without overly penalizing incorrect shorter outputs. Evaluations conducted on various math benchmarks demonstrate that HAPO effectively improves the reasoning abilities and performance of models such as DeepSeek-R1-Distill-Qwen-1.5B and others. By fostering a more efficient approach to problem-solving, HAPO not only enhances the capabilities of LLMs but also contributes to the broader goal of optimizing AI performance in …
— via World Pulse Now AI Editorial System