GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning

arXiv — cs.LGThursday, November 20, 2025 at 5:00:00 AM

Was this article worth reading? Share it

Recommended Readings
Investigating Hallucination in Conversations for Low Resource Languages
NeutralArtificial Intelligence
Large Language Models (LLMs) have shown exceptional ability in text generation but often produce factually incorrect statements, known as 'hallucinations'. This study investigates hallucinations in conversational data across three low-resource languages: Hindi, Farsi, and Mandarin. The analysis of various LLMs, including GPT-3.5 and GPT-4o, reveals that while Mandarin has few hallucinated responses, Hindi and Farsi exhibit significantly higher rates of inaccuracies.
Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization
PositiveArtificial Intelligence
The paper introduces Group Turn Policy Optimization (GTPO), a novel reinforcement learning algorithm aimed at enhancing the training of Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR). GTPO addresses limitations of existing methods like Group Relative Policy Optimization (GRPO) by implementing turn-level reward assignments, return-based advantage estimation, and self-supervised reward shaping, which collectively improve learning signals for complex interactions.
Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities
NeutralArtificial Intelligence
Recent advancements in Large Reasoning Models (LRMs) have shown impressive performance in specialized reasoning tasks. However, a systematic evaluation reveals that acquiring deliberative reasoning capabilities significantly reduces foundational capabilities, leading to declines in helpfulness and harmlessness, along with increased inference costs. Adaptive reasoning methods can alleviate these drawbacks, highlighting the need for more versatile LRMs.
ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions
NeutralArtificial Intelligence
ConInstruct is a benchmark designed to evaluate Large Language Models (LLMs) on their ability to detect and resolve conflicts in user instructions. While many existing assessments focus on adherence to instructions, ConInstruct addresses the often-overlooked scenarios where conflicting constraints arise. Initial evaluations show that proprietary LLMs generally perform well in conflict detection, with DeepSeek-R1 and Claude-4.5-Sonnet achieving the highest F1-scores.
Preference Learning with Lie Detectors can Induce Honesty or Evasion
NeutralArtificial Intelligence
As AI systems advance, deceptive behaviors pose challenges in evaluation and user trust. Recent research indicates that lie detectors can effectively identify deception, yet they are seldom integrated into training due to fears of contamination and manipulation. This study explores the impact of incorporating lie detectors in the labeling phase of large language model (LLM) training, using a new dataset called DolusChat. It identifies key factors influencing the honesty of learned policies, revealing that preference learning with lie detectors can lead to evasion strategies.
Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions
PositiveArtificial Intelligence
This report details a submission to Track 5 of the DCASE 2025 Challenge focused on Audio Question Answering (AQA). The system utilizes the SSL backbone BEATs to extract frame-level audio features, which are processed by a classification head to generate segment-level predictions of acoustic events based on the Audioset ontology. These predictions are calibrated before producing event-level predictions, which are then structured into a prompt for a fine-tuned version of Qwen2.5-7B-Instruct, trained with the GRPO algorithm. The method achieved an accuracy of 62.6% on the development set, highlig…
GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards
NeutralArtificial Intelligence
Membership inference attacks (MIAs) on large language models (LLMs) present significant privacy risks during model training. Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have transformed LLM training, especially for complex reasoning tasks. However, the on-policy nature of RLVR leads to unique privacy concerns, as it requires determining if a prompt was used in fine-tuning, creating potential leakage not from memorization but from behavioral changes. The Divergence-in-Behavior Attack (DIBA) framework is proposed to address this risk.
Meta’s DreamGym framework trains AI agents in a simulated world to cut reinforcement learning costs
PositiveArtificial Intelligence
Researchers at Meta, the University of Chicago, and UC Berkeley have developed DreamGym, a new framework that reduces the costs and complexities of training AI agents using reinforcement learning (RL). This framework simulates an RL environment, allowing agents to learn progressively by adjusting task difficulty. Experiments indicate that DreamGym enhances RL training efficiency, achieving results comparable to established algorithms while significantly lowering data collection costs.