GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning

arXiv — cs.LG•Thursday, November 20, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The introduction of Group Relative Policy Optimization for Representation Model (GRPO
This development is crucial as it not only improves the performance of LLMs but also addresses the challenges faced in representation learning, potentially leading to more robust AI applications.
The broader implications of GRPO

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CL6 hours ago

Investigating Hallucination in Conversations for Low Resource Languages

NeutralArtificial Intelligence

Large Language Models (LLMs) have shown exceptional ability in text generation but often produce factually incorrect statements, known as 'hallucinations'. This study investigates hallucinations in conversational data across three low-resource languages: Hindi, Farsi, and Mandarin. The analysis of various LLMs, including GPT-3.5 and GPT-4o, reveals that while Mandarin has few hallucinated responses, Hindi and Farsi exhibit significantly higher rates of inaccuracies.

Read full article

via arXiv — cs.CL

arXiv — cs.LG6 hours ago

Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization

PositiveArtificial Intelligence

The paper introduces Group Turn Policy Optimization (GTPO), a novel reinforcement learning algorithm aimed at enhancing the training of Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR). GTPO addresses limitations of existing methods like Group Relative Policy Optimization (GRPO) by implementing turn-level reward assignments, return-based advantage estimation, and self-supervised reward shaping, which collectively improve learning signals for complex interactions.

Read full article

via arXiv — cs.LG

arXiv — cs.CL6 hours ago

Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

NeutralArtificial Intelligence

Recent advancements in Large Reasoning Models (LRMs) have shown impressive performance in specialized reasoning tasks. However, a systematic evaluation reveals that acquiring deliberative reasoning capabilities significantly reduces foundational capabilities, leading to declines in helpfulness and harmlessness, along with increased inference costs. Adaptive reasoning methods can alleviate these drawbacks, highlighting the need for more versatile LRMs.

Read full article

via arXiv — cs.CL

arXiv — cs.CL6 hours ago

ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

NeutralArtificial Intelligence

ConInstruct is a benchmark designed to evaluate Large Language Models (LLMs) on their ability to detect and resolve conflicts in user instructions. While many existing assessments focus on adherence to instructions, ConInstruct addresses the often-overlooked scenarios where conflicting constraints arise. Initial evaluations show that proprietary LLMs generally perform well in conflict detection, with DeepSeek-R1 and Claude-4.5-Sonnet achieving the highest F1-scores.

Read full article

via arXiv — cs.CL

arXiv — cs.LGa day ago

Preference Learning with Lie Detectors can Induce Honesty or Evasion

NeutralArtificial Intelligence

As AI systems advance, deceptive behaviors pose challenges in evaluation and user trust. Recent research indicates that lie detectors can effectively identify deception, yet they are seldom integrated into training due to fears of contamination and manipulation. This study explores the impact of incorporating lie detectors in the labeling phase of large language model (LLM) training, using a new dataset called DolusChat. It identifies key factors influencing the honesty of learned policies, revealing that preference learning with lie detectors can lead to evasion strategies.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

PositiveArtificial Intelligence

This report details a submission to Track 5 of the DCASE 2025 Challenge focused on Audio Question Answering (AQA). The system utilizes the SSL backbone BEATs to extract frame-level audio features, which are processed by a classification head to generate segment-level predictions of acoustic events based on the Audioset ontology. These predictions are calibrated before producing event-level predictions, which are then structured into a prompt for a fine-tuned version of Qwen2.5-7B-Instruct, trained with the GRPO algorithm. The method achieved an accuracy of 62.6% on the development set, highlig…

Read full article

via arXiv — cs.LG

arXiv — cs.CLa day ago

GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards

NeutralArtificial Intelligence

Membership inference attacks (MIAs) on large language models (LLMs) present significant privacy risks during model training. Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have transformed LLM training, especially for complex reasoning tasks. However, the on-policy nature of RLVR leads to unique privacy concerns, as it requires determining if a prompt was used in fine-tuning, creating potential leakage not from memorization but from behavioral changes. The Divergence-in-Behavior Attack (DIBA) framework is proposed to address this risk.

Read full article

via arXiv — cs.CL

VentureBeat — AIa day ago

Meta’s DreamGym framework trains AI agents in a simulated world to cut reinforcement learning costs

PositiveArtificial Intelligence

Researchers at Meta, the University of Chicago, and UC Berkeley have developed DreamGym, a new framework that reduces the costs and complexities of training AI agents using reinforcement learning (RL). This framework simulates an RL environment, allowing agents to learn progressively by adjusting task difficulty. Experiments indicate that DreamGym enhances RL training efficiency, achieving results comparable to established algorithms while significantly lowering data collection costs.

Read full article

via VentureBeat — AI