Growing with the Generator: Self-paced GRPO for Video Generation

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

PositiveArtificial Intelligence

The introduction of AVATAR, a novel framework for reinforcement learning, aims to enhance multimodal reasoning over long-horizon video by addressing key limitations of existing methods like Group Relative Policy Optimization (GRPO). AVATAR improves sample efficiency and resolves issues such as vanishing advantages and uniform credit assignment through an off-policy training architecture.

Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation

PositiveArtificial Intelligence

The introduction of Bayesian Prior-Guided Optimization (BPGO) enhances Group Relative Policy Optimization (GRPO) by addressing the inherent ambiguity in visual generation tasks. BPGO incorporates a semantic prior anchor to model reward uncertainty, allowing for more effective optimization by emphasizing reliable feedback while down-weighting ambiguous signals.

Training-Free Efficient Video Generation via Dynamic Token Carving

PositiveArtificial Intelligence

A new inference pipeline named Jenga has been introduced to enhance the efficiency of video generation using Video Diffusion Transformer (DiT) models. This approach addresses the computational challenges associated with self-attention and the multi-step nature of diffusion models by employing dynamic attention carving and progressive resolution generation.

The Alignment Paradox of Medical Large Language Models in Infertility Care: Decoupling Algorithmic Improvement from Clinical Decision-making Quality

arXiv — cs.LGa day ago

NeutralArtificial Intelligence

A recent study evaluated the alignment of large language models (LLMs) in infertility care, assessing four strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL). The findings revealed that GRPO achieved the highest algorithmic accuracy, while clinicians preferred SFT for its clearer reasoning and therapeutic feasibility.

via arXiv — cs.LG

EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning

PositiveArtificial Intelligence

EgoVITA has been introduced as a reinforcement learning framework designed to enhance the reasoning capabilities of multimodal large language models (MLLMs) by enabling them to plan and verify actions from both egocentric and exocentric perspectives. This dual-phase approach allows the model to predict future actions from a first-person viewpoint and subsequently verify these actions from a third-person perspective, addressing challenges in understanding dynamic visual contexts.

Synthetic Curriculum Reinforces Compositional Text-to-Image Generation

PositiveArtificial Intelligence

A novel compositional curriculum reinforcement learning framework named CompGen has been proposed to enhance text-to-image (T2I) generation, addressing the challenges of accurately rendering complex scenes with multiple objects and intricate relationships. This framework utilizes scene graphs to establish a difficulty criterion for compositional ability and employs an adaptive Markov Chain Monte Carlo graph sampling algorithm to optimize T2I models through reinforcement learning.

Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

PositiveArtificial Intelligence

A new approach called Visual Preference Policy Optimization (ViPO) has been introduced to enhance visual generative models by utilizing structured, pixel-level feedback instead of traditional scalar rewards. This method aims to improve the alignment of generated images and videos with human preferences by focusing on perceptually significant areas, thus addressing limitations in existing Group Relative Policy Optimization (GRPO) frameworks.

Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models

arXiv — cs.LG2 days ago

PositiveArtificial Intelligence

The introduction of Neighbor Group Relative Policy Optimization (GRPO) presents a significant advancement in aligning flow models with human preferences by eliminating the need for Stochastic Differential Equations (SDEs). This novel algorithm generates diverse candidate trajectories through perturbation, enhancing the efficiency of the alignment process.