Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation

arXiv — cs.CV•Tuesday, November 25, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A recent study highlights the limitations of Direct Preference Optimization (DPO) in diffusion models, particularly the issue of likelihood displacement, where the probabilities of preferred samples decrease during training. This phenomenon can lead to suboptimal performance in video generation tasks, which are increasingly relevant in AI applications.
Addressing likelihood displacement is crucial for enhancing the quality of generative outputs, as it directly impacts the alignment of AI models with human preferences. Improving DPO could lead to more effective video generation technologies, benefiting various sectors that rely on AI-generated content.
The challenges associated with DPO reflect broader issues in reinforcement learning and multimodal learning frameworks, where traditional methods often struggle with cold starts and preference robustness. These ongoing debates highlight the need for innovative approaches to optimize AI models, ensuring they meet complex human demands across diverse applications.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Drezily

Your AI shopping assistant for personalized product recommendations and deals.

AI & DataTry the app

Dynamiq

Build, deploy, and scale your generative AI applications with one unified platform.

Business & ProductivityTry the app

Redreach AI

Find relevant Reddit posts to promote your products and boost visibility.

Marketing & CommerceTry the app

Continue Readings

arXiv — cs.CVa day ago

Systematic Reward Gap Optimization for Mitigating VLM Hallucinations

PositiveArtificial Intelligence

A novel framework called Topic-level Preference Rewriting (TPR) has been introduced to systematically optimize reward gaps in Vision Language Models (VLMs), addressing the challenges of hallucinations during data curation. This method focuses on selectively replacing semantic topics within VLM responses to enhance the accuracy of generated outputs.

Read full article

via arXiv — cs.CV

arXiv — cs.CLa day ago

ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

PositiveArtificial Intelligence

ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes) has been proposed to enhance the detection of hateful memes, addressing limitations in existing models that primarily provide binary predictions without context. This new approach aims to incorporate reasoning similar to human annotators, improving the understanding of policy-relevant cues such as targets and attack types.

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment

PositiveArtificial Intelligence

A new framework named BideDPO has been proposed to enhance conditional image generation by addressing conflicts between text prompts and conditioning images. This method utilizes a bidirectionally decoupled approach to optimize the alignment of text and conditions, aiming to reduce gradient entanglement that hampers performance in existing models.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation

PositiveArtificial Intelligence

A new framework called Multi-Value Alignment (MVA) has been proposed to address the challenges of aligning large language models (LLMs) with multiple human values, particularly when these values conflict. This framework aims to improve the stability and efficiency of multi-value optimization, overcoming limitations seen in existing methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO).

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

The Alignment Paradox of Medical Large Language Models in Infertility Care: Decoupling Algorithmic Improvement from Clinical Decision-making Quality

NeutralArtificial Intelligence

A recent study evaluated the alignment of large language models (LLMs) in infertility care, assessing four strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL). The findings revealed that GRPO achieved the highest algorithmic accuracy, while clinicians preferred SFT for its clearer reasoning and therapeutic feasibility.

Read full article

via arXiv — cs.LG