arXiv:2510.19474v2 Announce Type: replace 
Abstract: Direct Preference Optimization (DPO) is an effective approach for aligning protein language models with experimental design goals. However, DPO faces a scalability bottleneck: the number of possible training pairs grows quadratically with the number of labeled sequences, leading to prohibitive training times even for modestly sized datasets. We introduce g-DPO, a framework that (i) uses sequence space clustering to prune redundant pairs while preserving training signal, and (ii) amortizes likelihood computations with group-based approximations. Across three protein engineering tasks, g-DPO maintains in silico and in vitro performance that is statistically indistinguishable from standard DPO, while converging 1.7x to 5.4x times faster, with speedups that scale with dataset size and the structure of the underlying mutational landscape.

تم تقديم g-DPO، وهو إطار قابل للتوسع لتحسين التفضيلات المباشرة (DPO)، لمعالجة تحديات قابلية التوسع التي تواجه نماذج اللغة البروتينية أثناء التدريب. من خلال استخدام تجميع مساحة التسلسل والاقترابات القائمة على المجموعات، يقلل g-DPO بشكل كبير من أوقات التدريب مع الحفاظ على الأداء عبر مهام هندسة البروتين المختلفة.

La introducción de g-DPO, un marco escalable para la Optimización de Preferencias Directas (DPO), aborda los desafíos de escalabilidad que enfrentan los modelos de lenguaje proteico durante el entrenamiento. Al emplear agrupamiento del espacio de secuencias y aproximaciones basadas en grupos, g-DPO reduce significativamente los tiempos de entrenamiento mientras mantiene el rendimiento en diversas tareas de ingeniería proteica.

L'introduction de g-DPO, un cadre évolutif pour l'optimisation des préférences directes (DPO), répond aux défis de scalabilité rencontrés par les modèles de langage protéique lors de l'entraînement. En utilisant le regroupement de l'espace de séquence et des approximations basées sur des groupes, g-DPO réduit considérablement les temps d'entraînement tout en maintenant la performance dans diverses tâches d'ingénierie protéique.

The introduction of g-DPO, a scalable framework for Direct Preference Optimization (DPO), addresses the scalability challenges faced by protein language models during training. By employing sequence space clustering and group-based approximations, g-DPO significantly reduces training times while maintaining performance across various protein engineering tasks.

g-DPO: Scalable Preference Optimization for Protein Language Models

arXiv:2509.22851v3 Announce Type: replace 
Abstract: Margin-based optimization is fundamental to improving generalization and robustness in classification tasks. In the context of reward model learning from preferences within Reinforcement Learning from Human Feedback (RLHF), existing methods typically rely on no margins, fixed margins, or margins that are simplistic functions of preference ratings. However, such formulations often fail to account for the varying strengths of different preferences, for example some preferences are associated with larger margins between responses, or they rely on noisy margin information derived from ratings. We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment. Furthermore, many existing methods that use adaptive margins assume access to accurate preference scores, which can be difficult for humans to provide reliably. We propose an approach that leverages preferences over preferences, that is annotations indicating which of two preferences reflects a stronger distinction. We use this ordinal signal to infer adaptive margins on a per-datapoint basis. We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision, enabling improved discriminative and generative performance. Empirically, our method outperforms vanilla DPO, DPO with fixed margins, and DPO with ground-truth margins on the UltraFeedback dataset. Additionally, we show that there is a tradeoff between discriminative and generative performance: improving test classification accuracy, particularly by correctly labeling weaker preferences at the expense of stronger ones, can lead to a decline in generative quality. To navigate this tradeoff, we propose two sampling strategies to gather preference-over-preference labels: one favoring discriminative performance and one favoring generative performance.

تم اقتراح نهج جديد في التعلم المعزز من خلال التغذية الراجعة البشرية (RLHF) يركز على تحسين الهوامش التكيفية من خلال نمذجة التفضيلات على التفضيلات. تهدف هذه الطريقة إلى تعزيز التعميم والموثوقية في مهام التصنيف من خلال معالجة قيود تقنيات تحسين الهوامش الحالية، التي غالبًا ما تتجاهل الاختلافات في قوة التفضيلات.

Se ha propuesto un nuevo enfoque en el aprendizaje por refuerzo a partir de la retroalimentación humana (RLHF), centrado en la optimización de márgenes adaptativos mediante la modelización de preferencias sobre preferencias. Este método busca mejorar la generalización y la robustez en tareas de clasificación al abordar las limitaciones de las técnicas de optimización basadas en márgenes existentes, que a menudo pasan por alto las diferentes intensidades de las preferencias.

Une nouvelle approche dans l'apprentissage par renforcement à partir des retours humains (RLHF) a été proposée, axée sur l'optimisation des marges adaptatives par la modélisation des préférences sur les préférences. Cette méthode vise à améliorer la généralisation et la robustesse dans les tâches de classification en abordant les limitations des techniques d'optimisation basées sur les marges existantes, qui négligent souvent les forces variées des préférences.

A new approach in reinforcement learning from human feedback (RLHF) has been proposed, focusing on adaptive margin optimization through modeling preferences over preferences. This method aims to enhance generalization and robustness in classification tasks by addressing the limitations of existing margin-based optimization techniques, which often overlook the varying strengths of preferences.

g-DPO: Scalable Preference Optimization for Protein Language Models

Was this article worth reading? Share it

Polidict

LucidQuery AI

Scop.ai