arXiv:2510.27077v1 Announce Type: new 
Abstract: This paper addresses the limitations of large-scale language models in safety alignment and robustness by proposing a fine-tuning method that combines contrastive distillation with noise-robust training. The method freezes the backbone model and transfers the knowledge boundaries of the teacher model to the student model through distillation, thereby improving semantic consistency and alignment accuracy. At the same time, noise perturbations and robust optimization constraints are introduced during training to ensure that the model maintains stable predictive outputs under noisy and uncertain inputs. The overall framework consists of distillation loss, robustness loss, and a regularization term, forming a unified optimization objective that balances alignment ability with resistance to interference. To systematically validate its effectiveness, the study designs experiments from multiple perspectives, including distillation weight sensitivity, stability analysis under computation budgets and mixed-precision environments, and the impact of data noise and distribution shifts on model performance. Results show that the method significantly outperforms existing baselines in knowledge transfer, robustness, and overall safety, achieving the best performance across several key metrics. This work not only enriches the theoretical system of parameter-efficient fine-tuning but also provides a new solution for building safer and more trustworthy alignment mechanisms.

تقدم ورقة جديدة على arXiv نهجًا مبتكرًا لتعزيز أمان وموثوقية نماذج اللغة الكبيرة. من خلال دمج التقطير التبايني مع التدريب المقاوم للضوضاء، يقترح المؤلفون طريقة تحسن دقة المحاذاة والاتساق الدلالي. هذا مهم لأنه يعالج القيود الحرجة في النماذج الحالية، مما قد يؤدي إلى تطبيقات ذكاء اصطناعي أكثر أمانًا وموثوقية.

Un nuevo artículo en arXiv presenta un enfoque innovador para mejorar la seguridad y robustez de los grandes modelos de lenguaje. Al combinar la destilación contrastiva con un entrenamiento robusto al ruido, los autores proponen un método que mejora la precisión del alineamiento y la consistencia semántica. Esto es significativo ya que aborda limitaciones críticas en los modelos actuales, lo que podría llevar a aplicaciones de IA más seguras y confiables.

Un nouvel article sur arXiv présente une approche innovante pour améliorer la sécurité et la robustesse des grands modèles de langage. En combinant la distillation contrastive avec un entraînement robuste au bruit, les auteurs proposent une méthode qui améliore la précision de l'alignement et la cohérence sémantique. Cela est important car cela répond à des limitations critiques des modèles actuels, ce qui pourrait conduire à des applications d'IA plus sûres et plus fiables.

A new paper on arXiv introduces an innovative approach to enhance the safety and robustness of large language models. By combining contrastive distillation with noise-robust training, the authors propose a method that improves alignment accuracy and semantic consistency. This is significant as it addresses critical limitations in current models, potentially leading to safer and more reliable AI applications.

Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models

arXiv:2511.17481v1 Announce Type: new 
Abstract: World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.

تم تقديم إطار عمل جديد لنماذج العالم المضاد للحقائق، مما يسمح بتوقع تسلسلات زمنية تحت تعديلات افتراضية لخصائص المشهد المرصود. يعتمد هذا التقدم على نماذج العالم التقليدية التي تركز فقط على الملاحظات الواقعية، مما يمكّن من فهم أكثر دقة للبيئات من خلال المحاكاة المتقدمة.

Se ha introducido un nuevo marco para los modelos mundiales contrafactuales, que permite la predicción de secuencias temporales bajo modificaciones hipotéticas de las propiedades de la escena observada. Este avance se basa en los modelos mundiales tradicionales que se centran únicamente en observaciones fácticas, permitiendo una comprensión más matizada de los entornos a través de la simulación hacia adelante.

Un nouveau cadre pour les modèles mondiaux contrefactuels a été introduit, permettant la prédiction de séquences temporelles sous des modifications hypothétiques des propriétés de la scène observée. Cette avancée s'appuie sur les modèles mondiaux traditionnels qui se concentrent uniquement sur les observations factuelles, permettant une compréhension plus nuancée des environnements grâce à la simulation avancée.

A new framework for counterfactual world models has been introduced, which allows for the prediction of temporal sequences under hypothetical modifications to observed scene properties. This advancement builds on traditional world models that focus solely on factual observations, enabling a more nuanced understanding of environments through forward simulation.

Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models

Was this article worth reading? Share it