arXiv:2511.01490v1 Announce Type: new 
Abstract: As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial. This paper investigates the impact of the diversity of sources of synthetic data on fine-tuned large language models. We focus on three key dimensions: distribution collapse, adversarial robustness, and self-preference bias. Our findings reveal that fine-tuning models on synthetic data from diverse sources can mitigate distribution collapse, preserving the breadth of the output distribution and the diversity of the output text. Furthermore, while both human and synthetic fine-tuning data can remove safeguards, the latter preserves higher output quality, thus making outputs potentially more usable and dangerous. Finally, fine-tuning reduces self-preference bias, with human data being the most effective, followed by multi-source synthetic data.

تسلط دراسة حديثة الضوء على أهمية استخدام مصادر متنوعة من البيانات الاصطناعية في تحسين نماذج اللغة الكبيرة. من خلال فحص كيفية تأثير هذه التنوعات على سلوك النموذج، تظهر الأبحاث أنها يمكن أن تساعد في تقليل مشكلات مثل انهيار التوزيع وتحسين القوة ضد الهجمات العدائية. هذا مهم لأنه مع تزايد استخدام البيانات الاصطناعية في تطوير الذكاء الاصطناعي، فإن فهم تأثيراتها يمكن أن يؤدي إلى نماذج لغة أكثر موثوقية وفعالية.

Un estudio reciente destaca la importancia de utilizar fuentes diversas de datos sintéticos en el ajuste fino de grandes modelos de lenguaje. Al examinar cómo esta diversidad afecta el comportamiento del modelo, la investigación muestra que puede ayudar a reducir problemas como el colapso de la distribución y mejorar la robustez ante ataques adversariales. Esto es significativo porque, a medida que los datos sintéticos se vuelven más comunes en el desarrollo de IA, comprender sus efectos puede llevar a modelos de lenguaje más confiables y efectivos.

Une étude récente souligne l'importance d'utiliser des sources diverses de données synthétiques pour le fine-tuning des grands modèles de langage. En examinant comment cette diversité affecte le comportement des modèles, la recherche montre qu'elle peut aider à réduire des problèmes tels que l'effondrement de la distribution et améliorer la robustesse aux attaques adversariales. Cela est significatif car, à mesure que les données synthétiques deviennent plus courantes dans le développement de l'IA, comprendre leurs effets peut conduire à des modèles de langage plus fiables et efficaces.

A recent study highlights the importance of using diverse sources of synthetic data in fine-tuning large language models. By examining how this diversity affects model behavior, the research shows that it can help reduce issues like distribution collapse and improve adversarial robustness. This is significant because as synthetic data becomes more prevalent in AI development, understanding its effects can lead to more reliable and effective language models.

Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

arXiv:2511.16893v1 Announce Type: new 
Abstract: Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.

دراسة حديثة استكشفت تشكيل رؤوس الاستقراء (IHs) في نماذج اللغة، كاشفة أن تطورها يتأثر بخصائص بيانات التدريب مثل حجم الدفعة وحجم السياق. تشير الأبحاث إلى أن تكرار ثنائي الجمل العالي والموثوقية أمران حاسمان لتشكيل IH، بينما تتطلب المستويات المنخفضة النظر في التصنيف وشكل التوزيع الهامشي.

Un estudio reciente ha explorado la formación de cabezales de inducción (IH) en modelos de lenguaje, revelando que su desarrollo está influenciado por propiedades de los datos de entrenamiento como el tamaño del lote y el tamaño del contexto. La investigación indica que la alta frecuencia de repetición de bigramas y la fiabilidad son críticas para la formación de IH, mientras que niveles bajos requieren considerar la categorización y la forma de la distribución marginal.

Une étude récente a exploré la formation des têtes d'induction (IH) dans les modèles de langage, révélant que leur développement est influencé par des propriétés des données d'entraînement telles que la taille des lots et la taille du contexte. La recherche indique que la fréquence de répétition des bigrammes et leur fiabilité sont essentielles pour la formation des IH, tandis que des niveaux faibles nécessitent de prendre en compte la catégorisation et la forme de la distribution marginale.

A recent study has explored the formation of induction heads (IHs) in language models, revealing that their development is influenced by training data properties such as batch size and context size. The research indicates that high bigram repetition frequency and reliability are critical for IH formation, while low levels necessitate consideration of categoriality and marginal distribution shape.

Predicting the Formation of Induction Heads

arXiv:2511.16778v1 Announce Type: new 
Abstract: Recently, structure-text contrastive learning has shown promising performance on text-attributed graphs by leveraging the complementary strengths of graph neural networks and language models. However, existing methods typically rely on homophily assumptions in similarity estimation and hard optimization objectives, which limit their applicability to heterophilic graphs. Although existing methods can mitigate heterophily through structural adjustments or neighbor aggregation, they usually treat textual embeddings as static targets, leading to suboptimal alignment. In this work, we identify the multi-granular heterophily in text-attributed graphs, including complete heterophily, partial heterophily, and latent homophily, which makes structure-text alignment particularly challenging due to mixed, noisy, and missing semantic correlations. To achieve flexible and bidirectional alignment, we propose GCL-OT, a novel graph contrastive learning framework with optimal transport, equipped with tailored mechanisms for each type of heterophily. Specifically, for partial heterophily, we design a RealSoftMax-based similarity estimator to emphasize key neighbor-word interactions while easing background noise. For complete heterophily, we introduce a prompt-based filter that adaptively excludes irrelevant noise during optimal transport alignment. Furthermore, we incorporate OT-guided soft supervision to uncover potential neighbors with similar semantics, enhancing the learning of latent homophily. Theoretical analysis shows that GCL-OT can improve the mutual information bound and Bayes error guarantees. Extensive experiments on nine benchmarks show that GCL-OT consistently outperforms state-of-the-art methods, verifying its effectiveness and robustness.

تم تقديم GCL-OT، وهو إطار جديد للتعلم التبايني الرسومي، لتحسين أداء الرسوم البيانية المنسوبة إلى النص، وخاصة تلك التي تظهر التباين. تتناول هذه الطريقة القيود المفروضة على الأساليب الحالية التي تعتمد على افتراضات التماثل، مما قد يعيق المحاذاة الفعالة بين البيانات النصية والهيكلية. يحدد الإطار أشكالًا مختلفة من التباين، مما يسمح بمحاذاة أكثر مرونة وثنائية الاتجاه بين هياكل الرسوم البيانية وتضمينات النص.

GCL-OT, un nuevo marco de aprendizaje contrastivo gráfico, ha sido introducido para mejorar el rendimiento de los grafos atribuidos a texto, especialmente aquellos que presentan heterofilia. Este método aborda las limitaciones de los enfoques existentes que dependen de suposiciones de homofilia, lo que puede obstaculizar la alineación efectiva de los datos textuales y estructurales. El marco identifica diversas formas de heterofilia, permitiendo una alineación más flexible y bidireccional entre las estructuras de grafos y las incrustaciones textuales.

GCL-OT, un nouveau cadre d'apprentissage contrastif graphique, a été introduit pour améliorer la performance des graphes attribués par du texte, en particulier ceux présentant de l'hétérophilie. Cette méthode aborde les limitations des approches existantes qui reposent sur des hypothèses d'homophilie, ce qui peut entraver l'alignement efficace des données textuelles et structurelles. Le cadre identifie diverses formes d'hétérophilie, permettant un alignement plus flexible et bidirectionnel entre les structures de graphes et les embeddings textuels.

GCL-OT, a novel graph contrastive learning framework, has been introduced to enhance the performance of text-attributed graphs, particularly those exhibiting heterophily. This method addresses limitations in existing approaches that rely on homophily assumptions, which can hinder the effective alignment of textual and structural data. The framework identifies various forms of heterophily, enabling more flexible and bidirectional alignment between graph structures and text embeddings.

GCL-OT: Graph Contrastive Learning with Optimal Transport for Heterophilic Text-Attributed Graphs

arXiv:2511.17380v1 Announce Type: cross 
Abstract: Deep learning (DL) models, despite their remarkable success, remain vulnerable to small input perturbations that can cause erroneous outputs, motivating the recent proposal of probabilistic robustness (PR) as a complementary alternative to adversarial robustness (AR). However, existing PR formulations assume a fixed and known perturbation distribution, an unrealistic expectation in practice. To address this limitation, we propose non-parametric probabilistic robustness (NPPR), a more practical PR metric that does not rely on any predefined perturbation distribution. Following the non-parametric paradigm in statistical modeling, NPPR learns an optimized perturbation distribution directly from data, enabling conservative PR evaluation under distributional uncertainty. We further develop an NPPR estimator based on a Gaussian Mixture Model (GMM) with Multilayer Perceptron (MLP) heads and bicubic up-sampling, covering various input-dependent and input-independent perturbation scenarios. Theoretical analyses establish the relationships among AR, PR, and NPPR. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet across ResNet18/50, WideResNet50 and VGG16 validate NPPR as a more practical robustness metric, showing up to 40\% more conservative (lower) PR estimates compared to assuming those common perturbation distributions used in state-of-the-arts.

تم اقتراح نهج جديد للمتانة الاحتمالية في التعلم العميق، يسمى المتانة الاحتمالية غير المعلمية (NPPR)، والذي يتعلم توزيعات الاضطراب المحسّنة مباشرة من البيانات بدلاً من الاعتماد على توزيعات ثابتة. تهدف هذه الطريقة إلى تحسين تقييم متانة النموذج في ظل عدم اليقين التوزيعي، مما يعالج قيدًا كبيرًا في الأطر الحالية للمتانة الاحتمالية.

Se ha propuesto un nuevo enfoque para la robustez probabilística en el aprendizaje profundo, denominado robustez probabilística no paramétrica (NPPR), que aprende distribuciones de perturbación optimizadas directamente de los datos en lugar de depender de distribuciones fijas. Este método busca mejorar la evaluación de la robustez del modelo ante la incertidumbre de distribución, abordando una limitación significativa en los marcos de robustez probabilística existentes.

Une nouvelle approche de la robustesse probabiliste dans l'apprentissage profond, appelée robustesse probabiliste non paramétrique (NPPR), a été proposée, apprenant des distributions de perturbation optimisées directement à partir des données plutôt que de s'appuyer sur des distributions fixes. Cette méthode vise à améliorer l'évaluation de la robustesse des modèles face à l'incertitude distributionnelle, abordant une limitation significative des cadres de robustesse probabiliste existants.

A new approach to probabilistic robustness in deep learning, termed non-parametric probabilistic robustness (NPPR), has been proposed, which learns optimized perturbation distributions directly from data rather than relying on fixed distributions. This method aims to enhance the evaluation of model robustness under distributional uncertainty, addressing a significant limitation in existing probabilistic robustness frameworks.

Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

Was this article worth reading? Share it

Augmeta

Synthx

TypeThinkAI