arXiv:2506.15679v2 Announce Type: replace 
Abstract: Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are \emph{dense}), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs -- suggesting that high density features are an intrinsic property of the residual space. We then introduce a taxonomy of dense latents, identifying classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction. Finally, we analyze how these features evolve across layers, revealing a shift from structural features in early layers, to semantic features in mid layers, and finally to output-oriented signals in the last layers of the model. Our findings indicate that dense latents serve functional roles in language model computation and should not be dismissed as training noise.

تكشف الأبحاث الحديثة حول المشفرات النادرة (SAEs) أن اللاتينات الكثيفة، التي تُعتبر غالبًا عيوبًا، قد تلعب في الواقع أدوارًا مهمة في نماذج اللغة. تستكشف هذه الدراسة هندسة ووظائف هذه اللاتينات الكثيفة، متحدية الرؤية التقليدية التي تعتبرها مجرد آثار جانبية لعملية التدريب. إن فهم دور هذه الميزات أمر بالغ الأهمية لتحسين قابلية تفسير وفعالية نماذج اللغة، مما قد يكون له آثار كبيرة على تطبيقات مختلفة في معالجة اللغة الطبيعية.

Una investigación reciente sobre autoencoders dispersos (SAE) revela que los latentes densos, a menudo vistos como defectos, pueden desempeñar funciones importantes en los modelos de lenguaje. Este estudio explora la geometría y la funcionalidad de estos latentes densos, desafiando la visión tradicional de que son meros artefactos del proceso de entrenamiento. Comprender el papel de estas características es crucial para mejorar la interpretabilidad y efectividad de los modelos de lenguaje, lo que puede tener implicaciones significativas para diversas aplicaciones en procesamiento de lenguaje natural.

Une recherche récente sur les autoencodeurs épars (SAE) révèle que les latents denses, souvent considérés comme des défauts, peuvent en réalité jouer des rôles importants dans les modèles de langage. Cette étude explore la géométrie et la fonctionnalité de ces latents denses, remettant en question la vision traditionnelle selon laquelle ils ne sont que des artefacts du processus d'entraînement. Comprendre le rôle de ces caractéristiques est crucial pour améliorer l'interprétabilité et l'efficacité des modèles de langage, ce qui peut avoir des implications significatives pour diverses applications en traitement du langage naturel.

Recent research on sparse autoencoders (SAEs) reveals that dense latents, often seen as flaws, may actually serve important functions in language models. This study explores the geometry and functionality of these dense latents, challenging the traditional view that they are merely artifacts of the training process. Understanding the role of these features is crucial for improving the interpretability and effectiveness of language models, which can have significant implications for various applications in natural language processing.

Dense SAE Latents Are Features, Not Bugs

arXiv:2511.16893v1 Announce Type: new 
Abstract: Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.

دراسة حديثة استكشفت تشكيل رؤوس الاستقراء (IHs) في نماذج اللغة، كاشفة أن تطورها يتأثر بخصائص بيانات التدريب مثل حجم الدفعة وحجم السياق. تشير الأبحاث إلى أن تكرار ثنائي الجمل العالي والموثوقية أمران حاسمان لتشكيل IH، بينما تتطلب المستويات المنخفضة النظر في التصنيف وشكل التوزيع الهامشي.

Un estudio reciente ha explorado la formación de cabezales de inducción (IH) en modelos de lenguaje, revelando que su desarrollo está influenciado por propiedades de los datos de entrenamiento como el tamaño del lote y el tamaño del contexto. La investigación indica que la alta frecuencia de repetición de bigramas y la fiabilidad son críticas para la formación de IH, mientras que niveles bajos requieren considerar la categorización y la forma de la distribución marginal.

Une étude récente a exploré la formation des têtes d'induction (IH) dans les modèles de langage, révélant que leur développement est influencé par des propriétés des données d'entraînement telles que la taille des lots et la taille du contexte. La recherche indique que la fréquence de répétition des bigrammes et leur fiabilité sont essentielles pour la formation des IH, tandis que des niveaux faibles nécessitent de prendre en compte la catégorisation et la forme de la distribution marginale.

A recent study has explored the formation of induction heads (IHs) in language models, revealing that their development is influenced by training data properties such as batch size and context size. The research indicates that high bigram repetition frequency and reliability are critical for IH formation, while low levels necessitate consideration of categoriality and marginal distribution shape.

Predicting the Formation of Induction Heads

arXiv:2511.16778v1 Announce Type: new 
Abstract: Recently, structure-text contrastive learning has shown promising performance on text-attributed graphs by leveraging the complementary strengths of graph neural networks and language models. However, existing methods typically rely on homophily assumptions in similarity estimation and hard optimization objectives, which limit their applicability to heterophilic graphs. Although existing methods can mitigate heterophily through structural adjustments or neighbor aggregation, they usually treat textual embeddings as static targets, leading to suboptimal alignment. In this work, we identify the multi-granular heterophily in text-attributed graphs, including complete heterophily, partial heterophily, and latent homophily, which makes structure-text alignment particularly challenging due to mixed, noisy, and missing semantic correlations. To achieve flexible and bidirectional alignment, we propose GCL-OT, a novel graph contrastive learning framework with optimal transport, equipped with tailored mechanisms for each type of heterophily. Specifically, for partial heterophily, we design a RealSoftMax-based similarity estimator to emphasize key neighbor-word interactions while easing background noise. For complete heterophily, we introduce a prompt-based filter that adaptively excludes irrelevant noise during optimal transport alignment. Furthermore, we incorporate OT-guided soft supervision to uncover potential neighbors with similar semantics, enhancing the learning of latent homophily. Theoretical analysis shows that GCL-OT can improve the mutual information bound and Bayes error guarantees. Extensive experiments on nine benchmarks show that GCL-OT consistently outperforms state-of-the-art methods, verifying its effectiveness and robustness.

تم تقديم GCL-OT، وهو إطار جديد للتعلم التبايني الرسومي، لتحسين أداء الرسوم البيانية المنسوبة إلى النص، وخاصة تلك التي تظهر التباين. تتناول هذه الطريقة القيود المفروضة على الأساليب الحالية التي تعتمد على افتراضات التماثل، مما قد يعيق المحاذاة الفعالة بين البيانات النصية والهيكلية. يحدد الإطار أشكالًا مختلفة من التباين، مما يسمح بمحاذاة أكثر مرونة وثنائية الاتجاه بين هياكل الرسوم البيانية وتضمينات النص.

GCL-OT, un nuevo marco de aprendizaje contrastivo gráfico, ha sido introducido para mejorar el rendimiento de los grafos atribuidos a texto, especialmente aquellos que presentan heterofilia. Este método aborda las limitaciones de los enfoques existentes que dependen de suposiciones de homofilia, lo que puede obstaculizar la alineación efectiva de los datos textuales y estructurales. El marco identifica diversas formas de heterofilia, permitiendo una alineación más flexible y bidireccional entre las estructuras de grafos y las incrustaciones textuales.

GCL-OT, un nouveau cadre d'apprentissage contrastif graphique, a été introduit pour améliorer la performance des graphes attribués par du texte, en particulier ceux présentant de l'hétérophilie. Cette méthode aborde les limitations des approches existantes qui reposent sur des hypothèses d'homophilie, ce qui peut entraver l'alignement efficace des données textuelles et structurelles. Le cadre identifie diverses formes d'hétérophilie, permettant un alignement plus flexible et bidirectionnel entre les structures de graphes et les embeddings textuels.

GCL-OT, a novel graph contrastive learning framework, has been introduced to enhance the performance of text-attributed graphs, particularly those exhibiting heterophily. This method addresses limitations in existing approaches that rely on homophily assumptions, which can hinder the effective alignment of textual and structural data. The framework identifies various forms of heterophily, enabling more flexible and bidirectional alignment between graph structures and text embeddings.

Dense SAE Latents Are Features, Not Bugs

Was this article worth reading? Share it

AiReelGenerator.com

Hypertune

Humanize AI