arXiv:2510.22332v1 Announce Type: cross 
Abstract: Recent interpretability work on large language models (LLMs) has been increasingly dominated by a feature-discovery approach with the help of proxy modules. Then, the quality of features learned by, e.g., sparse auto-encoders (SAEs), is evaluated. This paradigm naturally raises a critical question: do such learned features have better properties than those already represented within the original model parameters, and unfortunately, only a few studies have made such comparisons systematically so far. In this work, we revisit the interpretability of feature vectors stored in feed-forward (FF) layers, given the perspective of FF as key-value memories, with modern interpretability benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain aspects, surprisingly, even vanilla FFs yielded better interpretability than the SAEs, and features discovered in SAEs and FFs diverged. These bring questions about the advantage of SAEs from both perspectives of feature quality and faithfulness, compared to directly interpreting FF feature vectors, and FF key-value parameters serve as a strong baseline in modern interpretability research.

تسلط الأبحاث الحديثة الضوء على أن ذاكرات المفتاح والقيمة في المحولات قابلة للتفسير تقريبًا مثل المشفرات المتناثرة، وهو اكتشاف مهم في مجال نماذج اللغة الكبيرة. وهذا مهم لأن فهم كيفية تعلم هذه النماذج وتمثيل الميزات يمكن أن يؤدي إلى تحسين تصميم النموذج وتطبيقه، مما يعزز فعاليته في مهام متنوعة.

Investigaciones recientes destacan que las memorias clave-valor de los transformadores son casi tan interpretables como los autoencoders dispersos, un hallazgo significativo en el campo de los grandes modelos de lenguaje. Esto es importante porque entender cómo estos modelos aprenden y representan características puede llevar a un mejor diseño y aplicación de los modelos, mejorando su efectividad en diversas tareas.

Des recherches récentes montrent que les mémoires clé-valeur des transformateurs sont presque aussi interprétables que les autoencodeurs clairsemés, une découverte importante dans le domaine des grands modèles de langage. Cela a de l'importance car comprendre comment ces modèles apprennent et représentent des caractéristiques peut conduire à une meilleure conception et application des modèles, améliorant ainsi leur efficacité dans diverses tâches.

Recent research highlights that transformer key-value memories are almost as interpretable as sparse autoencoders, a significant finding in the field of large language models. This matters because understanding how these models learn and represent features can lead to better model design and application, ultimately enhancing their effectiveness in various tasks.

Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

arXiv:2511.17254v1 Announce Type: new 
Abstract: Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.

تقدم دراسة جديدة إطار العمل Intervene-All-Paths، الذي يهدف إلى التخفيف من الهلوسات في نماذج اللغة الكبيرة للرؤية (LVLMs) من خلال معالجة تفاعل مسارات سببية متنوعة. تبرز هذه البحث أن الهلوسات تنشأ من مصادر متعددة، بما في ذلك التفاعلات من الصورة إلى نص الإدخال ومن النص إلى النص، وتقترح تدخلات مستهدفة لأشكال محاذاة الأسئلة والأجوبة المختلفة.

Un nuevo estudio presenta el marco Intervene-All-Paths, destinado a mitigar las alucinaciones en los Modelos de Lenguaje de Visión Grande (LVLMs) al abordar la interacción de diversas rutas causales. Esta investigación destaca que las alucinaciones provienen de múltiples fuentes, incluidas las interacciones de imagen a texto de entrada y de texto a texto, y propone intervenciones específicas para diferentes formatos de alineación de preguntas y respuestas.

Une nouvelle étude présente le cadre Intervene-All-Paths, visant à atténuer les hallucinations dans les grands modèles de vision-langage (LVLMs) en abordant l'interaction de divers chemins causaux. Cette recherche met en évidence que les hallucinations proviennent de multiples sources, y compris les interactions image-texte d'entrée et texte-texte, et propose des interventions ciblées pour différents formats d'alignement question-réponse.

A new study introduces the Intervene-All-Paths framework, aimed at mitigating hallucinations in Large Vision-Language Models (LVLMs) by addressing the interplay of various causal pathways. This research highlights that hallucinations stem from multiple sources, including image-to-input-text and text-to-text interactions, and proposes targeted interventions for different question-answer alignment formats.

Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

arXiv:2511.16893v1 Announce Type: new 
Abstract: Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.

دراسة حديثة استكشفت تشكيل رؤوس الاستقراء (IHs) في نماذج اللغة، كاشفة أن تطورها يتأثر بخصائص بيانات التدريب مثل حجم الدفعة وحجم السياق. تشير الأبحاث إلى أن تكرار ثنائي الجمل العالي والموثوقية أمران حاسمان لتشكيل IH، بينما تتطلب المستويات المنخفضة النظر في التصنيف وشكل التوزيع الهامشي.

Un estudio reciente ha explorado la formación de cabezales de inducción (IH) en modelos de lenguaje, revelando que su desarrollo está influenciado por propiedades de los datos de entrenamiento como el tamaño del lote y el tamaño del contexto. La investigación indica que la alta frecuencia de repetición de bigramas y la fiabilidad son críticas para la formación de IH, mientras que niveles bajos requieren considerar la categorización y la forma de la distribución marginal.

Une étude récente a exploré la formation des têtes d'induction (IH) dans les modèles de langage, révélant que leur développement est influencé par des propriétés des données d'entraînement telles que la taille des lots et la taille du contexte. La recherche indique que la fréquence de répétition des bigrammes et leur fiabilité sont essentielles pour la formation des IH, tandis que des niveaux faibles nécessitent de prendre en compte la catégorisation et la forme de la distribution marginale.

A recent study has explored the formation of induction heads (IHs) in language models, revealing that their development is influenced by training data properties such as batch size and context size. The research indicates that high bigram repetition frequency and reliability are critical for IH formation, while low levels necessitate consideration of categoriality and marginal distribution shape.

Predicting the Formation of Induction Heads

arXiv:2511.16778v1 Announce Type: new 
Abstract: Recently, structure-text contrastive learning has shown promising performance on text-attributed graphs by leveraging the complementary strengths of graph neural networks and language models. However, existing methods typically rely on homophily assumptions in similarity estimation and hard optimization objectives, which limit their applicability to heterophilic graphs. Although existing methods can mitigate heterophily through structural adjustments or neighbor aggregation, they usually treat textual embeddings as static targets, leading to suboptimal alignment. In this work, we identify the multi-granular heterophily in text-attributed graphs, including complete heterophily, partial heterophily, and latent homophily, which makes structure-text alignment particularly challenging due to mixed, noisy, and missing semantic correlations. To achieve flexible and bidirectional alignment, we propose GCL-OT, a novel graph contrastive learning framework with optimal transport, equipped with tailored mechanisms for each type of heterophily. Specifically, for partial heterophily, we design a RealSoftMax-based similarity estimator to emphasize key neighbor-word interactions while easing background noise. For complete heterophily, we introduce a prompt-based filter that adaptively excludes irrelevant noise during optimal transport alignment. Furthermore, we incorporate OT-guided soft supervision to uncover potential neighbors with similar semantics, enhancing the learning of latent homophily. Theoretical analysis shows that GCL-OT can improve the mutual information bound and Bayes error guarantees. Extensive experiments on nine benchmarks show that GCL-OT consistently outperforms state-of-the-art methods, verifying its effectiveness and robustness.

تم تقديم GCL-OT، وهو إطار جديد للتعلم التبايني الرسومي، لتحسين أداء الرسوم البيانية المنسوبة إلى النص، وخاصة تلك التي تظهر التباين. تتناول هذه الطريقة القيود المفروضة على الأساليب الحالية التي تعتمد على افتراضات التماثل، مما قد يعيق المحاذاة الفعالة بين البيانات النصية والهيكلية. يحدد الإطار أشكالًا مختلفة من التباين، مما يسمح بمحاذاة أكثر مرونة وثنائية الاتجاه بين هياكل الرسوم البيانية وتضمينات النص.

GCL-OT, un nuevo marco de aprendizaje contrastivo gráfico, ha sido introducido para mejorar el rendimiento de los grafos atribuidos a texto, especialmente aquellos que presentan heterofilia. Este método aborda las limitaciones de los enfoques existentes que dependen de suposiciones de homofilia, lo que puede obstaculizar la alineación efectiva de los datos textuales y estructurales. El marco identifica diversas formas de heterofilia, permitiendo una alineación más flexible y bidireccional entre las estructuras de grafos y las incrustaciones textuales.

GCL-OT, un nouveau cadre d'apprentissage contrastif graphique, a été introduit pour améliorer la performance des graphes attribués par du texte, en particulier ceux présentant de l'hétérophilie. Cette méthode aborde les limitations des approches existantes qui reposent sur des hypothèses d'homophilie, ce qui peut entraver l'alignement efficace des données textuelles et structurelles. Le cadre identifie diverses formes d'hétérophilie, permettant un alignement plus flexible et bidirectionnel entre les structures de graphes et les embeddings textuels.

GCL-OT, a novel graph contrastive learning framework, has been introduced to enhance the performance of text-attributed graphs, particularly those exhibiting heterophily. This method addresses limitations in existing approaches that rely on homophily assumptions, which can hinder the effective alignment of textual and structural data. The framework identifies various forms of heterophily, enabling more flexible and bidirectional alignment between graph structures and text embeddings.

Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

Was this article worth reading? Share it

AI Humanizer

Synthesia

CodeSpaced