arXiv:2510.23802v1 Announce Type: new 
Abstract: While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio's dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music model, to demonstrate how pitch, timbre, and loudness evolve throughout generation. While our work is only done on audio modality, our framework can be extended to interpretable analysis of visual latent space generation models.

تقدم دراسة جديدة إطارًا يعزز فهمنا لنماذج توليد الصوت من خلال استخدام مشفرات متفرقة لربط بيانات الصوت المعقدة بمفاهيم يمكن تفسيرها من قبل البشر. هذا مهم لأنه يعالج التحديات المتعلقة باستخراج ميزات ذات معنى من الصوت، الذي غالبًا ما يكون كثيفًا وصعب التحليل. من خلال سد الفجوة بين توليد الصوت التقني والفهم البشري، يمكن أن تؤدي هذه الأبحاث إلى تقدم في تكنولوجيا الصوت وتطبيقات في مجالات متعددة.

Un nuevo estudio presenta un marco que mejora nuestra comprensión de los modelos generativos de audio al utilizar autoencoders dispersos para mapear datos de audio complejos a conceptos interpretables por humanos. Esto es significativo porque aborda los desafíos de extraer características significativas del audio, que a menudo es denso y difícil de analizar. Al cerrar la brecha entre la generación de audio técnica y la comprensión humana, esta investigación podría conducir a avances en la tecnología de audio y aplicaciones en varios campos.

Une nouvelle étude présente un cadre qui améliore notre compréhension des modèles génératifs audio en utilisant des autoencodeurs épars pour mapper des données audio complexes à des concepts interprétables par l'homme. Cela est important car cela répond aux défis d'extraction de caractéristiques significatives à partir de l'audio, qui est souvent dense et difficile à analyser. En comblant le fossé entre la génération audio technique et la compréhension humaine, cette recherche pourrait conduire à des avancées dans la technologie audio et des applications dans divers domaines.

A new study introduces a framework that enhances our understanding of audio generative models by using sparse autoencoders to map complex audio data to human-interpretable concepts. This is significant because it addresses the challenges of extracting meaningful features from audio, which is often dense and difficult to analyze. By bridging the gap between technical audio generation and human comprehension, this research could lead to advancements in audio technology and applications in various fields.

Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

arXiv:2511.16893v1 Announce Type: new 
Abstract: Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.

دراسة حديثة استكشفت تشكيل رؤوس الاستقراء (IHs) في نماذج اللغة، كاشفة أن تطورها يتأثر بخصائص بيانات التدريب مثل حجم الدفعة وحجم السياق. تشير الأبحاث إلى أن تكرار ثنائي الجمل العالي والموثوقية أمران حاسمان لتشكيل IH، بينما تتطلب المستويات المنخفضة النظر في التصنيف وشكل التوزيع الهامشي.

Un estudio reciente ha explorado la formación de cabezales de inducción (IH) en modelos de lenguaje, revelando que su desarrollo está influenciado por propiedades de los datos de entrenamiento como el tamaño del lote y el tamaño del contexto. La investigación indica que la alta frecuencia de repetición de bigramas y la fiabilidad son críticas para la formación de IH, mientras que niveles bajos requieren considerar la categorización y la forma de la distribución marginal.

Une étude récente a exploré la formation des têtes d'induction (IH) dans les modèles de langage, révélant que leur développement est influencé par des propriétés des données d'entraînement telles que la taille des lots et la taille du contexte. La recherche indique que la fréquence de répétition des bigrammes et leur fiabilité sont essentielles pour la formation des IH, tandis que des niveaux faibles nécessitent de prendre en compte la catégorisation et la forme de la distribution marginale.

A recent study has explored the formation of induction heads (IHs) in language models, revealing that their development is influenced by training data properties such as batch size and context size. The research indicates that high bigram repetition frequency and reliability are critical for IH formation, while low levels necessitate consideration of categoriality and marginal distribution shape.

Predicting the Formation of Induction Heads

arXiv:2511.16778v1 Announce Type: new 
Abstract: Recently, structure-text contrastive learning has shown promising performance on text-attributed graphs by leveraging the complementary strengths of graph neural networks and language models. However, existing methods typically rely on homophily assumptions in similarity estimation and hard optimization objectives, which limit their applicability to heterophilic graphs. Although existing methods can mitigate heterophily through structural adjustments or neighbor aggregation, they usually treat textual embeddings as static targets, leading to suboptimal alignment. In this work, we identify the multi-granular heterophily in text-attributed graphs, including complete heterophily, partial heterophily, and latent homophily, which makes structure-text alignment particularly challenging due to mixed, noisy, and missing semantic correlations. To achieve flexible and bidirectional alignment, we propose GCL-OT, a novel graph contrastive learning framework with optimal transport, equipped with tailored mechanisms for each type of heterophily. Specifically, for partial heterophily, we design a RealSoftMax-based similarity estimator to emphasize key neighbor-word interactions while easing background noise. For complete heterophily, we introduce a prompt-based filter that adaptively excludes irrelevant noise during optimal transport alignment. Furthermore, we incorporate OT-guided soft supervision to uncover potential neighbors with similar semantics, enhancing the learning of latent homophily. Theoretical analysis shows that GCL-OT can improve the mutual information bound and Bayes error guarantees. Extensive experiments on nine benchmarks show that GCL-OT consistently outperforms state-of-the-art methods, verifying its effectiveness and robustness.

تم تقديم GCL-OT، وهو إطار جديد للتعلم التبايني الرسومي، لتحسين أداء الرسوم البيانية المنسوبة إلى النص، وخاصة تلك التي تظهر التباين. تتناول هذه الطريقة القيود المفروضة على الأساليب الحالية التي تعتمد على افتراضات التماثل، مما قد يعيق المحاذاة الفعالة بين البيانات النصية والهيكلية. يحدد الإطار أشكالًا مختلفة من التباين، مما يسمح بمحاذاة أكثر مرونة وثنائية الاتجاه بين هياكل الرسوم البيانية وتضمينات النص.

GCL-OT, un nuevo marco de aprendizaje contrastivo gráfico, ha sido introducido para mejorar el rendimiento de los grafos atribuidos a texto, especialmente aquellos que presentan heterofilia. Este método aborda las limitaciones de los enfoques existentes que dependen de suposiciones de homofilia, lo que puede obstaculizar la alineación efectiva de los datos textuales y estructurales. El marco identifica diversas formas de heterofilia, permitiendo una alineación más flexible y bidireccional entre las estructuras de grafos y las incrustaciones textuales.

GCL-OT, un nouveau cadre d'apprentissage contrastif graphique, a été introduit pour améliorer la performance des graphes attribués par du texte, en particulier ceux présentant de l'hétérophilie. Cette méthode aborde les limitations des approches existantes qui reposent sur des hypothèses d'homophilie, ce qui peut entraver l'alignement efficace des données textuelles et structurelles. Le cadre identifie diverses formes d'hétérophilie, permettant un alignement plus flexible et bidirectionnel entre les structures de graphes et les embeddings textuels.

GCL-OT, a novel graph contrastive learning framework, has been introduced to enhance the performance of text-attributed graphs, particularly those exhibiting heterophily. This method addresses limitations in existing approaches that rely on homophily assumptions, which can hinder the effective alignment of textual and structural data. The framework identifies various forms of heterophily, enabling more flexible and bidirectional alignment between graph structures and text embeddings.

Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

Was this article worth reading? Share it

PodSnap.AI

Synthesia

All Voice Lab