arXiv:2512.02194v1 Announce Type: new 
Abstract: Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.

تهدف مقدمة أجهزة التشفير المتناثرة المرتبة (OSAE) إلى تحسين قابلية تفسير الشبكات العصبية من خلال إنشاء ترتيب صارم للميزات الكامنة واستخدام كل بُعد ميزة بشكل حتمي، مما يعالج التناقضات التي لوحظت في أجهزة التشفير المتناثرة التقليدية (SAE). تدعم هذه التطورات النتائج التجريبية على مجموعات بيانات مثل Gemma2-2B وPythia-70M، والتي تظهر تحسينًا في الاتساق مقارنة بالنماذج السابقة مثل SAE Matryoshka.

La introducción de los Autoencoders Escasos Ordenados (OSAE) tiene como objetivo mejorar la interpretabilidad de las redes neuronales al establecer un orden estricto de las características latentes y utilizar de manera determinista cada dimensión de característica, abordando así las inconsistencias observadas en los Autoencoders Escasos (SAE) tradicionales. Este desarrollo está respaldado por resultados empíricos en conjuntos de datos como Gemma2-2B y Pythia-70M, que demuestran una mejor consistencia en comparación con modelos anteriores como los SAE Matryoshka.

L'introduction des autoencodeurs épars ordonnés (OSAE) vise à améliorer l'interprétabilité des réseaux de neurones en établissant un ordre strict des caractéristiques latentes et en utilisant de manière déterministe chaque dimension de caractéristique, abordant ainsi les incohérences observées dans les autoencodeurs épars traditionnels (SAE). Ce développement est soutenu par des résultats empiriques sur des ensembles de données tels que Gemma2-2B et Pythia-70M, qui démontrent une meilleure cohérence par rapport à des modèles précédents comme les SAE Matryoshka.

The introduction of Ordered Sparse Autoencoders (OSAE) aims to enhance the interpretability of neural networks by establishing a strict ordering of latent features and deterministically utilizing every feature dimension, addressing inconsistencies seen in traditional Sparse Autoencoders (SAEs). This development is supported by empirical results on datasets such as Gemma2-2B and Pythia-70M, which demonstrate improved consistency over previous models like Matryoshka SAEs.

Enforcing Orderedness to Improve Feature Consistency

arXiv:2503.01822v2 Announce Type: replace 
Abstract: Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable -- switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability. Overall, we argue an SAE does not just reveal concepts -- it determines what can be seen at all.

تم تحليل المشفرات النادرة (SAEs) لتحديد فعاليتها في اكتشاف المفاهيم المهمة ضمن تمثيلات الشبكات العصبية. تم تقديم إطار موحد يضع SAEs كحلول لمشكلة تحسين ذات مستويين، مما يبرز التحيزات الكامنة في اكتشاف المفاهيم بناءً على الافتراضات الهيكلية لمعماريات SAE المختلفة.

Se han analizado los autoencoders dispersos (SAEs) para determinar su efectividad en descubrir conceptos significativos dentro de las representaciones de redes neuronales. Se ha introducido un marco unificado que enmarca los SAEs como soluciones a un problema de optimización de dos niveles, lo que destaca los sesgos inherentes en la detección de conceptos según las suposiciones estructurales de diferentes arquitecturas SAE.

Les autoencodeurs épars (SAEs) ont été analysés pour déterminer leur efficacité à découvrir des concepts significatifs au sein des représentations des réseaux de neurones. Un cadre unifié a été introduit, présentant les SAEs comme des solutions à un problème d'optimisation à deux niveaux, ce qui met en évidence les biais inhérents à la détection des concepts en fonction des hypothèses structurelles des différentes architectures SAE.

Sparse Autoencoders (SAEs) have been analyzed to determine their effectiveness in uncovering meaningful concepts within neural network representations. A unified framework has been introduced, framing SAEs as solutions to a bilevel optimization problem, which highlights the inherent biases in concept detection based on the structural assumptions of different SAE architectures.

Enforcing Orderedness to Improve Feature Consistency

Was this article worth reading? Share it

LucidQuery AI

Airparser

CodeSpaced