arXiv:2512.10547v1 Announce Type: new 
Abstract: The Key-Value (KV) cache is the primary memory bottleneck in long-context Large Language Models, yet it is typically treated as an opaque numerical tensor. In this work, we propose \textbf{STA-Attention}, a framework that utilizes Top-K Sparse Autoencoders (SAEs) to decompose the KV cache into interpretable ``semantic atoms.'' Unlike standard $L_1$-regularized SAEs, our Top-K approach eliminates shrinkage bias, preserving the precise dot-product geometry required for attention. Our analysis uncovers a fundamental \textbf{Key-Value Asymmetry}: while Key vectors serve as highly sparse routers dominated by a ``Semantic Elbow,'' deep Value vectors carry dense content payloads requiring a larger budget. Based on this structure, we introduce a Dual-Budget Strategy that selectively preserves the most informative semantic components while filtering representational noise. Experiments on Yi-6B, Mistral-7B, Qwen2.5-32B, and others show that our semantic reconstructions maintain perplexity and zero-shot performance comparable to the original models, effectively bridging the gap between mechanistic interpretability and faithful attention modeling.

تقدم دراسة جديدة إطار عمل STA-Attention، الذي يستخدم مشفرات تلقائية نادرة Top-K لتحليل ذاكرة المفتاح والقيمة (KV) في نماذج اللغة طويلة السياق (LLMs). تكشف هذه الدراسة عن عدم تماثل المفتاح والقيمة، حيث تعمل متجهات المفتاح كموصلات نادرة بينما تحتوي متجهات القيمة على محتوى كثيف، مما يؤدي إلى استراتيجية ميزانية مزدوجة لتحسين الاحتفاظ بالمكونات الدلالية.

Un nuevo estudio presenta STA-Attention, un marco que utiliza Autoencoders Escasos Top-K para analizar la caché de Clave-Valor (KV) en Modelos de Lenguaje de Largo Contexto (LLMs). Esta investigación revela una Asimetría Clave-Valor, donde los vectores Clave actúan como enrutadores escasos mientras que los vectores Valor contienen contenido denso, lo que lleva a una Estrategia de Doble Presupuesto para optimizar la retención de componentes semánticos.

Une nouvelle étude présente STA-Attention, un cadre utilisant des autoencodeurs épars Top-K pour analyser le cache Key-Value (KV) dans les modèles de langage à long contexte (LLMs). Cette recherche révèle une asymétrie Key-Value, où les vecteurs Key agissent comme des routeurs épars tandis que les vecteurs Value contiennent un contenu dense, menant à une stratégie à double budget pour optimiser la rétention des composants sémantiques.

A new study introduces STA-Attention, a framework utilizing Top-K Sparse Autoencoders to analyze the Key-Value (KV) cache in long-context Large Language Models (LLMs). This research reveals a Key-Value Asymmetry, where Key vectors act as sparse routers while Value vectors contain dense content, leading to a proposed Dual-Budget Strategy for optimizing semantic component retention.

Unlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse Autoencoders

arXiv:2512.10150v1 Announce Type: new 
Abstract: The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.

تسلط دراسة حديثة الضوء على أهمية توافق الأمان في نماذج اللغة الكبيرة (LLMs) مع تزايد تكيفها لمهام متنوعة. تحدد الأبحاث تدهور الأمان أثناء التخصيص، وتنسب ذلك إلى النسيان الكارثي، وتقترح استراتيجيات التعلم المستمر (CL) للحفاظ على الأمان. تُظهر تقييمات هذه الاستراتيجيات أنها يمكن أن تقلل بشكل فعال من معدلات نجاح الهجمات مقارنة بأساليب التخصيص القياسية.

Un estudio reciente destaca la importancia del alineamiento de seguridad en los modelos de lenguaje de gran tamaño (LLMs) a medida que se adaptan cada vez más a diversas tareas. La investigación identifica la degradación de la seguridad durante el ajuste fino, atribuyéndola al olvido catastrófico, y propone estrategias de aprendizaje continuo (CL) para preservar la seguridad. La evaluación de estas estrategias muestra que pueden reducir eficazmente las tasas de éxito de los ataques en comparación con los métodos de ajuste fino estándar.

Une étude récente souligne l'importance de l'alignement de la sécurité dans les modèles de langage de grande taille (LLMs) alors qu'ils sont de plus en plus adaptés à diverses tâches. La recherche identifie la dégradation de la sécurité lors du fine-tuning, l'attribuant à l'oubli catastrophique, et propose des stratégies d'apprentissage continu (CL) pour préserver la sécurité. L'évaluation de ces stratégies montre qu'elles peuvent réduire efficacement les taux de succès des attaques par rapport aux méthodes de fine-tuning standard.

A recent study highlights the importance of safety alignment in large language models (LLMs) as they are increasingly adapted for various tasks. The research identifies safety degradation during fine-tuning, attributing it to catastrophic forgetting, and proposes continual learning (CL) strategies to preserve safety. The evaluation of these strategies shows that they can effectively reduce attack success rates compared to standard fine-tuning methods.

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

arXiv:2512.10185v1 Announce Type: cross 
Abstract: A recent watermarking scheme for language models achieves distortion-free embedding and robustness to edit-distance attacks. However, it suffers from limited generation diversity and high detection overhead. In parallel, recent research has focused on undetectability, a property ensuring that watermarks remain difficult for adversaries to detect and spoof. In this work, we introduce a new class of watermarking schemes constructed through probabilistic automata. We present two instantiations: (i) a practical scheme with exponential generation diversity and computational efficiency, and (ii) a theoretical construction with formal undetectability guarantees under cryptographic assumptions. Extensive experiments on LLaMA-3B and Mistral-7B validate the superior performance of our scheme in terms of robustness and efficiency.

تم تقديم مخطط جديد لوضع العلامات المائية لنماذج اللغة، باستخدام الأوتوماتا الاحتمالية لتحقيق تضمين خالٍ من التشويه ومرونة ضد هجمات مسافة التحرير. تم اختبار هذه الطريقة على LLaMA-3B وMistral-7B، وتقدم تحسينات كبيرة في تنوع التوليد وكفاءة الحوسبة مقارنة بالتقنيات السابقة.

Se ha introducido un nuevo esquema de marca de agua para modelos de lenguaje, utilizando autómatas probabilísticos para lograr una incrustación sin distorsión y robustez contra ataques de distancia de edición. Este método, probado en LLaMA-3B y Mistral-7B, ofrece mejoras significativas en la diversidad de generación y eficiencia computacional en comparación con técnicas anteriores.

Un nouveau schéma de filigrane pour les modèles de langage a été introduit, utilisant des automates probabilistes pour atteindre une intégration sans distorsion et une robustesse contre les attaques par distance d'édition. Cette méthode, testée sur LLaMA-3B et Mistral-7B, offre des améliorations significatives en termes de diversité de génération et d'efficacité computationnelle par rapport aux techniques précédentes.

A new watermarking scheme for language models has been introduced, utilizing probabilistic automata to achieve distortion-free embedding and robustness against edit-distance attacks. This method, tested on LLaMA-3B and Mistral-7B, offers significant improvements in generation diversity and computational efficiency compared to previous techniques.

Unlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse Autoencoders

Was this article worth reading? Share it

Ready to build your own newsroom?