arXiv:2511.18936v1 Announce Type: cross 
Abstract: Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.

تم تقديم إطار عمل جديد يسمى SWAN لمعالجة تحديات الذاكرة التي تواجه نماذج اللغة الكبيرة (LLMs) أثناء الاستدلال التلقائي، مع التركيز بشكل خاص على الاستخدام الكبير للذاكرة في ذاكرة المفتاح والقيمة (KV). يستخدم SWAN مصفوفة متعامدة خارج الخط لتدوير وتقليص ذاكرة KV بكفاءة، مما يسمح باستخدامها مباشرة في حساب الانتباه دون الحاجة إلى خطوات فك الضغط.

Se ha presentado un nuevo marco llamado SWAN para abordar los desafíos de memoria que enfrentan los Modelos de Lenguaje Grande (LLMs) durante la inferencia autorregresiva, centrándose específicamente en el uso sustancial de memoria del caché de Clave-Valor (KV). SWAN emplea una matriz ortogonal fuera de línea para rotar y podar eficientemente el caché KV, permitiendo su uso directo en el cálculo de atención sin requerir pasos de descompresión.

Un nouveau cadre nommé SWAN a été introduit pour relever les défis de mémoire rencontrés par les grands modèles de langage (LLMs) lors de l'inférence autorégressive, ciblant spécifiquement l'utilisation substantielle de mémoire du cache clé-valeur (KV). SWAN utilise une matrice orthogonale hors ligne pour faire pivoter et élaguer efficacement le cache KV, permettant une utilisation directe dans le calcul d'attention sans nécessiter d'étapes de décompression.

A novel framework named SWAN has been introduced to address the memory challenges faced by Large Language Models (LLMs) during autoregressive inference, specifically targeting the Key-Value (KV) cache's substantial memory usage. SWAN employs an offline orthogonal matrix to efficiently rotate and prune the KV-cache, allowing for direct use in attention computation without requiring decompression steps.

SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression

Was this article worth reading? Share it

LangWatch

Hypertune

SafeWrite AI