Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference

Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference

arXiv — cs.LG•Monday, December 15, 2025 at 5:00:00 AM

The Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR) framework has been introduced as a training-free solution for efficient large language model (LLM) generation, specifically targeting the LLaMA-3 architecture. This method employs a reversible soft-freeze mechanism to manage key-value updates for low-importance tokens, significantly reducing active KV cache size by 55-67% while maintaining generation quality.
This development is crucial as it allows for more efficient memory usage during inference, which is vital for deploying large language models in real-world applications. By preserving all tokens in off-GPU storage and restoring them on demand, ASR-KF-EGR offers a practical solution that does not require fine-tuning, making it accessible for various applications.
The introduction of ASR-KF-EGR aligns with ongoing efforts to enhance the safety and efficiency of LLMs, as seen in related advancements like Graph-Regularized Sparse Autoencoders and online structured pruning techniques. These innovations collectively address the challenges of adversarial vulnerabilities and memory management in LLMs, reflecting a broader trend towards optimizing AI systems for both performance and safety.

— via World Pulse Now AI Editorial System

Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference