Leveraging KV Similarity for Online Structured Pruning in LLMs

arXiv — cs.CLTuesday, December 9, 2025 at 5:00:00 AM
  • A new online structured pruning technique called Token Filtering has been introduced for large language models (LLMs), allowing pruning decisions to be made during inference without the need for calibration data. This method measures token redundancy through joint key-value similarity, effectively reducing inference costs while maintaining essential information. The approach also includes a variance-aware fusion strategy to ensure important tokens are preserved even with high pruning ratios.
  • This development is significant as it addresses the instability often associated with traditional pruning methods that rely on offline calibration data. By enabling real-time pruning decisions, the Token Filtering technique enhances the efficiency of LLMs like LLaMA-2 and Mistral, potentially leading to faster and more reliable model performance in various applications.
  • The introduction of Token Filtering aligns with ongoing efforts to optimize LLMs, as seen in other advancements such as KQ-SVD for KV cache optimization and FastForward Pruning using reinforcement learning. These innovations reflect a broader trend in AI research focused on improving model efficiency and reducing computational costs, which is crucial as LLMs continue to grow in complexity and application.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Mistral launches powerful Devstral 2 coding model including open source, laptop-friendly version
PositiveArtificial Intelligence
French AI startup Mistral has launched the Devstral 2 coding model, which includes a laptop-friendly version optimized for software engineering tasks. This release follows the introduction of the Mistral 3 LLM family, aimed at enhancing local hardware capabilities for developers.
Large Language Model-Based Generation of Discharge Summaries
PositiveArtificial Intelligence
Recent research has demonstrated the potential of Large Language Models (LLMs) in automating the generation of discharge summaries, which are critical documents in patient care. The study evaluated five models, including proprietary systems like GPT-4 and Gemini 1.5 Pro, and found that Gemini, particularly with one-shot prompting, produced summaries most similar to gold standards. This advancement could significantly reduce the workload of healthcare professionals and enhance the accuracy of patient information.
GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering
PositiveArtificial Intelligence
The introduction of Graph-Regularized Sparse Autoencoders (GSAEs) aims to enhance the safety of large language models (LLMs) by addressing their vulnerabilities to adversarial prompts and jailbreak attacks. GSAEs extend traditional sparse autoencoders by incorporating a Laplacian smoothness penalty, allowing for the recovery of distributed safety representations across multiple features rather than isolating them in a single latent dimension.
Depth-Wise Activation Steering for Honest Language Models
PositiveArtificial Intelligence
A new method called Depth-Wise Activation Steering has been introduced to enhance the honesty of large language models (LLMs) like LLaMA, Qwen, and Mistral. This training-free approach utilizes a Gaussian schedule to improve the models' ability to report truthfully, addressing the issue of models asserting falsehoods despite having the correct information internally.
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation
NeutralArtificial Intelligence
Recent research indicates that large language models (LLMs) demonstrate biases in evaluation tasks, particularly favoring self-generated content. However, a study exploring retrieval-augmented generation (RAG) frameworks found no significant self-preference effect, suggesting that LLMs can evaluate factual content more impartially than previously thought.
KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity
PositiveArtificial Intelligence
A new method called KQ-SVD has been introduced to enhance the efficiency of transformer-based large language models (LLMs) by optimizing the Key-Value (KV) cache. This method addresses the memory bottleneck caused by increasing sequence lengths and batch sizes, proving that traditional compression techniques are suboptimal for approximating the attention matrix. KQ-SVD offers a computationally efficient low-rank decomposition that maintains attention fidelity under compression.
Last Week in AI #328 - DeepSeek 3.2, Mistral 3, Trainium3, Runway Gen-4.5
PositiveArtificial Intelligence
DeepSeek has released new reasoning models, including updates from its V3 to V3.2 versions, while Mistral has launched the Mistral 3 family of open-source models designed for various platforms, marking significant advancements in AI technology. These developments highlight the competitive landscape in the AI sector, where companies are striving to enhance their offerings and capabilities.