Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning
PositiveArtificial Intelligence
The introduction of SharpV marks a pivotal advancement in the realm of Video Large Language Models (VideoLLMs), which have been hindered by quadratic computational complexity and inefficient key-value cache scaling. SharpV offers a minimalist and efficient solution through adaptive pruning of visual tokens, dynamically adjusting based on spatial-temporal information. This method not only reduces redundancy but also occasionally surpasses the performance of dense models, showcasing its potential as a new paradigm in adaptive pruning. During the pruning process, SharpV intelligently discards degraded visual features, guided by their similarity to original features, thereby enhancing the model's information flow. Experiments conducted on various public benchmarks have demonstrated SharpV's superiority, establishing it as the first two-stage pruning framework that operates without needing access to exposed attention scores. This development is crucial for the future of VideoLLMs, promising…
— via World Pulse Now AI Editorial System