arXiv:2511.16595v2 Announce Type: replace 
Abstract: We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.

تم تقديم TimeViper كنموذج هجين للرؤية واللغة يهدف إلى تحسين فهم مقاطع الفيديو الطويلة من خلال استخدام هيكل Mamba-Transformer. تجمع هذه البنية المبتكرة بين كفاءة نماذج الحالة وفعالية آليات الانتباه، مما يسمح لها بمعالجة محتوى الفيديو الواسع بشكل فعال.

TimeViper ha sido presentado como un modelo híbrido de visión-lenguaje diseñado para mejorar la comprensión de videos largos mediante el uso de una arquitectura Mamba-Transformer. Esta innovadora arquitectura combina la eficiencia de los modelos de estado con la expresividad de los mecanismos de atención, permitiendo procesar contenido de video extenso de manera efectiva.

TimeViper a été présenté comme un modèle hybride vision-langage visant à améliorer la compréhension des longues vidéos en utilisant une architecture Mamba-Transformer. Cette architecture innovante combine l'efficacité des modèles d'état avec l'expressivité des mécanismes d'attention, lui permettant de traiter efficacement un contenu vidéo étendu.

TimeViper has been introduced as a hybrid vision-language model aimed at enhancing long video understanding by utilizing a Mamba-Transformer backbone. This innovative architecture combines the efficiency of state-space models with the expressiveness of attention mechanisms, allowing it to process extensive video content effectively.

TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

One More Thing in AI – Your Shortcut to AI Mastery

TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

The Visualizer

Videotok

VideoDigest

VideoTranslator

Ready to build your own newsroom?