arXiv:2511.18831v1 Announce Type: new 
Abstract: The scalability of video understanding models is increasingly limited by the prohibitive storage and computational costs of large-scale video datasets. While data synthesis has improved data efficiency in the image domain, its extension to video remains challenging due to pervasive temporal redundancy and complex spatiotemporal dynamics. In this work, we uncover a critical insight: the primary source of inefficiency in video datasets is not inter-sample redundancy, but intra-sample frame-level redundancy. To leverage this insight, we introduce VideoCompressa, a novel framework for video data synthesis that reframes the problem as dynamic latent compression. Specifically, VideoCompressa jointly optimizes a differentiable keyframe selector-implemented as a lightweight ConvNet with Gumbel-Softmax sampling-to identify the most informative frames, and a pretrained, frozen Variational Autoencoder (VAE) to compress these frames into compact, semantically rich latent codes. These latent representations are then fed into a compression network, enabling end-to-end backpropagation. Crucially, the keyframe selector and synthetic latent codes are co-optimized to maximize retention of task-relevant information. Experiments show that our method achieves unprecedented data efficiency: on UCF101 with ConvNets, VideoCompressa surpasses full-data training by 2.34\% points using only 0.13\% of the original data, with over 5800x speedup compared to traditional synthesis method. Moreover, when fine-tuning Qwen2.5-7B-VL on HMDB51, VideoCompressa matches full-data performance using just 0.41\% of the training data-outperforming zero-shot baseline by 10.61\%.

تم تقديم إطار عمل جديد يسمى VideoCompressa لتحسين كفاءة البيانات في فهم الفيديو من خلال معالجة التكرار على مستوى الإطارات داخل العينة. تستخدم هذه الطريقة محدد إطارات رئيسية قابل للتفريق ومشفر تلقائي متغير مسبق التدريب لتحسين تخليق بيانات الفيديو من خلال ضغط كامن ديناميكي.

Se ha introducido un nuevo marco llamado VideoCompressa para mejorar la eficiencia de los datos en la comprensión de videos al abordar la redundancia a nivel de fotogramas intra-muestra. Este enfoque utiliza un selector de fotogramas clave diferenciable y un autoencoder variacional preentrenado para optimizar la síntesis de datos de video a través de la compresión latente dinámica.

Un nouveau cadre appelé VideoCompressa a été introduit pour améliorer l'efficacité des données dans la compréhension vidéo en s'attaquant à la redondance au niveau des images intra-échantillons. Cette approche utilise un sélecteur de keyframes différentiable et un autoencodeur variationnel pré-entraîné pour optimiser la synthèse de données vidéo par compression latente dynamique.

A new framework called VideoCompressa has been introduced to enhance data efficiency in video understanding by addressing intra-sample frame-level redundancy. This approach utilizes a differentiable keyframe selector and a pretrained Variational Autoencoder to optimize video data synthesis through dynamic latent compression.

VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction

Was this article worth reading? Share it

Postugc

Cococlip.AI

Unifab

ComfyUI

EasyVideo

Focal

Ready to build your own newsroom?