arXiv:2512.10942v1 Announce Type: new 
Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

تم تقديم VL-JEPA كنموذج للرؤية واللغة يستخدم بنية تنبؤية للتضمين المشترك (JEPA)، والتي تتنبأ بتضمينات مستمرة للنصوص المستهدفة بدلاً من توليد الرموز بطريقة تلقائية. يظهر هذا النموذج أداءً محسنًا مع تقليل 50% من المعلمات القابلة للتدريب مقارنة بالنماذج التقليدية المعتمدة على الرموز، مما يبرز كفاءته في معالجة مهام الرؤية واللغة.

Se ha presentado VL-JEPA como un modelo de visión-lenguaje que utiliza una Arquitectura Predictiva de Embedding Conjunto (JEPA), que predice embeddings continuos de los textos objetivo en lugar de generar tokens de manera autoregresiva. Este modelo demuestra un rendimiento mejorado con un 50% menos de parámetros entrenables en comparación con los modelos tradicionales basados en tokens, destacando su eficiencia en el procesamiento de tareas de visión-lenguaje.

VL-JEPA a été introduit comme un modèle de vision-langage utilisant une architecture prédictive d'encodage conjoint (JEPA), qui prédit des embeddings continus des textes cibles plutôt que de générer des tokens de manière autoregressive. Ce modèle démontre une performance améliorée avec 50 % de paramètres entraînables en moins par rapport aux modèles traditionnels basés sur des tokens, soulignant son efficacité dans le traitement des tâches de vision-langage.

VL-JEPA has been introduced as a vision-language model utilizing a Joint Embedding Predictive Architecture (JEPA), which predicts continuous embeddings of target texts rather than generating tokens autoregressively. This model demonstrates improved performance with 50% fewer trainable parameters compared to traditional token-space models, highlighting its efficiency in processing vision-language tasks.

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Was this article worth reading? Share it

LucidQuery AI

ShareSpeak

The Visualizer

Videolulu

VibeFrame

Synthesia

Ready to build your own newsroom?