arXiv:2511.18424v1 Announce Type: new 
Abstract: Image-to-point cross-modal learning has emerged to address the scarcity of large-scale 3D datasets in 3D representation learning. However, current methods that leverage 2D data often result in large, slow-to-train models, making them computationally expensive and difficult to deploy in resource-constrained environments. The architecture design of such models is therefore critical, determining their performance, memory footprint, and compute efficiency. The Joint-embedding Predictive Architecture (JEPA) has gained wide popularity in self-supervised learning for its simplicity and efficiency, but has been under-explored in cross-modal settings, partly due to the misconception that masking is intrinsic to JEPA. In this light, we propose CrossJEPA, a simple Cross-modal Joint Embedding Predictive Architecture that harnesses the knowledge of an image foundation model and trains a predictor to infer embeddings of specific rendered 2D views from corresponding 3D point clouds, thereby introducing a JEPA-style pretraining strategy beyond masking. By conditioning the predictor on cross-domain projection information, CrossJEPA purifies the supervision signal from semantics exclusive to the target domain. We further exploit the frozen teacher design with a one-time target embedding caching mechanism, yielding amortized efficiency. CrossJEPA achieves a new state-of-the-art in linear probing on the synthetic ModelNet40 (94.2%) and the real-world ScanObjectNN (88.3%) benchmarks, using only 14.1M pretraining parameters (8.5M in the point encoder), and about 6 pretraining hours on a standard single GPU. These results position CrossJEPA as a performant, memory-efficient, and fast-to-train framework for 3D representation learning via knowledge distillation. We analyze CrossJEPA intuitively, theoretically, and empirically, and extensively ablate our design choices. Code will be made available.

تم تقديم CrossJEPA كمعمارية جديدة للتنبؤ بالتضمين المشترك عبر الأنماط تهدف إلى تحسين تعلم التمثيلات ثلاثية الأبعاد من الصور ثنائية الأبعاد، مما يعالج التحديات التي تطرحها محدودية توفر مجموعات البيانات ثلاثية الأبعاد الكبيرة. تستفيد هذه المعمارية من معمارية التضمين المشترك للتنبؤ (JEPA) لتحسين كفاءة النموذج وتقليل التكاليف الحاسوبية المرتبطة بتدريب النماذج الكبيرة.

CrossJEPA se ha presentado como una nueva arquitectura predictiva de embedding conjunto cross-modal destinada a mejorar el aprendizaje de representaciones 3D a partir de imágenes 2D, abordando los desafíos que plantea la disponibilidad limitada de grandes conjuntos de datos 3D. Esta arquitectura aprovecha la Arquitectura Predictiva de Embedding Conjunto (JEPA) para mejorar la eficiencia del modelo y reducir los costos computacionales asociados con el entrenamiento de modelos grandes.

CrossJEPA a été introduit comme une nouvelle architecture prédictive d'embedding conjoint cross-modal visant à améliorer l'apprentissage de représentations 3D à partir d'images 2D, en répondant aux défis posés par la disponibilité limitée de grands ensembles de données 3D. Cette architecture exploite l'architecture prédictive d'embedding conjoint (JEPA) pour améliorer l'efficacité du modèle et réduire les coûts computationnels associés à l'entraînement de grands modèles.

CrossJEPA has been introduced as a new Cross-modal Joint Embedding Predictive Architecture aimed at improving 3D representation learning from 2D images, addressing the challenges posed by the limited availability of large-scale 3D datasets. This architecture leverages the Joint-embedding Predictive Architecture (JEPA) to enhance model efficiency and reduce computational costs associated with training large models.

CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images

Was this article worth reading? Share it

Zemith-3bda3b

Https

Cometapi-e0d0fd

Attentive AI

ChatOne

Jaxo Ai

Ready to build your own newsroom?