arXiv:2510.23145v1 Announce Type: new 
Abstract: Transferability estimation identifies the best pre-trained models for downstream tasks without incurring the high computational cost of full fine-tuning. This capability facilitates deployment and advances the pre-training and fine-tuning paradigm. However, existing methods often struggle to accurately assess transferability for emerging pre-trained models with diverse architectures, training strategies, and task alignments. In this work, we propose Implicit Transferability Modeling (ITM), a novel framework that implicitly models each model's intrinsic transferability, coupled with a Divide-and-Conquer Variational Approximation (DVA) strategy to efficiently approximate embedding space evolution. This design enables generalization across a broader range of models and downstream tasks. Extensive experiments on a comprehensive benchmark--spanning extensive training regimes and a wider variety of model types--demonstrate that ITM consistently outperforms existing methods in terms of stability, effectiveness, and efficiency.

تسلط دراسة جديدة حول تقدير القابلية للنقل لنماذج الأساس البصرية الضوء على طريقة تحدد أفضل النماذج المدربة مسبقًا للمهام المختلفة دون الحاجة إلى ضبط دقيق مكثف. هذه الخطوة مهمة لأنها تبسط عملية النشر وتعزز الكفاءة العامة لتدريب النماذج، خاصة في ظل التحديات التي تطرحها الهياكل المتنوعة واستراتيجيات التدريب في النماذج الناشئة.

Un nuevo estudio sobre la estimación de la transferibilidad para modelos de visión de base destaca un método que identifica los modelos preentrenados más efectivos para diversas tareas sin necesidad de un ajuste fino extenso. Este avance es significativo ya que agiliza el proceso de implementación y mejora la eficiencia general del entrenamiento de modelos, especialmente ante los desafíos que presentan las arquitecturas diversas y las estrategias de entrenamiento en modelos emergentes.

Une nouvelle étude sur l'estimation de la transférabilité pour les modèles de fondation visuelle met en lumière une méthode qui identifie les modèles pré-entraînés les plus efficaces pour diverses tâches sans nécessiter un ajustement fin étendu. Cette avancée est significative car elle rationalise le processus de déploiement et améliore l'efficacité globale de l'entraînement des modèles, surtout face aux défis posés par les architectures diverses et les stratégies d'entraînement dans les modèles émergents.

A new study on transferability estimation for vision foundation models highlights a method that identifies the most effective pre-trained models for various tasks without the need for extensive fine-tuning. This advancement is significant as it streamlines the deployment process and enhances the overall efficiency of model training, especially given the challenges posed by diverse architectures and training strategies in emerging models.

Implicit Modeling for Transferability Estimation of Vision Foundation Models

arXiv:2512.03370v1 Announce Type: new 
Abstract: We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in the-wild performance across diverse urban scenarios. Project website: https://lunarlab-gatech.github.io/ShelfGaussian/.

تم تقديم ShelfGaussian كإطار لفهم المشاهد ثلاثية الأبعاد يعتمد على Gaussian متعدد الوسائط وذو مفردات مفتوحة، مستفيدًا من نماذج رؤية الأساس المتاحة لتحسين الأداء والكفاءة في مجموعة متنوعة من مهام فهم المشهد. يتناول هذا الإطار قيود الأساليب الحالية من خلال تمكين Gaussian من استعلام الميزات من عدة أنواع من المستشعرات وتحسينها على مستويات 2D و3D.

ShelfGaussian se ha introducido como un marco de comprensión de escenas 3D basado en Gaussianos multimodales y de vocabulario abierto, aprovechando modelos de visión de fundación disponibles para mejorar el rendimiento y la eficiencia en diversas tareas de comprensión de escenas. Este marco aborda las limitaciones de los métodos existentes al permitir que los Gaussianos consulten características de múltiples modalidades de sensores y optimizarlos tanto a niveles 2D como 3D.

ShelfGaussian a été introduit comme un cadre de compréhension de scène 3D basé sur des Gaussiens multi-modaux et à vocabulaire ouvert, tirant parti de modèles de fondation visuels disponibles pour améliorer la performance et l'efficacité dans diverses tâches de compréhension de scène. Ce cadre répond aux limitations des méthodes existantes en permettant aux Gaussiens d'interroger des caractéristiques de multiples modalités de capteurs et en les optimisant à la fois aux niveaux 2D et 3D.

ShelfGaussian has been introduced as an open-vocabulary multi-modal Gaussian-based framework for 3D scene understanding, leveraging off-the-shelf vision foundation models to enhance performance and efficiency in various scene understanding tasks. This framework addresses limitations of existing methods by enabling Gaussians to query features from multiple sensor modalities and optimizing them at both 2D and 3D levels.

ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

arXiv:2501.04005v3 Announce Type: replace-cross 
Abstract: Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: (i) VFM-driven superpixel generation for detailed semantic representation, (ii) a VFM-assisted contrastive learning strategy to align multimodal features, (iii) superpoint temporal consistency to maintain stable representations across time, and (iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach achieves substantial gains over state-of-the-art methods in linear probing and fine-tuning for LiDAR-based segmentation and object detection. Extensive experiments on 11 large-scale multi-sensor datasets highlight our superior performance, demonstrating adaptability, efficiency, and robustness in real-world autonomous driving scenarios.

تم تقديم LargeAD كإطار قابل للتوسع للتدريب المسبق ثلاثي الأبعاد على نطاق واسع في القيادة الذاتية، حيث يستخدم نماذج الأساس البصرية (VFM) لتعزيز المحاذاة الدلالية بين الصور ثنائية الأبعاد وسحب النقاط LiDAR. تهدف هذه الطريقة المبتكرة إلى تحسين فهم البيئات ثلاثية الأبعاد المعقدة، وهو أمر حاسم لتقدم تقنيات القيادة الذاتية.

LargeAD se ha presentado como un marco escalable para el preentrenamiento 3D a gran escala en la conducción autónoma, utilizando modelos de fundación visual (VFM) para mejorar la alineación semántica entre imágenes 2D y nubes de puntos LiDAR. Este enfoque innovador busca mejorar la comprensión de entornos 3D complejos, lo cual es crucial para el avance de las tecnologías de conducción autónoma.

LargeAD a été introduit comme un cadre évolutif pour le pré-entraînement 3D à grande échelle dans la conduite autonome, utilisant des modèles de fondation visuelle (VFM) pour améliorer l'alignement sémantique entre les images 2D et les nuages de points LiDAR. Cette approche innovante vise à améliorer la compréhension des environnements 3D complexes, ce qui est crucial pour l'avancement des technologies de conduite autonome.

LargeAD has been introduced as a scalable framework for large-scale 3D pretraining in autonomous driving, utilizing vision foundation models (VFMs) to enhance the semantic alignment between 2D images and LiDAR point clouds. This innovative approach aims to improve the understanding of complex 3D environments, which is crucial for the advancement of autonomous driving technologies.

LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving

arXiv:2512.02622v1 Announce Type: new 
Abstract: Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

أدت التطورات الأخيرة في توليد الفيديو إلى تقديم RULER-Bench، وهو معيار يهدف إلى تقييم قدرات التفكير القائم على القواعد لنماذج توليد الفيديو. تعالج هذه المبادرة فجوة كبيرة في التقييمات الحالية، التي تركزت بشكل أساسي على الإدراك البصري والتماسك، من خلال دمج القواعد المعرفية في عملية التقييم.

Los avances recientes en la generación de videos han llevado a la introducción de RULER-Bench, un marco de referencia diseñado para evaluar las capacidades de razonamiento basadas en reglas de los modelos de generación de video. Esta iniciativa aborda una brecha significativa en las evaluaciones existentes, que se han centrado principalmente en la percepción visual y la coherencia, al incorporar reglas cognitivas en el proceso de evaluación.

Les avancées récentes dans la génération vidéo ont conduit à l'introduction de RULER-Bench, une référence visant à évaluer les capacités de raisonnement basées sur des règles des modèles de génération vidéo. Cette initiative comble une lacune significative dans les évaluations existantes, qui se sont principalement concentrées sur la perception visuelle et la cohérence, en intégrant des règles cognitives dans le processus d'évaluation.

Recent advancements in video generation have led to the introduction of RULER-Bench, a benchmark aimed at evaluating the rule-based reasoning capabilities of video generation models. This initiative addresses a significant gap in existing evaluations, which have primarily focused on visual perception and coherence, by incorporating cognitive rules into the assessment process.

Implicit Modeling for Transferability Estimation of Vision Foundation Models

Was this article worth reading? Share it

Keywords AI

Octofy

Https