arXiv:2507.08441v2 Announce Type: replace 
Abstract: In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 1.36 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at https://github.com/CVMI-Lab/VFMTok.

تقدم دراسة جديدة نهجًا مبتكرًا لتوليد الصور من خلال استخدام نماذج الأساس البصري كأدوات فعالة لتجزئة الصور. تعزز هذه الطريقة كفاءة ترميز الصور من خلال إطار عمل للتكميم التكيفي الإقليمي، الذي يقلل من التكرار في الميزات المدربة مسبقًا. هذه الخطوة مهمة لأنها تفتح آفاقًا جديدة لتحسين تقنيات توليد الصور، مما يجعلها أكثر فعالية وسلاسة، مما قد يكون له تطبيقات واسعة في مجالات مثل الذكاء الاصطناعي ووسائل الإعلام الرقمية.

Un nuevo estudio presenta un enfoque innovador para la generación de imágenes utilizando modelos de fundación visual como tokenizadores visuales efectivos. Este método mejora la eficiencia de la codificación de imágenes a través de un marco de cuantificación adaptativa por región, que minimiza la redundancia en las características preentrenadas. Este avance es significativo ya que abre nuevas posibilidades para mejorar las técnicas de generación de imágenes, haciéndolas más efectivas y optimizadas, lo que podría tener aplicaciones amplias en campos como la inteligencia artificial y los medios digitales.

Une nouvelle étude présente une approche innovante de la génération d'images en utilisant des modèles de fondation visuelle comme des tokenizeurs visuels efficaces. Cette méthode améliore l'efficacité de l'encodage d'images grâce à un cadre de quantification adaptatif par région, qui minimise la redondance dans les caractéristiques pré-entraînées. Cette avancée est significative car elle ouvre de nouvelles possibilités pour améliorer les techniques de génération d'images, les rendant plus efficaces et rationalisées, ce qui pourrait avoir des applications variées dans des domaines comme l'intelligence artificielle et les médias numériques.

A new study introduces an innovative approach to image generation by utilizing vision foundation models as effective visual tokenizers. This method enhances the efficiency of image encoding through a region-adaptive quantization framework, which minimizes redundancy in pre-trained features. This advancement is significant as it opens up new possibilities for improving image generation techniques, making them more effective and streamlined, which could have wide-ranging applications in fields like artificial intelligence and digital media.

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

arXiv:2512.03370v1 Announce Type: new 
Abstract: We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in the-wild performance across diverse urban scenarios. Project website: https://lunarlab-gatech.github.io/ShelfGaussian/.

تم تقديم ShelfGaussian كإطار لفهم المشاهد ثلاثية الأبعاد يعتمد على Gaussian متعدد الوسائط وذو مفردات مفتوحة، مستفيدًا من نماذج رؤية الأساس المتاحة لتحسين الأداء والكفاءة في مجموعة متنوعة من مهام فهم المشهد. يتناول هذا الإطار قيود الأساليب الحالية من خلال تمكين Gaussian من استعلام الميزات من عدة أنواع من المستشعرات وتحسينها على مستويات 2D و3D.

ShelfGaussian se ha introducido como un marco de comprensión de escenas 3D basado en Gaussianos multimodales y de vocabulario abierto, aprovechando modelos de visión de fundación disponibles para mejorar el rendimiento y la eficiencia en diversas tareas de comprensión de escenas. Este marco aborda las limitaciones de los métodos existentes al permitir que los Gaussianos consulten características de múltiples modalidades de sensores y optimizarlos tanto a niveles 2D como 3D.

ShelfGaussian a été introduit comme un cadre de compréhension de scène 3D basé sur des Gaussiens multi-modaux et à vocabulaire ouvert, tirant parti de modèles de fondation visuels disponibles pour améliorer la performance et l'efficacité dans diverses tâches de compréhension de scène. Ce cadre répond aux limitations des méthodes existantes en permettant aux Gaussiens d'interroger des caractéristiques de multiples modalités de capteurs et en les optimisant à la fois aux niveaux 2D et 3D.

ShelfGaussian has been introduced as an open-vocabulary multi-modal Gaussian-based framework for 3D scene understanding, leveraging off-the-shelf vision foundation models to enhance performance and efficiency in various scene understanding tasks. This framework addresses limitations of existing methods by enabling Gaussians to query features from multiple sensor modalities and optimizing them at both 2D and 3D levels.

ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

arXiv:2501.04005v3 Announce Type: replace-cross 
Abstract: Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: (i) VFM-driven superpixel generation for detailed semantic representation, (ii) a VFM-assisted contrastive learning strategy to align multimodal features, (iii) superpoint temporal consistency to maintain stable representations across time, and (iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach achieves substantial gains over state-of-the-art methods in linear probing and fine-tuning for LiDAR-based segmentation and object detection. Extensive experiments on 11 large-scale multi-sensor datasets highlight our superior performance, demonstrating adaptability, efficiency, and robustness in real-world autonomous driving scenarios.

تم تقديم LargeAD كإطار قابل للتوسع للتدريب المسبق ثلاثي الأبعاد على نطاق واسع في القيادة الذاتية، حيث يستخدم نماذج الأساس البصرية (VFM) لتعزيز المحاذاة الدلالية بين الصور ثنائية الأبعاد وسحب النقاط LiDAR. تهدف هذه الطريقة المبتكرة إلى تحسين فهم البيئات ثلاثية الأبعاد المعقدة، وهو أمر حاسم لتقدم تقنيات القيادة الذاتية.

LargeAD se ha presentado como un marco escalable para el preentrenamiento 3D a gran escala en la conducción autónoma, utilizando modelos de fundación visual (VFM) para mejorar la alineación semántica entre imágenes 2D y nubes de puntos LiDAR. Este enfoque innovador busca mejorar la comprensión de entornos 3D complejos, lo cual es crucial para el avance de las tecnologías de conducción autónoma.

LargeAD a été introduit comme un cadre évolutif pour le pré-entraînement 3D à grande échelle dans la conduite autonome, utilisant des modèles de fondation visuelle (VFM) pour améliorer l'alignement sémantique entre les images 2D et les nuages de points LiDAR. Cette approche innovante vise à améliorer la compréhension des environnements 3D complexes, ce qui est crucial pour l'avancement des technologies de conduite autonome.

LargeAD has been introduced as a scalable framework for large-scale 3D pretraining in autonomous driving, utilizing vision foundation models (VFMs) to enhance the semantic alignment between 2D images and LiDAR point clouds. This innovative approach aims to improve the understanding of complex 3D environments, which is crucial for the advancement of autonomous driving technologies.

LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving

arXiv:2512.02622v1 Announce Type: new 
Abstract: Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

أدت التطورات الأخيرة في توليد الفيديو إلى تقديم RULER-Bench، وهو معيار يهدف إلى تقييم قدرات التفكير القائم على القواعد لنماذج توليد الفيديو. تعالج هذه المبادرة فجوة كبيرة في التقييمات الحالية، التي تركزت بشكل أساسي على الإدراك البصري والتماسك، من خلال دمج القواعد المعرفية في عملية التقييم.

Los avances recientes en la generación de videos han llevado a la introducción de RULER-Bench, un marco de referencia diseñado para evaluar las capacidades de razonamiento basadas en reglas de los modelos de generación de video. Esta iniciativa aborda una brecha significativa en las evaluaciones existentes, que se han centrado principalmente en la percepción visual y la coherencia, al incorporar reglas cognitivas en el proceso de evaluación.

Les avancées récentes dans la génération vidéo ont conduit à l'introduction de RULER-Bench, une référence visant à évaluer les capacités de raisonnement basées sur des règles des modèles de génération vidéo. Cette initiative comble une lacune significative dans les évaluations existantes, qui se sont principalement concentrées sur la perception visuelle et la cohérence, en intégrant des règles cognitives dans le processus d'évaluation.

Recent advancements in video generation have led to the introduction of RULER-Bench, a benchmark aimed at evaluating the rule-based reasoning capabilities of video generation models. This initiative addresses a significant gap in existing evaluations, which have primarily focused on visual perception and coherence, by incorporating cognitive rules into the assessment process.

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Was this article worth reading? Share it

Blunge

VibeFrame

Raphael