arXiv:2510.25257v1 Announce Type: new 
Abstract: Real-time object detection has achieved substantial progress through meticulously designed architectures and optimization strategies. However, the pursuit of high-speed inference via lightweight network designs often leads to degraded feature representation, which hinders further performance improvements and practical on-device deployment. In this paper, we propose a cost-effective and highly adaptable distillation framework that harnesses the rapidly evolving capabilities of Vision Foundation Models (VFMs) to enhance lightweight object detectors. Given the significant architectural and learning objective disparities between VFMs and resource-constrained detectors, achieving stable and task-aligned semantic transfer is challenging. To address this, on one hand, we introduce a Deep Semantic Injector (DSI) module that facilitates the integration of high-level representations from VFMs into the deep layers of the detector. On the other hand, we devise a Gradient-guided Adaptive Modulation (GAM) strategy, which dynamically adjusts the intensity of semantic transfer based on gradient norm ratios. Without increasing deployment and inference overhead, our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models, underscoring its practical utility for real-time detection. Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.

تقدم الورقة الأخيرة حول RT-DETRv4 إطار عمل مبتكر للتقطير يهدف إلى تحسين الكشف عن الكائنات في الوقت الحقيقي دون المساس بالأداء. هذه الخطوة مهمة لأنها تعالج التحدي الشائع المتمثل في تحقيق التوازن بين السرعة والدقة في تصميم الشبكات الخفيفة، مما يسهل نشر نماذج فعالة على الأجهزة. يمكن أن تؤدي هذه التحسينات إلى تطبيقات أكثر كفاءة في مجالات متنوعة، من المركبات المستقلة إلى أنظمة المراقبة الذكية.

El reciente artículo sobre RT-DETRv4 presenta un marco de destilación innovador que busca mejorar la detección de objetos en tiempo real sin comprometer el rendimiento. Este avance es significativo ya que aborda el desafío común de equilibrar velocidad y precisión en diseños de redes ligeras, facilitando el despliegue de modelos efectivos en dispositivos. Tales mejoras podrían llevar a aplicaciones más eficientes en diversos campos, desde vehículos autónomos hasta sistemas de vigilancia inteligentes.

Le récent article sur RT-DETRv4 présente un cadre de distillation innovant visant à améliorer la détection d'objets en temps réel sans compromettre les performances. Cette avancée est significative car elle répond au défi courant de l'équilibre entre vitesse et précision dans les conceptions de réseaux légers, facilitant ainsi le déploiement de modèles efficaces sur des appareils. De telles améliorations pourraient conduire à des applications plus efficaces dans divers domaines, des véhicules autonomes aux systèmes de surveillance intelligents.

The recent paper on RT-DETRv4 introduces an innovative distillation framework aimed at enhancing real-time object detection without compromising performance. This advancement is significant as it addresses the common challenge of balancing speed and accuracy in lightweight network designs, making it easier to deploy effective models on devices. Such improvements could lead to more efficient applications in various fields, from autonomous vehicles to smart surveillance systems.

RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models

arXiv:2512.03370v1 Announce Type: new 
Abstract: We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in the-wild performance across diverse urban scenarios. Project website: https://lunarlab-gatech.github.io/ShelfGaussian/.

تم تقديم ShelfGaussian كإطار لفهم المشاهد ثلاثية الأبعاد يعتمد على Gaussian متعدد الوسائط وذو مفردات مفتوحة، مستفيدًا من نماذج رؤية الأساس المتاحة لتحسين الأداء والكفاءة في مجموعة متنوعة من مهام فهم المشهد. يتناول هذا الإطار قيود الأساليب الحالية من خلال تمكين Gaussian من استعلام الميزات من عدة أنواع من المستشعرات وتحسينها على مستويات 2D و3D.

ShelfGaussian se ha introducido como un marco de comprensión de escenas 3D basado en Gaussianos multimodales y de vocabulario abierto, aprovechando modelos de visión de fundación disponibles para mejorar el rendimiento y la eficiencia en diversas tareas de comprensión de escenas. Este marco aborda las limitaciones de los métodos existentes al permitir que los Gaussianos consulten características de múltiples modalidades de sensores y optimizarlos tanto a niveles 2D como 3D.

ShelfGaussian a été introduit comme un cadre de compréhension de scène 3D basé sur des Gaussiens multi-modaux et à vocabulaire ouvert, tirant parti de modèles de fondation visuels disponibles pour améliorer la performance et l'efficacité dans diverses tâches de compréhension de scène. Ce cadre répond aux limitations des méthodes existantes en permettant aux Gaussiens d'interroger des caractéristiques de multiples modalités de capteurs et en les optimisant à la fois aux niveaux 2D et 3D.

ShelfGaussian has been introduced as an open-vocabulary multi-modal Gaussian-based framework for 3D scene understanding, leveraging off-the-shelf vision foundation models to enhance performance and efficiency in various scene understanding tasks. This framework addresses limitations of existing methods by enabling Gaussians to query features from multiple sensor modalities and optimizing them at both 2D and 3D levels.

ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

arXiv:2501.04005v3 Announce Type: replace-cross 
Abstract: Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: (i) VFM-driven superpixel generation for detailed semantic representation, (ii) a VFM-assisted contrastive learning strategy to align multimodal features, (iii) superpoint temporal consistency to maintain stable representations across time, and (iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach achieves substantial gains over state-of-the-art methods in linear probing and fine-tuning for LiDAR-based segmentation and object detection. Extensive experiments on 11 large-scale multi-sensor datasets highlight our superior performance, demonstrating adaptability, efficiency, and robustness in real-world autonomous driving scenarios.

تم تقديم LargeAD كإطار قابل للتوسع للتدريب المسبق ثلاثي الأبعاد على نطاق واسع في القيادة الذاتية، حيث يستخدم نماذج الأساس البصرية (VFM) لتعزيز المحاذاة الدلالية بين الصور ثنائية الأبعاد وسحب النقاط LiDAR. تهدف هذه الطريقة المبتكرة إلى تحسين فهم البيئات ثلاثية الأبعاد المعقدة، وهو أمر حاسم لتقدم تقنيات القيادة الذاتية.

LargeAD se ha presentado como un marco escalable para el preentrenamiento 3D a gran escala en la conducción autónoma, utilizando modelos de fundación visual (VFM) para mejorar la alineación semántica entre imágenes 2D y nubes de puntos LiDAR. Este enfoque innovador busca mejorar la comprensión de entornos 3D complejos, lo cual es crucial para el avance de las tecnologías de conducción autónoma.

LargeAD a été introduit comme un cadre évolutif pour le pré-entraînement 3D à grande échelle dans la conduite autonome, utilisant des modèles de fondation visuelle (VFM) pour améliorer l'alignement sémantique entre les images 2D et les nuages de points LiDAR. Cette approche innovante vise à améliorer la compréhension des environnements 3D complexes, ce qui est crucial pour l'avancement des technologies de conduite autonome.

LargeAD has been introduced as a scalable framework for large-scale 3D pretraining in autonomous driving, utilizing vision foundation models (VFMs) to enhance the semantic alignment between 2D images and LiDAR point clouds. This innovative approach aims to improve the understanding of complex 3D environments, which is crucial for the advancement of autonomous driving technologies.

LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving

arXiv:2512.02622v1 Announce Type: new 
Abstract: Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

أدت التطورات الأخيرة في توليد الفيديو إلى تقديم RULER-Bench، وهو معيار يهدف إلى تقييم قدرات التفكير القائم على القواعد لنماذج توليد الفيديو. تعالج هذه المبادرة فجوة كبيرة في التقييمات الحالية، التي تركزت بشكل أساسي على الإدراك البصري والتماسك، من خلال دمج القواعد المعرفية في عملية التقييم.

Los avances recientes en la generación de videos han llevado a la introducción de RULER-Bench, un marco de referencia diseñado para evaluar las capacidades de razonamiento basadas en reglas de los modelos de generación de video. Esta iniciativa aborda una brecha significativa en las evaluaciones existentes, que se han centrado principalmente en la percepción visual y la coherencia, al incorporar reglas cognitivas en el proceso de evaluación.

Les avancées récentes dans la génération vidéo ont conduit à l'introduction de RULER-Bench, une référence visant à évaluer les capacités de raisonnement basées sur des règles des modèles de génération vidéo. Cette initiative comble une lacune significative dans les évaluations existantes, qui se sont principalement concentrées sur la perception visuelle et la cohérence, en intégrant des règles cognitives dans le processus d'évaluation.

Recent advancements in video generation have led to the introduction of RULER-Bench, a benchmark aimed at evaluating the rule-based reasoning capabilities of video generation models. This initiative addresses a significant gap in existing evaluations, which have primarily focused on visual perception and coherence, by incorporating cognitive rules into the assessment process.

RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models

Was this article worth reading? Share it

LucidQuery AI

Octofy

Dyad