<p>Table of Contents Grounding DINO: Open Vocabulary Object Detection on Videos Why Language Makes Open-Set Detection Possible GLIP: Grounded Language-Image Pre-Training The DINO Detector (Closed-Set DETR) Grounding DINO Architecture Feature Enhancer (Neck Fusion) and Cross-Attention: The Teacher&#8217;s Guidance Language-Guided Query&#8230;</p>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2025/12/08/grounding-dino-open-vocabulary-object-detection-on-videos/">Grounding DINO: Open Vocabulary Object Detection on Videos</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>

تم تقديم Grounding DINO كإطار عمل لاكتشاف الكائنات ذات المفردات المفتوحة في مقاطع الفيديو، مستفيدًا من اللغة لتعزيز قدرات الاكتشاف. تهدف هذه المقاربة إلى تحسين دقة ومرونة أنظمة اكتشاف الكائنات من خلال السماح لها بالتعرف على مجموعة أوسع من الكائنات دون أن تكون مقيدة بفئات محددة مسبقًا.

Grounding DINO se ha presentado como un marco para la detección de objetos de vocabulario abierto en videos, aprovechando el lenguaje para mejorar las capacidades de detección. Este enfoque busca mejorar la precisión y flexibilidad de los sistemas de detección de objetos al permitirles reconocer una gama más amplia de objetos sin estar limitados a categorías predefinidas.

Grounding DINO a été introduit comme un cadre pour la détection d'objets à vocabulaire ouvert dans les vidéos, utilisant le langage pour améliorer les capacités de détection. Cette approche vise à améliorer la précision et la flexibilité des systèmes de détection d'objets en leur permettant de reconnaître un plus large éventail d'objets sans être limités à des catégories prédéfinies.

Grounding DINO has been introduced as a framework for open vocabulary object detection in videos, leveraging language to enhance detection capabilities. This approach aims to improve the accuracy and flexibility of object detection systems by allowing them to recognize a broader range of objects without being limited to predefined categories.

Grounding DINO: Open Vocabulary Object Detection on Videos

arXiv:2503.23062v5 Announce Type: replace 
Abstract: Shapes and textures are the basic building blocks of visual perception. The ability to identify shapes regardless of orientation, texture, or context, and to recognize textures and materials independently of their associated objects, is essential for a general visual understanding of the world. This work introduces the Large Shapes and Textures dataset (LAS&T), a giant collection of highly diverse shapes and textures, created by unsupervised extraction of patterns from natural images. This dataset is used to benchmark how effectively leading Large Vision-Language Models (LVLM/VLM) recognize and represent shapes, textures, and materials in 2D and 3D scenes. For shape recognition, we test the models' ability to match images of identical shapes that differ in orientation, texture, color, or environment. Our results show that the shape-recognition capabilities of LVLMs remain well below human performance, especially when multiple transformations are applied. LVLMs rely predominantly on high-level and semantic features and struggle with abstract shapes lacking class associations. For texture and material recognition, we evaluated the models' ability to identify images with identical textures and materials across different objects and environments. Interestingly, leading LVLMs approach human-level performance in recognizing materials in 3D scenes, yet substantially underperform humans when identifying simpler, more abstract 2D textures and shapes. These results are consistent across a wide range of leading LVLMs (GPT/Gemini/LLama/Qwen) and foundation vision models (DINO/CLIP), exposing major deficiencies in the ability of VLMs to extract low-level visual features. In contrast, humans and simple nets trained directly for these tasks achieve high accuracy. The LAS&T dataset, featuring over 700,000 images for 2D/3D shape and textures recognition and retrieval, is freely available.

تم تقديم مجموعة بيانات الأشكال والقوام الكبيرة (LAS&T) لتعزيز قدرات نماذج اللغة والرؤية الكبيرة (LVLM) في التعرف على الأشكال والقوام وتمثيلها في سياقات متنوعة. تم إنشاء هذه المجموعة من خلال استخراج غير مشرف من الصور الطبيعية، وتعمل كمعيار لتقييم أداء النماذج الرائدة مثل CLIP وDINO في مهام التعرف على الأشكال.

Se ha introducido el conjunto de datos Large Shapes and Textures (LAS&T) para mejorar las capacidades de los Modelos de Lenguaje-Visión de Gran Escala (LVLM) en el reconocimiento y representación de formas y texturas en diversos contextos. Este conjunto de datos, creado mediante la extracción no supervisada de imágenes naturales, sirve como referencia para evaluar el rendimiento de modelos líderes como CLIP y DINO en tareas de reconocimiento de formas.

Le Large Shapes and Textures dataset (LAS&T) a été introduit pour améliorer les capacités des grands modèles de vision-langage (LVLM) dans la reconnaissance et la représentation des formes et textures dans divers contextes. Ce dataset, créé par extraction non supervisée d'images naturelles, sert de référence pour évaluer la performance de modèles leaders comme CLIP et DINO dans les tâches de reconnaissance de formes.

The Large Shapes and Textures dataset (LAS&T) has been introduced to enhance the capabilities of Large Vision-Language Models (LVLMs) in recognizing and representing shapes and textures across various contexts. This dataset, created through unsupervised extraction from natural images, serves as a benchmark for evaluating the performance of leading models like CLIP and DINO in shape recognition tasks.

Shape and Texture Recognition in Large Vision-Language Models

arXiv:2512.07331v1 Announce Type: new 
Abstract: Vision Transformers (ViTs) lack the hierarchical inductive biases inherent to Convolutional Neural Networks (CNNs), theoretically allowing them to maintain high-dimensional representations throughout all layers. However, recent observations suggest ViTs often spontaneously manifest a "U-shaped" entropy profile-compressing information in middle layers before expanding it for the final classification. In this work, we demonstrate that this "Inductive Bottleneck" is not an architectural artifact, but a data-dependent adaptation. By analyzing the layer-wise Effective Encoding Dimension (EED) of DINO-trained ViTs across datasets of varying compositional complexity (UC Merced, Tiny ImageNet, and CIFAR-100), we show that the depth of the bottleneck correlates strongly with the semantic abstraction required by the task. We find that while texture-heavy datasets preserve high-rank representations throughout, object-centric datasets drive the network to dampen high-frequency information in middle layers, effectively "learning" a bottleneck to isolate semantic features.

أظهرت الأبحاث الأخيرة وجود 'عنق زجاجة استقرائي' في محولات الرؤية (ViTs)، حيث تظهر هذه النماذج ملف تعريف انتروبيا على شكل حرف U، مما يضغط المعلومات في الطبقات الوسطى قبل توسيعها للتصنيف النهائي. يرتبط هذا الظاهرة بالتحليل الدلالي المطلوب من المهام المحددة، وليس مجرد عيب معماري، بل هو تكيف يعتمد على البيانات تم ملاحظته عبر مجموعات بيانات مختلفة مثل UC Merced وTiny ImageNet وCIFAR-100.

Investigaciones recientes han identificado un 'cuello de botella inductivo' en los Vision Transformers (ViTs), donde estos modelos exhiben un perfil de entropía en forma de U, comprimiendo información en las capas intermedias antes de expandirla para la clasificación final. Este fenómeno está vinculado a la abstracción semántica requerida por tareas específicas y no es simplemente un defecto arquitectónico, sino una adaptación dependiente de los datos observada en varios conjuntos de datos como UC Merced, Tiny ImageNet y CIFAR-100.

Des recherches récentes ont identifié un 'goulot d'étranglement inductif' dans les Vision Transformers (ViTs), où ces modèles présentent un profil d'entropie en U, comprimant l'information dans les couches intermédiaires avant de l'étendre pour la classification finale. Ce phénomène est lié à l'abstraction sémantique requise par des tâches spécifiques et n'est pas simplement un défaut architectural, mais une adaptation dépendante des données observée à travers divers ensembles de données tels que UC Merced, Tiny ImageNet et CIFAR-100.

Recent research has identified an 'Inductive Bottleneck' in Vision Transformers (ViTs), where these models exhibit a U-shaped entropy profile, compressing information in middle layers before expanding it for final classification. This phenomenon is linked to the semantic abstraction required by specific tasks and is not merely an architectural flaw but a data-dependent adaptation observed across various datasets such as UC Merced, Tiny ImageNet, and CIFAR-100.

The Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers

arXiv:2512.07829v1 Announce Type: new 
Abstract: Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.

تم تقديم إطار عمل جديد يسمى Feature Auto-Encoder (FAE) لتكييف التمثيلات البصرية المدربة مسبقًا لتوليد الصور، حيث يتناول التحديات المتعلقة بمحاذاة الميزات عالية الأبعاد مع النماذج التوليدية منخفضة الأبعاد. تهدف هذه الطريقة إلى تبسيط عملية التكيف، مما يعزز كفاءة وجودة الصور المولدة.

Se ha introducido un nuevo marco llamado Feature Auto-Encoder (FAE) para adaptar representaciones visuales preentrenadas a la generación de imágenes, abordando los desafíos de alinear características de alta dimensión con modelos generativos de baja dimensión. Este enfoque busca simplificar el proceso de adaptación, mejorando así la eficiencia y calidad de las imágenes generadas.

Un nouveau cadre appelé Feature Auto-Encoder (FAE) a été introduit pour adapter les représentations visuelles pré-entraînées à la génération d'images, abordant les défis d'alignement entre les caractéristiques de haute dimension et les modèles génératifs de basse dimension. Cette approche vise à simplifier le processus d'adaptation, améliorant ainsi l'efficacité et la qualité des images générées.

A new framework called Feature Auto-Encoder (FAE) has been introduced to adapt pre-trained visual representations for image generation, addressing challenges in aligning high-dimensional features with low-dimensional generative models. This approach aims to simplify the adaptation process, enhancing the efficiency and quality of generated images.

Grounding DINO: Open Vocabulary Object Detection on Videos

Was this article worth reading? Share it