arXiv:2412.20206v3 Announce Type: replace 
Abstract: Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships between visual and linguistic modalities, enabling machines to develop human-like multimodal comprehension capabilities. Consequently, it has extensive applications in various domains. However, since 2021, visual grounding has witnessed significant advancements, with emerging new concepts such as grounded pre-training, grounding multimodal LLMs, generalized visual grounding, and giga-pixel grounding, which have brought numerous new challenges. In this survey, we first examine the developmental history of visual grounding and provide an overview of essential background knowledge. We systematically track and summarize the advancements, and then meticulously define and organize the various settings to standardize future research and ensure a fair comparison. Additionally, we delve into numerous related datasets and applications, and highlight several advanced topics. Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers. By extracting common technical details, this survey encompasses the representative work in each subtopic over the past decade. To the best of our knowledge, this paper represents the most comprehensive overview currently available in the field of visual grounding. This survey is designed to be suitable for both beginners and experienced researchers, serving as an invaluable resource for understanding key concepts and tracking the latest research developments. We keep tracing related work at https://github.com/linhuixiao/Awesome-Visual-Grounding.

استطلاع بعنوان 'Towards Visual Grounding' يستكشف التقدم في مجال التأسيس البصري، وهي مهمة تربط مناطق محددة في الصور بالتعبيرات النصية المقابلة. منذ عام 2021، تم إحراز تقدم كبير، مما أدى إلى إدخال مفاهيم جديدة مثل التدريب المسبق المؤسس والتأسيس على دقة جيجا بكسل. هذه الأبحاث مهمة لتحسين فهم الآلات للأنماط البصرية واللغوية، مع تطبيقات واسعة في مجالات متعددة.

Una encuesta titulada 'Towards Visual Grounding' explora los avances en el grounding visual, una tarea que conecta regiones específicas de imágenes con expresiones textuales correspondientes. Desde 2021, se han logrado progresos significativos, introduciendo conceptos como el preentrenamiento anclado y el grounding de giga-píxeles. Esta investigación es crucial para mejorar la comprensión de las máquinas de las modalidades visuales y lingüísticas, con amplias aplicaciones en diversos campos.

Une enquête intitulée 'Towards Visual Grounding' examine les avancées dans le domaine du grounding visuel, une tâche qui relie des régions spécifiques d'images à des expressions textuelles correspondantes. Depuis 2021, des progrès significatifs ont été réalisés, introduisant des concepts tels que le pré-entraînement ancré et le grounding giga-pixel. Cette recherche est cruciale pour améliorer la compréhension des machines des modalités visuelles et linguistiques, avec de nombreuses applications dans divers domaines.

A survey titled 'Towards Visual Grounding' explores advancements in visual grounding, a task that connects specific image regions to corresponding text expressions. Since 2021, significant progress has been made, introducing concepts like grounded pre-training and giga-pixel grounding. This research is crucial for enhancing machine comprehension of visual and linguistic modalities, with broad applications across various fields.

Towards Visual Grounding: A Survey

Was this article worth reading? Share it

Ready to build your own newsroom?