arXiv:2511.12614v1 Announce Type: cross 
Abstract: We introduce a unified, end-to-end framework that seamlessly integrates object detection and pose estimation with a versatile onboarding process. Our pipeline begins with an onboarding stage that generates object representations from either traditional 3D CAD models or, in their absence, by rapidly reconstructing a high-fidelity neural representation (NeRF) from multi-view images. Given a test image, our system first employs the CNOS detector to localize target objects. For each detection, our novel pose estimation module, OPFormer, infers the precise 6D pose. The core of OPFormer is a transformer-based architecture that leverages a foundation model for robust feature extraction. It uniquely learns a comprehensive object representation by jointly encoding multiple template views and enriches these features with explicit 3D geometric priors using Normalized Object Coordinate Space (NOCS). A decoder then establishes robust 2D-3D correspondences to determine the final pose. Evaluated on the challenging BOP benchmarks, our integrated system demonstrates a strong balance between accuracy and efficiency, showcasing its practical applicability in both model-based and model-free scenarios.

يقدم المقال OPFormer، وهو إطار عمل جديد لتقدير وضع الكائنات يدمج بين اكتشاف الكائنات وتقدير الوضع في عملية واحدة. تبدأ العملية بمرحلة إدماج تنشئ تمثيلات للكائنات من نماذج CAD ثلاثية الأبعاد أو، في غيابها، من خلال إعادة بناء سريعة لتمثيل عصبي عالي الدقة باستخدام صور متعددة الزوايا. يستخدم النظام كاشف CNOS لتحديد مواقع الكائنات في الصور الاختبارية، بينما يستنتج OPFormer أوضاعها ثلاثية الأبعاد باستخدام بنية قائمة على المحولات تتضمن أولويات هندسية من خلال فضاء الإحداثيات العادية للكائنات (NOCS).

El artículo presenta OPFormer, un nuevo marco para la estimación de la pose de objetos que integra la detección de objetos y la estimación de pose en un solo proceso. El proceso comienza con una etapa de incorporación que genera representaciones de objetos a partir de modelos CAD 3D o, en su ausencia, mediante la reconstrucción rápida de representaciones neuronales de alta fidelidad utilizando imágenes de múltiples vistas. El sistema utiliza el detector CNOS para localizar objetos en imágenes de prueba, mientras que OPFormer infiere sus poses 6D utilizando una arquitectura basada en transforma…

L'article présente OPFormer, un nouveau cadre pour l'estimation de la pose d'objet qui intègre la détection d'objet et l'estimation de pose dans un pipeline unique. Le processus commence par une étape d'intégration qui crée des représentations d'objets à partir de modèles 3D CAD ou génère des représentations neuronales de haute fidélité à l'aide d'images multi-vues. Le système utilise le détecteur CNOS pour localiser les objets dans les images de test, tandis qu'OPFormer infère leurs poses 6D en utilisant une architecture basée sur un transformateur qui intègre des prioris géométriques via l'E…

The paper introduces OPFormer, a new framework for object pose estimation that integrates object detection and pose estimation into a single pipeline. The process begins with an onboarding stage that creates object representations from 3D CAD models or generates high-fidelity neural representations using multi-view images. The system employs the CNOS detector to localize objects in test images, while OPFormer infers their 6D poses using a transformer-based architecture that incorporates geometric priors through Normalized Object Coordinate Space (NOCS).

OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding

arXiv:2511.16857v1 Announce Type: new 
Abstract: Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.

تم تقديم مجموعة بيانات جديدة تُدعى BOP-ASK لتحسين التفكير في تفاعل الكائنات في نماذج اللغة البصرية (VLMs). تتناول هذه المجموعة القيود المفروضة على التقييمات الحالية التي تركز على العلاقات المكانية عالية المستوى بينما تتجاهل الفهم المكاني الدقيق اللازم للتطبيقات في العالم الحقيقي. تتضمن BOP-ASK أكثر من 150,000 صورة و33 مليون سؤال، مستمدة من أوضاع كائنات ثلاثية الأبعاد وبيانات توضيحية مفصلة.

Se ha presentado un nuevo conjunto de datos llamado BOP-ASK para mejorar el razonamiento sobre la interacción de objetos en los Modelos de Lenguaje Visual (VLMs). Este conjunto de datos aborda las limitaciones de las evaluaciones existentes que se centran en relaciones espaciales de alto nivel, mientras que ignoran la comprensión espacial detallada necesaria para aplicaciones del mundo real. BOP-ASK incluye más de 150,000 imágenes y 33 millones de preguntas, derivadas de poses de objetos 6D y anotaciones detalladas.

Un nouveau jeu de données nommé BOP-ASK a été introduit pour améliorer le raisonnement sur les interactions entre objets dans les modèles de langage visuel (VLMs). Ce jeu de données répond aux limitations des benchmarks existants qui se concentrent sur les relations spatiales de haut niveau tout en négligeant la compréhension spatiale fine nécessaire pour les applications réelles. BOP-ASK comprend plus de 150 000 images et 33 millions de questions, dérivées de poses d'objets 6D détaillées et d'annotations.

A new dataset named BOP-ASK has been introduced to enhance object-interaction reasoning in Vision Language Models (VLMs). This dataset addresses the limitations of existing benchmarks that focus on high-level spatial relationships while neglecting fine-grained spatial understanding necessary for real-world applications. BOP-ASK includes over 150,000 images and 33 million questions, derived from detailed 6D object poses and annotations.

OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding

Was this article worth reading? Share it

Octofy

Augmeta

Attentive AI