CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D

arXiv — cs.CVTuesday, December 9, 2025 at 5:00:00 AM
  • CORE-3D introduces a novel approach to 3D scene understanding by utilizing context-aware open-vocabulary retrieval through embeddings, enhancing the accuracy of object-level masks in complex environments. This method leverages SemanticSAM and a refined CLIP encoding strategy to improve 3D semantic segmentation, addressing limitations of previous models that produced fragmented masks and inaccurate semantic assignments.
  • The development of CORE-3D is significant as it enhances the capabilities of embodied AI and robotics, facilitating more reliable perception for interaction and navigation in intricate 3D environments. By improving semantic mapping, it opens new avenues for applications in autonomous systems and robotics.
  • This advancement aligns with ongoing efforts in the AI field to enhance open-vocabulary capabilities across various applications, including 3D instance segmentation and object detection. The integration of context-aware models reflects a broader trend towards improving the robustness and accuracy of AI systems in understanding and interacting with complex environments.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation
PositiveArtificial Intelligence
A recent study has conducted a visual comparison between Vision Foundation Models (VFMs) and Vision Language Models (VLMs) for 3D pose estimation, particularly in hand object grasping scenarios. The research highlights the strengths of CLIP in semantic understanding and DINOv2 in providing dense geometric features, demonstrating their complementary roles in enhancing 6D object pose estimation.
Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model
PositiveArtificial Intelligence
The Embodied Tree of Thoughts (EToT) framework has been introduced as a significant advancement in robot manipulation planning, utilizing a physics-based interactive digital twin to enhance the prediction of future environmental states and the reasoning of actions prior to execution. This approach aims to overcome limitations found in existing video-generation models, which often lack physical grounding and consistency in long-horizon constraints.
OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics
PositiveArtificial Intelligence
OpenMonoGS-SLAM has been introduced as a pioneering monocular SLAM framework that integrates 3D Gaussian Splatting with open-set semantic understanding, enhancing the capabilities of simultaneous localization and mapping in robotics and autonomous systems. This development leverages advanced Visual Foundation Models to improve tracking and mapping accuracy in diverse environments.
SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing
PositiveArtificial Intelligence
A novel approach called SATGround has been introduced to enhance visual grounding in remote sensing through a structured localization mechanism that fine-tunes a pretrained vision-language model (VLM) on diverse instruction-following tasks. This method significantly improves the model's ability to localize objects in complex satellite imagery, achieving a 24.8% relative improvement over previous methods in visual grounding benchmarks.
Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models
PositiveArtificial Intelligence
A new study introduces NoisyCLIP, a method designed to enhance the alignment between text prompts and latent representations in diffusion models, addressing common issues of misalignment and hallucinations in generated images. This approach allows for early detection of misalignments during the denoising process, potentially improving the quality of outputs without waiting for complete generation.
Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning
PositiveArtificial Intelligence
A recent study has introduced a framework aimed at decoupling template bias in the Contrastive Language-Image Pre-Training (CLIP) model by utilizing empty prompts. This approach addresses the issue of template-sample similarity (TSS) bias, which can hinder the model's accuracy and robustness in classification tasks. The framework operates in two stages: reducing bias during pre-training and enforcing correct alignment during few-shot fine-tuning.
Shape and Texture Recognition in Large Vision-Language Models
NeutralArtificial Intelligence
The Large Shapes and Textures dataset (LAS&T) has been introduced to enhance the capabilities of Large Vision-Language Models (LVLMs) in recognizing and representing shapes and textures across various contexts. This dataset, created through unsupervised extraction from natural images, serves as a benchmark for evaluating the performance of leading models like CLIP and DINO in shape recognition tasks.
CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
PositiveArtificial Intelligence
The recent study introduces CAPE, a dual-model framework designed to enhance Embodied Reference Understanding by predicting objects referenced through pointing gestures and language. This model utilizes a Gaussian ray heatmap representation to improve the attention to visual cues, addressing limitations in existing methods that often overlook critical disambiguation signals.