PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

arXiv — cs.CV•Tuesday, December 9, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

The introduction of PoSh, a new metric utilizing scene graphs, aims to enhance the evaluation of Vision-Language Models (VLMs) in generating detailed image descriptions. Traditional metrics like CIDEr and SPICE have struggled with longer texts, often failing to accurately assess compositional understanding and specific errors. PoSh seeks to provide a more interpretable and replicable scoring system, validated through the DOCENT dataset, which includes expert-written references for artwork.
This development is significant as it addresses the limitations of existing evaluation metrics for VLMs, particularly in the context of detailed image descriptions. By focusing on fine-grained errors and compositional understanding, PoSh offers a more nuanced approach that could lead to improvements in the performance of VLMs, ultimately benefiting applications in art analysis, content generation, and accessibility tools.
The challenges faced by VLMs in accurately interpreting and generating detailed descriptions reflect broader issues in the field of AI, including biases in training data and the need for more robust evaluation frameworks. As researchers explore various methodologies to enhance VLM capabilities, such as Latent Representation Probing and new benchmarks for spatial reasoning, the ongoing discourse emphasizes the importance of developing metrics that align closely with human judgment and understanding.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

ShareSpeak

AI teleprompter for seamless presentations

AI & DataView app details

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

Langfuse

Debug, monitor, and improve your complex LLM applications with ease.

Tech & Developer ToolsView app details

Continue Readings

arXiv — cs.CL2 days ago

Are generative AI text annotations systematically biased?

NeutralArtificial Intelligence

A recent study investigates bias in generative AI text annotations, replicating manual annotations from Boukes (2024) using various Generative Large Language Models (GLLMs) including Llama3.1, Llama3.3, GPT4o, and Qwen2.5. The findings indicate that while GLLMs achieve adequate F1 scores, they exhibit systematic bias, aligning more closely with each other than with manual annotations, which leads to different downstream results.

Read full article

via arXiv — cs.CL

arXiv — cs.CV2 days ago

SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing

PositiveArtificial Intelligence

A novel approach called SATGround has been introduced to enhance visual grounding in remote sensing through a structured localization mechanism that fine-tunes a pretrained vision-language model (VLM) on diverse instruction-following tasks. This method significantly improves the model's ability to localize objects in complex satellite imagery, achieving a 24.8% relative improvement over previous methods in visual grounding benchmarks.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model

PositiveArtificial Intelligence

The Embodied Tree of Thoughts (EToT) framework has been introduced as a significant advancement in robot manipulation planning, utilizing a physics-based interactive digital twin to enhance the prediction of future environmental states and the reasoning of actions prior to execution. This approach aims to overcome limitations found in existing video-generation models, which often lack physical grounding and consistency in long-horizon constraints.

Read full article

via arXiv — cs.CV

arXiv — cs.CL3 days ago

Transparent and Coherent Procedural Mistake Detection

NeutralArtificial Intelligence

A new approach to procedural mistake detection (PMD) has been introduced, focusing on classifying task execution success through egocentric video analysis. This method emphasizes generating visual self-dialog rationales to enhance decision-making transparency, leveraging advanced vision-and-language models (VLMs) and establishing baseline metrics for coherence in generated rationales.

Read full article

via arXiv — cs.CL

arXiv — cs.CV3 days ago

CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D

PositiveArtificial Intelligence

CORE-3D introduces a novel approach to 3D scene understanding by utilizing context-aware open-vocabulary retrieval through embeddings, enhancing the accuracy of object-level masks in complex environments. This method leverages SemanticSAM and a refined CLIP encoding strategy to improve 3D semantic segmentation, addressing limitations of previous models that produced fragmented masks and inaccurate semantic assignments.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

Language-driven Fine-grained Retrieval

NeutralArtificial Intelligence

A new framework named LaFG has been introduced for fine-grained image retrieval, which utilizes large language models (LLMs) and vision-language models (VLMs) to convert class names into detailed attribute-level descriptions. This approach aims to enhance the modeling of comparability among cross-category details, addressing limitations of existing methods that rely on sparse one-hot labels.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts

PositiveArtificial Intelligence

A new framework called AerialVP has been introduced to enhance image perception in UAVs by improving task prompts used in Vision-Language Models (VLMs). This framework addresses challenges such as target confusion and scale variations that arise from the complex nature of UAV imagery, which traditional VLMs struggle to interpret effectively.

Read full article

via arXiv — cs.CV