World PulseNowPowered by AI

Trending:

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

arXiv — cs.CV•Thursday, November 27, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

CAPability has been introduced as a comprehensive visual caption benchmark designed to evaluate the correctness and thoroughness of captions generated by multimodal large language models (MLLMs). This benchmark addresses the limitations of existing visual captioning assessments, which often rely on brief ground-truth sentences and traditional metrics that fail to capture detailed captioning effectively.
The development of CAPability is significant as it provides a stable evaluation framework that includes nearly 11,000 human-annotated images and videos, allowing for a more nuanced assessment of generated captions through precision and hit metrics. This advancement is crucial for improving the performance of MLLMs in visual understanding tasks.
This initiative reflects a broader trend in the AI field towards enhancing evaluation metrics for multimodal models, as seen in other recent benchmarks like CaptionQA and CounterVQA. These developments highlight the ongoing efforts to refine how AI systems interpret and generate content across various domains, emphasizing the importance of thorough evaluation criteria in advancing AI capabilities.

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

TypeThinkAI

Compare top AI models and generate text, images, and videos in one platform.

AI & DataTry the app

Capte

AI-powered video editing that simplifies and enhances your creative workflow.

AI & DataTry the app

OpenL Translator

Instantly translate text from images of signs and menus with accuracy.

AI & DataTry the app

Continue Readings

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

arXiv — cs.CV13 hours ago

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

PositiveArtificial Intelligence

ReMatch has been introduced as a new framework that utilizes the generative capabilities of Multimodal Large Language Models (MLLMs) for enhanced multimodal retrieval. This approach trains the MLLM end-to-end, employing a chat-style generative matching stage that assesses relevance from various inputs, including raw data and projected embeddings.

Read full article

via arXiv — cs.CV

Restora-Flow: Mask-Guided Image Restoration with Flow Matching

arXiv — cs.CV13 hours ago

Restora-Flow: Mask-Guided Image Restoration with Flow Matching

PositiveArtificial Intelligence

Restora-Flow has been introduced as a training-free method for image restoration that utilizes flow matching sampling guided by a degradation mask. This innovative approach aims to enhance the quality of image restoration tasks such as inpainting, super-resolution, and denoising while addressing the long processing times and over-smoothing issues faced by existing methods.

Read full article

via arXiv — cs.CV

RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

arXiv — cs.CV13 hours ago

RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

PositiveArtificial Intelligence

RobustMerge has been introduced as a parameter-efficient model merging method designed for multi-task learning in machine learning language models (MLLMs), emphasizing direction robustness during the merging process. This approach addresses the challenges of merging expert models without data leakage, which has become increasingly important as model sizes and data complexity grow.

Read full article

via arXiv — cs.CV

EmoFeedback$^2$: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

arXiv — cs.CV13 hours ago

EmoFeedback$^2$: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

PositiveArtificial Intelligence

The recent introduction of EmoFeedback$^2$ aims to enhance continuous emotional image generation (C-EICG) by utilizing a large vision-language model (LVLM) to provide reward and textual feedback, addressing the limitations of existing methods that struggle with emotional continuity and fidelity. This paradigm allows for better alignment of generated images with user emotional descriptions.

Read full article

via arXiv — cs.CV

From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

arXiv — cs.CV13 hours ago

From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

PositiveArtificial Intelligence

A new study has introduced a diffusion-based inpainting model adapted for image layer decomposition, addressing the challenges of separating images into distinct layers for independent editing. This model employs lightweight finetuning and a multi-modal context fusion module to enhance detail preservation in the latent space, achieving superior results in object removal and occlusion recovery using a synthetic dataset.

Read full article

via arXiv — cs.CV

CaptionQA: Is Your Caption as Useful as the Image Itself?

arXiv — cs.CV13 hours ago

CaptionQA: Is Your Caption as Useful as the Image Itself?

PositiveArtificial Intelligence

A new benchmark called CaptionQA has been introduced to evaluate the utility of model-generated captions in supporting downstream tasks across various domains, including Natural, Document, E-commerce, and Embodied AI. This benchmark consists of 33,027 annotated multiple-choice questions that require visual information to answer, aiming to assess whether captions can effectively replace images in multimodal systems.

Read full article

via arXiv — cs.CV

Structure-Aware Prototype Guided Trusted Multi-View Classification

arXiv — cs.CV13 hours ago

Structure-Aware Prototype Guided Trusted Multi-View Classification

PositiveArtificial Intelligence

A novel framework for Trustworthy Multi-View Classification (TMVC) has been proposed, addressing the challenges of reliable decision-making in scenarios with heterogeneous and conflicting multi-source information. This framework introduces prototypes to represent neighbor structures of each view, simplifying the learning of intra-view relations and enhancing consistency across inter-view relationships.

Read full article

via arXiv — cs.CV

LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

arXiv — cs.CV13 hours ago

LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

PositiveArtificial Intelligence

LLaVA-UHD v3 has been introduced as a new multi-modal large language model (MLLM) that utilizes Progressive Visual Compression (PVC) for efficient native-resolution encoding, enhancing visual understanding capabilities while addressing computational overhead. This model integrates refined patch embedding and windowed token compression to optimize performance in vision-language tasks.

Read full article

via arXiv — cs.CV