World PulseNowPowered by AI

Trending:

AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

arXiv — cs.CV•Wednesday, December 3, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

AlignBench has been introduced as a benchmark for evaluating fine-grained image-text alignment using synthetic image-caption pairs, addressing limitations in existing models like CLIP that rely on rule-based perturbations or short captions. This benchmark allows for a more detailed assessment of visual-language models (VLMs) by annotating each sentence for correctness.
The development of AlignBench is significant as it provides a new standard for measuring the performance of VLMs, revealing critical insights into their alignment capabilities and highlighting issues such as over-scoring early sentences and self-preference in model outputs.
This initiative reflects ongoing challenges in the field of AI, particularly in enhancing the robustness and accuracy of VLMs. It aligns with broader efforts to improve image-captioning technologies and tackle issues like overfitting and alignment transfer, which are crucial for advancing applications in semantic segmentation and visual recognition.

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

TypeThinkAI

Compare top AI models and generate text, images, and videos in one platform.

AI & DataTry the app

OpenL Translator

Instantly translate text from images of signs and menus with accuracy.

AI & DataTry the app

sync. labs

Create, reanimate, and understand humans in video with advanced lip-sync technology.

Creative & DesignTry the app

Continue Readings

Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

arXiv — cs.CV17 hours ago

Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

NeutralArtificial Intelligence

Recent research has introduced ReMindView-Bench, a benchmark designed to evaluate how Vision-Language Models (VLMs) construct and maintain spatial mental models across multiple viewpoints. This initiative addresses the challenges VLMs face in achieving geometric coherence and cross-view consistency in spatial reasoning tasks, which are crucial for understanding 3D environments.

Read full article

via arXiv — cs.CV

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

arXiv — cs.CV17 hours ago

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

PositiveArtificial Intelligence

The AVA-VLA framework has been introduced to enhance Vision-Language-Action (VLA) models by integrating Active Visual Attention (AVA), allowing for dynamic modulation of visual processing based on historical context. This reformulation addresses limitations in existing models that process visual inputs independently, improving decision-making in dynamic environments.

Read full article

via arXiv — cs.CV

VACoT: Rethinking Visual Data Augmentation with VLMs

arXiv — cs.CV17 hours ago

VACoT: Rethinking Visual Data Augmentation with VLMs

PositiveArtificial Intelligence

The introduction of the Visual Augmentation Chain-of-Thought (VACoT) framework marks a significant advancement in visual data augmentation for Vision Language Models (VLMs). This framework dynamically applies image augmentations during model inference, enhancing the robustness of VLMs, particularly in challenging scenarios such as Optical Character Recognition (OCR) tasks.

Read full article

via arXiv — cs.CV

SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

arXiv — cs.CV17 hours ago

SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

PositiveArtificial Intelligence

SPARK has been introduced as a framework for reconstructing articulated 3D objects from a single RGB image, utilizing Vision-Language Models (VLMs) to extract parameters and generate part-level reference images. This innovative approach integrates part-image guidance and structure graphs into a generative diffusion transformer, optimizing the creation of simulation-ready assets for robotics and AI applications.

Read full article

via arXiv — cs.CV

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

arXiv — cs.CV17 hours ago

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

PositiveArtificial Intelligence

CT-GLIP, a new 3D Grounded Language-Image Pretrained model, has been introduced to enhance the alignment of CT scans with radiology reports, addressing limitations in existing methods that rely on global embeddings. This model constructs fine-grained CT-report pairs to improve cross-modal contrastive learning, enabling better identification of organs and abnormalities in a zero-shot manner.

Read full article

via arXiv — cs.CV

Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution

arXiv — cs.CV17 hours ago

Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution

PositiveArtificial Intelligence

A new Mixture-of-Ranks (MoR) architecture has been proposed for one-step real-world image super-resolution (Real-ISR), integrating sparse Mixture-of-Experts (MoE) to enhance the adaptability of models in reconstructing high-resolution images from degraded samples. This approach utilizes a fine-grained expert partitioning strategy, treating each rank in Low-Rank Adaptation (LoRA) as an independent expert, thereby improving the model's ability to capture heterogeneous characteristics of real-world images.

Read full article

via arXiv — cs.CV

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

arXiv — cs.CV17 hours ago

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

NeutralArtificial Intelligence

Recent research has highlighted that Vision Language Models (VLMs) often exhibit biases learned during training, particularly when tasked with counting specific objects in images. A new synthetic benchmark dataset and evaluation framework has been developed to assess how counting performance varies with different image and prompt characteristics, revealing fluctuating attention allocation in open-source VLMs.

Read full article

via arXiv — cs.CV

Vision Language Models are Biased

arXiv — cs.LG2 days ago

Vision Language Models are Biased

NegativeArtificial Intelligence

Recent research has revealed that vision language models (VLMs) exhibit significant biases, particularly in tasks involving counting and identification, with an average accuracy of only 17.05% across various domains. This study highlights the models' inability to recognize subtle changes, such as additional stripes on logos, indicating a flaw in their understanding of visual context.

Read full article

via arXiv — cs.LG