Understanding Task Transfer in Vision-Language Models

arXiv — cs.CV•Tuesday, November 25, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

A recent study on Vision-Language Models (VLMs) highlights their performance on multimodal benchmarks, revealing challenges in visual perception tasks such as depth estimation and object counting. The research introduces the Perfection Gap Factor (PGF) to quantify task transferability, demonstrating how finetuning on one task can unpredictably impact performance on others across 13 perception tasks.
This development is significant as it addresses the complexities of task-specific finetuning in VLMs, which have shown inconsistent results when applied to various perception tasks. Understanding these dynamics can lead to improved model training and performance in practical applications.
The findings resonate with ongoing discussions about the limitations of VLMs, particularly their biases and vulnerabilities in handling diverse inputs. As researchers explore frameworks to enhance robustness and address biases, the insights from this study contribute to a broader understanding of how VLMs can evolve to meet the demands of complex visual tasks.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Langfuse

Debug, monitor, and improve your complex LLM applications with ease.

Tech & Developer ToolsTry the app

Augmeta

AI peers for collaborative problem-solving and enhanced team productivity.

AI & DataTry the app

The Visualizer

Transform complex topics into clear, visual explanations for effortless learning.

AI & DataTry the app

Continue Readings

arXiv — cs.CVa day ago

MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis in Chest X-Ray

PositiveArtificial Intelligence

MedBridge has been introduced as a lightweight multimodal adaptation framework designed to enhance the application of pre-trained vision-language models (VLMs) in medical image diagnosis, particularly for chest X-rays. This framework includes innovative components such as a Focal Sampling module and a Query-Encoder model to improve the accuracy of medical image analysis without extensive retraining.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Spotlight: Identifying and Localizing Video Generation Errors Using VLMs

PositiveArtificial Intelligence

A new task named Spotlight has been introduced to identify and localize video generation errors in text-to-video models (T2V), which can produce high-quality videos but still exhibit nuanced errors. The research generated 600 videos using diverse prompts and three advanced video generators, annotating over 1600 specific errors across various categories such as motion and physics.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

PositiveArtificial Intelligence

A new approach called MASS has been introduced to enhance Vision Language Models (VLMs) by addressing their limitations in physics-driven reasoning and comprehension of motion dynamics. This method translates physical-world context cues into interpretable representations, facilitating better understanding and generation of content in real and AI-generated videos. The MASS-Bench benchmark comprises 4,350 videos and 8,361 question-answering pairs focused on physics-related tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

NeutralArtificial Intelligence

The introduction of BackdoorVLM marks a significant advancement in the evaluation of backdoor attacks on vision-language models (VLMs), addressing a critical gap in the understanding of these threats within multimodal machine learning systems. This benchmark categorizes backdoor threats into five distinct types, including targeted refusal and perceptual hijack, providing a structured approach to analyze their impact on tasks like image captioning and visual question answering.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

NeutralArtificial Intelligence

Recent research indicates that Vision Language Models (VLMs) often exhibit biases learned during training, particularly when tasked with specific queries about visual properties, such as counting objects in images. A new synthetic benchmark dataset and evaluation framework have been developed to assess how counting performance varies with different image and prompt characteristics.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

VK-Det: Visual Knowledge Guided Prototype Learning for Open-Vocabulary Aerial Object Detection

PositiveArtificial Intelligence

VK-Det has been introduced as a new framework for open-vocabulary aerial object detection, utilizing visual-language models (VLMs) to identify objects beyond predefined categories without requiring additional supervision. This approach enhances fine-grained localization and adaptive distillation through innovative pseudo-labeling strategies that model inter-class decision boundaries.

Read full article

via arXiv — cs.CV

arXiv — cs.CLa day ago

L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

PositiveArtificial Intelligence

Researchers have introduced L2V-CoT, a novel training-free approach that facilitates the transfer of Chain-of-Thought (CoT) reasoning from large language models (LLMs) to Vision-Language Models (VLMs) using Linear Artificial Tomography (LAT). This method addresses the challenges VLMs face in multi-step reasoning tasks due to limited multimodal reasoning data.

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models

PositiveArtificial Intelligence

A recent study has proposed a new framework called Subspace Projection Debiasing (SPD) to address the pervasive demographic biases in Vision-Language Models (VLMs). This framework challenges the traditional post-hoc debiasing methods that focus on coordinate-wise adjustments, revealing that biases are distributed across linear subspaces rather than isolated coordinates.

Read full article

via arXiv — cs.CV