Not All Attention Heads Are What You Need: Refining CLIP&#x27;s Image Representation with Attention Ablation

Continual Learning for Image Captioning through Improved Image-Text Alignment

PositiveArtificial Intelligence

Generating accurate and coherent image captions in a continual learning environment poses significant challenges, particularly due to catastrophic forgetting and the evolving nature of visual concepts. This study introduces a multi-loss framework for continual image captioning that leverages semantic guidance through prompt-based continual learning and contrastive alignment. The proposed method, built on a pretrained ViT-GPT-2 backbone, integrates various loss components to enhance image-text alignment without introducing additional parameters.

arXiv — cs.CL19 hours ago

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

PositiveArtificial Intelligence

Recent advancements in vision-language models (VLMs) have utilized large language models (LLMs) to achieve performance comparable to proprietary systems like GPT-4V. However, deploying these models on resource-constrained devices poses challenges due to high computational requirements. To address this, a new framework called Generation after Recalibration (GenRecal) has been introduced, which distills knowledge from large VLMs into smaller, more efficient models by aligning feature representations across diverse architectures.

via arXiv — cs.CL

VLMs Guided Interpretable Decision Making for Autonomous Driving

PositiveArtificial Intelligence

Recent advancements in autonomous driving have investigated the application of vision-language models (VLMs) in visual question answering (VQA) frameworks for driving decision-making. However, these methods often rely on handcrafted prompts and exhibit inconsistent performance, which hampers their effectiveness in real-world scenarios. This study assesses state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs, revealing significant limitations in their ability to provide reliable, context-aware decisions.

QwenCLIP: Boosting Medical Vision-Language Pretraining via LLM Embeddings and Prompt tuning

PositiveArtificial Intelligence

QwenCLIP is a new vision-language framework that enhances medical pretraining by integrating large language model (LLM) embeddings and learnable prompts. Traditional Contrastive Language-Image Pretraining (CLIP) struggles with long radiology reports due to its limited token capacity. By replacing CLIP's text encoder with an LLM-based module, QwenCLIP aims to improve cross-modal alignment and capture comprehensive medical semantics, addressing the limitations of existing domain-specific encoders like PubMedBERT and ClinicalBERT.

Scene Graph-Guided Generative AI Framework for Synthesizing and Evaluating Industrial Hazard Scenarios

PositiveArtificial Intelligence

A new study introduces a scene graph-guided generative AI framework aimed at synthesizing realistic images of industrial hazard scenarios. This framework addresses the challenge of acquiring datasets for workplace hazards, which are difficult to capture in real-time. By analyzing historical Occupational Safety and Health Administration (OSHA) accident reports with GPT-4o, the study extracts structured hazard reasoning and creates object-level scene graphs. These graphs are utilized to guide a text-to-image diffusion model, generating accurate hazard scenes for evaluation.

arXiv — cs.LG19 hours ago

Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions

PositiveArtificial Intelligence

The paper titled 'Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions' introduces a new method called FIxLIP for explaining the similarity outputs of vision-language models. This approach addresses limitations of existing saliency maps by utilizing the weighted Banzhaf interaction index from game theory, which enhances computational efficiency and flexibility. The study emphasizes the importance of understanding complex cross-modal interactions in language-image pre-training (LIP) models.

via arXiv — cs.LG

Segmenting Collision Sound Sources in Egocentric Videos

PositiveArtificial Intelligence

The article presents a novel task called Collision Sound Source Segmentation (CS3), which aims to identify and segment the objects responsible for collision sounds in egocentric video footage. This task is challenging due to the nature of collision sounds arising from interactions between two objects, making it difficult to isolate the sound source visually. The proposed method utilizes weakly-supervised audio-conditioned segmentation techniques, leveraging foundation models like CLIP and SAM2, and incorporates egocentric cues to enhance object identification.

GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

PositiveArtificial Intelligence

The article presents a new framework called GMAT, which enhances Multiple Instance Learning (MIL) for whole slide image (WSI) classification. By integrating vision-language models (VLMs), GMAT aims to improve the generation of clinical descriptions that are more expressive and medically specific. This addresses limitations in existing methods that rely on large language models (LLMs) for generating descriptions, which often lack domain grounding and detailed medical specificity, thus improving alignment with visual features.