PRSM: A Measure to Evaluate CLIP&#x27;s Robustness Against Paraphrases

Continual Learning for Image Captioning through Improved Image-Text Alignment

PositiveArtificial Intelligence

Generating accurate and coherent image captions in a continual learning environment poses significant challenges, particularly due to catastrophic forgetting and the evolving nature of visual concepts. This study introduces a multi-loss framework for continual image captioning that leverages semantic guidance through prompt-based continual learning and contrastive alignment. The proposed method, built on a pretrained ViT-GPT-2 backbone, integrates various loss components to enhance image-text alignment without introducing additional parameters.

QwenCLIP: Boosting Medical Vision-Language Pretraining via LLM Embeddings and Prompt tuning

PositiveArtificial Intelligence

QwenCLIP is a new vision-language framework that enhances medical pretraining by integrating large language model (LLM) embeddings and learnable prompts. Traditional Contrastive Language-Image Pretraining (CLIP) struggles with long radiology reports due to its limited token capacity. By replacing CLIP's text encoder with an LLM-based module, QwenCLIP aims to improve cross-modal alignment and capture comprehensive medical semantics, addressing the limitations of existing domain-specific encoders like PubMedBERT and ClinicalBERT.

Scene Graph-Guided Generative AI Framework for Synthesizing and Evaluating Industrial Hazard Scenarios

PositiveArtificial Intelligence

A new study introduces a scene graph-guided generative AI framework aimed at synthesizing realistic images of industrial hazard scenarios. This framework addresses the challenge of acquiring datasets for workplace hazards, which are difficult to capture in real-time. By analyzing historical Occupational Safety and Health Administration (OSHA) accident reports with GPT-4o, the study extracts structured hazard reasoning and creates object-level scene graphs. These graphs are utilized to guide a text-to-image diffusion model, generating accurate hazard scenes for evaluation.

Segmenting Collision Sound Sources in Egocentric Videos

PositiveArtificial Intelligence

The article presents a novel task called Collision Sound Source Segmentation (CS3), which aims to identify and segment the objects responsible for collision sounds in egocentric video footage. This task is challenging due to the nature of collision sounds arising from interactions between two objects, making it difficult to isolate the sound source visually. The proposed method utilizes weakly-supervised audio-conditioned segmentation techniques, leveraging foundation models like CLIP and SAM2, and incorporates egocentric cues to enhance object identification.

Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions

arXiv — cs.LG9 hours ago

PositiveArtificial Intelligence

The paper titled 'Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions' introduces a new method called FIxLIP for explaining the similarity outputs of vision-language models. This approach addresses limitations of existing saliency maps by utilizing the weighted Banzhaf interaction index from game theory, which enhances computational efficiency and flexibility. The study emphasizes the importance of understanding complex cross-modal interactions in language-image pre-training (LIP) models.

Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation

via arXiv — cs.LG

arXiv — cs.LGa day ago

PositiveArtificial Intelligence

This paper investigates the role of attention heads in CLIP's image encoder. It finds that certain heads across layers can negatively impact representations. To address this, the authors propose the Attention Ablation Technique (AAT), which suppresses selected heads by manipulating their attention weights. AAT allows for the identification and ablation of harmful heads with minimal overhead, leading to improved downstream performance, including an 11.1% boost in recall on cross-modal retrieval benchmarks.