PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases

arXiv — cs.CLMonday, November 17, 2025 at 5:00:00 AM
  • The introduction of the Paraphrase Ranking Stability Metric (PRSM) aims to evaluate the robustness of the CLIP model against paraphrased queries, highlighting its sensitivity to linguistic variations. The study employs the Social Counterfactuals dataset to empirically assess CLIP's stability under paraphrastic changes, revealing potential biases in its performance.
  • This development is significant as it addresses the need for reliable deployment of AI systems in socially sensitive contexts, ensuring that multimodal models like CLIP can operate fairly and equitably, thus mitigating demographic biases.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Continual Learning for Image Captioning through Improved Image-Text Alignment
PositiveArtificial Intelligence
Generating accurate and coherent image captions in a continual learning environment poses significant challenges, particularly due to catastrophic forgetting and the evolving nature of visual concepts. This study introduces a multi-loss framework for continual image captioning that leverages semantic guidance through prompt-based continual learning and contrastive alignment. The proposed method, built on a pretrained ViT-GPT-2 backbone, integrates various loss components to enhance image-text alignment without introducing additional parameters.
QwenCLIP: Boosting Medical Vision-Language Pretraining via LLM Embeddings and Prompt tuning
PositiveArtificial Intelligence
QwenCLIP is a new vision-language framework that enhances medical pretraining by integrating large language model (LLM) embeddings and learnable prompts. Traditional Contrastive Language-Image Pretraining (CLIP) struggles with long radiology reports due to its limited token capacity. By replacing CLIP's text encoder with an LLM-based module, QwenCLIP aims to improve cross-modal alignment and capture comprehensive medical semantics, addressing the limitations of existing domain-specific encoders like PubMedBERT and ClinicalBERT.
Scene Graph-Guided Generative AI Framework for Synthesizing and Evaluating Industrial Hazard Scenarios
PositiveArtificial Intelligence
A new study introduces a scene graph-guided generative AI framework aimed at synthesizing realistic images of industrial hazard scenarios. This framework addresses the challenge of acquiring datasets for workplace hazards, which are difficult to capture in real-time. By analyzing historical Occupational Safety and Health Administration (OSHA) accident reports with GPT-4o, the study extracts structured hazard reasoning and creates object-level scene graphs. These graphs are utilized to guide a text-to-image diffusion model, generating accurate hazard scenes for evaluation.
Segmenting Collision Sound Sources in Egocentric Videos
PositiveArtificial Intelligence
The article presents a novel task called Collision Sound Source Segmentation (CS3), which aims to identify and segment the objects responsible for collision sounds in egocentric video footage. This task is challenging due to the nature of collision sounds arising from interactions between two objects, making it difficult to isolate the sound source visually. The proposed method utilizes weakly-supervised audio-conditioned segmentation techniques, leveraging foundation models like CLIP and SAM2, and incorporates egocentric cues to enhance object identification.
Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions
PositiveArtificial Intelligence
The paper titled 'Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions' introduces a new method called FIxLIP for explaining the similarity outputs of vision-language models. This approach addresses limitations of existing saliency maps by utilizing the weighted Banzhaf interaction index from game theory, which enhances computational efficiency and flexibility. The study emphasizes the importance of understanding complex cross-modal interactions in language-image pre-training (LIP) models.
Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation
PositiveArtificial Intelligence
This paper investigates the role of attention heads in CLIP's image encoder. It finds that certain heads across layers can negatively impact representations. To address this, the authors propose the Attention Ablation Technique (AAT), which suppresses selected heads by manipulating their attention weights. AAT allows for the identification and ablation of harmful heads with minimal overhead, leading to improved downstream performance, including an 11.1% boost in recall on cross-modal retrieval benchmarks.