Do Vision-Language Models Understand Visual Persuasiveness?

arXiv — cs.CV•Monday, November 24, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

Recent research has examined whether Vision-Language Models (VLMs) comprehend visual persuasion, which influences human attitudes and decisions. A new dataset was created for binary persuasiveness judgment, introducing a taxonomy of Visual Persuasive Factors (VPFs) that includes various levels of visual cues. The analysis indicates that VLMs tend to overestimate high persuasiveness and struggle with low/mid-level features, while high-level semantic alignment is a strong predictor of human judgment.
Understanding visual persuasion is crucial for enhancing the effectiveness of VLMs in applications such as marketing, education, and social media, where visual content significantly impacts audience perception. The findings suggest that improving VLMs' ability to recognize and interpret persuasive visual elements could lead to more effective communication strategies and user engagement.
This inquiry into visual persuasion aligns with ongoing advancements in AI, particularly in enhancing VLMs through frameworks like Agentic Video Intelligence and self-evolving models. As the field progresses, addressing the cognitive biases and limitations of current models will be essential for developing more nuanced AI systems capable of understanding complex human interactions and decision-making processes.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataTry the app

Accesstive

AI-powered accessibility solutions designed for a more inclusive digital marketplace.

Marketing & CommerceTry the app

Https

Access multiple AI models seamlessly in one unified chat application.

AI & DataTry the app

Continue Readings

arXiv — cs.CVa day ago

MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models

NeutralArtificial Intelligence

The introduction of MultiPriv marks a significant advancement in the evaluation of individual-level privacy reasoning within Vision-Language Models (VLMs). This benchmark addresses the inadequacies of current privacy assessments, which primarily focus on privacy perception rather than the ability of VLMs to link distributed information and construct individual profiles. The framework includes a novel bilingual multimodal dataset that features synthetic individual profiles linked to sensitive attributes.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

PositiveArtificial Intelligence

A new framework called MMT-ARD has been proposed to enhance the robustness of Vision-Language Models (VLMs) through a Multimodal Multi-Teacher Adversarial Distillation approach. This method addresses the limitations of traditional single-teacher distillation by incorporating a dual-teacher knowledge fusion architecture, which optimizes both clean feature preservation and robust feature enhancement.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

PositiveArtificial Intelligence

A novel approach called Vision-align-to-Language integrated Knowledge Graph (VaLiK) has been proposed to enhance reasoning in Large Language Models (LLMs) by constructing Multimodal Knowledge Graphs (MMKGs) without the need for manual annotations. This method aims to address challenges such as incomplete knowledge and hallucination artifacts that LLMs face due to the limitations of traditional Knowledge Graphs (KGs).

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

NeutralArtificial Intelligence

PhyBlock has been introduced as a progressive benchmark aimed at evaluating vision-language models (VLMs) on their physical understanding and planning capabilities through robotic 3D block assembly tasks. This benchmark features a four-level cognitive hierarchy assembly task and includes 2,600 tasks to assess models on spatial reasoning and physical comprehension.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

PositiveArtificial Intelligence

SPEAR-1 has been introduced as a significant advancement in the field of robotic foundation models, aiming to enhance the generalization capabilities of robots across diverse environments and tasks. This initiative addresses the limitations of existing models that primarily rely on 2D image-language tasks, which do not adequately support 3D spatial reasoning necessary for effective robotic control.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

PositiveArtificial Intelligence

MOCHA, a new distillation framework, has been introduced to enhance personalized object detection by transferring multimodal knowledge from a frozen vision-language model (VLM) to a lightweight vision-only detector. This approach enables the effective recognition of user-specific instances from minimal examples without requiring modifications to the teacher model during inference.

Read full article

via arXiv — cs.LG

arXiv — cs.CVa day ago

Vision Language Models are Confused Tourists

NegativeArtificial Intelligence

Recent evaluations of Vision-Language Models (VLMs) have revealed significant vulnerabilities, particularly in their ability to handle diverse cultural inputs. The introduction of the ConfusedTourist framework aims to assess these models' robustness against geographical perturbations, highlighting a concerning drop in accuracy when faced with complex cultural cues.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

NeutralArtificial Intelligence

The introduction of MirageTVQA, a new benchmark for evaluating Vision-Language Models (VLMs), highlights the significant performance gaps in existing datasets that primarily focus on monolingual and visually perfect tables. This benchmark includes nearly 60,000 QA pairs across 24 languages and incorporates realistic noise to better reflect real-world scenarios.

Read full article

via arXiv — cs.CV