World PulseNowPowered by AI

Trending:

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

arXiv — cs.CL•Tuesday, November 25, 2025 at 5:00:00 AM

NegativeArtificial Intelligence

A recent study introduced IAG, a novel multi-target backdoor attack on vision-language models (VLMs) used for visual grounding, revealing significant vulnerabilities in these systems. This attack employs dynamic, input-aware triggers that are text-guided and can adapt to various target object descriptions, posing a serious security risk to VLM applications.
The implications of this research are critical as it highlights the need for enhanced security measures in VLM-based systems, which are increasingly utilized in various applications, including image recognition and natural language processing. The findings suggest that existing models may be susceptible to sophisticated attacks, necessitating immediate attention from developers and researchers.
This development underscores a growing concern regarding the security of AI systems, particularly as advancements in multimodal models continue to evolve. The juxtaposition of innovative techniques like IAG with emerging frameworks aimed at enhancing spatial reasoning and scene understanding in VLMs reflects a broader trend in the AI community, where the balance between performance and security remains a pivotal challenge.

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

Loadbalancer ADC Portal

Manage, monitor, and secure all your ADCs with automated tasks and real-time vulnerability alerts.

Tech & Developer ToolsTry the app

GPTHuman

Generate undetectable AI content that reads naturally and bypasses detection tools.

Business & ProductivityTry the app

Blunge

Train your own private AI image models to protect and personalize your unique artistic style.

Creative & DesignTry the app

Continue Readings

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

arXiv — cs.CVa day ago

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

PositiveArtificial Intelligence

The paper introduces TRANSPORTER, a model-independent approach designed to enhance video generation by transferring visual semantics from Vision Language Models (VLMs). This method addresses the challenge of understanding how VLMs derive their predictions, particularly in complex scenes with various objects and actions. TRANSPORTER generates videos that reflect changes in captions across diverse attributes and contexts.

Read full article

via arXiv — cs.CV

Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation

arXiv — cs.CVa day ago

Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation

PositiveArtificial Intelligence

A new generative framework has been proposed for enhancing low-light images and reducing blur, utilizing visual autoregressive modeling guided by perceptual priors from vision-language models. This approach addresses significant challenges in restoring dark images, which often suffer from low visibility, contrast, noise, and blur.

Read full article

via arXiv — cs.CV

DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

arXiv — cs.CVa day ago

DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

PositiveArtificial Intelligence

The introduction of DiffSeg30k marks a significant advancement in the detection of AI-generated content (AIGC) by providing a dataset of 30,000 diffusion-edited images with pixel-level annotations. This dataset allows for fine-grained detection of localized edits, addressing a gap in existing benchmarks that typically assess entire images without considering localized modifications.

Read full article

via arXiv — cs.CV

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

arXiv — cs.CVa day ago

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

PositiveArtificial Intelligence

A new framework called Chain-of-Visual-Thought (COVT) has been introduced to enhance Vision-Language Models (VLMs) by enabling them to reason with continuous visual tokens, which encapsulate rich perceptual cues. This approach aims to address the limitations of current VLMs in dense visual perception tasks, such as spatial reasoning and geometric awareness, by distilling knowledge from lightweight vision experts within a budget of approximately 20 tokens.

Read full article

via arXiv — cs.CV

VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment

arXiv — cs.CVa day ago

VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment

PositiveArtificial Intelligence

The Vision Language Caption Enhancer (VLCE) has been introduced as a multimodal framework designed to improve image description in disaster assessments by integrating external semantic knowledge from ConceptNet and WordNet. This framework addresses the limitations of current Vision-Language Models (VLMs) that often fail to generate disaster-specific descriptions due to a lack of domain knowledge.

Read full article

via arXiv — cs.CV

SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion

arXiv — cs.CV2 days ago

SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion

PositiveArtificial Intelligence

SpatialGeo has been introduced as a novel vision encoder that enhances the spatial reasoning capabilities of multimodal large language models (MLLMs) by integrating geometry and semantics features. This advancement addresses the limitations of existing MLLMs, particularly in interpreting spatial arrangements in three-dimensional space, which has been a significant challenge in the field.

Read full article

via arXiv — cs.CV