IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

arXiv — cs.CLTuesday, November 25, 2025 at 5:00:00 AM
  • A recent study introduced IAG, a novel multi-target backdoor attack on vision-language models (VLMs) used for visual grounding, revealing significant vulnerabilities in these systems. This attack employs dynamic, input-aware triggers that are text-guided and can adapt to various target object descriptions, posing a serious security risk to VLM applications.
  • The implications of this research are critical as it highlights the need for enhanced security measures in VLM-based systems, which are increasingly utilized in various applications, including image recognition and natural language processing. The findings suggest that existing models may be susceptible to sophisticated attacks, necessitating immediate attention from developers and researchers.
  • This development underscores a growing concern regarding the security of AI systems, particularly as advancements in multimodal models continue to evolve. The juxtaposition of innovative techniques like IAG with emerging frameworks aimed at enhancing spatial reasoning and scene understanding in VLMs reflects a broader trend in the AI community, where the balance between performance and security remains a pivotal challenge.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
PositiveArtificial Intelligence
The paper introduces TRANSPORTER, a model-independent approach designed to enhance video generation by transferring visual semantics from Vision Language Models (VLMs). This method addresses the challenge of understanding how VLMs derive their predictions, particularly in complex scenes with various objects and actions. TRANSPORTER generates videos that reflect changes in captions across diverse attributes and contexts.
Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation
PositiveArtificial Intelligence
A new generative framework has been proposed for enhancing low-light images and reducing blur, utilizing visual autoregressive modeling guided by perceptual priors from vision-language models. This approach addresses significant challenges in restoring dark images, which often suffer from low visibility, contrast, noise, and blur.
DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection
PositiveArtificial Intelligence
The introduction of DiffSeg30k marks a significant advancement in the detection of AI-generated content (AIGC) by providing a dataset of 30,000 diffusion-edited images with pixel-level annotations. This dataset allows for fine-grained detection of localized edits, addressing a gap in existing benchmarks that typically assess entire images without considering localized modifications.
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
PositiveArtificial Intelligence
A new framework called Chain-of-Visual-Thought (COVT) has been introduced to enhance Vision-Language Models (VLMs) by enabling them to reason with continuous visual tokens, which encapsulate rich perceptual cues. This approach aims to address the limitations of current VLMs in dense visual perception tasks, such as spatial reasoning and geometric awareness, by distilling knowledge from lightweight vision experts within a budget of approximately 20 tokens.
VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
PositiveArtificial Intelligence
The Vision Language Caption Enhancer (VLCE) has been introduced as a multimodal framework designed to improve image description in disaster assessments by integrating external semantic knowledge from ConceptNet and WordNet. This framework addresses the limitations of current Vision-Language Models (VLMs) that often fail to generate disaster-specific descriptions due to a lack of domain knowledge.
SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion
PositiveArtificial Intelligence
SpatialGeo has been introduced as a novel vision encoder that enhances the spatial reasoning capabilities of multimodal large language models (MLLMs) by integrating geometry and semantics features. This advancement addresses the limitations of existing MLLMs, particularly in interpreting spatial arrangements in three-dimensional space, which has been a significant challenge in the field.