RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion
NeutralArtificial Intelligence
RedDiffuser represents a significant advancement in addressing the vulnerabilities of Vision-Language Models (VLMs) to toxic continuation attacks, which occur when harmful inputs are paired with partial toxic outputs, resulting in dangerous completions. This framework is the first to utilize reinforcement learning for fine-tuning diffusion models, enabling the generation of adversarial images that induce these toxic continuations. Experimental results show that RedDiffuser increases the toxicity rate in LLaVA outputs by 10.69% and 8.91% on original and hold-out sets, respectively, while also raising toxicity rates in Gemini by 5.1% and in LLaMA-Vision by 26.83%. These findings underscore a critical cross-modal toxicity amplification vulnerability in current VLM alignment, emphasizing the necessity for robust multimodal red teaming to enhance the safety and reliability of AI systems.
— via World Pulse Now AI Editorial System
