RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion

arXiv — cs.CVWednesday, November 12, 2025 at 5:00:00 AM
RedDiffuser represents a significant advancement in addressing the vulnerabilities of Vision-Language Models (VLMs) to toxic continuation attacks, which occur when harmful inputs are paired with partial toxic outputs, resulting in dangerous completions. This framework is the first to utilize reinforcement learning for fine-tuning diffusion models, enabling the generation of adversarial images that induce these toxic continuations. Experimental results show that RedDiffuser increases the toxicity rate in LLaVA outputs by 10.69% and 8.91% on original and hold-out sets, respectively, while also raising toxicity rates in Gemini by 5.1% and in LLaMA-Vision by 26.83%. These findings underscore a critical cross-modal toxicity amplification vulnerability in current VLM alignment, emphasizing the necessity for robust multimodal red teaming to enhance the safety and reliability of AI systems.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about