Benchmarking Gaslighting Negation Attacks Against Reasoning Models
NegativeArtificial Intelligence
- Recent research evaluated the vulnerability of leading reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet, and Gemini-2.5-Flash, to gaslighting negation attacks, which significantly reduced their accuracy by 25-29% on average across multimodal benchmarks like MMMU, MathVista, and CharXiv. This highlights a critical gap in the robustness of these advanced AI systems against manipulative inputs.
- The findings underscore the challenges faced by top-tier AI models in maintaining accuracy under adversarial conditions, raising concerns about their reliability in real-world applications where user feedback can be misleading or deceptive.
- This situation reflects ongoing debates in the AI community regarding the biases inherent in large language models and their evaluation methods, as well as the need for improved benchmarks and diagnostic tools like GaslightingBench-R to better assess and enhance model resilience against adversarial prompts.
— via World Pulse Now AI Editorial System





