Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
NeutralArtificial Intelligence
- Large language models (LLMs) often produce exaggerated refusals, declining benign requests due to safety concerns. To address this, researchers introduced two benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which assess refusal calibration in realistic dialogues. These benchmarks highlight persistent refusal issues across various LLMs, particularly in complex scenarios.
- The introduction of these benchmarks is significant as it provides a structured approach to identify and mitigate exaggerated refusals in LLMs, enhancing their reliability and usability. By employing model-agnostic methods such as ignore-word instructions and prompt rephrasing, the research aims to improve LLM performance without necessitating retraining or parameter access.
- This development reflects ongoing challenges in the field of artificial intelligence, particularly regarding the balance between safety and functionality in LLMs. The persistence of exaggerated refusals underscores a broader debate about the limitations of current safety protocols and the need for innovative solutions to ensure that LLMs can respond appropriately in diverse contexts, while also maintaining user trust.
— via World Pulse Now AI Editorial System

