Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

arXiv — cs.CL•Tuesday, November 25, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

A recent study has highlighted the issue of over-refusal in large language models (LLMs), which occurs when these models excessively decline to generate outputs due to safety concerns. The research proposes a new approach called MOSR, which aims to balance safety and usability by addressing the representation of safety in LLMs.
This development is significant as it seeks to enhance the practical usability of LLMs while maintaining safety standards. By mitigating over-refusal, the proposed method could lead to more effective applications of LLMs in various fields, including natural language processing and AI-driven solutions.
The challenge of balancing safety and performance in LLMs is a recurring theme in AI research. While advancements like MOSR aim to improve usability, other studies have also focused on issues such as evaluation-awareness, label length bias, and the need for diverse output generation, indicating a broader discourse on optimizing LLMs for real-world applications.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Supametas.AI

Extract and structure unstructured data for seamless LLM RAG integration.

AI & DataTry the app

Keywords AI

Monitor and optimize your AI models with comprehensive observability tools.

Business & ProductivityTry the app

FastML

Build and deploy machine learning pipelines with speed and efficiency.

Business & ProductivityTry the app

Continue Readings

arXiv — cs.CL21 hours ago

Using tournaments to calculate AUROC for zero-shot classification with LLMs

PositiveArtificial Intelligence

A recent study has introduced a novel method for evaluating large language models (LLMs) in zero-shot classification tasks by transforming binary classifications into pairwise comparisons. This approach utilizes the Elo rating system to rank instances, thereby enhancing classification performance and providing more informative results than traditional methods.

Read full article

via arXiv — cs.CL

arXiv — cs.CL21 hours ago

$A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving

PositiveArtificial Intelligence

A new study introduces $A^3$, an attention-aware method designed to enhance the efficiency of large language models (LLMs) by improving key-value (KV) cache fusion. This advancement aims to reduce decoding latency and memory overhead, addressing significant challenges faced in real-world applications of LLMs, particularly in processing long textual inputs.

Read full article

via arXiv — cs.CL

arXiv — cs.CL21 hours ago

MURMUR: Using cross-user chatter to break collaborative language agents in groups

NegativeArtificial Intelligence

A recent study introduces MURMUR, a framework that reveals vulnerabilities in collaborative language agents through cross-user poisoning (CUP) attacks. These attacks exploit the lack of isolation in user interactions within multi-user environments, allowing adversaries to manipulate shared states and trigger unintended actions by the agents. The research validates these attacks on popular multi-user systems, highlighting a significant security concern in the evolving landscape of AI collaboration.

Read full article

via arXiv — cs.CL

arXiv — cs.CL21 hours ago

Representational Stability of Truth in Large Language Models

NeutralArtificial Intelligence

Recent research has introduced the concept of representational stability in large language models (LLMs), focusing on how these models encode distinctions between true, false, and neither-true-nor-false content. The study assesses this stability by training a linear probe on LLM activations to differentiate true from not-true statements and measuring shifts in decision boundaries under label changes.

Read full article

via arXiv — cs.CL

arXiv — cs.CL21 hours ago

Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

NeutralArtificial Intelligence

Recent evaluations of large language models (LLMs) have highlighted their vulnerability to flawed premises, which can lead to inefficient reasoning and unreliable outputs. The introduction of the Premise Critique Bench (PCBench) aims to assess the Premise Critique Ability of LLMs, focusing on their capacity to identify and articulate errors in input premises across various difficulty levels.

Read full article

via arXiv — cs.CL

arXiv — cs.CL21 hours ago

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

PositiveArtificial Intelligence

A new framework called ReVeL (Rewrite and Verify by LLM) has been proposed to enhance the multiple-choice question answering (MCQA) format used in evaluating multimodal language models. This framework transforms MCQA into open-form questions while ensuring answers remain verifiable, addressing issues of answer guessing and unreliable accuracy metrics during reinforcement fine-tuning (RFT).

Read full article

via arXiv — cs.CL

arXiv — cs.CL21 hours ago

For Those Who May Find Themselves on the Red Team

NeutralArtificial Intelligence

A recent position paper emphasizes the need for literary scholars to engage with research on large language model (LLM) interpretability, suggesting that the red team could serve as a platform for this ideological struggle. The paper argues that current interpretability standards are insufficient for evaluating LLMs.

Read full article

via arXiv — cs.CL

arXiv — cs.CL21 hours ago

Multi-Agent Collaborative Filtering: Orchestrating Users and Items for Agentic Recommendations

PositiveArtificial Intelligence

The Multi-Agent Collaborative Filtering (MACF) framework has been proposed to enhance agentic recommendations by utilizing large language model (LLM) agents that can interact with users and suggest relevant items based on collaborative signals from user-item interactions. This approach aims to improve the effectiveness of recommendation systems beyond traditional single-agent workflows.

Read full article

via arXiv — cs.CL