PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

arXiv — cs.LG•Thursday, November 27, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

Recent advancements in Large Language Models (LLMs) have raised concerns regarding their potential to acquire and misuse dangerous capabilities, leading to the introduction of PropensityBench, a benchmark framework designed to evaluate the latent safety risks associated with these models. This framework assesses the likelihood of models engaging in harmful actions when equipped with simulated dangerous capabilities across 5,874 scenarios.
The development of PropensityBench is significant as it addresses a critical blind spot in current safety evaluations, which primarily focus on a model's capabilities rather than its propensity for misuse. By emphasizing the likelihood of harmful actions, this framework aims to enhance the understanding of safety risks in LLMs, thereby contributing to more effective risk management strategies.
The introduction of PropensityBench aligns with ongoing discussions about the ethical implications and vulnerabilities of LLMs, particularly in high-stakes applications. As researchers explore various methods to mitigate risks, including behavior editing and vulnerability detection, the need for comprehensive safety evaluations becomes increasingly apparent. This highlights a broader trend in AI research focusing on balancing the capabilities of LLMs with the imperative to ensure their safe and ethical deployment.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Keywords AI

Monitor and optimize your AI models with comprehensive observability tools.

Business & ProductivityTry the app

E2B Dev

Securely run AI-generated code in isolated environments for developers.

Tech & Developer ToolsTry the app

Kadag Security

Test your app with AI security agents in a sandboxed, instrumented environment.

Business & ProductivityTry the app

Continue Readings

arXiv — cs.CLa day ago

Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths

PositiveArtificial Intelligence

A new approach called Mixture of Attention Spans (MoA) has been proposed to enhance the efficiency of Large Language Models (LLMs) by utilizing heterogeneous sliding-window lengths for attention mechanisms. This method addresses the limitations of traditional uniform window lengths, which fail to capture the diverse attention patterns across different heads and layers in LLMs.

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

Geometry of Decision Making in Language Models

NeutralArtificial Intelligence

A recent study on the geometry of decision-making in Large Language Models (LLMs) reveals insights into their internal processes, particularly in multiple-choice question answering (MCQA) tasks. The research analyzed 28 transformer models, uncovering a consistent pattern in the intrinsic dimension of hidden representations across different layers, indicating how LLMs project linguistic inputs onto low-dimensional manifolds.

Read full article

via arXiv — cs.CL

arXiv — cs.CVa day ago

Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale

PositiveArtificial Intelligence

Recent advancements in Large Language Models (LLMs) have led to the development of a multi-reward Group Relative Policy Optimization (GRPO) framework aimed at enhancing the stability and prosody of single-codebook text-to-speech (TTS) systems. This framework integrates various rule-based rewards to optimize token generation policies, addressing issues such as unstable prosody and speaker drift that have plagued existing models.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning

PositiveArtificial Intelligence

Recent advancements in aligning Large Language Models (LLMs) with specialized biomedical knowledge have led to the introduction of Balanced Fine-Tuning (BFT), a method designed to enhance the models' ability to learn complex reasoning from sparse data without relying on external reward signals. This approach addresses the limitations of traditional Supervised Fine-Tuning and Reinforcement Learning in the biomedical domain.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

Minimizing Hyperbolic Embedding Distortion with LLM-Guided Hierarchy Restructuring

PositiveArtificial Intelligence

A recent study has explored the potential of Large Language Models (LLMs) to assist in restructuring hierarchical knowledge to optimize hyperbolic embeddings. This research highlights the importance of a high branching factor and single inheritance in creating effective hyperbolic representations, which are crucial for applications in machine learning that rely on hierarchical data structures.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

Beyond Introspection: Reinforcing Thinking via Externalist Behavioral Feedback

PositiveArtificial Intelligence

A new framework called Distillation-Reinforcement-Reasoning (DRR) has been proposed to enhance the reliability of Large Language Models (LLMs) by providing external behavioral feedback rather than relying on self-critique, which can perpetuate biases. This approach aims to address the inconsistencies that arise when LLMs operate near their knowledge boundaries.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

Active Slice Discovery in Large Language Models

PositiveArtificial Intelligence

Recent research has introduced the concept of Active Slice Discovery in Large Language Models (LLMs), focusing on identifying systematic errors, or error slices, that occur in specific data subsets, such as demographic groups. This method aims to enhance the understanding and improvement of LLMs by actively grouping errors and verifying patterns with limited manual annotation.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

The Structure-Content Trade-off in Knowledge Graph Retrieval

NeutralArtificial Intelligence

Recent research highlights the trade-off between structure and content in knowledge graph retrieval for large language models (LLMs). The study reveals that while subquestion-based retrieval enhances content precision, it results in disjoint subgraphs, whereas question-based retrieval maintains structural integrity but compromises relevance. The optimal performance is achieved by balancing these two extremes.

Read full article

via arXiv — cs.LG