Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

arXiv — cs.CL•Monday, December 8, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

A recent study investigated the emergence of moral bias, specifically the Knobe effect, in finetuned large language models (LLMs). The research revealed that this bias is not only learned during finetuning but is also localized in specific layers of the models. By employing a Layer-Patching analysis, the researchers demonstrated that targeted interventions can mitigate this bias without requiring complete model retraining.
This development is significant as it provides a method to interpret and address social biases in LLMs, which are increasingly utilized in various applications. The ability to localize and eliminate biases could enhance the reliability and ethical deployment of these models in real-world scenarios, ensuring they align more closely with human values.
The findings contribute to ongoing discussions about the ethical implications of LLMs, particularly in their role as evaluators and decision-makers. As LLMs are integrated into systems requiring human-like judgment, understanding and correcting biases becomes crucial. This research aligns with broader efforts to improve the interpretability and fairness of AI systems, addressing concerns about their impact on society.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

Hypertune

Optimize machine learning models with automated hyperparameter tuning and experiment tracking.

Business & ProductivityView app details

Keywords AI

Monitor and optimize your AI models with comprehensive observability tools.

Business & ProductivityView app details

Continue Readings

arXiv — cs.CLa day ago

CORE: A Conceptual Reasoning Layer for Large Language Models

PositiveArtificial Intelligence

A new conceptual reasoning layer named CORE has been proposed to enhance the performance of large language models (LLMs) in multi-turn interactions. CORE aims to address the limitations of existing models, which struggle to maintain user intent and task state across conversations, leading to inconsistencies and prompt drift. By utilizing a compact semantic state and cognitive operators, CORE reduces the need for extensive token history, resulting in a significant decrease in cumulative prompt tokens.

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

The Vector Grounding Problem

NeutralArtificial Intelligence

Large language models (LLMs) face a modern variant of the symbol grounding problem, questioning whether their outputs can represent extra-linguistic reality without human interpretation. The research emphasizes the necessity of referential grounding, which connects internal states to the world through causal relations and historical selection.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Representational Stability of Truth in Large Language Models

NeutralArtificial Intelligence

Large language models (LLMs) are increasingly utilized for factual inquiries, yet their internal representations of truth remain inadequately understood. A recent study introduces the concept of representational stability, assessing how robustly LLMs differentiate between true, false, and ambiguous statements through controlled experiments involving linear probes and model activations.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs

NeutralArtificial Intelligence

Large language models (LLMs) exhibit two mechanisms of value expression: intrinsic, based on learned values, and prompted, based on explicit prompts. This study analyzes these mechanisms at a mechanistic level, revealing both shared and unique components in their operation.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

NeutralArtificial Intelligence

Large language models (LLMs) are increasingly being integrated into multi-agent systems (MAS), where peer interactions significantly influence decision-making. A recent study introduces KAIROS, a benchmark designed to simulate collaborative quiz-style interactions among peer agents, allowing for a detailed analysis of how rapport and peer behaviors affect LLMs' decision-making processes.

Read full article

via arXiv — cs.CL

arXiv — cs.LG3 days ago

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

NeutralArtificial Intelligence

A recent study titled 'RL-MTJail' explores the vulnerabilities of large language models (LLMs) to jailbreak attacks, focusing on black-box multi-turn jailbreaks. The research proposes a reinforcement learning framework to optimize the harmfulness of outputs through a series of prompt-output interactions, addressing the limitations of existing single-turn optimization methods.

Read full article

via arXiv — cs.LG

arXiv — cs.LG3 days ago

LUNE: Efficient LLM Unlearning via LoRA Fine-Tuning with Negative Examples

PositiveArtificial Intelligence

A new framework called LUNE has been introduced, enabling efficient unlearning in large language models (LLMs) through LoRA fine-tuning with negative examples. This method allows for targeted suppression of specific knowledge without the need for extensive computational resources, addressing challenges related to privacy and bias mitigation.

Read full article

via arXiv — cs.LG

arXiv — cs.LG3 days ago

Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation

NeutralArtificial Intelligence

A recent study highlights the importance of incorporating multiple generations in the evaluation of large language models (LLMs) to enhance benchmark accuracy. The proposed hierarchical statistical model addresses the randomness inherent in LLMs, which traditional evaluation methods often overlook. This approach aims to provide a more reliable assessment of LLM capabilities by reducing variance in benchmark score estimates.

Read full article

via arXiv — cs.LG