Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs

arXiv — cs.CL•Wednesday, December 10, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

Large language models (LLMs) exhibit two mechanisms of value expression: intrinsic, based on learned values, and prompted, based on explicit prompts. This study analyzes these mechanisms at a mechanistic level, revealing both shared and unique components in their operation.
Understanding these mechanisms is crucial for applications in value alignment and persona steering, as it informs how LLMs can be effectively guided to express desired values in various contexts, enhancing their utility and safety.
The exploration of value expression in LLMs intersects with ongoing discussions about ethical implications, evaluation awareness, and the challenges of steering models towards specific human values, highlighting the complexity of aligning AI behavior with societal norms.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Continue Readings

arXiv — cs.CL2 days ago

Representational Stability of Truth in Large Language Models

NeutralArtificial Intelligence

Large language models (LLMs) are increasingly utilized for factual inquiries, yet their internal representations of truth remain inadequately understood. A recent study introduces the concept of representational stability, assessing how robustly LLMs differentiate between true, false, and ambiguous statements through controlled experiments involving linear probes and model activations.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

NeutralArtificial Intelligence

Large language models (LLMs) are increasingly being integrated into multi-agent systems (MAS), where peer interactions significantly influence decision-making. A recent study introduces KAIROS, a benchmark designed to simulate collaborative quiz-style interactions among peer agents, allowing for a detailed analysis of how rapport and peer behaviors affect LLMs' decision-making processes.

Read full article

via arXiv — cs.CL

arXiv — cs.CL3 days ago

Why Chain of Thought Fails in Clinical Text Understanding

NeutralArtificial Intelligence

A systematic study has revealed that chain-of-thought (CoT) prompting, which is often used to enhance reasoning in large language models (LLMs), fails to improve performance in clinical text understanding. The research assessed 95 advanced LLMs across 87 real-world clinical tasks, finding that 86.3% of models experienced performance degradation in CoT settings, particularly with electronic health records that are lengthy and fragmented.

Read full article

via arXiv — cs.CL

arXiv — cs.LG3 days ago

Going All-In on LLM Accuracy: Fake Prediction Markets, Real Confidence Signals

NeutralArtificial Intelligence

A recent pilot study explored the effectiveness of framing evaluation tasks for large language models (LLMs) as a betting game, utilizing a fictional currency called LLMCoin. The study involved generating 100 math and logic questions, with models predicting the accuracy of baseline responses under two conditions: a control scenario and an incentive-based scenario with wagers. Results indicated that the incentive condition yielded a modest increase in prediction accuracy.

Read full article

via arXiv — cs.LG

arXiv — cs.LG3 days ago

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

NeutralArtificial Intelligence

A recent study titled 'RL-MTJail' explores the vulnerabilities of large language models (LLMs) to jailbreak attacks, focusing on black-box multi-turn jailbreaks. The research proposes a reinforcement learning framework to optimize the harmfulness of outputs through a series of prompt-output interactions, addressing the limitations of existing single-turn optimization methods.

Read full article

via arXiv — cs.LG

arXiv — cs.LG3 days ago

LUNE: Efficient LLM Unlearning via LoRA Fine-Tuning with Negative Examples

PositiveArtificial Intelligence

A new framework called LUNE has been introduced, enabling efficient unlearning in large language models (LLMs) through LoRA fine-tuning with negative examples. This method allows for targeted suppression of specific knowledge without the need for extensive computational resources, addressing challenges related to privacy and bias mitigation.

Read full article

via arXiv — cs.LG

arXiv — cs.LG3 days ago

Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation

NeutralArtificial Intelligence

A recent study highlights the importance of incorporating multiple generations in the evaluation of large language models (LLMs) to enhance benchmark accuracy. The proposed hierarchical statistical model addresses the randomness inherent in LLMs, which traditional evaluation methods often overlook. This approach aims to provide a more reliable assessment of LLM capabilities by reducing variance in benchmark score estimates.

Read full article

via arXiv — cs.LG

arXiv — cs.LG3 days ago

Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression

NeutralArtificial Intelligence

Recent research has introduced a controlled evaluation framework to assess the generalization capabilities of large language models (LLMs) like BERT, Qwen2, and LLaMA under various logical perturbations, including rule deletion and contradictory evidence. The findings indicate that these models maintain high accuracy despite structural changes in reasoning tasks.

Read full article

via arXiv — cs.LG