SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

arXiv — cs.LG•Wednesday, November 12, 2025 at 5:00:00 AM

SALT, or Steering Activations towards Leakage-free Thinking, has been introduced as a solution to a pressing privacy issue in Large Language Models (LLMs). These models, while increasingly utilized as personal assistants, have been found to leak sensitive information through their internal reasoning processes, which can violate user privacy expectations. SALT aims to mitigate this leakage by injecting targeted steering vectors into the model's hidden states, effectively reducing the exposure of sensitive details during reasoning. Experimental results demonstrate that SALT achieves notable reductions in privacy leakage, with an 18.2% decrease in CPL on QwQ-32B, 17.9% on Llama-3.1-8B, and 31.2% on Deepseek, all while maintaining comparable task performance. This balance between privacy and utility is essential as LLMs continue to evolve and integrate into daily life, highlighting the importance of safeguarding user data against inadvertent exposure.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — stat.MLa day ago

Silenced Biases: The Dark Side LLMs Learned to Refuse

NegativeArtificial Intelligence

Safety-aligned large language models (LLMs) are increasingly used in sensitive applications where fairness is crucial. Evaluating their fairness is complex, often relying on standard question-answer methods that misinterpret refusal responses as indicators of fairness. This paper introduces the concept of silenced biases, which are unfair preferences hidden within the models' latent space, masked by safety-alignment. Previous methods have limitations, prompting the need for new approaches to uncover these biases effectively.

Read full article

via arXiv — stat.ML

arXiv — cs.LGa day ago

Fair In-Context Learning via Latent Concept Variables

PositiveArtificial Intelligence

The paper titled 'Fair In-Context Learning via Latent Concept Variables' explores the in-context learning (ICL) capabilities of large language models (LLMs) in handling tabular data. It highlights the potential for LLMs to inherit biases from pre-training data, which can lead to discrimination in high-stakes applications. The authors propose an optimal demonstration selection method using latent concept variables to enhance task adaptation and fairness, alongside data augmentation strategies to minimize correlations between sensitive variables and predictive outcomes.

Read full article

via arXiv — cs.LG

arXiv — cs.CL2 days ago

Modeling and Predicting Multi-Turn Answer Instability in Large Language Models

NeutralArtificial Intelligence

The paper titled 'Modeling and Predicting Multi-Turn Answer Instability in Large Language Models' discusses the evaluation of large language models (LLMs) in terms of their robustness during user interactions. The study employs multi-turn follow-up prompts to assess changes in model answers and accuracy dynamics using Markov chains. Results indicate vulnerabilities in LLMs, with a 10% accuracy drop for Gemini 1.5 Flash after a 'Think again' prompt over nine turns, and a 7.5% drop for Claude 3.5 Haiku with a reworded question. The findings suggest that accuracy can be modeled over time.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge

NeutralArtificial Intelligence

A recent study published on arXiv examines the phenomenon of negative bias in large language models (LLMs), which refers to their tendency to generate negative responses in binary decision tasks. The research highlights that previous studies have primarily focused on identifying negative attention heads that contribute to this bias. The authors introduce a new evaluation pipeline that categorizes responses based on the model's parametric knowledge, revealing that the format of prompts significantly influences the responses more than the semantics of the content itself.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models

PositiveArtificial Intelligence

PustakAI is a framework designed to create interactive textbooks aligned with the NCERT curriculum for grades 6 to 8 in India. Utilizing Large Language Models (LLMs), it aims to enhance personalized learning experiences, particularly in areas with limited educational resources. The initiative addresses challenges in adapting LLMs to specific curricular content, ensuring accuracy and pedagogical relevance.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Who Gets the Reward, Who Gets the Blame? Evaluation-Aligned Training Signals for Multi-LLM Agents

PositiveArtificial Intelligence

The article discusses a new theoretical framework for training multi-agent systems using large language models (LLMs). It aims to connect system-level evaluations with agent-level learning by integrating cooperative game-theoretic attribution and process reward modeling. This approach produces local, signed, and credit-conserving signals, enhancing cooperation among agents while penalizing harmful actions in failure scenarios.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Identifying and Analyzing Performance-Critical Tokens in Large Language Models

NeutralArtificial Intelligence

The paper titled 'Identifying and Analyzing Performance-Critical Tokens in Large Language Models' explores how large language models (LLMs) utilize in-context learning (ICL) for few-shot learning. It categorizes tokens in ICL prompts into content, stopword, and template tokens, aiming to identify those that significantly impact LLM performance. The study reveals that template and stopword tokens have a greater influence on performance than informative content tokens, challenging existing assumptions about human attention to informative words.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models

PositiveArtificial Intelligence

The paper titled 'Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models' introduces a method to enhance the efficiency of Mixture-of-Experts (MoE) Large Language Models (LLMs). The authors propose a pre-attention expert prediction technique that improves accuracy and reduces computational overhead by utilizing activations before the attention block. This approach aims to optimize expert prefetching, achieving about a 15% improvement in accuracy over existing methods.

Read full article

via arXiv — cs.CL