arXiv:2601.08489v1 Announce Type: new 
Abstract: Safety-aligned language models systematically refuse harmful requests. While activation steering can modulate refusal, ablating the raw "refusal vector" calculated from contrastive harmful and harmless prompts often causes collateral damage and distribution drift. We argue this degradation occurs because the raw vector is polysemantic, entangling the refusal signal with core capability circuits and linguistic style.
  We introduce Surgical Refusal Ablation (SRA) to distill these steering directions. SRA constructs a registry of independent Concept Atoms representing protected capabilities and stylistic confounds, then uses ridge-regularized spectral residualization to orthogonalize the refusal vector against these directions. This yields a clean refusal direction that targets refusal-relevant structure while minimizing disruption to the model's semantic geometry.
  Across five models (Qwen3-VL and Ministral series), SRA achieves deep refusal reduction (0-2%) with negligible perplexity impact on Wikitext-2 (mean delta PPL approx. 0.02) and minimal distribution drift. Notably, standard ablation on Qwen3-VL-4B induces severe drift (first-token KL = 2.088), whereas SRA maintains the original distribution (KL = 0.044) while achieving the same 0% refusal rate. Using teacher-forced perplexity on GSM8K and MBPP as a high-resolution capability proxy, we show SRA preserves math and code distributions. These results suggest that common "model damage" is often "Ghost Noise," defined as the spectral bleeding of the dirty refusal direction into capability subspaces.

تهدف مقدمة عملية الاستئصال الجراحي للرفض (SRA) إلى تعزيز أمان نماذج اللغة من خلال تحسين قدراتها على الرفض، مما يقلل من الأضرار الجانبية والانحراف في التوزيع الناتج عن الأساليب التقليدية. تحقق SRA ذلك من خلال إنشاء سجل من الذرات المفاهيمية المستقلة واستخدام التصفية الطيفية المنتظمة بواسطة ridge لإنتاج اتجاه رفض واضح.

La introducción de la Ablación Quirúrgica de Rechazo (SRA) tiene como objetivo mejorar la seguridad de los modelos de lenguaje al refinar sus capacidades de rechazo, minimizando los daños colaterales y la deriva de distribución causados por métodos tradicionales. La SRA logra esto creando un registro de Átomos de Concepto independientes y utilizando la residualización espectral regularizada por ridge para producir una dirección de rechazo clara.

L'introduction de l'Ablation de Refus Chirurgical (SRA) vise à améliorer la sécurité des modèles de langage en affinant leurs capacités de refus, minimisant les dommages collatéraux et la dérive de distribution causés par les méthodes traditionnelles. Le SRA y parvient en créant un registre d'Atomes de Concept indépendants et en utilisant la résidualisation spectrale régularisée par crête pour produire une direction de refus claire.

The introduction of Surgical Refusal Ablation (SRA) aims to enhance the safety of language models by refining their refusal capabilities, minimizing collateral damage and distribution drift caused by traditional methods. SRA achieves this by creating a registry of independent Concept Atoms and utilizing ridge-regularized spectral residualization to produce a clean refusal direction.

Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning

arXiv:2601.08343v1 Announce Type: cross 
Abstract: Multi-agent LLM systems routinely generate multiple candidate responses that are aggregated by an LLM judge. To reduce the dominant prefill cost in such pipelines, recent work advocates KV cache reuse across partially shared contexts and reports substantial speedups for generation agents. In this work, we show that these efficiency gains do not transfer uniformly to judge-centric inference. Across GSM8K, MMLU, and HumanEval, we find that reuse strategies that are effective for execution agents can severely perturb judge behavior: end-task accuracy may appear stable, yet the judge's selection becomes highly inconsistent with dense prefill. We quantify this risk using Judge Consistency Rate (JCR) and provide diagnostics showing that reuse systematically weakens cross-candidate attention, especially for later candidate blocks. Our ablation further demonstrates that explicit cross-candidate interaction is crucial for preserving dense-prefill decisions. Overall, our results identify a previously overlooked failure mode of KV cache reuse and highlight judge-centric inference as a distinct regime that demands dedicated, risk-aware system design.

تسلط الأبحاث الحديثة الضوء على أنه بينما يمكن أن تعزز إعادة استخدام ذاكرة التخزين المؤقت KV الكفاءة في أنظمة نماذج اللغة متعددة الوكلاء (LLM)، إلا أنها قد تؤثر سلبًا على أداء القضاة LLM، مما يؤدي إلى سلوكيات اختيار غير متسقة على الرغم من دقة المهام النهائية المستقرة.

Investigaciones recientes destacan que, aunque la reutilización de caché KV puede mejorar la eficiencia en sistemas de modelos de lenguaje multiagente (LLM), puede afectar negativamente el rendimiento de los jueces LLM, llevando a comportamientos de selección inconsistentes a pesar de una precisión estable en la tarea final.

Des recherches récentes soulignent que bien que la réutilisation du cache KV puisse améliorer l'efficacité des systèmes de modèles de langage multi-agents (LLM), elle peut nuire à la performance des juges LLM, entraînant des comportements de sélection incohérents malgré une précision stable des tâches finales.

Recent research highlights that while KV cache reuse can enhance efficiency in multi-agent large language model (LLM) systems, it can negatively impact the performance of LLM judges, leading to inconsistent selection behaviors despite stable end-task accuracy.

Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning

Was this article worth reading? Share it

LucidQuery AI

Nudge AI

SafeWrite AI

Resub

CRANQ

GPTHumanizer

Ready to build your own newsroom?