arXiv:2511.16540v1 Announce Type: cross 
Abstract: Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.

تقدم الورقة إطارًا تنبؤيًا لفهم نماذج اللغة الكبيرة (LLMs) من خلال تفسير تنشيطها بناءً على أنواع النصوص. باستخدام نموذج Mistral-7B ومجموعتين من البيانات، تُظهر الدراسة أنه يمكن التنبؤ بنوع النص بدقة تصل إلى 98% و71% باستخدام مصنفات scikit-learn، متفوقةً على المهام الضابطة ومقدمةً دليلًا على مفهوم استنتاج الأنواع من LLMs.

El artículo presenta un marco predictivo para comprender los Modelos de Lenguaje Grande (LLMs) interpretando sus activaciones según los géneros textuales. Utilizando el modelo Mistral-7B y dos conjuntos de datos, el estudio demuestra que se puede predecir el género textual con puntuaciones F1 de hasta el 98% y el 71% utilizando clasificadores de scikit-learn, superando las tareas de control y proporcionando una prueba de concepto para la inferencia de géneros a partir de LLMs.

Cet article présente un cadre prédictif pour comprendre les grands modèles de langage (LLM) en interprétant leurs activations en fonction des genres de texte. En utilisant le modèle Mistral-7B et deux ensembles de données, l'étude démontre que le genre de texte peut être prédit avec des scores F1 atteignant jusqu'à 98 % et 71 % à l'aide de classificateurs scikit-learn, surpassant les tâches de contrôle et fournissant une preuve de concept pour l'inférence de genre à partir des LLM.

The paper presents a predictive framework for understanding Large Language Models (LLMs) by interpreting their activations based on text genres. Using the Mistral-7B model and two datasets, the study demonstrates that text genre can be accurately predicted with F1-scores reaching up to 98% and 71% using scikit-learn classifiers, outperforming control tasks and providing a proof of concept for genre inference from LLMs.

Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

arXiv:2512.04210v1 Announce Type: cross 
Abstract: Large Language Models (LLMs) are increasingly used in healthcare, yet ensuring their safety and trustworthiness remains a barrier to deployment. Conversational medical assistants must avoid unsafe compliance without over-refusing benign queries. We present an iterative post-deployment alignment framework that applies Kahneman-Tversky Optimization (KTO) and Direct Preference Optimization (DPO) to refine models against domain-specific safety signals. Using the CARES-18K benchmark for adversarial robustness, we evaluate four LLMs (Llama-3B/8B, Meditron-8B, Mistral-7B) across multiple cycles. Our results show up to 42% improvement in safety-related metrics for harmful query detection, alongside interesting trade-offs against erroneous refusals, thereby exposing architecture-dependent calibration biases. We also perform ablation studies to identify when self-evaluation is reliable and when external or finetuned judges are necessary to maximize performance gains. Our findings underscore the importance of adopting best practices that balance patient safety, user trust, and clinical utility in the design of conversational medical assistants.

تم تقديم إطار عمل جديد لتنسيق مساعدي الذكاء الاصطناعي في الرعاية الصحية، يركز على تحقيق التوازن بين الأمان والفائدة من خلال محاذاة التفضيلات التكرارية. يستخدم هذا النهج تحسين كاهنمان-تفيرسكي وتحسين التفضيلات المباشرة لتنقيح نماذج اللغة الكبيرة (LLMs) وفقًا لإشارات الأمان المحددة، مما يؤدي إلى تحسينات كبيرة في مقاييس الكشف عن الاستفسارات الضارة.

Se ha introducido un nuevo marco para alinear asistentes de IA en el sector salud, centrado en equilibrar la seguridad y la utilidad a través de la alineación iterativa de preferencias. Este enfoque utiliza la Optimización de Kahneman-Tversky y la Optimización de Preferencias Directas para refinar modelos de lenguaje de gran tamaño (LLMs) en función de señales de seguridad específicas, logrando mejoras significativas en las métricas de detección de consultas dañinas.

Un nouveau cadre d'alignement pour les assistants AI dans le secteur de la santé a été introduit, visant à équilibrer sécurité et utilité grâce à un alignement itératif des préférences. Cette approche utilise l'optimisation de Kahneman-Tversky et l'optimisation des préférences directes pour affiner les modèles de langage de grande taille (LLMs) en fonction de signaux de sécurité spécifiques, entraînant des améliorations significatives dans les métriques de détection des requêtes nuisibles.

A new framework for aligning healthcare AI assistants has been introduced, focusing on balancing safety and helpfulness through iterative preference alignment. This approach utilizes Kahneman-Tversky Optimization and Direct Preference Optimization to refine large language models (LLMs) against specific safety signals, resulting in significant improvements in harmful query detection metrics.

Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment

arXiv:2512.05117v1 Announce Type: cross 
Abstract: We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.

تقدم دراسة حديثة فرضية الفضاء الوزني العالمي، كاشفةً أن الشبكات العصبية العميقة المدربة على مهام متنوعة تتقارب إلى فضاءات بارامترية منخفضة الأبعاد مماثلة. قامت هذه الأبحاث بتحليل أكثر من 1,100 نموذج، بما في ذلك Mistral-7B وVision Transformers وLLaMA-8B، مما يثبت أن هذه الشبكات تستغل فضاءات طيفية مشتركة بغض النظر عن التهيئة أو المهمة.

Un estudio reciente presenta la Hipótesis del Subespacio de Peso Universal, revelando que las redes neuronales profundas entrenadas en diversas tareas convergen a subespacios paramétricos de baja dimensión similares. Esta investigación analizó más de 1,100 modelos, incluidos Mistral-7B, Vision Transformers y LLaMA-8B, demostrando que estas redes explotan subespacios espectrales compartidos independientemente de la inicialización o la tarea.

Une étude récente présente l'hypothèse de l'espace de poids universel, révélant que les réseaux de neurones profonds formés sur diverses tâches convergent vers des sous-espaces paramétriques similaires de faible dimension. Cette recherche a analysé plus de 1 100 modèles, y compris Mistral-7B, Vision Transformers et LLaMA-8B, démontrant que ces réseaux exploitent des sous-espaces spectraux partagés, indépendamment de l'initialisation ou de la tâche.

A recent study presents the Universal Weight Subspace Hypothesis, revealing that deep neural networks trained on various tasks converge to similar low-dimensional parametric subspaces. This research analyzed over 1,100 models, including Mistral-7B, Vision Transformers, and LLaMA-8B, demonstrating that these networks exploit shared spectral subspaces regardless of initialization or task.

The Universal Weight Subspace Hypothesis

arXiv:2512.04457v1 Announce Type: new 
Abstract: Removing specific data influence from large language models (LLMs) remains challenging, as retraining is costly and existing approximate unlearning methods are often unstable. The challenge is exacerbated when the forget set is small or imbalanced. We introduce RapidUn, an influence-driven and parameter-efficient unlearning framework. It first estimates per-sample influence through a fast estimation module, then maps these scores into adaptive update weights that guide selective parameter updates -- forgetting harmful behavior while retaining general knowledge. On Mistral-7B and Llama-3-8B across Dolly-15k and Alpaca-57k, RapidUn achieves up to 100 times higher efficiency than full retraining and consistently outperforms Fisher, GA, and LoReUn on both in-distribution and out-of-distribution forgetting. These results establish influence-guided parameter reweighting as a scalable and interpretable paradigm for LLM unlearning.

تم تقديم إطار عمل جديد يسمى RapidUn لمعالجة التحديات المتعلقة بنسيان تأثيرات بيانات محددة في نماذج اللغة الكبيرة (LLMs). تستخدم هذه الطريقة نهجًا مدفوعًا بالتأثير لتحديث المعلمات بشكل انتقائي، مما يحقق تحسينات كبيرة في الكفاءة مقارنة بأساليب إعادة التدريب التقليدية، خاصة على نماذج مثل Mistral-7B و Llama-3-8B.

Se ha introducido un nuevo marco llamado RapidUn para abordar los desafíos de olvidar influencias de datos específicas en los grandes modelos de lenguaje (LLMs). Este método utiliza un enfoque impulsado por la influencia para actualizar selectivamente los parámetros, logrando mejoras significativas en eficiencia en comparación con los métodos de reentrenamiento tradicionales, especialmente en modelos como Mistral-7B y Llama-3-8B.

Un nouveau cadre appelé RapidUn a été introduit pour relever les défis de l'oubli des influences de données spécifiques dans les grands modèles de langage (LLMs). Cette méthode utilise une approche axée sur l'influence pour mettre à jour sélectivement les paramètres, réalisant des améliorations d'efficacité significatives par rapport aux méthodes de réentraînement traditionnelles, en particulier sur des modèles comme Mistral-7B et Llama-3-8B.

A new framework called RapidUn has been introduced to address the challenges of unlearning specific data influences in large language models (LLMs). This method utilizes an influence-driven approach to selectively update parameters, achieving significant efficiency improvements over traditional retraining methods, particularly on models like Mistral-7B and Llama-3-8B.

Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

Was this article worth reading? Share it

LucidQuery AI

Airparser

Chattermate