arXiv:2511.17577v1 Announce Type: cross 
Abstract: With the rapid development of deep learning, large language models have shown strong capabilities in complex reasoning tasks such as mathematical equation solving. However, their substantial computational and storage costs hinder practical deployment. This paper proposes a lightweight optimization method that integrates dynamic attention head pruning with knowledge distillation. The approach dynamically evaluates the importance of each attention head in the multi-head attention mechanism using a combination of weight norms and entropy, and prunes redundant heads in real time to reduce computational overhead. To mitigate performance degradation, knowledge distillation transfers information from the original model to the pruned student, enabling the smaller model to preserve reasoning ability. Experiments conducted on both Math23k and ASDiv-A verify the effectiveness of the proposed method. For example, on Math23k with a 30% pruning ratio, parameters are reduced by 18.7%, inference speed is improved by 27.5%, FLOPs are reduced by 19.3%, and accuracy drops only 0.7% (from 84.4% to 83.7%). These results demonstrate that the method achieves substantial efficiency gains while maintaining strong reasoning performance, providing a practical solution for efficient deployment of large language models in mathematical reasoning tasks.

قدمت دراسة حديثة طريقة تحسين خفيفة لنماذج اللغة الكبيرة (LLMs) تجمع بين تقليم رؤوس الانتباه الديناميكية واستنساخ المعرفة، بهدف تحسين قدرات التفكير الرياضي مع تقليل التكاليف الحاسوبية. تقوم الطريقة بتقييم أهمية كل رأس انتباه في الوقت الفعلي وتقصير الرؤوس الزائدة، مما يسمح بنشر فعال في المهام المعقدة مثل حل المعادلات الرياضية.

Un estudio reciente ha presentado un método de optimización ligero para modelos de lenguaje de gran tamaño (LLMs) que combina la poda dinámica de cabezales de atención con la destilación de conocimiento, con el objetivo de mejorar las capacidades de razonamiento matemático mientras se reducen los costos computacionales. El método evalúa la importancia de los cabezales de atención en tiempo real y poda los cabezales redundantes, permitiendo un despliegue efectivo en tareas de razonamiento complejas como la resolución de ecuaciones matemáticas.

Une étude récente a introduit une méthode d'optimisation légère pour les modèles de langage de grande taille (LLMs) qui combine l'élagage dynamique des têtes d'attention avec la distillation des connaissances, visant à améliorer les capacités de raisonnement mathématique tout en réduisant les coûts computationnels. La méthode évalue l'importance des têtes d'attention en temps réel et élaguer les têtes redondantes, permettant un déploiement efficace dans des tâches de raisonnement complexes telles que la résolution d'équations mathématiques.

A recent study has introduced a lightweight optimization method for large language models (LLMs) that combines dynamic attention head pruning with knowledge distillation, aimed at enhancing mathematical reasoning capabilities while reducing computational costs. The method evaluates the importance of attention heads in real-time and prunes redundant heads, allowing for effective deployment in complex reasoning tasks such as solving mathematical equations.

Efficient Mathematical Reasoning Models via Dynamic Pruning and Knowledge Distillation

Was this article worth reading? Share it

Augmeta

Cogent

Zemith-3bda3b