arXiv:2512.06457v1 Announce Type: new 
Abstract: Attention-based large language models (LLMs) have transformed modern AI applications, but the quadratic cost of self-attention imposes significant compute and memory overhead. Dynamic sparsity (DS) attention mitigates this, yet its hardware efficiency is limited by the added prediction stage and the heavy memory traffic it entails. To address these limitations, this paper proposes BitStopper, a fine-grained algorithm-architecture co-design that operates without a sparsity predictor. First, a bit-serial enable stage fusion (BESF) mechanism is proposed to reuse and minimize the memory access by progressively terminating trivial tokens and merging the prediction stage into the execution stage. Second, a lightweight and adaptive token selection (LATS) strategy is developed to work in concert with the bit-level sparsity speculation. Third, a bit-level asynchronous processing (BAP) strategy is employed to improve compute utilization during the on-demand bit-grained memory fetching. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Extensive evaluations demonstrate that, compared to state-of-the-art (SOTA) Transformer accelerators, BitStopper achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.

تم تقديم خوارزمية جديدة لتصميم المعمارية تسمى BitStopper، والتي تعزز كفاءة نماذج اللغة الكبيرة المعتمدة على الانتباه (LLMs) من خلال تقليل تكاليف الحساب والذاكرة المرتبطة بآليات الانتباه الذاتي. تستخدم هذه الطريقة آلية دمج مرحلة تمكين بتسلسلي واستراتيجية اختيار رموز خفيفة لتحسين الأداء دون الحاجة إلى متنبئ للتشتت.

Se ha presentado un nuevo diseño co-optimizado de algoritmo-arquitectura llamado BitStopper, que mejora la eficiencia de los modelos de lenguaje de gran tamaño (LLMs) basados en atención al minimizar los costos de computación y memoria asociados con los mecanismos de auto-atención. Este enfoque utiliza un mecanismo de fusión de etapa de activación bit-serial y una estrategia de selección de tokens ligera para optimizar el rendimiento sin necesidad de un predictor de esparcimiento.

Un nouvel algorithme-architecture co-conçu, nommé BitStopper, a été introduit pour améliorer l'efficacité des modèles de langage à grande échelle basés sur l'attention (LLMs) en minimisant les coûts de calcul et de mémoire associés aux mécanismes d'auto-attention. Cette approche utilise un mécanisme de fusion d'étape d'activation bit-serial et une stratégie de sélection de jetons légère pour optimiser les performances sans avoir besoin d'un prédicteur de parcimonie.

A new algorithm-architecture co-design named BitStopper has been introduced to enhance the efficiency of attention-based large language models (LLMs) by minimizing compute and memory overhead associated with self-attention mechanisms. This approach employs a bit-serial enable stage fusion mechanism and a lightweight token selection strategy to optimize performance without the need for a sparsity predictor.

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

arXiv:2512.07855v1 Announce Type: new 
Abstract: Attention-based Transformers have revolutionized natural language processing (NLP) and shown strong performance in computer vision (CV) tasks. However, as the input sequence varies, the computational bottlenecks in Transformer models exhibit dynamic behavior across stages, which calls for a cross-stage sparse acceleration strategy. Unfortunately, most existing sparse Transformer approaches are single-stage based, and their sparsity prediction mechanisms lead to significant power overhead when applied across multiple stages. To this end, this paper proposes a log-domain attention prediction algorithm-architecture co-design, named LAPA. First, an asymmetric leading one computing (ALOC) scheme is designed to eliminate expensive multiplications. Next, a mixed-precision multi-round shifting accumulation (MRSA) mechanism is further proposed to mitigate the accumulation overhead. A data-feature dependent filter (DDF) strategy is designed to work in concert with the MRSA process. Finally, an elaborate accelerator is designed to translate the theoretical enhancement into practical hardware improvement. Experimental results show that LAPA achieves 3.52x, 3.24x and 2.79x higher energy efficiency than the state-of-the-art (SOTA) works Spatten, Sanger and FACT, respectively.

يقدم LAPA مسرعًا ديناميكيًا للندرة مدفوعًا بالتنبؤات في المجال اللوغاريتمي، مصممًا لنماذج Transformer، ويعالج الاختناقات الحسابية التي تنشأ بسبب اختلاف تسلسلات الإدخال. يجمع هذا النهج المبتكر بين مخطط حسابي غير متوازن وآلية تراكم متعددة الجولات بدقة مختلطة لتحسين الكفاءة عبر مراحل المعالجة المتعددة.

LAPA es un nuevo acelerador de sparsidad dinámica impulsado por predicciones en dominio logarítmico, diseñado para modelos Transformer, que aborda los cuellos de botella computacionales que surgen debido a la variación en las secuencias de entrada. Este enfoque innovador combina un esquema de cálculo asimétrico y un mecanismo de acumulación de precisión mixta para mejorar la eficiencia en múltiples etapas de procesamiento.

LAPA est un nouvel accélérateur de sparsité dynamique basé sur la prédiction en domaine logarithmique, conçu pour les modèles Transformer, qui vise à résoudre les goulets d'étranglement computationnels causés par la variation des séquences d'entrée. Cette approche innovante combine un schéma de calcul asymétrique et un mécanisme d'accumulation à précision mixte pour améliorer l'efficacité à travers plusieurs étapes de traitement.

The paper introduces LAPA, a log-domain prediction-driven dynamic sparsity accelerator designed for Transformer models, addressing the computational bottlenecks that arise due to varying input sequences. This innovative approach combines an asymmetric leading one computing scheme and a mixed-precision multi-round shifting accumulation mechanism to enhance efficiency across multiple stages of processing.

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Was this article worth reading? Share it

LucidQuery AI

Airparser

Humanize AI