arXiv:2505.18522v2 Announce Type: replace 
Abstract: Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution testing, which successfully uncovers significant differences in base capabilities among architectures at an early stage. Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer. Then, through a series of architecture component analysis, we summarize a key architecture design principle: A sequence modeling architecture need possess full-sequence arbitrary selection capability to avoid degradation in base capabilities. Finally, we empirically validate this principle using an extremely simple Top-1 element selection architecture and further generalize it to a more practical Top-1 chunk selection architecture. Experimental results demonstrate our proposed sequence modeling architecture design principle and suggest that our work can serve as a valuable reference for future architecture improvements and novel designs.

تستكشف دراسة حديثة كيف تؤثر بنية نمذجة التسلسل على القدرات الأساسية لنماذج اللغة المدربة مسبقًا مثل Transformer. بينما ركزت الأبحاث السابقة على تحسين كفاءة آليات الانتباه، يركز هذا العمل على فهم كيفية تأثير الهياكل المختلفة على الأداء الأساسي. هذا مهم لأنه قد يؤدي إلى مبادئ تصميم أفضل تحافظ على فعالية نماذج اللغة أو تحسنها في تطبيقات متنوعة.

Un estudio reciente explora cómo la arquitectura de modelado de secuencias impacta las capacidades básicas de los modelos de lenguaje preentrenados como el Transformer. Mientras que investigaciones anteriores se han centrado en mejorar la eficiencia de los mecanismos de atención, este trabajo enfatiza la comprensión de cómo diferentes arquitecturas pueden afectar el rendimiento fundamental. Esto es significativo, ya que podría llevar a mejores principios de diseño que mantengan o mejoren la efectividad de los modelos de lenguaje en diversas aplicaciones.

Une étude récente examine comment l'architecture de la modélisation de séquence influence les capacités de base des modèles de langage pré-entraînés comme le Transformer. Alors que des recherches antérieures se concentraient sur l'amélioration de l'efficacité des mécanismes d'attention, ce travail met l'accent sur la compréhension de la manière dont différentes architectures peuvent affecter la performance fondamentale. Cela est important car cela pourrait conduire à de meilleurs principes de conception qui maintiennent ou améliorent l'efficacité des modèles de langage dans diverses applications.

A recent study explores how the architecture of sequence modeling impacts the base capabilities of pre-trained language models like the Transformer. While previous research has focused on enhancing the efficiency of attention mechanisms, this work emphasizes understanding how different architectures can affect foundational performance. This is significant as it could lead to better design principles that maintain or improve the effectiveness of language models in various applications.

How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

arXiv:2511.17254v1 Announce Type: new 
Abstract: Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.

تقدم دراسة جديدة إطار العمل Intervene-All-Paths، الذي يهدف إلى التخفيف من الهلوسات في نماذج اللغة الكبيرة للرؤية (LVLMs) من خلال معالجة تفاعل مسارات سببية متنوعة. تبرز هذه البحث أن الهلوسات تنشأ من مصادر متعددة، بما في ذلك التفاعلات من الصورة إلى نص الإدخال ومن النص إلى النص، وتقترح تدخلات مستهدفة لأشكال محاذاة الأسئلة والأجوبة المختلفة.

Un nuevo estudio presenta el marco Intervene-All-Paths, destinado a mitigar las alucinaciones en los Modelos de Lenguaje de Visión Grande (LVLMs) al abordar la interacción de diversas rutas causales. Esta investigación destaca que las alucinaciones provienen de múltiples fuentes, incluidas las interacciones de imagen a texto de entrada y de texto a texto, y propone intervenciones específicas para diferentes formatos de alineación de preguntas y respuestas.

Une nouvelle étude présente le cadre Intervene-All-Paths, visant à atténuer les hallucinations dans les grands modèles de vision-langage (LVLMs) en abordant l'interaction de divers chemins causaux. Cette recherche met en évidence que les hallucinations proviennent de multiples sources, y compris les interactions image-texte d'entrée et texte-texte, et propose des interventions ciblées pour différents formats d'alignement question-réponse.

A new study introduces the Intervene-All-Paths framework, aimed at mitigating hallucinations in Large Vision-Language Models (LVLMs) by addressing the interplay of various causal pathways. This research highlights that hallucinations stem from multiple sources, including image-to-input-text and text-to-text interactions, and proposes targeted interventions for different question-answer alignment formats.

Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

arXiv:2510.08318v2 Announce Type: replace 
Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.

تم تقديم LinVideo كإطار عمل بعد التدريب يعزز كفاءة توليد الفيديو من خلال استبدال بعض وحدات الانتباه الذاتي بانتباه خطي، مما يعالج التكاليف الحسابية التربيعية المرتبطة بنماذج انتشار الفيديو التقليدية. تحافظ هذه الطريقة على أداء النموذج الأصلي مع تقليل كبير في متطلبات الموارد.

LinVideo se ha presentado como un marco de post-entrenamiento que mejora la eficiencia en la generación de videos al reemplazar ciertos módulos de atención propia con atención lineal, abordando así los costos computacionales cuadráticos asociados con los modelos de difusión de video tradicionales. Este método preserva el rendimiento del modelo original mientras reduce significativamente las demandas de recursos.

LinVideo a été introduit comme un cadre de post-formation qui améliore l'efficacité de la génération vidéo en remplaçant certains modules d'attention auto par une attention linéaire, abordant ainsi les coûts computationnels quadratiques associés aux modèles de diffusion vidéo traditionnels. Cette méthode préserve la performance du modèle original tout en réduisant considérablement les besoins en ressources.

LinVideo has been introduced as a post-training framework that enhances video generation efficiency by replacing certain self-attention modules with linear attention, addressing the quadratic computational costs associated with traditional video diffusion models. This method preserves the original model's performance while significantly reducing resource demands.

How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

Was this article worth reading? Share it

Octofy

Keywords AI

Scop.ai