arXiv:2512.02541v1 Announce Type: new 
Abstract: Since DUSt3R, models such as VGGT and $\pi^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $\pi^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $\pi^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.

دراسة حديثة بعنوان 'AVGGT: إعادة التفكير في الانتباه العالمي لتسريع VGGT' تستكشف آليات الانتباه العالمي في نماذج مثل VGGT وπ3، كاشفةً عن أدوارها في الأداء ثلاثي الأبعاد متعدد المناظر. يقترح المؤلفون مخطط تسريع من خطوتين لتحسين الكفاءة من خلال تعديل الطبقات العالمية المبكرة وأخذ عينات من الانتباه العالمي. يهدف هذا النهج إلى تقليل التكاليف الحاسوبية مع الحفاظ على الأداء.

Un estudio reciente titulado 'AVGGT: Repensando la Atención Global para Acelerar VGGT' investiga los mecanismos de atención global en modelos como VGGT y π3, revelando sus roles en el rendimiento 3D de múltiples vistas. Los autores proponen un esquema de aceleración en dos pasos para mejorar la eficiencia al modificar las primeras capas globales y muestrear la atención global. Este enfoque busca reducir los costos computacionales mientras se mantiene el rendimiento.

Une étude récente intitulée 'AVGGT : Repenser l'attention globale pour accélérer VGGT' examine les mécanismes d'attention globale dans des modèles comme VGGT et π3, révélant leurs rôles dans la performance 3D multi-vues. Les auteurs proposent un schéma d'accélération en deux étapes pour améliorer l'efficacité en modifiant les premières couches globales et en sous-échantillonnant l'attention globale. Cette approche vise à réduire les coûts computationnels tout en maintenant la performance.

A recent study titled 'AVGGT: Rethinking Global Attention for Accelerating VGGT' investigates the global attention mechanisms in models like VGGT and π3, revealing their roles in multi-view 3D performance. The authors propose a two-step acceleration scheme to enhance efficiency by modifying early global layers and subsampling global attention. This approach aims to reduce computational costs while maintaining performance.

AVGGT: Rethinking Global Attention for Accelerating VGGT

arXiv:2512.04012v1 Announce Type: new 
Abstract: Reliable 3D reconstruction from in-the-wild image collections is often hindered by "noisy" images-irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.

أظهرت دراسة حديثة أن نماذج إعادة البناء ثلاثية الأبعاد المعتمدة على التغذية الأمامية، مثل VGGT، يمكنها تمييز الصور المزعجة بشكل فطري، والتي تعيق عادةً إعادة البناء ثلاثي الأبعاد الموثوق من مجموعات الصور في العالم الحقيقي. تسلط هذه الاكتشافات الضوء على طبقة معينة داخل النموذج تظهر سلوكًا قمعيًا للقيم الشاذة، مما يسمح بفلترة فعالة للضوضاء دون آليات صريحة لاستبعاد القيم الشاذة.

Un estudio reciente ha revelado que los modelos de reconstrucción 3D de alimentación directa, como VGGT, pueden distinguir intrínsecamente imágenes ruidosas, que tradicionalmente obstaculizan la reconstrucción 3D confiable a partir de colecciones de imágenes en el mundo real. Este descubrimiento destaca una capa específica dentro del modelo que exhibe un comportamiento de supresión de valores atípicos, permitiendo un filtrado efectivo de ruido sin mecanismos explícitos de rechazo de valores atípicos.

Une étude récente a révélé que les modèles de reconstruction 3D à alimentation directe, tels que VGGT, peuvent intrinsèquement distinguer les images bruyantes, qui entravent traditionnellement la reconstruction 3D fiable à partir de collections d'images en milieu naturel. Cette découverte met en lumière une couche spécifique au sein du modèle qui présente un comportement de suppression des valeurs aberrantes, permettant un filtrage efficace du bruit sans mécanismes explicites de rejet des valeurs aberrantes.

A recent study has revealed that feed-forward 3D reconstruction models, such as VGGT, can inherently distinguish noisy images, which traditionally hinder reliable 3D reconstruction from in-the-wild image collections. This discovery highlights a specific layer within the model that exhibits outlier-suppressing behavior, enabling effective noise filtering without explicit mechanisms for outlier rejection.

AVGGT: Rethinking Global Attention for Accelerating VGGT

Was this article worth reading? Share it

LucidQuery AI

Airparser

Augmeta