arXiv:2511.14751v1 Announce Type: new 
Abstract: We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

يقدم المقال دمج الرموز الموجه بالثقة (Co-Me)، وهو آلية تسريع جديدة للمحولّات الهندسية البصرية لا تتطلب إعادة تدريب أو ضبط النموذج الأساسي. يستخدم Co-Me متنبئ ثقة خفيف الوزن لتصنيف الرموز بناءً على عدم اليقين، مما يسمح بالدمج الانتقائي للرموز ذات الثقة المنخفضة. تقلل هذه الطريقة بشكل فعال من المتطلبات الحاسوبية مع الحفاظ على التغطية المكانية. يظهر Co-Me تسريعات ملحوظة، حيث يحقق تسريعًا يصل إلى 11.3 مرة و7.2 مرة عند تطبيقه على VGGT وMapAnything، مما يجعل الإدراك وإعادة البناء ثلاثي الأبعاد في الوقت الحقيقي ممكنًا.

El artículo presenta el Merging de Tokens Guiado por la Confianza (Co-Me), un nuevo mecanismo de aceleración para transformadores geométricos visuales que no requiere reentrenamiento ni ajuste del modelo base. Co-Me utiliza un predictor de confianza ligero para clasificar los tokens según su incertidumbre, permitiendo la fusión selectiva de tokens de baja confianza. Este enfoque reduce eficazmente las demandas computacionales mientras mantiene la cobertura espacial. Co-Me muestra aceleraciones significativas, logrando hasta 11.3x y 7.2x al aplicarse a VGGT y MapAnything, respectivamente, hacie…

Cet article présente le Merging de Tokens Guidé par la Confiance (Co-Me), un nouveau mécanisme d'accélération pour les transformateurs géométriques visuels qui ne nécessite pas de réentraînement ou d'ajustement du modèle de base. Co-Me utilise un prédicteur de confiance léger pour classer les tokens en fonction de leur incertitude, permettant la fusion sélective des tokens à faible confiance. Cette approche réduit efficacement les exigences computationnelles tout en préservant la couverture spatiale. Co-Me montre des accélérations significatives, atteignant jusqu'à 11,3x et 7,2x lorsqu'il est …

The paper introduces Confidence-Guided Token Merging (Co-Me), a novel acceleration mechanism for visual geometric transformers that does not require retraining or fine-tuning of the base model. Co-Me utilizes a lightweight confidence predictor to rank tokens based on uncertainty, allowing for the selective merging of low-confidence tokens. This approach effectively reduces computational demands while preserving spatial coverage. Co-Me demonstrates significant speedups, achieving up to 11.3x and 7.2x acceleration when applied to VGGT and MapAnything, respectively, making real-time 3D perception…

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

arXiv:2512.04012v1 Announce Type: new 
Abstract: Reliable 3D reconstruction from in-the-wild image collections is often hindered by "noisy" images-irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.

أظهرت دراسة حديثة أن نماذج إعادة البناء ثلاثية الأبعاد المعتمدة على التغذية الأمامية، مثل VGGT، يمكنها تمييز الصور المزعجة بشكل فطري، والتي تعيق عادةً إعادة البناء ثلاثي الأبعاد الموثوق من مجموعات الصور في العالم الحقيقي. تسلط هذه الاكتشافات الضوء على طبقة معينة داخل النموذج تظهر سلوكًا قمعيًا للقيم الشاذة، مما يسمح بفلترة فعالة للضوضاء دون آليات صريحة لاستبعاد القيم الشاذة.

Un estudio reciente ha revelado que los modelos de reconstrucción 3D de alimentación directa, como VGGT, pueden distinguir intrínsecamente imágenes ruidosas, que tradicionalmente obstaculizan la reconstrucción 3D confiable a partir de colecciones de imágenes en el mundo real. Este descubrimiento destaca una capa específica dentro del modelo que exhibe un comportamiento de supresión de valores atípicos, permitiendo un filtrado efectivo de ruido sin mecanismos explícitos de rechazo de valores atípicos.

Une étude récente a révélé que les modèles de reconstruction 3D à alimentation directe, tels que VGGT, peuvent intrinsèquement distinguer les images bruyantes, qui entravent traditionnellement la reconstruction 3D fiable à partir de collections d'images en milieu naturel. Cette découverte met en lumière une couche spécifique au sein du modèle qui présente un comportement de suppression des valeurs aberrantes, permettant un filtrage efficace du bruit sans mécanismes explicites de rejet des valeurs aberrantes.

A recent study has revealed that feed-forward 3D reconstruction models, such as VGGT, can inherently distinguish noisy images, which traditionally hinder reliable 3D reconstruction from in-the-wild image collections. This discovery highlights a specific layer within the model that exhibits outlier-suppressing behavior, enabling effective noise filtering without explicit mechanisms for outlier rejection.

Emergent Outlier View Rejection in Visual Geometry Grounded Transformers

arXiv:2512.02541v1 Announce Type: new 
Abstract: Since DUSt3R, models such as VGGT and $\pi^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $\pi^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $\pi^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.

دراسة حديثة بعنوان 'AVGGT: إعادة التفكير في الانتباه العالمي لتسريع VGGT' تستكشف آليات الانتباه العالمي في نماذج مثل VGGT وπ3، كاشفةً عن أدوارها في الأداء ثلاثي الأبعاد متعدد المناظر. يقترح المؤلفون مخطط تسريع من خطوتين لتحسين الكفاءة من خلال تعديل الطبقات العالمية المبكرة وأخذ عينات من الانتباه العالمي. يهدف هذا النهج إلى تقليل التكاليف الحاسوبية مع الحفاظ على الأداء.

Un estudio reciente titulado 'AVGGT: Repensando la Atención Global para Acelerar VGGT' investiga los mecanismos de atención global en modelos como VGGT y π3, revelando sus roles en el rendimiento 3D de múltiples vistas. Los autores proponen un esquema de aceleración en dos pasos para mejorar la eficiencia al modificar las primeras capas globales y muestrear la atención global. Este enfoque busca reducir los costos computacionales mientras se mantiene el rendimiento.

Une étude récente intitulée 'AVGGT : Repenser l'attention globale pour accélérer VGGT' examine les mécanismes d'attention globale dans des modèles comme VGGT et π3, révélant leurs rôles dans la performance 3D multi-vues. Les auteurs proposent un schéma d'accélération en deux étapes pour améliorer l'efficacité en modifiant les premières couches globales et en sous-échantillonnant l'attention globale. Cette approche vise à réduire les coûts computationnels tout en maintenant la performance.

A recent study titled 'AVGGT: Rethinking Global Attention for Accelerating VGGT' investigates the global attention mechanisms in models like VGGT and π3, revealing their roles in multi-view 3D performance. The authors propose a two-step acceleration scheme to enhance efficiency by modifying early global layers and subsampling global attention. This approach aims to reduce computational costs while maintaining performance.

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

Was this article worth reading? Share it

MicroEstimates

SVGX

ClipCutAi