Vision Transformers with Self-Distilled Registers

arXiv — cs.CV•Wednesday, November 19, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

Vision Transformers (ViTs) are increasingly recognized for their effectiveness in visual processing, yet they face challenges with artifact tokens that compromise their performance. This study addresses these issues by introducing register tokens, specifically Post Hoc Registers (PH
The introduction of PH
The ongoing evolution of ViTs reflects a broader trend in AI towards optimizing model architectures and training methodologies, as seen in recent studies exploring procedural pretraining and hierarchical knowledge organization, which aim to further enhance the capabilities and efficiency of these models.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CV19 hours ago

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

PositiveArtificial Intelligence

This study explores a novel approach to enhance vision transformers (ViTs) by pretraining them on procedurally-generated data that lacks visual or semantic content. Utilizing simple algorithms, the research aims to instill generic biases in ViTs, allowing them to internalize abstract computational priors. The findings indicate that this warm-up phase, followed by standard image-based training, significantly boosts data efficiency, convergence speed, and overall performance, with notable improvements observed on ImageNet-1k.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

Stratified Knowledge-Density Super-Network for Scalable Vision Transformers

PositiveArtificial Intelligence

The article presents a novel approach to optimizing vision transformer (ViT) models by creating a stratified knowledge-density super-network. This method organizes knowledge hierarchically across weights, allowing for flexible extraction of sub-networks that maintain essential knowledge for various model sizes. The introduction of Weighted PCA for Attention Contraction (WPAC) enhances knowledge compactness while preserving the original network function, addressing the inefficiencies of training multiple ViT models under different resource constraints.

Read full article

via arXiv — cs.LG

arXiv — stat.ML2 days ago

Likelihood-guided Regularization in Attention Based Models

PositiveArtificial Intelligence

The paper introduces a novel likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs), aimed at enhancing model generalization while dynamically pruning redundant parameters. This approach utilizes Bayesian sparsification techniques to impose structured sparsity on model weights, allowing for adaptive architecture search during training. Unlike traditional dropout methods, this framework learns task-adaptive regularization, improving efficiency and interpretability in classification tasks involving structured and high-dimensional data.

Read full article

via arXiv — stat.ML