World PulseNowPowered by AI

Trending:

Mechanisms of Non-Monotonic Scaling in Vision Transformers

arXiv — cs.CV•Thursday, November 27, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

A recent study on Vision Transformers (ViTs) reveals a non-monotonic scaling behavior, where deeper models like ViT-L may underperform compared to shallower variants such as ViT-S and ViT-B. This research identifies a three-phase pattern—Cliff-Plateau-Climb—indicating how representation quality evolves with depth, particularly noting the diminishing role of the [CLS] token in favor of patch tokens for better performance.
This development is significant as it challenges existing assumptions about model depth in transformer architectures, suggesting that simply increasing layers does not guarantee improved task performance. The findings advocate for a more nuanced understanding of how depth impacts information processing in ViTs, which could influence future model designs and applications.
The study's insights resonate with ongoing discussions in the AI community regarding the optimization of Vision Transformers. Various approaches, such as dynamic granularity adjustments and structural reparameterization, are being explored to enhance model efficiency and effectiveness. These developments highlight a broader trend towards refining transformer architectures to better balance depth, performance, and computational efficiency.

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

Blunge

Train your own private AI image models to protect and personalize your unique artistic style.

Creative & DesignTry the app

OpenL Translator

Instantly translate text from images of signs and menus with accuracy.

AI & DataTry the app

Attentive AI

Extract digital maps from satellite, aerial, and drone imagery using deep learning.

AI & DataTry the app

Continue Readings

Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models

arXiv — cs.CV16 hours ago

Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models

PositiveArtificial Intelligence

A new framework called EntPruner has been introduced to address parameter redundancy in large-scale vision generative models, specifically diffusion and flow models. This framework employs an entropy-guided automatic progressive pruning strategy, which assesses the importance of model blocks based on Conditional Entropy Deviation (CED) to optimize performance across various downstream tasks.

Read full article

via arXiv — cs.CV

Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

arXiv — cs.CV16 hours ago

Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

PositiveArtificial Intelligence

The introduction of Filter Like You Test (FLYT) presents a novel algorithm for curating large-scale vision-language datasets, enhancing the selection of pretraining examples by learning the usefulness of each data point through gradient signals from downstream tasks. This is complemented by Mixing-FLYT (M-FLYT) and Soft Cap Sampling (SCS), which improve dataset filtering and accuracy.

Read full article

via arXiv — cs.CV

Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI

arXiv — cs.CV16 hours ago

Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI

PositiveArtificial Intelligence

A new framework for privacy-preserving federated learning has been introduced, combining Vision Transformers with lightweight homomorphic encryption to enhance histopathology classification across multiple healthcare institutions. This approach addresses the challenges posed by privacy regulations like HIPAA, which restrict direct patient data sharing, while still enabling collaborative machine learning.

Read full article

via arXiv — cs.CV

Frequency-Aware Token Reduction for Efficient Vision Transformer

arXiv — cs.CV16 hours ago

Frequency-Aware Token Reduction for Efficient Vision Transformer

PositiveArtificial Intelligence

A new study introduces a frequency-aware token reduction strategy for Vision Transformers, addressing the computational complexity associated with token length. This method enhances efficiency by categorizing tokens into high-frequency and low-frequency groups, selectively preserving high-frequency tokens while aggregating low-frequency ones into a compact form.

Read full article

via arXiv — cs.CV

LTD: Low Temperature Distillation for Gradient Masking-free Adversarial Training

arXiv — cs.CV16 hours ago

LTD: Low Temperature Distillation for Gradient Masking-free Adversarial Training

PositiveArtificial Intelligence

A novel approach called Low-Temperature Distillation (LTD) has been introduced to enhance adversarial training in neural networks, addressing the vulnerabilities associated with one-hot label representations in image classification. LTD utilizes a lower temperature in the teacher model while keeping the student model's temperature fixed, refining label representations and improving model robustness against adversarial attacks.

Read full article

via arXiv — cs.CV

Decorrelation Speeds Up Vision Transformers

arXiv — cs.CV16 hours ago

Decorrelation Speeds Up Vision Transformers

PositiveArtificial Intelligence

Recent advancements in the optimization of Vision Transformers (ViTs) have been achieved through the integration of Decorrelated Backpropagation (DBP) into Masked Autoencoder (MAE) pre-training, resulting in a 21.1% reduction in wall-clock time and a 21.4% decrease in carbon emissions during training on datasets like ImageNet-1K and ADE20K.

Read full article

via arXiv — cs.CV

Rethinking Vision Transformer Depth via Structural Reparameterization

arXiv — cs.CV2 days ago

Rethinking Vision Transformer Depth via Structural Reparameterization

PositiveArtificial Intelligence

A new study proposes a branch-based structural reparameterization technique for Vision Transformers, aiming to reduce the number of stacked transformer layers while maintaining their representational capacity. This method operates during the training phase, allowing for the consolidation of parallel branches into streamlined models for efficient inference deployment.

Read full article

via arXiv — cs.CV

DP-MicroAdam: Private and Frugal Algorithm for Training and Fine-tuning

arXiv — cs.LG2 days ago

DP-MicroAdam: Private and Frugal Algorithm for Training and Fine-tuning

PositiveArtificial Intelligence

The introduction of DP-MicroAdam marks a significant advancement in the realm of adaptive optimizers for differentially private training, demonstrating superior performance and convergence rates compared to traditional methods like DP-SGD. This new algorithm is designed to be memory-efficient and sparsity-aware, addressing the challenges of extensive compute and hyperparameter tuning typically associated with differential privacy.

Read full article

via arXiv — cs.LG