From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

arXiv — cs.CV•Thursday, November 20, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A recent study highlights the challenges of feature
This finding is significant as it suggests a need for rethinking the design of KD methods specifically for ViTs, which are becoming increasingly prevalent in visual processing tasks.
The ongoing research into optimizing ViTs, including novel architectures and regularization techniques, underscores a broader trend towards enhancing model efficiency and performance in deep learning.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.LG8 hours ago

D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models

PositiveArtificial Intelligence

Data-Free Quantization (DFQ) presents a solution for model compression without needing real data, which is beneficial in privacy-sensitive contexts. While DFQ has been effective for unimodal models, its application to Vision-Language Models like CLIP has not been thoroughly investigated. This study introduces D4C, a DFQ framework specifically designed for CLIP, addressing challenges such as semantic content and intra-image diversity in synthesized samples.

Read full article

via arXiv — cs.LG

arXiv — cs.CV8 hours ago

Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

PositiveArtificial Intelligence

This paper introduces a novel approach to self pre-training using topology- and spatiality-aware Masked Autoencoders (MAEs) for 3D medical image segmentation. The proposed method enhances the ability of Vision Transformers (ViTs) to capture geometric shape and spatial information, which are crucial for accurate segmentation. A new topological loss is introduced to preserve geometric shape information, improving the performance of MAEs in medical imaging tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.CV8 hours ago

Application of Graph Based Vision Transformers Architectures for Accurate Temperature Prediction in Fiber Specklegram Sensors

PositiveArtificial Intelligence

This study explores the application of transformer-based architectures for predicting temperature variations using Fiber Specklegram Sensors (FSS). The research highlights the challenges posed by the nonlinear nature of specklegram data and demonstrates that Vision Transformers (ViTs) achieved a Mean Absolute Error (MAE) of 1.15, outperforming traditional models like CNNs. The findings underscore the potential of advanced transformer models in enhancing environmental monitoring capabilities.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

PositiveArtificial Intelligence

The paper introduces CascadedViT (CViT), a lightweight vision transformer architecture designed to address the high computational and energy demands of traditional Vision Transformers (ViTs). It features a novel feedforward network called Cascaded-Chunk Feed Forward Network (CCFFN), which enhances parameter and FLOP efficiency by splitting input features. Experiments on ImageNet-1K demonstrate that the CViT-XL model achieves 75.5% Top-1 accuracy while reducing FLOPs by 15% and energy consumption by 3.3% compared to EfficientViT-M5, making it suitable for battery-constrained devices.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Vision Transformers with Self-Distilled Registers

PositiveArtificial Intelligence

Vision Transformers (ViTs) have become the leading architecture for visual processing tasks, showcasing remarkable scalability with larger training datasets and model sizes. However, recent findings have revealed the presence of artifact tokens in ViTs that conflict with local semantics, negatively impacting performance in tasks requiring precise localization and structural coherence. This paper introduces register tokens to mitigate this issue, proposing Post Hoc Registers (PH-Reg) as an efficient self-distillation method to integrate these tokens into existing ViTs without the need for retra…

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

PositiveArtificial Intelligence

The paper titled 'UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective' addresses the computational challenges posed by large datasets in deep learning. It proposes a novel approach to dataset pruning that focuses on generalization rather than fitting, scoring samples based on models not exposed to them during training. This method aims to create a more effective selection process by reducing the concentration of sample scores, ultimately improving the performance of deep learning models.

Read full article

via arXiv — cs.CV

arXiv — stat.ML2 days ago

Likelihood-guided Regularization in Attention Based Models

PositiveArtificial Intelligence

The paper introduces a novel likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs), aimed at enhancing model generalization while dynamically pruning redundant parameters. This approach utilizes Bayesian sparsification techniques to impose structured sparsity on model weights, allowing for adaptive architecture search during training. Unlike traditional dropout methods, this framework learns task-adaptive regularization, improving efficiency and interpretability in classification tasks involving structured and high-dimensional data.

Read full article

via arXiv — stat.ML

arXiv — cs.LG2 days ago

Stratified Knowledge-Density Super-Network for Scalable Vision Transformers

PositiveArtificial Intelligence

The article presents a novel approach to optimizing vision transformer (ViT) models by creating a stratified knowledge-density super-network. This method organizes knowledge hierarchically across weights, allowing for flexible extraction of sub-networks that maintain essential knowledge for various model sizes. The introduction of Weighted PCA for Attention Contraction (WPAC) enhances knowledge compactness while preserving the original network function, addressing the inefficiencies of training multiple ViT models under different resource constraints.

Read full article

via arXiv — cs.LG