Dynamic Temperature Scheduler for Knowledge Distillation

arXiv — cs.LG•Wednesday, November 19, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A new method called Dynamic Temperature Scheduler (DTS) has been introduced to enhance Knowledge Distillation (KD) by dynamically adjusting the temperature based on the loss gap between teacher and student models. This approach allows for improved training efficiency by providing softer probabilities initially and sharper ones as training progresses.
The development of DTS is significant as it represents the first temperature scheduling method that adapts to the divergence between teacher and student distributions, potentially leading to better performance in AI models across various applications, including vision tasks.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CV19 hours ago

UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

PositiveArtificial Intelligence

The paper titled 'UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective' addresses the computational challenges posed by large datasets in deep learning. It proposes a novel approach to dataset pruning that focuses on generalization rather than fitting, scoring samples based on models not exposed to them during training. This method aims to create a more effective selection process by reducing the concentration of sample scores, ultimately improving the performance of deep learning models.

Read full article

via arXiv — cs.CV

arXiv — cs.LG19 hours ago

Squeezed Diffusion Models

PositiveArtificial Intelligence

Squeezed Diffusion Models (SDM) introduce a novel approach to diffusion models by scaling noise anisotropically along the principal component of the training distribution. This method, inspired by quantum squeezed states and the Heisenberg uncertainty principle, aims to enhance the signal-to-noise ratio, thereby improving the learning of important data features. Initial studies on datasets like CIFAR-10/100 and CelebA-64 indicate that mild antisqueezing can lead to significant improvements in model performance, with FID scores improving by up to 15%.

Read full article

via arXiv — cs.LG

arXiv — cs.CV19 hours ago

Attention Via Convolutional Nearest Neighbors

PositiveArtificial Intelligence

The article introduces Convolutional Nearest Neighbors (ConvNN), a framework that unifies Convolutional Neural Networks (CNNs) and Transformers by viewing convolution and self-attention as neighbor selection and aggregation methods. ConvNN allows for a systematic exploration of the spectrum between these two architectures, serving as a drop-in replacement for convolutional and attention layers. The framework's effectiveness is validated through classification tasks on CIFAR-10 and CIFAR-100 datasets.

Read full article

via arXiv — cs.CV

arXiv — stat.ML19 hours ago

Attention via Synaptic Plasticity is All You Need: A Biologically Inspired Spiking Neuromorphic Transformer

PositiveArtificial Intelligence

The article discusses a new approach to attention mechanisms in artificial intelligence, inspired by biological synaptic plasticity. This method aims to improve energy efficiency in spiking neural networks (SNNs) compared to traditional Transformers, which rely on dot-product similarity. The research highlights the limitations of current spiking attention models and proposes a biologically inspired spiking neuromorphic transformer that could reduce the carbon footprint associated with large language models (LLMs) like GPT.

Read full article

via arXiv — stat.ML

arXiv — cs.LG19 hours ago

MI-to-Mid Distilled Compression (M2M-DC): An Hybrid-Information-Guided-Block Pruning with Progressive Inner Slicing Approach to Model Compression

PositiveArtificial Intelligence

MI-to-Mid Distilled Compression (M2M-DC) is a novel compression framework that combines information-guided block pruning with progressive inner slicing and staged knowledge distillation. The method ranks residual blocks based on a mutual information signal, removing the least informative units. It alternates short knowledge distillation phases with channel slicing to maintain computational efficiency while preserving model accuracy. The approach has demonstrated promising results on CIFAR-100, achieving high accuracy with significantly reduced parameters.

Read full article

via arXiv — cs.LG

arXiv — stat.ML2 days ago

Likelihood-guided Regularization in Attention Based Models

PositiveArtificial Intelligence

The paper introduces a novel likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs), aimed at enhancing model generalization while dynamically pruning redundant parameters. This approach utilizes Bayesian sparsification techniques to impose structured sparsity on model weights, allowing for adaptive architecture search during training. Unlike traditional dropout methods, this framework learns task-adaptive regularization, improving efficiency and interpretability in classification tasks involving structured and high-dimensional data.

Read full article

via arXiv — stat.ML

arXiv — cs.LG2 days ago

A Systematic Analysis of Out-of-Distribution Detection Under Representation and Training Paradigm Shifts

NeutralArtificial Intelligence

The article presents a systematic comparison of out-of-distribution (OOD) detection methods across different representation paradigms, specifically CNNs and Vision Transformers (ViTs). The study evaluates these methods using metrics such as AURC and AUGRC on datasets including CIFAR-10, CIFAR-100, SuperCIFAR-100, and TinyImageNet. Findings indicate that the learned feature space significantly influences OOD detection efficacy, with probabilistic scores being more effective for CNNs, while geometry-aware scores excel in ViTs under stronger shifts.

Read full article

via arXiv — cs.LG

arXiv — cs.LG2 days ago

Nearest Neighbor Projection Removal Adversarial Training

PositiveArtificial Intelligence

Deep neural networks have shown remarkable capabilities in image classification but are susceptible to adversarial examples. Traditional adversarial training improves robustness but often overlooks inter-class feature overlap, which contributes to vulnerability. This study introduces a new adversarial training framework that reduces inter-class proximity by projecting out dependencies from both adversarial and clean samples in the feature space. The method enhances feature separability and theoretically lowers the Lipschitz constant of neural networks, improving generalization.

Read full article

via arXiv — cs.LG