Convergence Bound and Critical Batch Size of Muon Optimizer

arXiv — cs.LGMonday, November 17, 2025 at 5:00:00 AM
The paper titled 'Convergence Bound and Critical Batch Size of Muon Optimizer' presents a theoretical analysis of the Muon optimizer, which has shown strong empirical performance and is proposed as a successor to AdamW. The study provides convergence proofs for Muon across four practical settings, examining its behavior with and without Nesterov momentum and weight decay. It highlights that the inclusion of weight decay results in tighter theoretical bounds and identifies the critical batch size that minimizes training costs, validated through experiments in image classification and language modeling.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Mitigating Negative Flips via Margin Preserving Training
PositiveArtificial Intelligence
Minimizing inconsistencies across successive versions of an AI system is crucial in image classification, particularly as the number of training classes increases. Negative flips occur when an updated model misclassifies previously correctly classified samples. This issue intensifies with the addition of new categories, which can reduce the margin of each class and introduce conflicting patterns. A novel approach is proposed to preserve the margins of the original model while improving performance, encouraging a larger relative margin between learned and new classes.
SemanticNN: Compressive and Error-Resilient Semantic Offloading for Extremely Weak Devices
PositiveArtificial Intelligence
The article presents SemanticNN, a novel semantic codec designed for extremely weak embedded devices in the Internet of Things (IoT). It addresses the challenges of integrating artificial intelligence (AI) on such devices, which often face resource limitations and unreliable network conditions. SemanticNN focuses on achieving semantic-level correctness despite bit-level errors, utilizing a Bit Error Rate (BER)-aware decoder and a Soft Quantization (SQ)-based encoder to enhance collaborative inference offloading.
Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates
NeutralArtificial Intelligence
Recent advancements in non-Euclidean Stochastic Gradient Descent (SGD) methods, such as SignSGD, Lion, and Muon, have garnered attention for their effectiveness in training deep neural networks. However, previous theoretical analyses failed to adequately explain their superior performance compared to traditional Euclidean SGD. This study presents a unified convergence analysis that demonstrates how non-Euclidean SGD can leverage sparsity and low-rank structures, and benefit from techniques like extrapolation and momentum variance reduction, potentially matching the convergence rates of other m…
LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers
PositiveArtificial Intelligence
The paper titled 'LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers' presents a new method for quantizing pre-trained Vision Transformer models. The proposed Layer-wise Mixed Precision Quantization (LampQ) addresses limitations in existing quantization methods, such as coarse granularity and metric scale mismatches. By employing a type-aware Fisher-based metric, LampQ aims to enhance both the efficiency and accuracy of quantization in various tasks, including image classification and object detection.