Convergence Bound and Critical Batch Size of Muon Optimizer

arXiv — cs.LG•Monday, November 17, 2025 at 5:00:00 AM

The paper titled 'Convergence Bound and Critical Batch Size of Muon Optimizer' presents a theoretical analysis of the Muon optimizer, which has shown strong empirical performance and is proposed as a successor to AdamW. The study provides convergence proofs for Muon across four practical settings, examining its behavior with and without Nesterov momentum and weight decay. It highlights that the inclusion of weight decay results in tighter theoretical bounds and identifies the critical batch size that minimizes training costs, validated through experiments in image classification and language modeling.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CV21 hours ago

Mitigating Negative Flips via Margin Preserving Training

PositiveArtificial Intelligence

Minimizing inconsistencies across successive versions of an AI system is crucial in image classification, particularly as the number of training classes increases. Negative flips occur when an updated model misclassifies previously correctly classified samples. This issue intensifies with the addition of new categories, which can reduce the margin of each class and introduce conflicting patterns. A novel approach is proposed to preserve the margins of the original model while improving performance, encouraging a larger relative margin between learned and new classes.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

SemanticNN: Compressive and Error-Resilient Semantic Offloading for Extremely Weak Devices

PositiveArtificial Intelligence

The article presents SemanticNN, a novel semantic codec designed for extremely weak embedded devices in the Internet of Things (IoT). It addresses the challenges of integrating artificial intelligence (AI) on such devices, which often face resource limitations and unreliable network conditions. SemanticNN focuses on achieving semantic-level correctness despite bit-level errors, utilizing a Bit Error Rate (BER)-aware decoder and a Soft Quantization (SQ)-based encoder to enhance collaborative inference offloading.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates

NeutralArtificial Intelligence

Recent advancements in non-Euclidean Stochastic Gradient Descent (SGD) methods, such as SignSGD, Lion, and Muon, have garnered attention for their effectiveness in training deep neural networks. However, previous theoretical analyses failed to adequately explain their superior performance compared to traditional Euclidean SGD. This study presents a unified convergence analysis that demonstrates how non-Euclidean SGD can leverage sparsity and low-rank structures, and benefit from techniques like extrapolation and momentum variance reduction, potentially matching the convergence rates of other m…

Read full article

via arXiv — cs.LG

$$\pi$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling$

arXiv — cs.CL2 days ago

$\pi$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

PositiveArtificial Intelligence

Read full article

via arXiv — cs.CL

arXiv — cs.CV2 days ago

LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers

PositiveArtificial Intelligence

The paper titled 'LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers' presents a new method for quantizing pre-trained Vision Transformer models. The proposed Layer-wise Mixed Precision Quantization (LampQ) addresses limitations in existing quantization methods, such as coarse granularity and metric scale mismatches. By employing a type-aware Fisher-based metric, LampQ aims to enhance both the efficiency and accuracy of quantization in various tasks, including image classification and object detection.

Read full article

via arXiv — cs.CV