arXiv:2512.01881v2 Announce Type: replace 
Abstract: The training of deep vision models is fundamentally a signal recovery problem amidst high-dimensional stochastic noise. Current optimization paradigms impose a static compromise on information channel capacity. For instance, magnitude-based methods, such as AdamW, operate on the assumption that gradient norms are high-fidelity curvature signals. While this allows for precision in smooth regimes, it leads to catastrophic noise amplification when applied to rugged, non-convex landscapes. Conversely, sign-based methods (e.g., Lion) perform a radical 1-bit quantization of the gradient, which aims to provide robust regularization at the cost of discarding fine-grained descent information. We propose that optimal convergence requires neither static prior, but rather a dynamic modulation of the update bitrate. We introduce ThermoLion, a vision-centric framework that utilizes local Signal-to-Noise Ratio (SNR) gating to autonomously transition parameters between a "low-bit" exploration phase and a "high-precision" exploitation phase. Furthermore, we introduce a Momentum Alignment mechanism that detects constructive interference between historical drift and instantaneous gradients to accelerate convergence during stable trajectories. Empirical benchmarks across 12 diverse vision datasets (including CIFAR, SVHN, and GTSRB) demonstrate that ThermoLion surpasses state-of-the-art optimizers, such as AdamW and Lion, in convergence speed and terminal accuracy.

يقدم تقديم ThermoLion نهجًا مبتكرًا لتحسين الشبكات العميقة للرؤية من خلال تعديل معدل التحديث ديناميكيًا، مما يعالج قيود طرق التحسين الحالية مثل AdamW وLion، التي إما تضخم الضوضاء أو تتجاهل معلومات التدرج الحيوية. يهدف هذا الإطار إلى تحسين تدريب النماذج في ظل الضوضاء العشوائية عالية الأبعاد.

La introducción de ThermoLion presenta un enfoque novedoso para optimizar redes de visión profunda mediante la modulación dinámica de la tasa de actualización, abordando las limitaciones de los métodos de optimización existentes como AdamW y Lion, que amplifican el ruido o descartan información crucial del gradiente. Este marco busca mejorar el entrenamiento de modelos en medio de ruido estocástico de alta dimensión.

L'introduction de ThermoLion propose une approche novatrice pour optimiser les réseaux de vision profonds en modulant dynamiquement le débit de mise à jour, répondant ainsi aux limitations des méthodes d'optimisation existantes comme AdamW et Lion, qui amplifient soit le bruit, soit rejettent des informations cruciales sur le gradient. Ce cadre vise à améliorer l'entraînement des modèles dans un bruit stochastique de haute dimension.

The introduction of ThermoLion presents a novel approach to optimizing deep vision networks by dynamically modulating update bitrate, addressing the limitations of existing optimization methods like AdamW and Lion, which either amplify noise or discard crucial gradient information. This framework aims to enhance model training amidst high-dimensional stochastic noise.

Unifying Sign and Magnitude for Optimizing Deep Vision Networks via ThermoLion

arXiv:2512.08217v1 Announce Type: new 
Abstract: Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $\gamma$ without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set $\propto \gamma^2$ instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay $\propto \gamma^2$ results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, we derive and empirically verify that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers and we show that decoupled weight decay $\propto \gamma^2$ leads to stable weight and gradient norms and allows us to better control the training dynamics and improve the model performance.

دراسة حديثة تتحدى النهج التقليدي لوزن الانخفاض المنفصل في خوارزميات التحسين، حيث تتساءل عن الافتراض الذي ظل قائمًا لفترة طويلة بأن الوزن يجب أن يكون متناسبًا مع معدل التعلم. تقترح الأبحاث أن يكون التناسب مع مربع معدل التعلم أكثر ملاءمة، بناءً على حجج الاستقامة في الحالة الثابتة. ومع ذلك، تشير النتائج إلى تأثير ضئيل على ديناميات التدريب عند إزالة المكون العمودي للتحديثات.

Un estudio reciente desafía el enfoque convencional sobre la decadencia de peso desacoplada en los algoritmos de optimización, cuestionando la suposición de que debería ser proporcional a la tasa de aprendizaje. La investigación sugiere que una proporcionalidad al cuadrado de la tasa de aprendizaje podría ser más adecuada, basada en argumentos de ortogonalidad en estado estacionario. Sin embargo, los hallazgos indican un impacto mínimo en la dinámica de entrenamiento cuando se elimina la componente perpendicular de las actualizaciones.

Une étude récente remet en question l'approche conventionnelle de la décadence de poids découplée dans les algorithmes d'optimisation, en interrogeant l'hypothèse longtemps tenue selon laquelle elle devrait être proportionnelle au taux d'apprentissage. La recherche suggère qu'une proportionnalité au carré du taux d'apprentissage pourrait être plus appropriée, basée sur des arguments d'orthogonalité à l'état stationnaire. Cependant, les résultats indiquent un impact minimal sur la dynamique d'entraînement lorsque la composante perpendiculaire des mises à jour est supprimée.

A recent study challenges the conventional approach to decoupled weight decay in optimization algorithms, specifically questioning the long-held assumption that it should be proportional to the learning rate. The research suggests that a proportionality to the square of the learning rate may be more appropriate, based on steady-state orthogonality arguments. However, findings indicate minimal impact on training dynamics when the perpendicular component of updates is removed.

Unifying Sign and Magnitude for Optimizing Deep Vision Networks via ThermoLion

Was this article worth reading? Share it

LucidQuery AI

The Visualizer

Hypertune