Motif 2 12.7B technical report

arXiv — cs.CLWednesday, November 12, 2025 at 5:00:00 AM
On November 12, 2025, the Motif-2-12.7B model was unveiled, representing a significant advancement in the efficiency of large language models. Building on its predecessor, Motif-2.6B, this new model integrates Grouped Differential Attention (GDA) to enhance representational efficiency by effectively managing signal and noise in attention pathways. Pre-trained on an extensive dataset of 5.5 trillion tokens, the model employs a curriculum-driven data scheduler that optimally adjusts data composition. The training process utilizes the MuonClip optimizer and advanced techniques such as fused PolyNorm activations and the Parallel Muon algorithm, resulting in improved throughput and memory efficiency. Following the pre-training phase, a three-stage supervised fine-tuning pipeline is implemented to refine the model's ability to adhere to instructions and enhance its compositional understanding. The Motif-2-12.7B model demonstrates competitive performance across various benchmarks, showcasing …
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Kimi's K2 Opensource Language Model Supports Dynamic Resource Availability and New Optimizer
PositiveArtificial Intelligence
Kimi has launched K2, a Mixture-of-Experts large language model featuring 32 billion activated parameters and a total of 1.04 trillion parameters, trained on 15.5 trillion tokens. This release introduces MuonClip, a new optimizer that enhances the Muon optimizer by incorporating a QK-clip technique aimed at mitigating training instability. The development reportedly achieved a 'zero loss spike' during pre-training, indicating improved training reliability.