arXiv:2511.17127v1 Announce Type: new 
Abstract: We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

تم إجراء دراسة تدريب على نطاق واسع باستخدام مزيج من الخبراء (MoE) باستخدام أجهزة AMD النقية، وبالتحديد وحدات معالجة الرسوميات MI300X مع الاتصال Pollara. توفر هذه الدراسة إرشادات عملية حول تصميم الأنظمة والنماذج، بما في ذلك قياسات دقيقة شاملة لجمع البيانات الأساسية وقياسات MI300X الدقيقة لحجم النواة وعرض النطاق الترددي للذاكرة.

Se ha llevado a cabo un estudio de preentrenamiento a gran escala sobre una mezcla de expertos (MoE) utilizando hardware puro de AMD, específicamente GPUs MI300X con interconexión Pollara. Este estudio proporciona orientación práctica sobre el diseño de sistemas y modelos, incluidos microbenchmarks completos para colectivos centrales y microbenchmarks MI300X para el tamaño de núcleos y el ancho de banda de memoria.

Une étude de préentraînement à grande échelle sur un mélange d'experts (MoE) a été réalisée en utilisant du matériel AMD pur, en particulier des GPU MI300X avec interconnexion Pollara. Cette étude fournit des conseils pratiques sur la conception des systèmes et des modèles, y compris des microbenchmarks complets pour les collectifs de base et des microbenchmarks MI300X pour la taille des noyaux et la bande passante mémoire.

A large-scale mixture-of-experts (MoE) pretraining study has been conducted using pure AMD hardware, specifically MI300X GPUs with Pollara interconnect. This study provides practical guidance on system and model design, including comprehensive microbenchmarks for core collectives and MI300X microbenchmarks for kernel sizing and memory bandwidth.

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

Was this article worth reading? Share it

Aqaba.ai

Testfeed

Iximiuz Labs