Next-Scale Autoregressive Models for Text-to-Motion Generation
- What Happened
A new autoregressive framework named MoScale has been introduced for text-to-motion generation, which generates motion hierarchically from coarse to fine temporal resolutions. This model enhances the alignment of motion generation with the required temporal structure by providing global semantics at the coarsest scale and refining them progressively, achieving state-of-the-art performance in text-to-motion tasks.
- Why It Matters
The development of MoScale signifies a significant advancement in artificial intelligence, particularly in motion generation, as it improves robustness under limited data and effectively scales with model size, enabling diverse motion generation and editing tasks with high training efficiency.
