One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

arXiv — cs.CVWednesday, November 26, 2025 at 5:00:00 AM
  • A new approach called Cross-Resolution Phase-Aligned Attention (CRPA) has been introduced to address a critical failure in the use of rotary positional embeddings (RoPE) within Diffusion Transformers, particularly when handling mixed-resolution denoising. This issue arises from linear interpolation that leads to phase aliasing, causing instability in the attention mechanism and resulting in artifacts or collapse during processing.
  • The implementation of CRPA is significant as it offers a training-free solution that modifies the RoPE index map, ensuring that attention heads can effectively compare phases without the risk of incompatibility. This advancement aims to enhance the reliability and performance of pretrained Diffusion Transformers, which have shown vulnerabilities in their attention mechanisms.
  • This development reflects ongoing efforts to optimize Diffusion Transformers, a technology that has gained traction in various AI applications, including video generation and visual synthesis. The introduction of frameworks like Plan-X and methods such as Pluggable Pruning with Contiguous Layer Distillation highlights a broader trend in the AI field towards improving efficiency and output quality in complex models, addressing both computational costs and performance stability.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Terminal Velocity Matching
PositiveArtificial Intelligence
A new approach called Terminal Velocity Matching (TVM) has been proposed, which generalizes flow matching to enhance one- and few-step generative modeling. TVM focuses on the transition between diffusion timesteps and regularizes behavior at terminal time, proving to provide an upper bound on the 2-Wasserstein distance between data and model distributions under certain conditions.
Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation
PositiveArtificial Intelligence
The recent paper titled 'Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation' addresses the challenges posed by attention computation in video generation, particularly the latency introduced by the quadratic complexity of Diffusion Transformers. The authors propose a new method, Rectified SpaAttn, which aims to improve attention allocation by rectifying biases in the attention weights assigned to critical and non-critical tokens.
Plan-X: Instruct Video Generation via Semantic Planning
PositiveArtificial Intelligence
A new framework named Plan-X has been introduced to enhance video generation through high-level semantic planning, addressing the limitations of existing Diffusion Transformers in visual synthesis. The framework incorporates a Semantic Planner, which utilizes multimodal language processing to interpret user intent and generate structured spatio-temporal semantic tokens for video creation.