arXiv:2511.00503v1 Announce Type: new 
Abstract: We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30 seconds. We demonstrate the effectiveness of Diff4Splatacross video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.

Diff4Splat هي طريقة مبتكرة تسمح بإنشاء مشاهد 4D قابلة للتحكم من صورة واحدة فقط. من خلال دمج نماذج انتشار الفيديو مع قيود الهندسة والحركة المستفادة، تفتح هذه التكنولوجيا آفاقًا مثيرة للمبدعين والمطورين في مجالات مثل الألعاب والواقع الافتراضي. إنها لا تعزز التجربة البصرية فحسب، بل تجعل أيضًا عملية إنشاء المشاهد أكثر سهولة وكفاءة.

Diff4Splat es un método innovador que permite la generación de escenas 4D controlables a partir de una sola imagen. Al combinar modelos de difusión de video con restricciones de geometría y movimiento aprendidas, esta tecnología abre posibilidades emocionantes para creadores y desarrolladores en campos como los videojuegos y la realidad virtual. No solo mejora la experiencia visual, sino que también agiliza el proceso de creación de escenas, haciéndolo más accesible y eficiente.

Diff4Splat est une méthode innovante qui permet de générer des scènes 4D contrôlables à partir d'une seule image. En combinant des modèles de diffusion vidéo avec des contraintes de géométrie et de mouvement apprises, cette technologie ouvre des possibilités passionnantes pour les créateurs et les développeurs dans des domaines comme le jeu vidéo et la réalité virtuelle. Elle améliore non seulement l'expérience visuelle, mais rend également le processus de création de scènes plus accessible et efficace.

Diff4Splat is an innovative method that allows for the generation of controllable 4D scenes from just a single image. By combining video diffusion models with learned geometry and motion constraints, this technology opens up exciting possibilities for creators and developers in fields like gaming and virtual reality. It not only enhances the visual experience but also streamlines the process of scene creation, making it more accessible and efficient.

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

arXiv:2510.08318v2 Announce Type: replace 
Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.

تم تقديم LinVideo كإطار عمل بعد التدريب يعزز كفاءة توليد الفيديو من خلال استبدال بعض وحدات الانتباه الذاتي بانتباه خطي، مما يعالج التكاليف الحسابية التربيعية المرتبطة بنماذج انتشار الفيديو التقليدية. تحافظ هذه الطريقة على أداء النموذج الأصلي مع تقليل كبير في متطلبات الموارد.

LinVideo se ha presentado como un marco de post-entrenamiento que mejora la eficiencia en la generación de videos al reemplazar ciertos módulos de atención propia con atención lineal, abordando así los costos computacionales cuadráticos asociados con los modelos de difusión de video tradicionales. Este método preserva el rendimiento del modelo original mientras reduce significativamente las demandas de recursos.

LinVideo a été introduit comme un cadre de post-formation qui améliore l'efficacité de la génération vidéo en remplaçant certains modules d'attention auto par une attention linéaire, abordant ainsi les coûts computationnels quadratiques associés aux modèles de diffusion vidéo traditionnels. Cette méthode préserve la performance du modèle original tout en réduisant considérablement les besoins en ressources.

LinVideo has been introduced as a post-training framework that enhances video generation efficiency by replacing certain self-attention modules with linear attention, addressing the quadratic computational costs associated with traditional video diffusion models. This method preserves the original model's performance while significantly reducing resource demands.

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

Was this article worth reading? Share it