arXiv:2511.00062v1 Announce Type: cross 
Abstract: We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.

نموذج Cosmos-Predict2.5 الأخير يحدث ثورة في الذكاء الاصطناعي الفيزيائي من خلال دمج توليد Text2World وImage2World وVideo2World في نظام قوي واحد. مع بنية التدفق المتقدمة والتدريب على 200 مليون مقطع فيديو، يوفر نموذجًا أفضل لتثبيت النص وتحكمًا دقيقًا في محاكاة العالم.

El último modelo Cosmos-Predict2.5 está revolucionando la IA física al integrar la generación de Text2World, Image2World y Video2World en un único sistema potente. Con su avanzada arquitectura basada en flujos y su entrenamiento en 200 millones de clips de video, ofrece un mejor anclaje textual y un control preciso sobre las simulaciones del mundo.

Le dernier modèle Cosmos-Predict2.5 révolutionne l'IA physique en intégrant la génération Text2World, Image2World et Video2World en un seul système puissant. Avec son architecture avancée basée sur des flux et son entraînement sur 200 millions de clips vidéo, il offre un meilleur ancrage textuel et un contrôle précis sur les simulations du monde.

The latest Cosmos-Predict2.5 model is revolutionizing Physical AI by integrating Text2World, Image2World, and Video2World generation into one powerful system. With its advanced flow-based architecture and training on 200 million curated video clips, it offers enhanced text grounding and precise control over world simulations.

World Simulation with Video Foundation Models for Physical AI

Was this article worth reading? Share it

Ready to build your own newsroom?