NVIDIA's Cosmos World Model

is a world model development platform designed to accelerate the development of physical artificial intelligence (AI) systems, focusing on applications in robotics and autonomous vehicle (AV) laboratories.

The Cosmos platform integrates World Foundation Models (WFM), tokenizers, and video processing pipelines, aiming to accelerate the development of physical AI. Cosmos' codebase will help users run Cosmos models, execute inference scripts, and generate relevant videos. After its release, it saw a surge in Github Stars.

The core components of NVIDIA Cosmos

NVIDIA Cosmos™ is a world foundation model platform designed for developers to help physical AI developers build their physical AI systems more efficiently. The platform includes the following key elements:

：Through the Hugging Face platform, users can freely access pre-trained models licensed for commercial use under the NVIDIA Open Model License.
：Via the NVIDIA Nemo framework, training scripts are provided under the Apache 2 license to fine-tune pre-trained models for various physical AI applications.

Core functions

：Supports Text2World and Video2World generation, allowing users to generate visual simulations through text or video prompts.

：Also supports Video2World generation, enabling users to generate visual simulations through video prompts and optional text prompts.

：Efficiently decomposes videos into continuous tokens (latent vectors) and discrete tokens (integers), achieving efficient video processing.
: Helps users create their own video datasets.
: Conducts post-training on pre-trained world foundation models via the NeMo framework for various physical AI scenarios.
Help users build their own world foundation models through the NeMo framework.

Main advantages

Accelerating physical AI development

Cosmos provides open and accessible high-performance world foundation models and data pipelines, making the development of physical AI more widespread.

Physical perception

Cosmos includes the first batch of video-based models, trained on 900 trillion tokens, including 20 million hours of robotics and autonomous driving data. It can generate high-quality videos from multimodal inputs such as images, text, or video.

Openness

Cosmos's World Foundation Model (WFM) and tokenizer follow the NVIDIA Open Model License, allowing global developers to build physical AI systems at scale without high costs.

Accelerated Data Processing and Filtering

Through the NVIDIA NeMo Curator pipeline and CUDA™-X, NVIDIA AI acceleration tools, Cosmos provides 20x faster data processing capabilities, supporting the processing of over 100 PB of data. These tools offer users ready-to-use optimizations, reducing total cost of ownership (TCO) and accelerating time to market.

Customized model development

The tokenizer of Cosmos can convert visual data into high-fidelity tokens, providing 8 times better compression rates and 12 times faster processing speeds.

NVIDIA NeMo™ offers accelerated training and fine-tuning capabilities, helping users build multi-modal generative AI models to support the needs of physical AI.

Model introduction

: Including autoregressive and diffusion models, supporting text-to-world and video-to-world generation, with parameter scales ranging from 400 million to 1.4 billion, meeting various needs. Download address: https://huggingface.co/collections/nvidia/cosmos-6751e884dc10e013a0a0d8e6
: Optimizing the processing of text prompts to improve the accuracy and detail of generated results.
: Specifically designed for decoding video sequences, optimized for enhancing reality (AR) applications.
Built-in protection mechanism：

: Filter brands, hazardous content, and harmful prompts to ensure the safety of Cosmos-generated content.
: Remove suspicious scenarios.
: Automatically blur faces in videos.
: Add a digital watermark to synthetic videos generated by preview APIs available through the NVIDIA API catalog.

Use Cases

How developers can use NVIDIA Cosmos

Learn how developers across various fields leverage Cosmos to advance their work, including applications in robotics, autonomous driving, and visual AI.

Video Search

Cosmos helps developers build custom datasets for AI model training. Whether it's video footage of snowy roads for autonomous vehicles or busy warehouse scenes for robotics applications, Cosmos simplifies the process of video tagging and search by understanding spatial and temporal patterns, making the preparation of training data more efficient. This not only saves time and reduces costs but also helps deliver highly relevant AI models with significant practical impact.

Controllable 3D-to-Real Synthetic Data

Developers can use 3D simulation data to generate highly realistic synthetic videos. Through Omniverse, developers can create 3D environments that meet the needs of model training and generate realistic videos by precisely controlling 3D scenes, thereby creating highly customized synthetic datasets.

Policy Model Training and Evaluation

The Cosmos world foundation model has been fine-tuned to support action-conditioned video prediction. This enables the training and evaluation of policy models to be scalable and reproducible, where policy models define the action plans for physical AI systems by correlating states with actions. Developers can reduce reliance on high-risk real-world testing or complex simulations through these models, optimize performance in tasks such as obstacle navigation and object manipulation, and ensure reliability in practical applications like robotics and autonomous driving.

Forward-Looking Intelligence

Cosmos brings advanced predictive intelligence to physical AI, enabling systems to foresee future scenarios and make smarter decisions. Through forward-looking generation—predictive videos generated based on past data and text prompts—Cosmos allows physical AI to select optimal actions, thus improving efficiency, adaptability, and safety in dynamic environments.

Multiverse Simulation

By leveraging NVIDIA Omniverse, developers can simulate multiple Cosmos outcomes, evaluate real-time scenarios, and thus accelerate the decision-making process while optimizing AI-driven systems such as robots and autonomous driving. The combination of Cosmos and Omniverse enables physics-based AI models to explore all possible future outcomes, allowing for the selection of the optimal path and enhancing accuracy and reliability in complex environments.