Introduction#

NVIDIA Cosmos is a developer-first platform for designing Physical AI systems. Cosmos is divided into the following components, each with its own GitHub repository:

  • Cosmos-Predict2: A key branch of the [Cosmos World Foundation Models](https://www.nvidia.com/en-us/ai/cosmos) (WFMs) ecosystem for Physical AI, specializing in future state prediction through advanced world modeling. It offers two powerful capabilities: text-to-image generation for creating high-quality images from text descriptions, and video-to-world generation for producing visual simulations from video inputs.

  • Cosmos-Predict1: A collection of general-purpose world foundation models (WFMs) for inference, along with scripts for post-training these models for specific Physical AI use cases.

  • Cosmos-Transfer1: A set of pre-trained, diffusion-based conditional world models designed for multimodal, controllable world generation. These models can create world simulations based on multiple spatial control inputs across various modalities such as segmentation, depth, and edge maps. Cosmos-Transfer1 offers the flexibility to weight different conditional inputs differently at varying spatial locations and temporal instances, enabling highly customizable world generation. This capability is particularly useful for various world-to-world transfer applications, including Sim2Real.

  • Cosmos-Reason1: An [open](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license), customizable, reasoning vision language model (VLM) for physical AI and robotics. It enables robots and vision AI agents to reason like humans, using prior knowledge, physics understanding, and common sense to understand and act in the real world. This model understands space, time, and fundamental physics, and can serve as a planning model to reason what steps an embodied agent might take next.

    Cosmos-Reason1 excels at navigating the long tail of diverse physical world scenarios with spatial-temporal understanding. The Cosmos-Reason1 model is post-trained with physical common sense and embodied reasoning data, including supervised fine-tuning and reinforcement learning. It uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations.

Cosmos-Predict2#

Cosmos-Predict2 includes the following:

  • Diffusion-based world foundation models for Text2Image and Video2World generation, where a user can generate a visual simulation based on text prompts or video prompts.

Cosmos-Predict1#

The architecture of Cosmos-Predict1 is shown in the following figure:

_images/predict1_diagram.png

Cosmos-Predict1 includes the following components:

  • Diffusion Models: Generate visual simulations using text or video prompts.

  • Autoregressive Models: Generate visual simulations using video prompts along with optional text prompts.

  • Tokenizers: Split images or videos into continuous tokens (latent vectors) and discrete tokens (integers) efficiently and effectively.

  • Post-training Scripts: Help developers post-train the diffusion and autoregressive models for their particular Physical AI use cases.

  • Pre-training Scripts: Help developers train their WFMs from scratch.

Examples#

Cosmos-Predict1-7B-Text2World-Multiview#

This video shows the text input and corresponding multiview output generated using inference with the Cosmos-Predict1-7B-Text2World-Multiview diffusion model.

Cosmos-Predict1-5B-Video2World#

This video shows the text and image input and the corresponding video output generated using inference with the Cosmos-Predict1-5B-Video2World autoregressive model.

Getting Started Workflow#

Follow these steps to explore the capabilities of Cosmos-Predict1:

  1. Use the Model Matrix page to determine the best model for your use case. Note that only a subset of models currently support post-training.

  2. Review the Prerequisites page and follow the Installation guide.

  3. Follow the steps in the Diffusion Quickstart Guide or Autoregressive Quickstart Guide to get familiar with the inference process.

  4. If you want to post-train a model for a particular Physical AI use case, follow the steps in the Diffusion Post-Training Guide or Autoregressive Post-Training Guide.

  5. To learn more about the inference options available for each model, refer to the Diffusion Model Reference or Autoregressive Model Reference.

Cosmos-Transfer1#

The architecture of Cosmos-Transfer1 is shown in the following figure:

_images/transfer1_diagram.png

Cosmos-Transfer1 includes the following components:

  • ControlNet-based single modality conditional world generation: Generate visual simulation based on one of the following modalities: segmentation video, depth video, edge video, blur video, LiDAR video, or HDMap video. Cosmos-Transfer1 generates a video based on the signal modality, conditional input, user text prompt, and, optionally, an input RGB video frame prompt (which could be from the last video generation result when operating in the autoregressive setting). We will use Cosmos-Transfer1-7B [Modality] to refer to the model operating in this setting. For example, Cosmos-Transfer1-7B [Depth] refers to a depth ControlNet model.

  • MultiControlNet-based multimodal conditional world generation: Generate visual simulation based on any combination of segmentation video, depth video, edge video, and blur video (LiDAR video and HDMap in the AV sample) with a spatiotemporal control map to control the strength of each modality across space and time. Cosmos-Transfer1 generates a video based on the multimodal conditional inputs, a user text prompt, and, optionally, an input RGB video frame prompt (This could be from the last video generation result when operating in the autoregressive setting). This is the preferred mode of Cosmos-Transfer. We will refer it as Cosmos-Transfer1-7B.

  • 4KUpscaler: Allows for upscaling 720p-resolution video to 4K-resolution.

Examples#

Cosmos-Transfer1-7B#

Cosmos-Transfer1-7B [LiDAR|HDMap]#

Cosmos-Reason1#

Cosmos-Reason1 uses the following model:

Additional Resources#

The Cosmos-Reason1 model is based on the Qwen2.5-VL model architecture.

License and Contact#

Cosmos-Reason1 will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.