Introduction#

NVIDIA Cosmos is a developer-first platform for designing Physical AI systems. Cosmos is divided into the following components, each with its own GitHub repository:

Cosmos-Predict2.5: A flow based model specialized for simulating and predicting the future state of the world as a video. This model unifies Text2World, Image2World, and Video2World into a single model and utilizes Cosmos-Reason1, a Physical AI reasoning vision language model (VLM), as the text encoder. Cosmos-Predict2.5 significantly improves upon Cosmos-Predict1 in both quality and prompt alignment.
Cosmos-Predict2: A key branch of the Cosmos World Foundation Models (WFMs) ecosystem for Physical AI, specializing in future state prediction through advanced world modeling. It offers two powerful capabilities: text-to-image generation for creating high-quality images from text descriptions, and video-to-world generation for producing visual simulations from video inputs.
Cosmos-Predict1: A collection of general-purpose world foundation models (WFMs) for inference, along with scripts for post-training these models for specific Physical AI use cases.
Cosmos-Transfer2.5: A multi-controlnet designed to accept structured input of multiple video modalities including RGB, depth, segmentation and more. Users can configure generation using JSON-based controlnet_specs, and run inference with just a few commands. It supports both single-video inference, automatic control map generation, and multiple GPU setups.
Cosmos-Transfer1: A set of pre-trained, diffusion-based conditional world models designed for multimodal, controllable world generation. These models can create world simulations based on multiple spatial control inputs across various modalities such as segmentation, depth, and edge maps. Cosmos-Transfer1 offers the flexibility to weight different conditional inputs differently at varying spatial locations and temporal instances, enabling highly customizable world generation. This capability is particularly useful for various world-to-world transfer applications, including Sim2Real.
Cosmos-Reason2: The next generation of Cosmos-Reason model. New features include the following:
- Enhanced physical AI reasoning with improved spatio-temporal understanding and timestamp precision.
- Object detection with 2D/3D point localization and bounding-box coordinates, as well as reasoning explanations and labels.
- Improved long-context understanding up to 256K input tokens.
Cosmos-Reason1: An open , customizable, reasoning vision language model (VLM) for physical AI and robotics. It enables robots and vision AI agents to reason like humans, using prior knowledge, physics understanding, and common sense to understand and act in the real world. This model understands space, time, and fundamental physics, and can serve as a planning model to reason what steps an embodied agent might take next.

Cosmos-Reason1 excels at navigating the long tail of diverse physical world scenarios with spatial-temporal understanding. The Cosmos-Reason1 model is post-trained with physical common sense and embodied reasoning data, including supervised fine-tuning and reinforcement learning. It uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations.