Introduction#

The Cosmos-Predict1 models are a collection of general-purpose world foundation models for Physical AI.

NVIDIA NIM allows IT and DevOps teams to self-host Cosmos in their own managed environments, while still providing developers with industry-standard APIs for building advanced AI-powered applications. Leveraging cutting-edge GPU acceleration and scalable deployment, NIM offers the fastest path to inference with unparalleled performance.

To discover other NIMs and APIs, visit the API catalog.

Architecture#

NIMs are packaged as container images on a model-family basis. Each NIM is its own Docker container with a model, such as nvidia/cosmos-predict1-7b-text2world or nvidia/cosmos-predict1-7b-video2world. Models are automatically downloaded from NGC, leveraging a local filesystem cache if available. Each NIM is built from a common base, so once a NIM has been downloaded, additional NIMs can be downloaded quickly.

When a NIM is first deployed, it inspects the local hardware configuration and the available model versions in the model registry; it then automatically chooses the best version of the model for the available hardware.

NIMs are distributed as NGC container images through the NVIDIA NGC Catalog. A security scan report is available for each container in the NGC catalog, which provides a security rating of that image, a breakdown of CVE severity by package, and links to detailed information on the CVEs.

The Cosmos NIM sets up a Triton Inference Server that handles the model serving and inference operations. This architecture provides the following:

  • HTTP API: A RESTful interface for making inference requests, checking model health, and retrieving metadata

  • Resource Management: Efficient GPU memory utilization and compute scheduling

  • Monitoring: Built-in metrics and observability endpoints

Applications can interact with the NIM through either the HTTP API or gRPC interface, depending on their specific requirements. The API Reference provides detailed information on available endpoints and request formats.

Pipeline Architecture#

The Cosmos-Predict1-7B model family features two NVIDIA NIMs:

  • Cosmos-Predict1-7B-Text2World: Performs text-to-world generation.

  • Cosmos-Predict1-7B-Video2World: Performs both image-to-world and video-to-world generation (an image can be represented as single frame video).

The Cosmos NIM pipeline consists of several key components that work together to ensure safe and high-quality video generation:

Text Processing Components#

  • Tokenizer: Converts input text prompts into tokens that can be processed by the model.

  • T5 Encoder: Processes tokenized text through a T5 transformer model to create rich semantic embeddings that capture the meaning and context of the prompt.

  • Prompt Upsampler: Enhances the input prompt by adding relevant details and context to improve generation quality.

Visual Processing Components#

  • Visual Captioning: Generates a text description of the provided image or video input.

Safety and Guardrail Components#

The guardrail system implements multiple layers of safety checks:

  • Text Blocklist: Filters out prohibited words and phrases before processing.

  • Text Guardrail: Analyzes the semantic content of prompts to prevent unsafe or inappropriate generations.

  • Video Guardrail: Examines generated video frames to detect and filter out inappropriate content.

  • Face Blur: Detects and blurs faces in generated videos to protect privacy.

Model Architecture#

Cosmos-Predict1-7B is a specialized Transformer model designed for video denoising within the latent space. The model architecture consists of the following components:

  • Transformer Backbone: A deep network of interleaved self-attention, cross-attention, and feedforward layers

  • Cross-Attention Mechanism: Enables text conditioning throughout the denoising process by attending to text embeddings.

  • Adaptive Layer Normalization: Applied before each layer to incorporate temporal information for denoising

  • Temporal Modeling: Special attention mechanisms to handle the temporal aspects of video generation

  • Latent Space Processing: Works in a compressed latent space for efficient video generation.

The pipeline processes inputs sequentially:

  1. Text/visual input is processed through the text/visual components.

  2. Safety checks are performed on the input.

  3. The core model generates the video frames.

  4. Post-processing safety checks and face blurring are applied.

  5. The final video, meeting all quality and safety requirements, is produced.

The NVIDIA Developer Program#

Want to learn more about NIMs? Join the NVIDIA Developer Program to get free access to self-hosting NVIDIA NIMs and microservices on up to 16 GPUs on any infrastructure-cloud, data center, or personal workstation.

Once you join the free NVIDIA Developer Program, you can access NIMs through the NVIDIA API Catalog at any time. For enterprise-grade security, support, and API stability, select the option to access NIM through our free 90-day NVIDIA AI Enterprise Trial with a business email address.

Refer to the NVIDIA NIM FAQ for additional information.