Introduction#

This document describes the NVIDIA NIM for Cosmos WFM (World Foundation Models), which currently includes the Cosmos-Predict1 WFM.

Note

Refer to the NVIDIA NIM for VLMs site for Cosmos-Reason1 and Cosmos-Reason2 NIM documentation.

NVIDIA NIM allows IT and DevOps teams to self-host Cosmos in their own managed environments, while still providing developers with industry-standard APIs for building advanced AI-powered applications. Leveraging cutting-edge GPU acceleration and scalable deployment, NIM offers the fastest path to inference with unparalleled performance.

To discover other NIMs and APIs, visit the API catalog.

Architecture#

NIMs are packaged as container images on a model-family basis. Each NIM is its own Docker container with a model, such as nvidia/cosmos-predict1-7b-text2world, nvidia/cosmos-predict1-7b-video2world, and nvidia/cosmos-transfer2.5-2b. Models are automatically downloaded from NGC, leveraging a local filesystem cache if available. Each NIM is built from a common base, so once a NIM has been downloaded, additional NIMs can be downloaded quickly.

When a NIM is first deployed, it inspects the local hardware configuration and the available model versions in the model registry; it then automatically chooses the best version of the model for the available hardware.

NIMs are distributed as NGC container images through the NVIDIA NGC Catalog. A security scan report is available for each container in the NGC catalog, which provides a security rating of that image, a breakdown of CVE severity by package, and links to detailed information on the CVEs.

The NIM for Cosmos WFM sets up a Triton Inference Server that handles the model serving and inference operations. This architecture provides the following:

  • HTTP API: A RESTful interface for making inference requests, checking model health, and retrieving metadata

  • Resource Management: Efficient GPU memory utilization and compute scheduling

  • Monitoring: Built-in metrics and observability endpoints

Applications can interact with the NIM through either the HTTP API or gRPC interface, depending on their specific requirements. The API Reference provides detailed information on available endpoints and request formats.

Pipeline Architecture#

The Cosmos-Predict1-7B model family features two NVIDIA NIMs:

  • Cosmos-Predict1-7B-Text2World: Performs text-to-world generation.

  • Cosmos-Predict1-7B-Video2World: Performs both image-to-world and video-to-world generation (an image can be represented as single frame video).

The Cosmos-Transfer model family features one NVIDIA NIM:

  • Cosmos-Transfer2.5-2B: A more efficient and lightweight model for video-to-video style transfer, with enhanced control capabilities and faster inference times.

The Cosmos NIM pipeline consists of several key components that work together to ensure safe and high-quality video generation:

Text Processing Components#

  • Tokenizer: Converts input text prompts into tokens that can be processed by the model.

  • T5 Encoder: Processes tokenized text through a T5 transformer model to create rich semantic embeddings that capture the meaning and context of the prompt.

  • Prompt Upsampler: Enhances the input prompt by adding relevant details and context to improve generation quality.

Visual Processing Components#

  • Visual Captioning: Generates a text description of the provided image or video input.

Safety and Guardrail Components#

The guardrail system implements multiple layers of safety checks to ensure responsible AI usage:

  • Text Blocklist: Filters out prohibited words and phrases before processing.

  • Text Guardrail: Analyzes the semantic content of prompts to prevent unsafe or inappropriate generations.

  • Video Guardrail: Examines generated video frames to detect and filter out inappropriate content.

  • Face Blur: Detects and blurs faces in generated videos to protect privacy.

These guardrails are designed to prevent the generation of harmful, violent, or otherwise inappropriate content.

Note

If your request is unexpectedly rejected by the guardrails, consider rephrasing your prompt or adjusting your input content.

Model Architecture#

Cosmos-Predict1-7B is a specialized Transformer model designed for video denoising within the latent space. The model architecture consists of the following components:

  • Transformer Backbone: A deep network of interleaved self-attention, cross-attention, and feedforward layers

  • Cross-Attention Mechanism: Enables text conditioning throughout the denoising process by attending to text embeddings.

  • Adaptive Layer Normalization: Applied before each layer to incorporate temporal information for denoising

  • Temporal Modeling: Special attention mechanisms to handle the temporal aspects of video generation

  • Latent Space Processing: Works in a compressed latent space for efficient video generation.

The pipeline processes inputs sequentially:

  1. Text/visual input is processed through the text/visual components.

  2. Safety checks are performed on the input.

  3. The core model generates the video frames.

  4. Post-processing safety checks and face blurring are applied.

  5. The final video, meeting all quality and safety requirements, is produced.

Model Architecture for Cosmos-Transfer2.5-2B#

Cosmos-Transfer2.5-2B Architecture Diagram

Cosmos-Transfer2.5-2B is a newer, more efficient model that builds upon the Cosmos-Transfer1 architecture with several key improvements:

  • Compact Architecture: Contains only 2B parameters, compared to 7B for Transfer1, offering significantly faster inference times while maintaining high-quality outputs.

  • Enhanced Control Modalities: Supports four primary control types:

    • edge: Edge detection control with adjustable threshold presets (“very_low”, “low”, “medium”, “high”, “very_high”) for fine-tuned edge sensitivity

    • depth: Depth estimation control for preserving 3D spatial structure

    • vis: Visual/blur control with preset blur strength options for maintaining scene aesthetics

    • seg: Segmentation control for structural consistency and style matching

  • Flexible Resolution Support: Supports multiple processing resolutions (256p, 480p, 512p, 720p). The output video resolution is determined by the input video resolution.

  • Chunk-based Processing: Implements efficient memory management through configurable video frame chunking (num_video_frames_per_chunk).

  • Conditional Frame Control: Allows specification of conditioning frames (0, 1, or 2) from the input video for better temporal consistency.

  • Simplified API: Provides streamlined parameter names (e.g. guidance instead of guidance_scale and num_steps for diffusion steps).

The Cosmos-Transfer2.5-2B model shares the same text processing pipeline and safety guardrails as other Cosmos models, ensuring consistent quality and safe outputs across the family.

Note

Transfer2.5-2B requires at least one control modality (edge, depth, vis, or seg) be specified in each inference request. This ensures the model has sufficient guidance for the video transformation process.

The NVIDIA Developer Program#

Want to learn more about NIMs? Join the NVIDIA Developer Program to get free access to self-hosting NVIDIA NIMs and microservices on up to 16 GPUs on any infrastructure-cloud, data center, or personal workstation.

Once you join the free NVIDIA Developer Program, you can access NIMs through the NVIDIA API Catalog at any time. For enterprise-grade security, support, and API stability, select the option to access NIM through our free 90-day NVIDIA AI Enterprise Trial with a business email address.

Refer to the NVIDIA NIM FAQ for additional information.