Transfer Model Reference#

This page details the options available when using Cosmos-Transfer1 model.

Sample Commands#

This section contains sample commands for the Transfer1 model.

Note

Before running these commands, ensure you have followed the steps in the Set up Cosmos Transfer1 section of the Quickstart Guide.

Edge Detection ControlNet#

The following command runs inference with the Transfer1 model to generate a high-quality visual simulation from a low-resolution edge-detect source video.

export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/example1_single_control_edge \
    --controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \
    --offload_text_encoder_model

The --controlnet_specs argument specifies the path to a JSON file that contains the following transfer specifications.

{
    "prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design. ...",
    "input_video_path" : "assets/example1_input_video.mp4",
    "edge": {
        "control_weight": 1.0
    }
}

This is the source edge-detect video (640x480) :

This is the video generated by the Cosmos-Transfer1-7B model (960x704):

Multi-GPU Inference#

The following command performs the same inference as above, but with 4 GPUs.

export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0,1,2,3}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=4}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/example1_single_control_edge \
    --controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \
    --offload_text_encoder_model \
    --num_gpus $NUM_GPU

Inference with Prompt Upsampling#

You can use the prompt upsampler to convert a short text prompt into a longer, more detailed one. The prompt upsampler is enabled using the --upsample_prompt argument.

export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/example1_single_control_edge_upsampled_prompt \
    --controlnet_specs assets/inference_cosmos_transfer1_single_control_edge_short_prompt.json \
    --offload_text_encoder_model \
    --upsample_prompt \
    --offload_prompt_upsampler

This is the original short prompt:

Robotic arms hand over a coffee cup to a woman in a modern office.

This is the upsampled prompt:

The video opens with a close-up of a robotic arm holding a coffee cup with a lid, positioned next to a coffee machine. The arm is metallic with a black wrist, and the coffee cup is white with a brown lid. The background shows a modern office environment with a woman in a blue top and black pants standing in the distance. As the video progresses, the robotic arm moves the coffee cup towards the woman, who approaches to receive it. The woman has long hair and is wearing a blue top and black pants. The office has a contemporary design with glass partitions, potted plants, and other office furniture.

This is the video generated using inference with the upsampled prompt:

Batch Inference#

The --batch_input_path argument allows you to run inference on batch of video inputs. This argument specifies the path to a JSONL file, which contains one video/image input per line, along with an optional “prompt” field for a corresponding text prompt.

{"visual_input": "path/to/video1.mp4", "prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design..."}
{"visual_input": "path/to/video2.mp4"}

Inference can be performed as follows:

export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/example1_single_control_edge \
    --controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \
    --offload_text_encoder_model \
    --batch_input_path path/to/batch_input.jsonl

Multimodal Control#

The following --controlnet_specs JSON activates vis, edge, depth, and seg controls and applies uniform spatial weights.

{
    "prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design. ...",
    "input_video_path" : "assets/example1_input_video.mp4",
    "vis": {
        "control_weight": 0.25
    },
    "edge": {
        "control_weight": 0.25
    },
    "depth": {
        "input_control": "assets/example1_depth.mp4",
        "control_weight": 0.25
    },
    "seg": {
        "input_control": "assets/example1_seg.mp4",
        "control_weight": 0.25
    }
}

It can be passed to the transfer.py script as follows:

export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/example2_uniform_weights \
    --controlnet_specs assets/inference_cosmos_transfer1_uniform_weights.json \
    --offload_text_encoder_model

The following video is generated using this configuration.

Multimodal control with Spatiotemporal Control Map#

The following --controlnet_specs JSON activates vis, edge, depth, and seg controls and applies spatiotemporal weights.

{
    "prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design...",
    "input_video_path" : "assets/example1_input_video.mp4",
    "vis": {
        "control_weight": 0.5,
        "control_weight_prompt": "robotic arms . gloves"
    },
    "edge": {
        "control_weight": 0.5,
        "control_weight_prompt": "robotic arms . gloves"
    },
    "depth": {
        "control_weight": 0.5
    },
    "seg": {
        "control_weight": 0.5
    }
}

The ControlNet specification differs from the Multimodal Control example above in the following ways:

  • There are additional control_weight_prompt for the “vis” and “edge” modalities. This triggers the GroundingDINO+SAM2 pipeline to run video segmentation of the input video using the control_weight_prompt (e.g. robotic arms . gloves) for vis and edge and extract a binarized spatiotemporal mask in which the positive pixels will have a control_weight of 0.5 (and negative pixels will have 0.0).

  • The prompt section of the woman’s clothing is changed into a cream-colored and brown shirt. Since this area of the video will be conditioned only by depth and seg, there will be no conflict with the color information from the vis modality.

In effect, the seg and depth modalities will be applied everywhere uniformly, and vis and edge will be applied exclusively in the spatiotemporal mask given by the union of robotic arms and gloves mask detections. In those areas, the weight of each modality will be normalized to one, and thus vis, edge, seg and depth will be applied evenly there.

The ControlNet specification can be passed to the transfer.py script as follows:

export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/example3_spatiotemporal_weights \
    --controlnet_specs assets/inference_cosmos_transfer1_spatiotemporal_weights_auto.json \
    --offload_text_encoder_model

The following video is generated using this configuration.

The spatiotemporal mask extracted by the robotic arms . gloves prompt is shown below.

Autonomous Vehicle Transfer#

The following example performs inference using two ControlNet branches–“hdmap” and “lidar”–to transform these video inputs to a high-quality video simulation for autonomous vehicle (AV) applications.

#!/bin/bash
export PROMPT="The video is captured from a camera mounted on a car. The camera is facing forward. The video showcases a scenic golden-hour drive through a suburban area, bathed in the warm, golden hues of the setting sun. The dashboard camera captures the play of light and shadow as the sun’s rays filter through the trees, casting elongated patterns onto the road. The streetlights remain off, as the golden glow of the late afternoon sun provides ample illumination. The two-lane road appears to shimmer under the soft light, while the concrete barrier on the left side of the road reflects subtle warm tones. The stone wall on the right, adorned with lush greenery, stands out vibrantly under the golden light, with the palm trees swaying gently in the evening breeze. Several parked vehicles, including white sedans and vans, are seen on the left side of the road, their surfaces reflecting the amber hues of the sunset. The trees, now highlighted in a golden halo, cast intricate shadows onto the pavement. Further ahead, houses with red-tiled roofs glow warmly in the fading light, standing out against the sky, which transitions from deep orange to soft pastel blue. As the vehicle continues, a white sedan is seen driving in the same lane, while a black sedan and a white van move further ahead. The road markings are crisp, and the entire setting radiates a peaceful, almost cinematic beauty. The golden light, combined with the quiet suburban landscape, creates an atmosphere of tranquility and warmth, making for a mesmerizing and soothing drive."
export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_name output_video \
    --video_save_folder outputs/sample_av_multi_control \
    --prompt "$PROMPT" \
    --sigma_max 80 \
    --offload_text_encoder_model --is_av_sample \
    --controlnet_specs assets/sample_av_multi_control_spec.json

The assets/sample_av_multi_control_spec.json file contains the following ControlNet specification:

{
    "hdmap": {
        "control_weight": 0.3,
        "input_control": "assets/sample_av_multi_control_input_hdmap.mp4"
    },
    "lidar": {
        "control_weight": 0.7,
        "input_control": "assets/sample_av_multi_control_input_lidar.mp4"
    }
}

Note

In this example, the input prompt and some other parameters are provided through the command line arguments, as opposed to through the ControlNet specification file. This allows you to abstract out fixed parameters in the spec file and alter dynamic parameters through the command line.

This is the input_control for HDMap:

Multi-GPU Inference#

The following command performs the same inference as above, but with 4 GPUs.

export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0,1,2,3}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=4}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_name output_video \
    --video_save_folder outputs/sample_av_multi_control \
    --prompt "$PROMPT" \
    --sigma_max 80 \
    --offload_text_encoder_model --is_av_sample \
    --controlnet_specs assets/sample_av_multi_control_spec.json \
    --num_gpus $NUM_GPU

Additional AV Toolkits#

Additional AV toolkits are available from this GitHub repo provided by NVIDIA. This repo includes the following:

  • 10 additional raw data samples (e.g. HDMap and LiDAR), along with scripts to preprocess and render them into model-compatible inputs.

  • Rendering scripts for converting other datasets, such as the Waymo Open Dataset, into inputs compatible with Cosmos-Transfer1.

4K Upscaling#

The following command performs 4K upscaling on a 1280x704 input video.

export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/inference_upscaler \
    --controlnet_specs assets/inference_upscaler.json \
    --num_steps 10 \
    --offload_text_encoder_model

The assets/inference_upscaler.json file contains the following:

{
    "input_video_path" : "assets/inference_upscaler_input_video.mp4",
    "upscale": {
        "control_weight": 0.5
    },
}

This is the input video (1280x704), which was generated by the Cosmos-Predict1-7B-Text2World model:

This is the upscaled output video (3840x2112):

Multi-GPU Inference#

The following command performs the same 4K upscaling task as above, but with 4 GPUs.

export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0,1,2,3}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=4}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/inference_upscaler \
    --controlnet_specs assets/inference_upscaler.json \
    --num_steps 10 \
    --offload_text_encoder_model \
    --num_gpus $NUM_GPU

Arguments#

Parameter

Description

Default

--controlnet_specs

JSON file that configures Multi-ControlNet operations. Refer to the ControlNet Specification section below for more details.

JSON

--checkpoint_dir

Directory containing model weights

“checkpoints”

--tokenizer_dir

Directory containing tokenizer weights

“Cosmos-Tokenize1-CV8x8x8-720p”

--input_video_path

Path to the input video

None

--video_save_name

Output video filename for single-video generation

“output”

--video_save_folder

Output directory for batch video generation

“outputs/”

--prompt

Text prompt for video generation

“The video captures a stunning, photorealistic scene with remarkable attention to detail, giving it a lifelike appearance that is almost indistinguishable from reality. It appears to be from a high-budget 4K movie, showcasing ultra-high-definition quality with impeccable resolution.”

--negative_prompt

Negative prompt for improved quality

“The video captures a video game with bad graphics and cartoonish frames. It represents a recording of old outdated games. The lighting looks very fake. The textures are very raw and basic. The geometries are very primitive. The images are very pixelated and of poor CG quality. Overall, the video is not realistic at all.”

--num_steps

Number of diffusion sampling steps

35

--guidance

CFG guidance scale

7.0

--sigma_max

Level of partial noise added to the input video in the range [0, 80.0]. Any value equal to or higher than 80.0 will result in not using the input video and providing the model with pure noise.

70.0

--blur_strength

Strength of blurring when preparing the control input for the “vis” controlnet. Valid values are ‘very_low’, ‘low’, ‘medium’, ‘high’, and ‘very_high’.

‘medium’

--canny_threshold

Threshold for canny edge detection when preparing the control input for the “edge” controlnet. A lower threshold will result in more being edges detected. Valid values are ‘very_low’, ‘low’, ‘medium’, ‘high’, and ‘very_high’.

‘medium’

--fps

Output frames-per-second

24

--seed

Random seed

1

--offload_text_encoder_model

Offload the text encoder after inference; used for low-memory GPUs

False

--offload_guardrail_models

Offload the guardrail model after inference; used for low-memory GPUs

False

--upsample_prompt

Upsample the prompt using the prompt upsampler model

False

--offload_prompt_upsampler

Offload the prompt upsampler model after inference, used for low-memory GPUs

False

ControlNet Specification#

The --controlnet_specs argument specifies a JSON file that configures Multi-ControlNet operations. The JSON file can contain the following fields:

  • prompt: The global text prompt that all underlying networks will receive

  • input_video_path: The input video

  • sigma_max: The level of noise that should be added to the input video before feeding through the base model branch

  • vis: Activates the “vis” ControlNet branch.

  • edge: Activates the “edge” ControlNet branch.

  • depth: Activates the “depth” ControlNet branch.

  • seg: Activates the “seg” ControlNet branch.

  • control_weight: A number within the range [0, 1] that controls how strongly the ControlNet branch should affect the output of the model. The larger the value (i.e the closer to 1.0), the more strongly the generated video will adhere to the ControlNet input. However, this rigidity may come at a cost of quality. Lower values will give more creative liberty to the model at the cost of reduced adherence. Usually, a mid-range value near 0.5 will yield optimal results.

  • The inputs to each ControlNet branch are automatically computed according to the branch:

    • vis: Applies bilateral blurring on the input video to compute the input_control to that branch

    • edge: Uses Canny Edge Detection to compute the Canny edge input_control from the input_control.

    • depth: Uses DepthAnything to compute the depth map as input_control from the input video.

    • seg: Uses Segment Anything Model 2 for generating the segmentation map as input_control from the input video.

Note the following about the ControlNet specification:

  • At each spatiotemporal site, if the sum of the control maps across different modalities is greater than one, normalization is applied to the modality weights so that the sum is 1.

  • For depth and seg, if the input_control is not provided, DepthAnything2 and GroundingDino+SAM2 will be run on the video specified by the input_video_path to generate the corresponding input_control. Refer to the assets/inference_cosmos_transfer1_uniform_weights_auto.json file as an example.

  • For seg, the input_control_prompt can be provided to customize the prompt sent to GroundingDino. You can use . to separate objects in the input_control_prompt (e.g. robotic arms . woman . cup), as suggested in the GroundingDino README. If the input_control_prompt is not provided, the prompt will be used by default. Refer to assets/inference_cosmos_transfer1_uniform_weights_auto.json as an example.

Prompting Guidelines#

The input prompt is the most important parameter under your control when interacting with the model. Providing rich and descriptive prompts can positively impact the output quality of the model, whereas short and poorly detailed prompts can lead to subpar video generation. Here are some recommendations to keep in mind when crafting text prompts for the model:

  1. Describe a single, captivating scene: Focus on a single scene to prevent the model from generating videos with unnecessary shot changes.

  2. Limit camera control instructions: The model doesn’t handle prompts involving camera control well, as this feature is still under development.

Safety Features#

The Cosmos Transfer1 models use a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed, and they will be blurred by the guardrail.

For more information, refer to the the Cosmos Guardrail page.