Transfer Model Reference#
This page details the options available when using Cosmos-Transfer1 model.
Sample Commands#
This section contains sample commands for the Transfer1 model.
Note
Before running these commands, ensure you have followed the steps in the Set up Cosmos Transfer1 section of the Quickstart Guide.
Edge Detection ControlNet#
The following command runs inference with the Transfer1 model to generate a high-quality visual simulation from a low-resolution edge-detect source video.
export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/example1_single_control_edge \
--controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \
--offload_text_encoder_model
The --controlnet_specs
argument specifies the path to a JSON file that contains the following transfer
specifications.
{
"prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design. ...",
"input_video_path" : "assets/example1_input_video.mp4",
"edge": {
"control_weight": 1.0
}
}
This is the source edge-detect video (640x480) :
This is the video generated by the Cosmos-Transfer1-7B model (960x704):
Multi-GPU Inference#
The following command performs the same inference as above, but with 4 GPUs.
export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0,1,2,3}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=4}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/example1_single_control_edge \
--controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \
--offload_text_encoder_model \
--num_gpus $NUM_GPU
Inference with Prompt Upsampling#
You can use the prompt upsampler to convert a short text prompt into a longer, more detailed one. The prompt upsampler is
enabled using the --upsample_prompt
argument.
export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/example1_single_control_edge_upsampled_prompt \
--controlnet_specs assets/inference_cosmos_transfer1_single_control_edge_short_prompt.json \
--offload_text_encoder_model \
--upsample_prompt \
--offload_prompt_upsampler
This is the original short prompt:
Robotic arms hand over a coffee cup to a woman in a modern office.
This is the upsampled prompt:
The video opens with a close-up of a robotic arm holding a coffee cup with a lid, positioned next to a coffee machine. The arm is metallic with a black wrist, and the coffee cup is white with a brown lid. The background shows a modern office environment with a woman in a blue top and black pants standing in the distance. As the video progresses, the robotic arm moves the coffee cup towards the woman, who approaches to receive it. The woman has long hair and is wearing a blue top and black pants. The office has a contemporary design with glass partitions, potted plants, and other office furniture.
This is the video generated using inference with the upsampled prompt:
Batch Inference#
The --batch_input_path
argument allows you to run inference on batch of video inputs. This argument
specifies the path to a JSONL file, which contains one video/image input per line, along with an optional
“prompt” field for a corresponding text prompt.
{"visual_input": "path/to/video1.mp4", "prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design..."}
{"visual_input": "path/to/video2.mp4"}
Inference can be performed as follows:
export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/example1_single_control_edge \
--controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \
--offload_text_encoder_model \
--batch_input_path path/to/batch_input.jsonl
Multimodal Control#
The following --controlnet_specs
JSON activates vis, edge, depth, and seg controls
and applies uniform spatial weights.
{
"prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design. ...",
"input_video_path" : "assets/example1_input_video.mp4",
"vis": {
"control_weight": 0.25
},
"edge": {
"control_weight": 0.25
},
"depth": {
"input_control": "assets/example1_depth.mp4",
"control_weight": 0.25
},
"seg": {
"input_control": "assets/example1_seg.mp4",
"control_weight": 0.25
}
}
It can be passed to the transfer.py
script as follows:
export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/example2_uniform_weights \
--controlnet_specs assets/inference_cosmos_transfer1_uniform_weights.json \
--offload_text_encoder_model
The following video is generated using this configuration.
Multimodal control with Spatiotemporal Control Map#
The following --controlnet_specs
JSON activates vis, edge, depth, and seg controls
and applies spatiotemporal weights.
{
"prompt": "The video is set in a modern, well-lit office environment with a sleek, minimalist design...",
"input_video_path" : "assets/example1_input_video.mp4",
"vis": {
"control_weight": 0.5,
"control_weight_prompt": "robotic arms . gloves"
},
"edge": {
"control_weight": 0.5,
"control_weight_prompt": "robotic arms . gloves"
},
"depth": {
"control_weight": 0.5
},
"seg": {
"control_weight": 0.5
}
}
The ControlNet specification differs from the Multimodal Control example above in the following ways:
There are additional
control_weight_prompt
for the “vis” and “edge” modalities. This triggers the GroundingDINO+SAM2 pipeline to run video segmentation of the input video using thecontrol_weight_prompt
(e.g.robotic arms . gloves
) forvis
andedge
and extract a binarized spatiotemporal mask in which the positive pixels will have acontrol_weight
of 0.5 (and negative pixels will have 0.0).The prompt section of the woman’s clothing is changed into a cream-colored and brown shirt. Since this area of the video will be conditioned only by
depth
andseg
, there will be no conflict with the color information from thevis
modality.
In effect, the seg
and depth
modalities will be applied everywhere uniformly, and vis
and edge
will be applied exclusively in the
spatiotemporal mask given by the union of robotic arms
and gloves
mask detections. In those areas, the weight of each modality will be
normalized to one, and thus vis
, edge
, seg
and depth
will be applied evenly there.
The ControlNet specification can be passed to the transfer.py
script as follows:
export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/example3_spatiotemporal_weights \
--controlnet_specs assets/inference_cosmos_transfer1_spatiotemporal_weights_auto.json \
--offload_text_encoder_model
The following video is generated using this configuration.
The spatiotemporal mask extracted by the robotic arms . gloves
prompt is shown below.
Autonomous Vehicle Transfer#
The following example performs inference using two ControlNet branches–“hdmap” and “lidar”–to transform these video inputs to a high-quality video simulation for autonomous vehicle (AV) applications.
#!/bin/bash
export PROMPT="The video is captured from a camera mounted on a car. The camera is facing forward. The video showcases a scenic golden-hour drive through a suburban area, bathed in the warm, golden hues of the setting sun. The dashboard camera captures the play of light and shadow as the sun’s rays filter through the trees, casting elongated patterns onto the road. The streetlights remain off, as the golden glow of the late afternoon sun provides ample illumination. The two-lane road appears to shimmer under the soft light, while the concrete barrier on the left side of the road reflects subtle warm tones. The stone wall on the right, adorned with lush greenery, stands out vibrantly under the golden light, with the palm trees swaying gently in the evening breeze. Several parked vehicles, including white sedans and vans, are seen on the left side of the road, their surfaces reflecting the amber hues of the sunset. The trees, now highlighted in a golden halo, cast intricate shadows onto the pavement. Further ahead, houses with red-tiled roofs glow warmly in the fading light, standing out against the sky, which transitions from deep orange to soft pastel blue. As the vehicle continues, a white sedan is seen driving in the same lane, while a black sedan and a white van move further ahead. The road markings are crisp, and the entire setting radiates a peaceful, almost cinematic beauty. The golden light, combined with the quiet suburban landscape, creates an atmosphere of tranquility and warmth, making for a mesmerizing and soothing drive."
export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_name output_video \
--video_save_folder outputs/sample_av_multi_control \
--prompt "$PROMPT" \
--sigma_max 80 \
--offload_text_encoder_model --is_av_sample \
--controlnet_specs assets/sample_av_multi_control_spec.json
The assets/sample_av_multi_control_spec.json
file contains the following ControlNet specification:
{
"hdmap": {
"control_weight": 0.3,
"input_control": "assets/sample_av_multi_control_input_hdmap.mp4"
},
"lidar": {
"control_weight": 0.7,
"input_control": "assets/sample_av_multi_control_input_lidar.mp4"
}
}
Note
In this example, the input prompt and some other parameters are provided through the command line arguments, as opposed to through the ControlNet specification file. This allows you to abstract out fixed parameters in the spec file and alter dynamic parameters through the command line.
This is the input_control
for HDMap:
Multi-GPU Inference#
The following command performs the same inference as above, but with 4 GPUs.
export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0,1,2,3}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=4}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_name output_video \
--video_save_folder outputs/sample_av_multi_control \
--prompt "$PROMPT" \
--sigma_max 80 \
--offload_text_encoder_model --is_av_sample \
--controlnet_specs assets/sample_av_multi_control_spec.json \
--num_gpus $NUM_GPU
Additional AV Toolkits#
Additional AV toolkits are available from this GitHub repo provided by NVIDIA. This repo includes the following:
10 additional raw data samples (e.g. HDMap and LiDAR), along with scripts to preprocess and render them into model-compatible inputs.
Rendering scripts for converting other datasets, such as the Waymo Open Dataset, into inputs compatible with Cosmos-Transfer1.
4K Upscaling#
The following command performs 4K upscaling on a 1280x704 input video.
export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/inference_upscaler \
--controlnet_specs assets/inference_upscaler.json \
--num_steps 10 \
--offload_text_encoder_model
The assets/inference_upscaler.json
file contains the following:
{
"input_video_path" : "assets/inference_upscaler_input_video.mp4",
"upscale": {
"control_weight": 0.5
},
}
This is the input video (1280x704), which was generated by the Cosmos-Predict1-7B-Text2World model:
This is the upscaled output video (3840x2112):
Multi-GPU Inference#
The following command performs the same 4K upscaling task as above, but with 4 GPUs.
export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0,1,2,3}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=4}"
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/inference_upscaler \
--controlnet_specs assets/inference_upscaler.json \
--num_steps 10 \
--offload_text_encoder_model \
--num_gpus $NUM_GPU
Arguments#
Parameter |
Description |
Default |
---|---|---|
|
JSON file that configures Multi-ControlNet operations. Refer to the ControlNet Specification section below for more details. |
JSON |
|
Directory containing model weights |
“checkpoints” |
|
Directory containing tokenizer weights |
“Cosmos-Tokenize1-CV8x8x8-720p” |
|
Path to the input video |
None |
|
Output video filename for single-video generation |
“output” |
|
Output directory for batch video generation |
“outputs/” |
|
Text prompt for video generation |
“The video captures a stunning, photorealistic scene with remarkable attention to detail, giving it a lifelike appearance that is almost indistinguishable from reality. It appears to be from a high-budget 4K movie, showcasing ultra-high-definition quality with impeccable resolution.” |
|
Negative prompt for improved quality |
“The video captures a video game with bad graphics and cartoonish frames. It represents a recording of old outdated games. The lighting looks very fake. The textures are very raw and basic. The geometries are very primitive. The images are very pixelated and of poor CG quality. Overall, the video is not realistic at all.” |
|
Number of diffusion sampling steps |
35 |
|
CFG guidance scale |
7.0 |
|
Level of partial noise added to the input video in the range [0, 80.0]. Any value equal to or higher than 80.0 will result in not using the input video and providing the model with pure noise. |
70.0 |
|
Strength of blurring when preparing the control input for the “vis” controlnet. Valid values are ‘very_low’, ‘low’, ‘medium’, ‘high’, and ‘very_high’. |
‘medium’ |
|
Threshold for canny edge detection when preparing the control input for the “edge” controlnet. A lower threshold will result in more being edges detected. Valid values are ‘very_low’, ‘low’, ‘medium’, ‘high’, and ‘very_high’. |
‘medium’ |
|
Output frames-per-second |
24 |
|
Random seed |
1 |
|
Offload the text encoder after inference; used for low-memory GPUs |
False |
|
Offload the guardrail model after inference; used for low-memory GPUs |
False |
|
Upsample the prompt using the prompt upsampler model |
False |
|
Offload the prompt upsampler model after inference, used for low-memory GPUs |
False |
ControlNet Specification#
The --controlnet_specs
argument specifies a JSON file that configures Multi-ControlNet operations. The JSON file can contain the following fields:
prompt
: The global text prompt that all underlying networks will receiveinput_video_path
: The input videosigma_max
: The level of noise that should be added to the input video before feeding through the base model branchvis
: Activates the “vis” ControlNet branch.edge
: Activates the “edge” ControlNet branch.depth
: Activates the “depth” ControlNet branch.seg
: Activates the “seg” ControlNet branch.control_weight
: A number within the range [0, 1] that controls how strongly the ControlNet branch should affect the output of the model. The larger the value (i.e the closer to 1.0), the more strongly the generated video will adhere to the ControlNet input. However, this rigidity may come at a cost of quality. Lower values will give more creative liberty to the model at the cost of reduced adherence. Usually, a mid-range value near 0.5 will yield optimal results.The inputs to each ControlNet branch are automatically computed according to the branch:
vis
: Applies bilateral blurring on the input video to compute theinput_control
to that branchedge
: Uses Canny Edge Detection to compute the Canny edgeinput_control
from theinput_control
.depth
: Uses DepthAnything to compute the depth map asinput_control
from the input video.seg
: Uses Segment Anything Model 2 for generating the segmentation map asinput_control
from the input video.
Note the following about the ControlNet specification:
At each spatiotemporal site, if the sum of the control maps across different modalities is greater than one, normalization is applied to the modality weights so that the sum is 1.
For
depth
andseg
, if theinput_control
is not provided, DepthAnything2 and GroundingDino+SAM2 will be run on the video specified by theinput_video_path
to generate the correspondinginput_control
. Refer to theassets/inference_cosmos_transfer1_uniform_weights_auto.json
file as an example.For
seg
, theinput_control_prompt
can be provided to customize the prompt sent to GroundingDino. You can use.
to separate objects in theinput_control_prompt
(e.g.robotic arms . woman . cup
), as suggested in the GroundingDino README. If theinput_control_prompt
is not provided, theprompt
will be used by default. Refer toassets/inference_cosmos_transfer1_uniform_weights_auto.json
as an example.
Prompting Guidelines#
The input prompt is the most important parameter under your control when interacting with the model. Providing rich and descriptive prompts can positively impact the output quality of the model, whereas short and poorly detailed prompts can lead to subpar video generation. Here are some recommendations to keep in mind when crafting text prompts for the model:
Describe a single, captivating scene: Focus on a single scene to prevent the model from generating videos with unnecessary shot changes.
Limit camera control instructions: The model doesn’t handle prompts involving camera control well, as this feature is still under development.
Safety Features#
The Cosmos Transfer1 models use a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed, and they will be blurred by the guardrail.
For more information, refer to the the Cosmos Guardrail page.