Autoregressive Model Reference#
This page details the options available when using Cosmos Autoregressive models.
Model Types#
There are two model types available for autoregressive world generation:
*Base: World generation from image/video input
Models:
Cosmos-Predict1-4B
andCosmos-Predict1-12B
Inference script:
base.py
(/cosmos_predict1/autoregressive/inference/base.py
)
Video2World: World generation from image/video input and text input (both inputs are required)
Models:
Cosmos-Predict1-5B-Video2World
andCosmos-Predict1-13B-Video2World
Inference script:
video2world.py
(/cosmos_predict1/autoregressive/inference/video2world.py
)
Note
Autoregressive models only support images/videos with a resolution of 1024x640. If the input is not in this resolution, it will be resized and cropped.
Single and Batch Generation#
Autoregressive models support both single and batch video generation.
Single Video#
Base
models require the--input_image_or_video_path
argumentVideo2World
models require both the--input_image_or_video_path
argument and the--prompt
argument, which provides a text prompt for the model.
Batch Video#
For batch video generation, both Base
and Video2World
require --batch_input_path
, which specifies the path to a JSONL file.
Base
: The JSONL file should contain one visual input per line in the following format, where each line must contain a “visual_input” field:{"visual_input": "path/to/video1.mp4"} {"visual_input": "path/to/video2.mp4"}
Video2World
: Each line in the JSONL file must contain both “prompt” and “visual_input” fields:{"prompt": "prompt1", "visual_input": "path/to/video1.mp4"} {"prompt": "prompt2", "visual_input": "path/to/video2.mp4"}
Text Prompt#
The examples below use a text prompt assigned to the PROMPT environment variable. The following is an example text prompt:
PROMPT="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions."
Sample Commands#
This section contains sample commands for each model type, including single generation, single generation with multi-GPU inference, and batch generation.
Base#
Downloading the Model Weights#
Use the following command to download the Cosmos-Predict1-4B
and Cosmos-Predict1-12B
model weights from Hugging Face:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/autoregressive/inference/base.py --help
Single Generation#
This command runs inference with the 4B model to generate a single video using an input image or video, which is specified with the --input_image_or_video_path
argument.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/autoregressive/inference/base.py \
--checkpoint_dir checkpoints \
--ar_model_dir Cosmos-Predict1-4B \
--input_type video \
--input_image_or_video_path assets/autoregressive/input.mp4 \
--top_p 0.8 \
--temperature 1.0 \
--offload_diffusion_decoder \
--offload_tokenizer \
--video_save_name autoregressive-4b
Single Generation with Multi-GPU Inference#
This command runs parallelized inference using 8 GPUs to generate a single video.
NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/autoregressive/inference/base.py \
--num_gpus ${NUM_GPUS} \
--checkpoint_dir checkpoints \
--ar_model_dir Cosmos-Predict1-4B \
--input_type video \
--input_image_or_video_path assets/autoregressive/input.mp4 \
--top_p 0.8 \
--temperature 1.0 \
--offload_diffusion_decoder \
--offload_tokenizer \
--video_save_name autoregressive-4b-8gpu
Batch Generation#
The --batch_input_path
argument allows you to generate multiple videos. This argument
specifies the path to a JSONL file, which contains one image/video input per line
in the following format:
{"visual_input": "path/to/video1.mp4"}
{"visual_input": "path/to/video2.mp4"}
Inference is performed as follows:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/autoregressive/inference/base.py \
--checkpoint_dir checkpoints \
--ar_model_dir Cosmos-Predict1-4B \
--batch_input_path assets/diffusion/batch_inputs/text2world.jsonl \
--top_p 0.8 \
--temperature 1.0 \
--offload_diffusion_decoder \
--offload_tokenizer \
--video_save_folder autoregressive-4b-batch
Video2World#
Downloading the Model Weights#
Use the following command to download the Cosmos-Predict1-4B
and Cosmos-Predict1-12B
model weights from Hugging Face:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_autoregressive_checkpoints.py --model_sizes 5B 13B
Single Generation#
This command runs inference with the 5B model to generate a single video using an input image or video and a text prompt (both inputs are required).
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/autoregressive/inference/video2world.py \
--checkpoint_dir checkpoints \
--ar_model_dir Cosmos-Predict1-5B-Video2World \
--input_type text_and_video \
--input_image_or_video_path assets/autoregressive/input.mp4 \
--prompt "${PROMPT}" \
--top_p 0.7 \
--temperature 1.0 \
--offload_diffusion_decoder \
--offload_tokenizer \
--video_save_name autoregressive-video2world-5b
Single Generation with Multi-GPU Inference#
This command runs parallelized inference using 8 GPUs to generate a single video.
NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/autoregressive/inference/video2world.py \
--num_gpus ${NUM_GPUS} \
--checkpoint_dir checkpoints \
--ar_model_dir Cosmos-Predict1-5B-Video2World \
--input_type text_and_video \
--input_image_or_video_path assets/autoregressive/input.mp4 \
--prompt "${PROMPT}" \
--top_p 0.7 \
--temperature 1.0 \
--offload_diffusion_decoder \
--offload_tokenizer \
--video_save_name autoregressive-video2world-5b-8gpu
Batch Generation#
The --batch_input_path
argument allows you to generate multiple videos. This argument
specifies the path to a JSONL file, which contains one prompt and image/video input per line
in the following format:
{"prompt": "prompt1", "visual_input": "path/to/video1.mp4"}
{"prompt": "prompt2", "visual_input": "path/to/video2.mp4"}
Inference is performed as follows:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/autoregressive/inference/video2world.py \
--checkpoint_dir checkpoints \
--ar_model_dir Cosmos-Predict1-5B-Video2World \
--batch_input_path assets/diffusion/batch_inputs/video2world.jsonl \
--top_p 0.7 \
--temperature 1.0 \
--offload_diffusion_decoder \
--offload_tokenizer \
--video_save_folder autoregressive-video2world-5b-batch
Arguments#
Common Parameters#
Parameter |
Description |
Default |
---|---|---|
|
Directory containing model weights |
“checkpoints” |
|
Output video filename for single video generation |
“output” |
|
Folder where output videos are stored (for batch generation) |
“outputs/” |
|
Input image or video path. Required for single video generation. |
None |
|
Folder containing input images or videos. Required for batch video generation. |
None |
|
The number of GPUs to use for inference |
1 |
|
Temperature used while sampling. We recommend using the values in the provided sample commands. |
1.0 |
|
Top-p value for Top-p sampling. We recommend using the values in the provided sample commands. |
0.8 |
|
Offload guardrail models after inference; used for low-memory GPUs |
False |
|
Offload diffusion decoder after inference; used for low-memory GPUs |
False |
|
Offload the tokenizer process during inference; used for low-memory GPUs |
False |
Base Specific Parameters#
Parameter |
Description |
Default |
---|---|---|
|
Directory containing the AR model weight |
“Cosmos-Predict1-4B” |
|
Input type, either |
“video” |
Video2World Specific Parameters#
Parameter |
Description |
Default |
---|---|---|
|
Directory containing the AR model weight |
“Cosmos-Predict1-5B-Video2World” |
|
Input type, either |
“text_and_video” |
|
Text prompt for single video generation. Required for single video generation. |
None |
|
Offload text encoder after inference. Used for low-memory GPUs. |
False |
Safety Features#
The Cosmos Predict1 models use a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed, and they will be blurred by the guardrail.
For more information, refer to the the Cosmos Guardrail page.