Autoregressive Model Reference#

This page details the options available when using Cosmos Autoregressive models.

Model Types#

There are two model types available for autoregressive world generation:

*Base: World generation from image/video input
- Models: Cosmos-Predict1-4B and Cosmos-Predict1-12B
- Inference script: base.py(/cosmos_predict1/autoregressive/inference/base.py)
Video2World: World generation from image/video input and text input (both inputs are required)
- Models: Cosmos-Predict1-5B-Video2World and Cosmos-Predict1-13B-Video2World
- Inference script: video2world.py(/cosmos_predict1/autoregressive/inference/video2world.py)

Note

Autoregressive models only support images/videos with a resolution of 1024x640. If the input is not in this resolution, it will be resized and cropped.

Single and Batch Generation#

Autoregressive models support both single and batch video generation.

Single Video#

Base models require the --input_image_or_video_path argument
Video2World models require both the --input_image_or_video_path argument and the --prompt argument, which provides a text prompt for the model.

Batch Video#

For batch video generation, both Base and Video2World require --batch_input_path, which specifies the path to a JSONL file.

Base: The JSONL file should contain one visual input per line in the following format, where each line must contain a “visual_input” field:
```
{"visual_input": "path/to/video1.mp4"}
{"visual_input": "path/to/video2.mp4"}
```

Video2World: Each line in the JSONL file must contain both “prompt” and “visual_input” fields:

{"prompt": "prompt1", "visual_input": "path/to/video1.mp4"}
{"prompt": "prompt2", "visual_input": "path/to/video2.mp4"}

Text Prompt#

The examples below use a text prompt assigned to the PROMPT environment variable. The following is an example text prompt:

PROMPT="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions."

Sample Commands#

This section contains sample commands for each model type, including single generation, single generation with multi-GPU inference, and batch generation.

Base#

Downloading the Model Weights#

Use the following command to download the Cosmos-Predict1-4B and Cosmos-Predict1-12B model weights from Hugging Face:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/autoregressive/inference/base.py --help

Single Generation#

This command runs inference with the 4B model to generate a single video using an input image or video, which is specified with the --input_image_or_video_path argument.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/autoregressive/inference/base.py \
    --checkpoint_dir checkpoints \
    --ar_model_dir Cosmos-Predict1-4B \
    --input_type video \
    --input_image_or_video_path assets/autoregressive/input.mp4 \
    --top_p 0.8 \
    --temperature 1.0 \
    --offload_diffusion_decoder \
    --offload_tokenizer \
    --video_save_name autoregressive-4b

Single Generation with Multi-GPU Inference#

This command runs parallelized inference using 8 GPUs to generate a single video.

NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/autoregressive/inference/base.py \
    --num_gpus ${NUM_GPUS} \
    --checkpoint_dir checkpoints \
    --ar_model_dir Cosmos-Predict1-4B \
    --input_type video \
    --input_image_or_video_path assets/autoregressive/input.mp4 \
    --top_p 0.8 \
    --temperature 1.0 \
    --offload_diffusion_decoder \
    --offload_tokenizer \
    --video_save_name autoregressive-4b-8gpu

Batch Generation#

The --batch_input_path argument allows you to generate multiple videos. This argument specifies the path to a JSONL file, which contains one image/video input per line in the following format:

{"visual_input": "path/to/video1.mp4"}
{"visual_input": "path/to/video2.mp4"}

Inference is performed as follows:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/autoregressive/inference/base.py \
    --checkpoint_dir checkpoints \
    --ar_model_dir Cosmos-Predict1-4B \
    --batch_input_path assets/diffusion/batch_inputs/text2world.jsonl \
    --top_p 0.8 \
    --temperature 1.0 \
    --offload_diffusion_decoder \
    --offload_tokenizer \
    --video_save_folder autoregressive-4b-batch

Video2World#

Downloading the Model Weights#

Use the following command to download the Cosmos-Predict1-4B and Cosmos-Predict1-12B model weights from Hugging Face:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_autoregressive_checkpoints.py --model_sizes 5B 13B

Single Generation#

This command runs inference with the 5B model to generate a single video using an input image or video and a text prompt (both inputs are required).

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/autoregressive/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --ar_model_dir Cosmos-Predict1-5B-Video2World \
    --input_type text_and_video \
    --input_image_or_video_path assets/autoregressive/input.mp4 \
    --prompt "${PROMPT}" \
    --top_p 0.7 \
    --temperature 1.0 \
    --offload_diffusion_decoder \
    --offload_tokenizer \
    --video_save_name autoregressive-video2world-5b

Single Generation with Multi-GPU Inference#

This command runs parallelized inference using 8 GPUs to generate a single video.

NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/autoregressive/inference/video2world.py \
    --num_gpus ${NUM_GPUS} \
    --checkpoint_dir checkpoints \
    --ar_model_dir Cosmos-Predict1-5B-Video2World \
    --input_type text_and_video \
    --input_image_or_video_path assets/autoregressive/input.mp4 \
    --prompt "${PROMPT}" \
    --top_p 0.7 \
    --temperature 1.0 \
    --offload_diffusion_decoder \
    --offload_tokenizer \
    --video_save_name autoregressive-video2world-5b-8gpu

Batch Generation#

The --batch_input_path argument allows you to generate multiple videos. This argument specifies the path to a JSONL file, which contains one prompt and image/video input per line in the following format:

{"prompt": "prompt1", "visual_input": "path/to/video1.mp4"}
{"prompt": "prompt2", "visual_input": "path/to/video2.mp4"}

Inference is performed as follows:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/autoregressive/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --ar_model_dir Cosmos-Predict1-5B-Video2World \
    --batch_input_path assets/diffusion/batch_inputs/video2world.jsonl \
    --top_p 0.7 \
    --temperature 1.0 \
    --offload_diffusion_decoder \
    --offload_tokenizer \
    --video_save_folder autoregressive-video2world-5b-batch

Arguments#

Common Parameters#

Parameter	Description	Default
`--checkpoint_dir`	Directory containing model weights	“checkpoints”
`--video_save_name`	Output video filename for single video generation	“output”
`--video_save_folder`	Folder where output videos are stored (for batch generation)	“outputs/”
`--input_image_or_video_path`	Input image or video path. Required for single video generation.	None
`--batch_input_path`	Folder containing input images or videos. Required for batch video generation.	None
`--num_gpus`	The number of GPUs to use for inference	1
`--temperature`	Temperature used while sampling. We recommend using the values in the provided sample commands.	1.0
`--top_p`	Top-p value for Top-p sampling. We recommend using the values in the provided sample commands.	0.8
`--offload_guardrail_models`	Offload guardrail models after inference; used for low-memory GPUs	False
`--offload_diffusion_decoder`	Offload diffusion decoder after inference; used for low-memory GPUs	False
`--offload_tokenizer`	Offload the tokenizer process during inference; used for low-memory GPUs	False

Base Specific Parameters#

Parameter	Description	Default
`--ar_model_dir`	Directory containing the AR model weight	“Cosmos-Predict1-4B”
`--input_type`	Input type, either `video` or `image`	“video”

Video2World Specific Parameters#

Parameter	Description	Default
`--ar_model_dir`	Directory containing the AR model weight	“Cosmos-Predict1-5B-Video2World”
`--input_type`	Input type, either `text_and_video` or `text_and_image`	“text_and_video”
`--prompt`	Text prompt for single video generation. Required for single video generation.	None
`--offload_text_encoder_model`	Offload text encoder after inference. Used for low-memory GPUs.	False

Safety Features#

The Cosmos Predict1 models use a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed, and they will be blurred by the guardrail.

For more information, refer to the the Cosmos Guardrail page.