Tokenizer Inference Guide#

This page will walk you through setting up the Cosmos-Predict1 Tokenizer. It also provides several examples for how to use the tokenizer.

Set up the Tokenizer#

  1. Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.

  2. Follow the Installation guide to download the Cosmos-Predict1 repo and set up the conda environment.

  3. Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).

  4. Log in to Hugging Face with the access token:

    huggingface-cli login
    
  5. Accept the terms for LlamaGuard-7b model.

  6. Download the Cosmos tokenizer model weights from Hugging Face:

    python3 -m scripts.download_tokenizer_checkpoints --checkpoint_dir checkpoints
    

Tokenizer Overview#

The downloaded checkpoints will be in the following structure:

checkpoints/
├── Cosmos-Tokenize1-CV8x8x8-720p
├── Cosmos-Tokenize1-DV8x16x16-720p
├── Cosmos-Tokenize1-CI8x8-360p
├── Cosmos-Tokenize1-CI16x16-360p
├── Cosmos-Tokenize1-CV4x8x8-360p
├── Cosmos-Tokenize1-DI8x8-360p
├── Cosmos-Tokenize1-DI16x16-360p
└── Cosmos-Tokenize1-DV4x8x8-360p

Each checkpoints/<model-name> directory contains the encoder, decoder, full autoencoder in TorchScript (for JIT/TorchScript mode), and native PyTorch checkpoints. For instance, these are the contents of the Cosmos-Tokenize1-CV8x8x8-720p directory:

├── checkpoints/
│   ├── Cosmos-Tokenize1-CV8x8x8-720p/
│      ├── encoder.jit
│      ├── decoder.jit
│      ├── autoencoder.jit
│      ├── model.pt

Cosmos Tokenizer contains tokenizers for both images and video. The video tokenizers are temporally causal, so they can operate as image tokenizers. For each media type, Cosmos-Predict1 provides tokenizers that operate both on continuous and discrete latent spaces.

The following table summarizes the nomenclature used for naming the various tokenizers, categorized by media type (images and videos) and latent space type (continuous and discrete):

Continuous ( C )

Discrete ( D )

Images ( I )

Cosmos-Tokenize-CI

Cosmos-Tokenize-DI

Videos ( V )

Cosmos-Tokenize-CV

Cosmos-Tokenize-DV

There are two modes for running inference: JIT/TorchScript and PyTorch.

Examples#

This section provides example commands for encoding/decoding both video and images with the two inference modes: JIT/Torchscript and PyTorch. The example inference commands use the images and videos provided in the cosmos_predict1/tokenizer/test_data/ directory.

Tip

You can use --help to view all available command line arguments for image_cli and video_cli:

python3 -m cosmos_predict1.tokenizer.inference.image_cli --help
python3 -m cosmos_predict1.tokenizer.inference.video_cli --help

JIT/TorchScript Inference

Autoencoding Images

The following command accepts an input image and outputs a reconstruction of the image obtained by decoding the encoded latents.

# Autoencoding images using `Cosmos-Tokenize1-CI8x8-360p`.
model_name="Cosmos-Tokenize1-CI8x8-360p"
python3 -m cosmos_predict1.tokenizer.inference.image_cli \
    --image_pattern 'cosmos_predict1/tokenizer/test_data/image.png' \
    --checkpoint_enc checkpoints/${model_name}/encoder.jit \
    --checkpoint_dec checkpoints/${model_name}/decoder.jit

If --output_dir is not specified, the reconstructed image will be written to cosmos_predict1/tokenizer/test_data/reconstructions/image.png.

Autoencoding Videos

The following command accepts an input video and outputs a reconstruction of the video obtained by decoding the encoded latents.

# Autoencoding videos using `Cosmos-Tokenize1-DV4x8x8-360p`.
model_name="Cosmos-Tokenize1-DV4x8x8-360p"
python3 -m cosmos_predict1.tokenizer.inference.video_cli \
    --video_pattern 'cosmos_predict1/tokenizer/test_data/video.mp4' \
    --checkpoint_enc checkpoints/${model_name}/encoder.jit \
    --checkpoint_dec checkpoints/${model_name}/decoder.jit

If --output_dir is not specified, the reconstructed video will be written to cosmos_predict1/tokenizer/test_data/reconstructions/video.mp4.

PyTorch Inference

To run the tokenizers in native PyTorch, append your commands with --mode=torch. In PyTorch mode, the model is constructed from the native network definition, which requires providing additional arguments to configure the model for instantiation.

For example, to instantiate a DI tokenizer with a spatial compression factor of 8, operating at 720p or higher resolution, append the following command line arguments:

  • --mode=torch

  • --tokenizer_type=DI8x8-360p

Note that the --checkpoint_enc, --checkpoint_dec, and --checkpoint arguments should still refer to JIT files. The necessary state_dict values will be extracted from the loaded JIT models.

Autoencoding Images

The following command accepts an input image and outputs a reconstruction of the image obtained by decoding the encoded latents.

# Autoencoding images using `Cosmos-Tokenize1-DI8x8-360p`.
model_name="Cosmos-Tokenize1-DI8x8-360p"
python3 -m cosmos_predict1.tokenizer.inference.image_cli \
    --image_pattern 'cosmos_predict1/tokenizer/test_data/*.png' \
    --mode=torch \
    --tokenizer_type=DI8x8-360p \
    --checkpoint_enc checkpoints/${model_name}/encoder.jit \
    --checkpoint_dec checkpoints/${model_name}/decoder.jit

Autoencoding Videos

The following command-line arguments are used to instantiate a CV tokenizer with a temporal factor of 8 and a spatial compression factor of 8, operating at 720p or higher resolution:

  • --mode=torch

  • --tokenizer_type=CV8x8x8-720p

# Autoencoding videos using `Cosmos-Tokenize1-CV8x8x8-720p`.
model_name="Cosmos-Tokenize1-CV8x8x8-720p"
python3 -m cosmos_predict1.tokenizer.inference.video_cli \
    --video_pattern 'cosmos_predict1/tokenizer/test_data/*.mp4' \
    --mode=torch \
    --tokenizer_type=CV8x8x8-720p \
    --checkpoint_enc checkpoints/${model_name}/encoder.jit \
    --checkpoint_dec checkpoints/${model_name}/decoder.jit

Similarly, to instantiate a Cosmos-Tokenize1-CV4x8x8-360p, append the following command line arguments and the corresponding JIT-compiled checkpoints:

  • --mode=torch

  • --tokenizer_type=CV4x8x8-360p

# Autoencoding videos using `Cosmos-Tokenize1-CV4x8x8-360p`.
model_name="Cosmos-Tokenize1-CV4x8x8-360p"
python3 -m cosmos_predict1.tokenizer.inference.video_cli \
    --video_pattern 'cosmos_predict1/tokenizer/test_data/*.mp4' \
    --mode=torch \
    --tokenizer_type=CV4x8x8-360p \
    --checkpoint_enc checkpoints/${model_name}/encoder.jit \
    --checkpoint_dec checkpoints/${model_name}/decoder.jit

Encoding Videos

This section provides examples for the following use cases:

  • Encoding videos into a continuous latent space.

  • Encoding videos into discrete tokens.

Encoding Videos into Continuous Latent Space

import torch
from cosmos_predict1.tokenizer.inference.video_lib import CausalVideoTokenizer

model_name = "Cosmos-Tokenize1-CV4x8x8-360p"
input_tensor = torch.rand(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)  # [B, C, T, H, W]
input_tensor = input_tensor * 2. - 1.  # Normalize to [-1..1]
encoder = CausalVideoTokenizer(checkpoint_enc=f'checkpoints/{model_name}/encoder.jit')
(latent,) = encoder.encode(input_tensor)
torch.testing.assert_close(latent.shape, (1, 16, 3, 64, 64))

# The input tensor can be reconstructed by the decoder as:
decoder = CausalVideoTokenizer(checkpoint_dec=f'checkpoints/{model_name}/decoder.jit')
reconstructed_tensor = decoder.decode(latent)
torch.testing.assert_close(reconstructed_tensor.shape, input_tensor.shape)

The latent will have the shape (1, 16, 3, 64, 64), where the first of the three latents represents the first frame, and C=16 is the number of channels of the latent.

Encoding Videos into Discrete Tokens

import torch
from cosmos_predict1.tokenizer.inference.video_lib import CausalVideoTokenizer

model_name = "Cosmos-Tokenize1-DV4x8x8-360p"
input_tensor = torch.rand(1, 3, 9, 512, 512).to('cuda').to(torch.bfloat16)  # [B, C, T, H, W]
input_tensor = input_tensor * 2. - 1.  # Normalize to [-1..1]
encoder = CausalVideoTokenizer(checkpoint_enc=f'checkpoints/{model_name}/encoder.jit')
(indices, codes) = encoder.encode(input_tensor)
torch.testing.assert_close(indices.shape, (1, 3, 64, 64))
torch.testing.assert_close(codes.shape, (1, 6, 3, 64, 64))

# The input tensor can be reconstructed by the decoder as:
decoder = CausalVideoTokenizer(checkpoint_dec=f'checkpoints/{model_name}/decoder.jit')
reconstructed_tensor = decoder.decode(indices)
torch.testing.assert_close(reconstructed_tensor.shape, input_tensor.shape)

The indices will have the shape (1, 3, 64, 64) and contain integral values in the range [1..64K], where the first of the three integral maps represents the first frame. The codes will contain the pre-quantization continuous latent with shape (1, 6, 3, 64, 64), where C=6 represents the number of FSQ levels.

Jupyter Notebook Demos

Refer to the following Google Colab notebooks for additional demos:

  • Image Tokenization Open In Colab

  • Video Tokenization Open In Colab