Tokenizer Post-Training Guide#

Cosmos-Predict1 provides tokenizers for both images and videos. For each media type, Cosmos-Predict1 offers tokenizers that operate in both continuous and discrete latent spaces. Post-training a Cosmos Tokenizer allows you to fine-tune it for your specific use case.

Tip

Refer to the Model Matrix to learn which tokenizers support post-training, as well as the number of GPUs required for post-training.

Set up the Tokenizer#

  1. Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.

  2. Follow the Installation guide to download the Cosmos-Predict1 repo and set up the conda environment.

  3. Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).

  4. Log in to Hugging Face with the access token:

    huggingface-cli login
    
  5. Accept the terms for LlamaGuard-7b model.

  6. Download the Cosmos tokenizer model weights from Hugging Face:

     CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python3 -m scripts.download_tokenizer_checkpoints --tokenizer_types CV8x8x8-720p DV8x16x16-720p CV4x8x8-360p DV4x8x8-360p --checkpoint_dir checkpoints
    

The downloaded checkpoints will be in the following structure:

checkpoints/
├── Cosmos-Tokenize1-CV8x8x8-720p
│   ├── config.json
│   ├── encoder.jit
│   ├── decoder.jit
│   ├── autoencoder.jit
│   └── model.pt
├── Cosmos-Tokenize1-DV8x16x16-720p
│   ├── config.json
│   ├── encoder.jit
│   ├── decoder.jit
│   ├── autoencoder.jit
│   └── model.pt
├── Cosmos-Tokenize1-CV4x8x8-360p
│   ├── config.json
│   ├── encoder.jit
│   ├── decoder.jit
│   ├── autoencoder.jit
│   └── model.pt
└── Cosmos-Tokenize1-DV4x8x8-360p
│   ├── config.json
│   ├── encoder.jit
│   ├── decoder.jit
│   ├── autoencoder.jit
│   └── model.pt

Post-training a Tokenizer#

There are three steps to post-training a tokenizer: downloading a dataset, preprocessing the data, and post-training the tokenizer model.

Download a Dataset#

The first step is to download a dataset with videos.

You must provide a folder containing a collection of videos in MP4 format, preferably 720p. These videos should be diverse enough to capture different scenarios associated with your use case.

This example uses a subset of the HD-VILA-100M dataset for post-training.

  1. Download the metadata containing video URLs:

    mkdir -p datasets/hdvila
    cd datasets/hdvila
    wget https://huggingface.co/datasets/TempoFunk/hdvila-100M/resolve/main/hdvila-100M.jsonl
    
  2. Download the requirements for Youtube video downloads and video clipping:

    # Requirements for Youtube video downloads & video clipping
    pip install pytubefix ffmpeg
    
  3. Use the following command to download the original HD-VILA-100M videos and save the corresponding clips and the metadata.

    python3 -m scripts.download_tokenizer_example_data --dataset_path datasets/hdvila --N_videos 128 --do_download --do_clip
    

    The downloaded files will be in the following structure:

    datasets/hdvila/
    ├── metas/
    │   ├── *.json
    └── videos/
       └── *.mp4
    
  4. Register the glob pattern to the MP4 files at cosmos_predict1/tokenizer/training/datasets/dataset_provider.py, as shown below.

    _VIDEO_PATTERN_DICT = {
       "hdvila_video": "datasets/hdvila/videos/*mp4",
    }
    

Note

Different resolution variants of a hdvila_video can be obtained using the hdvila_video<resolution> parameter. The following examples use hdvila_video360 and hdvila_video720 to refer to the same HD-VILA video that has been resized at the time of training to 360p and 720p, respectively.

Post-train the Video Tokenizers#

Run the following command to execute a post-training job with the above data for Cosmos-Tokenize1-CV8x8x8-720p.

export OUTPUT_ROOT=checkpoints # default value
torchrun --nproc_per_node=8 -m cosmos_predict1.tokenizer.training.train \
    --config=cosmos_predict1/tokenizer/training/configs/config.py -- \
    experiment=Cosmos_Tokenize1_CV8x8x8_720p_HDVILA

The tokenizer will be post-trained using the subset of the HD-VILA-100M dataset described above.

Refer to the Cosmos_Tokenize1_CV8x8x8_720p_HDVILA configuration defined in cosmos_predict1/tokenizer/training/configs/experiments/cosmos_tokenize1.py to understand how the dataloader is determined.

Cosmos_Tokenize1_CV8x8x8_720p_HDVILA: LazyDict = LazyDict(
    dict(
        defaults=[
            "/experiment/video_basic",
            {"override /network": "continuous_factorized_video"},
            {"override /data_train": "hdvila_video720"}, # hdvila_video resized to 720p at the time of training
            {"override /data_val": "hdvila_video720"}, # hdvila_video resized to 720p at the time of training
            "_self_",
        ],
        dataloader_train=dict(
            dataset=dict(
                crop_height=256,
                num_video_frames=121,
            ),
            batch_size=1,
        ),
        dataloader_val=dict(
            dataset=dict(
                crop_height=256,
                num_video_frames=121,
            ),
            batch_size=1,
        ),
        model=dict(
            config=dict(
                network=dict(
                    channels_mult=[2, 4, 4],
                    patch_size=4,
                    legacy_mode=False,
                    temporal_compression=8,
                    spatial_compression=8,
                )
            )
        ),
        job=dict(
            project="posttraining",
            group="tokenizer",
            name="Cosmos-Tokenize1-CV8x8x8-720p-HDVILA",
        ),
        checkpoint=dict(
            load_path="checkpoints/Cosmos-Tokenize1-CV8x8x8-720p/model.pt",
            strict_resume=True,
            load_training_state=True,
            jit=dict(input_shape=[1, 3, 17, 512, 512]),
        ),
    )
)

The checkpoints will be saved to ${OUTPUT_ROOT}/PROJECT/GROUP/NAME.

In the above example, PROJECT is defined as posttraining, GROUP as tokenizer, and NAME as Cosmos-Tokenize1-CV8x8x8-720p-HDVILA.

During the training, the checkpoints will be saved in the below structure.

checkpoints/posttraining/tokenizer/Cosmos-Tokenize1-CV8x8x8-720p-HDVILA/checkpoints/
├── iter_{NUMBER}.pt
├── iter_{NUMBER}_enc.jit
├── iter_{NUMBER}_dec.jit
├── iter_{NUMBER}_ema.jit

Inference with the Post-trained Model Checkpoint#

Inference can be performed with the same interface as described in the Tokenizer Inference Guide.

# Autoencoding videos using post-trained `Cosmos-Tokenize1-CV8x8x8-720p-HDVILA`.
model_name="Cosmos-Tokenize1-CV8x8x8-720p-HDVILA"
python3 -m cosmos_predict1.tokenizer.inference.video_cli \
    --video_pattern 'cosmos_predict1/tokenizer/test_data/*.mp4' \
    --checkpoint_enc checkpoints/posttraining/tokenizer/${model_name}/checkpoints/iter_${NUMBER}_enc.jit \
    --checkpoint_dec checkpoints/posttraining/tokenizer/${model_name}/checkpoints/iter_${NUMBER}_dec.jit