Data Preparation

Note

It is the responsibility of each user to check the content of the dataset, review the applicable licenses, and determine if it is suitable for their intended use. Users should review any applicable links associated with the dataset before placing the data on their machine.

Prepare Pretraining Datasets

For all text2img multimodal models, we provide a generic pipeline as detailed below to download and prepare the dataset. The pipeline is suitable for any multimodal datasets hosted on the HuggingFace data repository where the data is stored as one or more parquet files. The pipeline processes the dataset into the WebDataset format, consisting of tar files of equal sizes for efficient training.

The 6 sub-stages are as follows.

download_parquet: Parquet files consisting of text (captions) and image URLs are downloaded from a HuggingFace repository.

download_images: The images are downloaded from their respective URLs and, along with the captions, are packed into tar files following the Webdataset format.

reorganize_tar: (Optional) Due to a variety of reasons (such as unstable network or removal of images), some images may fail to download, resulting in uneven tar files with varying number of examples each. If you are using a training sampler that does not support uneven tar files, you need to re-organize the contents of the tar files so that each one contains an equal number of image-text pairs.

precache_encodings: (Optional) If you are training a model with frozen encoders (e.g. Stable Diffusion), you have the option to precache (precompute) image and/or text encodings (embeddings) in this sub-stage. Precaching these encodings can significantly enhance training throughput.

generate_wdinfo: (Optional) The wdinfo.pkl file, which stores information on dataset shards, is generated.

Depending on your specific circumstance, not all sub-stages need to be run all at once. For example, for parquet datasets not hosted on HuggingFace or those whose format is not parquet, sub-stages 2-5 can be used to process locally downloaded datasets. For webdatasets already downloaded locally, sub-stages 4-5 can be used to precache the encoding to reduce training time. For models that encode image and text on-the-fly, only sub-stages 1-3 need to be run.

Running the Data Preparation Pipeline

Follow Section 5.1.1 to set up the environment. To run the data preparation pipeline for multimodal, set the conf/config.yaml file to:

defaults:
- data_preparation: multimodal/download_multimodal

stages:
- data_preparation

In multimodal/download_multimodal.yaml, set the dataset_repo_id and the dataset_output_root path. Enable the desired sub-stages by setting the enable flag, and modify other parameters as needed.

Then run:

python3 main.py

Configuration for Precaching

General Format

Precaching refers to the offline computation of image and text encodings prior to training a model. This technique is suitable for any model that uses pretrained, frozen encoders during training. By using precached encodings, embeddings for image and textx do not need to be recomputed in each epoch, thereby significantly improving training throughput (up to 60% higher).

Precached encodings are saved in the format of WebDataset. Each tar file contains one pickle file to store all the modality embeddings for each training example. Optionally, the tar file may also include the original image or text files

t0_r0_0.tar
|---- 00000.pickle
|---- 00000.jpg (optional)
|---- 00000.txt (optional)
|---- 00001.pickle
|---- 00001.jpg (optional)
|---- 00001.txt (optional)
...

Each pickle file stores one python dictionary, with key value pairs storing the embedding name and the embedding as a numpy array.

Precaching Config

Configuration for precaching can be extensive and intricate for some models. To maintain clarity and ensure an organized workflow, we utilize a separate YAML file for these configurations. An example can be found here: mulimodal/precache_sd.yaml.

encodings:
- modality: image
    extension: jpg
    key: autoencoderkl_image
    precision: 16
    encoder_config:
    cls: nemo.collections.multimodal.models.stable_diffusion.ldm.autoencoder.AutoencoderKL
    ... (kwargs to initialize the encoder)
- modality: text
    extension: txt
    key: clip-vit-large-patch14_text
    precision: 32
    store_pad_tokens: True
    encoder_config:
    cls: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenCLIPEmbedder
    ... (kwargs to initialize the encoder)

In this YAML file, the encodings field specifies a list of embeddings to be saved in the pickle file. Each entry can have the following attributes:

modality: either image or text

extension: file extension for this modality in the tar file (e.g. ‘jpg’, ‘txt’)

key: dictionary key for the encoding. It is recommended to follow the format {model_name}-{model_variant}_{modality}, if applicable. e.g. clip-vit-large-patch14_text

precision: precision of the stored tensors (32 or 16)

store_pad_tokens: Whether to store the PAD tokens. Not storing PAD tokens can significantly reduce disk usage, but the training script must account for this. Ignored for image modality.

encoder_config: This dictionary must contain cls which points to the location of the encoder class. The rest of the parameters are treated as kwargs to initiate the encoder class.

Note: the encoder class must implement an encode or __call__ function. If store_pad_tokens, this function must return the encoded tensor. Otherwise, this function must return a tuple of (encoded_tensor, text_mask).

Note that it is not required to have only one encoding per modality, if there are multiple encoders. The encodings field is designed as a list to account for this. For example, it’s possible to have one image embedding from CLIP, one text embedding from CLIP, and a second text embedding from T5.

Resume Precaching (Advanced)

The precaching module is able to launch multiple tasks as specified by precache_encodings.node_array_size in parallel in order to reduce the time required for each task. In the event of failed or interrupted run, we provide the option to resume precaching by specifying the exact task_id or range of task_id to re-run. This option eliminates the need to rerun the entire precaching process which can be lengthy.

Consider the following two scenarios as examples.

Interrupted runs: suppose 100 tasks (0-99) were launched, but tasks 50-99 did not complete before the cluster went down. To resume the runs, specify a string in node_array_size in conf/data_preparation/multimodal/download_multimodal.yaml
```
precache_encodings:
node_array_size: 50-99
```

In addition, in nemo_launcher/collections/dataprep_scripts/multimodal_dataprep/conf/config.yaml, specify

override_task_count: 100

Failed run: suppose 100 tasks (0-99) were launched, but task 67 experienced node failure. To re-run task 67, specify in conf/data_preparation/multimodal/download_multimodal.yaml
```
precache_encodings:
node_array_size: 1
```

In addition, in nemo_launcher/collections/dataprep_scripts/multimodal_dataprep/conf/config.yaml, specify

override_task_id: 67
override_task_count: 100