Data Preparation

Important

It is the responsibility of each user to check the content of the dataset, review the applicable licenses, and determine if it is suitable for their intended use. Users should review any applicable links associated with the dataset before placing the data on their machine.

Prepare Pretraining Datasets

The VideoNeVA model’s pre-training and fine-tuning process follows the same format as NeVa, with two minor differences.

Media Type - NeVA utilizes images, specified by the key "image". VideoNeVA on the other hand, employs videos, specified by the key "video".
Prompt Content - In the "value" field of the "conversations" array, the prompt for NeVA includes the placeholder <image>. For VideoNeVA, the placeholder is <video>.

The following example shows the data format needed for VideoNeVA:

{
    "id": 0,
    "video": "{video_folder}/076101_076150/1043215450.mp4",
    "conversations": [
        {
            "from": "human",
            "value": "<video>\nWrite a terse but informative summary of the following video clip."
        },
        {
            "from": "gpt",
            "value": "Oaxaca de juarez, mexico - circa 1970: mexican tourists on the square of the cathedral of our lady of the assumption in the city of oaxaca. archival of mexico in oaxaca state in the 1970s."
        }
    ]
}

Set Up LLaMA-2 Chat or Vicuna-v1.5 Checkpoints

The NeMo Framework offers support for both LLaMA-2 chat and Vicuna-v1.5 models. After downloading the appropriate Hugging Face checkpoint, you’ll need to extract and save it to your disk to prepare for pretraining.

Before initiating pretraining, convert the LLaMA-2 checkpoints to NeMo format.

When using the 7B chat model, run the following conversion command:

python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
  --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
  --output_path ${data_dir}/neva/checkpoints/llama-2-7b-chat.nemo

When using other supported models, alter --input_name_or_path and --output_path path accordingly.

Configure Tokenizer

You must incorporate special tokens into the tokenizer for NeVA training. After downloading the language models from Hugging Face, fetch the corresponding tokenizer model.

The following example shows how to use the 7B-chat model as a reference:

Download the tokenizer.model to:

${data_dir}/neva/tokenizers/tokenizer.model

Use the following command to integrate the special tokens into the tokenizer within the NeMo container:

cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
--input_file ${data_dir}/neva/tokenizers/tokenizer.model \
--output_file ${data_dir}/neva/tokenizers/tokenizer_neva.model \
--is_userdefined \
--tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
         "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"