Data Preparation - NVIDIA Docs

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide (Latest) Data Preparation

User Guide (Latest Version)

Important

It is the responsibility of each user to check the content of the dataset, review the applicable licenses, and determine if it is suitable for their intended use. Users should review any applicable links associated with the dataset before placing the data on their machine.

Prepare Pretraining Datasets

The VideoNeVA model’s pre-training and fine-tuning process follows the same format as NeVa, with two minor differences.

Media Type - NeVA utilizes images, specified by the key "image". VideoNeVA on the other hand, employs videos, specified by the key "video".
Prompt Content - In the "value" field of the "conversations" array, the prompt for NeVA includes the placeholder <image>. For VideoNeVA, the placeholder is <video>.

The following example shows the data format needed for VideoNeVA:

Copy
Copied!

            
            {
    "id": 0,
    "video": "{video_folder}/076101_076150/1043215450.mp4",
    "conversations": [
        {
            "from": "human",
            "value": "<video>\nWrite a terse but informative summary of the following video clip."
        },
        {
            "from": "gpt",
            "value": "Oaxaca de juarez, mexico - circa 1970: mexican tourists on the square of the cathedral of our lady of the assumption in the city of oaxaca. archival of mexico in oaxaca state in the 1970s."
        }
    ]
}

Set Up LLaMA-2 Chat or Vicuna-v1.5 Checkpoints

The NeMo Framework offers support for both LLaMA-2 chat and Vicuna-v1.5 models. After downloading the appropriate Hugging Face checkpoint, you’ll need to extract and save it to your disk to prepare for pretraining.

Before initiating pretraining, convert the LLaMA-2 checkpoints to NeMo format.

When using the 7B chat model, run the following conversion command:

Copy
Copied!

            
            python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
  --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
  --output_path ${data_dir}/neva/checkpoints/llama-2-7b-chat.nemo

When using other supported models, alter --input_name_or_path and --output_path path accordingly.

Configure Tokenizer

You must incorporate special tokens into the tokenizer for NeVA training. After downloading the language models from Hugging Face, fetch the corresponding tokenizer model.

The following example shows how to use the 7B-chat model as a reference:

Download the tokenizer.model to:

Copy
Copied!

            
            ${data_dir}/neva/tokenizers/tokenizer.model

Use the following command to integrate the special tokens into the tokenizer within the NeMo container:

Copy
Copied!

            
            cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
--input_file ${data_dir}/neva/tokenizers/tokenizer.model \
--output_file ${data_dir}/neva/tokenizers/tokenizer_neva.model \
--is_userdefined \
--tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
         "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"

Previous VideoNeVA

Next Training with Predefined Configurations