Data Preparation - NVIDIA Docs

Note

It is the responsibility of each user to check the content of the dataset, review the applicable licenses, and determine if it is suitable for their intended use. Users should review any applicable links associated with the dataset before placing the data on their machine.

Prepare Pretraining and Fine-tuning Datasets

The NeVA model training involves two phases: pretraining and fine-tuning. Each phase requires a distinct dataset.

For pretraining, use the LAION/CC/SBU BLIP-Caption Concept-balanced 558K dataset. Obtain this dataset from LLaVA’s official GitHub repository. After downloading, extract the dataset to:

Copy
Copied!

            
            ${data_dir}/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json

The image data can be downloaded from HuggingFace. Extract these images to:

Copy
Copied!

            
            ${data_dir}/neva/datasets/LLaVA-Pretrain-LCS-558K/images

For fine-tuning, use the LLaVA-mixture-instruction-tuning-data dataset. Obtain this dataset from LLaVA’s official GitHub Data Card. The prompts can be downloaded from HuggingFace. After downloading, extract the dataset to:

Copy
Copied!

            
            ${data_dir}/neva/datasets/LLaVA-Instruct-mixture/llava_v1_5_mix665k.json

The image data can be downloaded following LLaVA’s official GitHub repository. Download and organize inside:

Copy
Copied!

            
            ${data_dir}/neva/datasets/LLaVA-Instruct-mixture/images

Setting Up LLaMA-2 Chat or Vicuna-v1.5 Checkpoints

We offer support for both LLaMA-2 chat and Vicuna-v1.5 models. Once you’ve downloaded the appropriate HuggingFace checkpoint, you’ll need to extract and save it to your disk to prepare for pretraining.

Before initiating pretraining, convert the LLaMA-2 checkpoints to NeMo’s format:

For the 7B chat model, execute the following conversion command:

Copy
Copied!

            
            python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
  --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
  --output_path ${data_dir}/neva/checkpoints/llama-2-7b-chat.nemo

For other supported models, alter the --input_name_or_path and --output_path path accordingly.

Tokenizer Configuration

Special tokens must be incorporated into the tokenizer for NeVA training. After downloading language models from Huggingface, ensure you also fetch the corresponding tokenizer model. Using the 7B-chat model as a reference:

Download the tokenizer.model to:

Copy
Copied!

            
            ${data_dir}/neva/tokenizers/tokenizer.model

Use the command below to integrate special tokens into the tokenizer within the NeMo container:

Copy
Copied!

            
            cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
--input_file ${data_dir}/neva/tokenizers/tokenizer.model \
--output_file ${data_dir}/neva/tokenizers/tokenizer_neva.model \
--is_userdefined \
--tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
         "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"

Convert LLaVA Checkpoints from HF format to .nemo format

If you only want to use NeMo for inference or additional tuning with trained checkpoints from LLaVA repo. We provide a tool to convert LLaVA checkpoints to .nemo format as well.

Download the checkpoint

For example, download the original LLaVA checkpoint from Huggingface’s LLaVA model.

Update the tokenizer

The tokenizer file, named tokenizer.model, is located inside the downloaded HF checkpoint. For NeVA training, it’s essential to incorporate special tokens into the tokenizer. After downloading the 7B or 13B model from Huggingface, ensure you also obtain the corresponding tokenizer model. Use the following command within NeMo container to integrate special tokens:

Copy
Copied!

            
            cd /opt/sentencepiece/src/
protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
--input_file /path/to/tokenizer.model \
--output_file /path/to/tokenizer_neva.model \
--is_userdefined \
--tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
         "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"

Convert Checkpoint

Clone the LLaVA source code inside the container from LLaVA’s GitHub since we cannot share the code directly. After downloading, there’s no need to install LLaVA. The container already has the required environment. Extend it to the Python environment:

Copy
Copied!

            
            export PYTHONPATH=$PYTHONPATH:/project/coreai_dlalgo_modelopt/yuya/LLaVA

Now, convert the checkpoint:

Copy
Copied!

            
            python /opt/NeMo/examples/multimodal/mllm/neva/convert_hf_llava_to_neva.py \
--in-file /path/to/llava-v1.5-7b \
--out-file /path/to/llava-v1.5-7b.nemo \
--tokenizer-model /path/to/tokenizer_neva.model

The resulting /path/to/llava-v1.5-7b.nemo will be your converted .nemo checkpoint.