Multimodal Language Model Datasets

The NeMo multimodal language model supports the conversation data format, drawing inspiration from and designed based on LLaVA. Sample datasets can be explored at LLaVA’s data documentation.

Preparing the Training Dataset

The NeVA model training encompasses two phases: pretraining and finetuning. Each phase mandates a unique dataset.

For pretraining, utilize the LAION/CC/SBU BLIP-Caption Concept-balanced 558K dataset. Access this dataset via LLaVA’s GitHub. After procuring the dataset, extract it to:

Copy
Copied!

            
            /path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json

Acquire the image data from HuggingFace and extract to:

Copy
Copied!

            
            /path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/images

For fine-tuning, deploy the LLaVA-Instruct-150K dataset. This is also available on LLaVA’s GitHub. You can download the prompts from HuggingFace:

Copy
Copied!

            
            /path/to/neva/datasets/LLaVA-Instruct-150K/

Image data for this phase can be obtained from the COCO Dataset. Once downloaded, extract the images to:

Copy
Copied!

            
            /path/to/neva/datasets/LLaVA-Instruct-150K/images

Additional Preparation for NeVA Model

The following instructions are specific to the NeVA model within the NeMo Multimodal Language Models.

Setting Up LLaMA-2 Chat Checkpoints

Support is available for both the 7B and 13B chat models. Both can be downloaded from LLaVA’s Model Zoo. After downloading the desired HuggingFace checkpoint, extract and store it on your local system to prep for pretraining.

To convert the LLaMA-2 checkpoints to NeMo’s format, follow these steps:

Adjust the default yaml file at megatron_llama_config.yaml. Ensure model.mcore_gpt and model.transformer_engine are set to False before the checkpoint conversion.
For the 7B chat model, use this conversion command:

Copy
Copied!

            
            python /opt/NeMo/scripts/nlp_language_modeling/convert_hf_llama_to_nemo.py \
  --in-file <PATH-TO-HF-CHECKPOINT> \
  --out-file /path/to/neva/checkpoints/llama-2-7b-chat.nemo

For the 13B model, adjust the paths in the –in-file and –out-file parameters accordingly.

Execute the subsequent command to divide the checkpoint for tensor model parallel sizes of 4 or 8. It’s advisable to use TP=4 for the 7B model and TP=8 for the 13B model to ensure both pretraining and finetuning operate without memory complications.

Copy
Copied!

            
            # Instructions for the 7B model partitioning provided here.
# Adjust parameters for the 13B model as needed.
python /opt/NeMo/examples/nlp/language_modeling/megatron_change_num_partitions.py \
  --model_file=/path/to/neva/checkpoints/llama-2-7b-chat.nemo  \
  --target_file=/path/to/neva/checkpoints/llama-2-7b-chat-tp4.nemo \
  --tensor_model_parallel_size=1 \
  --target_tensor_model_parallel_size=4 \
  --pipeline_model_parallel_size=1 \
  --target_pipeline_model_parallel_size=1 \
  --tp_conversion_only \
  --model_class="nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel" \
  --tokenizer_model_path=<PATH-TO-HF-CHECKPOINT>/tokenizer.model

Tokenizer Configuration

For NeVA training, integrating special tokens into the tokenizer is vital. After obtaining the 7B/13B model from Huggingface, also procure the corresponding tokenizer model. Referring to the 7B-chat model:

Download the tokenizer.model to:

Copy
Copied!

            
            /path/to/neva/tokenizers/tokenizer.model

Executing the next script necessitates the NeMo dependency. It’s more convenient to run the script within the NeMo container.
Employ the command below to infuse special tokens into the tokenizer:

Copy
Copied!

            
            cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
--input_file /path/to/neva/tokenizers/tokenizer.model \
--output_file /path/to/neva/tokenizers/tokenizer_neva.model \
--is_userdefined \
--tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
         "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"