Multimodal Language Model Datasets#

The NeMo Framework multimodal language model supports the conversation data format, drawing inspiration from and designed based on LLaVA. Sample datasets can be explored at LLaVA’s data documentation.

Prepare the Training Dataset#

The NeVA model training encompasses two phases: pretraining and fine-tuning. Each phase mandates a unique dataset.

For pretraining, utilize the LAION/CC/SBU BLIP-Caption Concept-balanced 558K dataset. Access this dataset via LLaVA’s GitHub. After procuring the dataset, extract it to:

/path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json

Acquire the image data from Hugging Face and extract to:

/path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/images

For fine-tuning, deploy the LLaVA-Instruct-150K dataset. This is also available on LLaVA’s GitHub. You can download the prompts from HuggingFace:

/path/to/neva/datasets/LLaVA-Instruct-150K/

Image data for this phase can be obtained from the COCO Dataset. Once downloaded, extract the images to:

/path/to/neva/datasets/LLaVA-Instruct-150K/images

Additional Preparation for the NeVA Model#

The following instructions are specific to the NeVA model within the NeMo Framework multimodal language models.

Set Up LLaMA-2 Chat Checkpoints#

Support is available for both the 7B and 13B chat models. Both can be downloaded from LLaVA’s Model Zoo. After downloading the checkpoint you want from Hugging Face, extract and store it on your local system to prepare for pretraining.

To convert the LLaMA-2 checkpoints to NeMo’s format, follow these steps:

Adjust the default YAML file at megatron_llama_config.yaml. Ensure model.mcore_gpt and model.transformer_engine are set to False before the checkpoint conversion.
For the 7B chat model, use this conversion command:

python /opt/NeMo/scripts/nlp_language_modeling/convert_hf_llama_to_nemo.py \
  --in-file <PATH-TO-HF-CHECKPOINT> \
  --out-file /path/to/neva/checkpoints/llama-2-7b-chat.nemo

For the 13B model, adjust the paths in the –in-file and –out-file parameters accordingly.

Execute the subsequent command to divide the checkpoint for tensor model parallel sizes of 4 or 8. It’s advisable to use TP=4 for the 7B model and TP=8 for the 13B model to ensure both pretraining and fine-tuning operate without memory complications.

# Instructions for the 7B model partitioning provided here.
# Adjust parameters for the 13B model as needed.
python /opt/NeMo/examples/nlp/language_modeling/megatron_change_num_partitions.py \
  --model_file=/path/to/neva/checkpoints/llama-2-7b-chat.nemo  \
  --target_file=/path/to/neva/checkpoints/llama-2-7b-chat-tp4.nemo \
  --tensor_model_parallel_size=1 \
  --target_tensor_model_parallel_size=4 \
  --pipeline_model_parallel_size=1 \
  --target_pipeline_model_parallel_size=1 \
  --tp_conversion_only \
  --model_class="nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel" \
  --tokenizer_model_path=<PATH-TO-HF-CHECKPOINT>/tokenizer.model

Configure Tokenizer#

For NeVA training, it is vital that you integrate special tokens into the tokenizer. After obtaining the 7B/13B model from Hugging Face, you need to procure the corresponding tokenizer model. Referring to the 7B-chat model:

Download the tokenizer.model to:

/path/to/neva/tokenizers/tokenizer.model

Step 3 requires NeMo Framework to be installed. For quick setup, we recommend running it within the NeMo Framework container.
Employ the command below to infuse special tokens into the tokenizer:

cd /opt; git clone https://github.com/google/sentencepiece.git && \
  cd sentencepiece && \
  mkdir build && \
  cd build && \
  cmake .. && \
  make && \
  make install && \
  ldconfig
cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
--input_file /path/to/neva/tokenizers/tokenizer.model \
--output_file /path/to/neva/tokenizers/tokenizer_neva.model \
--is_userdefined \
--tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
         "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"