Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.

Multimodal Language Model Datasets

The NeMo multimodal language model supports the conversation data format, drawing inspiration from and designed based on LLaVA. Sample datasets can be explored at LLaVA’s data documentation.

Preparing the Training Dataset

The NeVA model training encompasses two phases: pretraining and finetuning. Each phase mandates a unique dataset.

For pretraining, utilize the LAION/CC/SBU BLIP-Caption Concept-balanced 558K dataset. Access this dataset via LLaVA’s GitHub. After procuring the dataset, extract it to:

/path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json

Acquire the image data from HuggingFace and extract to:

/path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/images

For fine-tuning, deploy the LLaVA-Instruct-150K dataset. This is also available on LLaVA’s GitHub. You can download the prompts from HuggingFace:

/path/to/neva/datasets/LLaVA-Instruct-150K/

Image data for this phase can be obtained from the COCO Dataset. Once downloaded, extract the images to:

/path/to/neva/datasets/LLaVA-Instruct-150K/images

Additional Preparation for NeVA Model

The following instructions are specific to the NeVA model within the NeMo Multimodal Language Models.

Setting Up LLaMA-2 Chat Checkpoints

Support is available for both the 7B and 13B chat models. Both can be downloaded from LLaVA’s Model Zoo. After downloading the desired HuggingFace checkpoint, extract and store it on your local system to prep for pretraining.

To convert the LLaMA-2 checkpoints to NeMo’s format, follow these steps:

  1. Adjust the default yaml file at megatron_llama_config.yaml. Ensure model.mcore_gpt and model.transformer_engine are set to False before the checkpoint conversion.

  2. For the 7B chat model, use this conversion command:

python /opt/NeMo/scripts/nlp_language_modeling/convert_hf_llama_to_nemo.py \
  --in-file <PATH-TO-HF-CHECKPOINT> \
  --out-file /path/to/neva/checkpoints/llama-2-7b-chat.nemo

For the 13B model, adjust the paths in the –in-file and –out-file parameters accordingly.

  1. Execute the subsequent command to divide the checkpoint for tensor model parallel sizes of 4 or 8. It’s advisable to use TP=4 for the 7B model and TP=8 for the 13B model to ensure both pretraining and finetuning operate without memory complications.

# Instructions for the 7B model partitioning provided here.
# Adjust parameters for the 13B model as needed.
python /opt/NeMo/examples/nlp/language_modeling/megatron_change_num_partitions.py \
  --model_file=/path/to/neva/checkpoints/llama-2-7b-chat.nemo  \
  --target_file=/path/to/neva/checkpoints/llama-2-7b-chat-tp4.nemo \
  --tensor_model_parallel_size=1 \
  --target_tensor_model_parallel_size=4 \
  --pipeline_model_parallel_size=1 \
  --target_pipeline_model_parallel_size=1 \
  --tp_conversion_only \
  --model_class="nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel" \
  --tokenizer_model_path=<PATH-TO-HF-CHECKPOINT>/tokenizer.model

Tokenizer Configuration

For NeVA training, integrating special tokens into the tokenizer is vital. After obtaining the 7B/13B model from Huggingface, also procure the corresponding tokenizer model. Referring to the 7B-chat model:

  1. Download the tokenizer.model to:

/path/to/neva/tokenizers/tokenizer.model
  1. Executing the next script necessitates the NeMo dependency. It’s more convenient to run the script within the NeMo container.

  2. Employ the command below to infuse special tokens into the tokenizer:

cd /opt; git clone https://github.com/google/sentencepiece.git && \
  cd sentencepiece && \
  mkdir build && \
  cd build && \
  cmake .. && \
  make && \
  make install && \
  ldconfig
cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
--input_file /path/to/neva/tokenizers/tokenizer.model \
--output_file /path/to/neva/tokenizers/tokenizer_neva.model \
--is_userdefined \
--tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
         "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"