Training Preparation

Important

It is the responsibility of each user to check the content of the dataset, review the applicable licenses, and determine if it is suitable for their intended use. Users should review any applicable links associated with the dataset before placing the data on their machine.

Prepare Pretraining and Fine-Tuning Datasets

The NeVA model training involves two phases: pretraining and fine-tuning. Each phase requires a distinct dataset.

For pretraining, use the LAION/CC/SBU BLIP-Caption Concept-balanced 558K dataset. Obtain this dataset from LLaVA’s official GitHub repository. After downloading, extract the dataset to:
```
${data_dir}/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json
```
Download the image data from HuggingFace. Extract these images to:
```
${data_dir}/neva/datasets/LLaVA-Pretrain-LCS-558K/images
```

3. For fine-tuning, use the LLaVA-mixture-instruction-tuning-data dataset. Obtain this dataset from LLaVA’s official GitHub Data Card. The prompts can be downloaded from HuggingFace. After downloading, extract the dataset to:

${data_dir}/neva/datasets/LLaVA-Instruct-mixture/llava_v1_5_mix665k.json

Download the image data from the LLaVA’s official GitHub repository. Once downloaded, organize the data as specified in the repository guidelines.:
```
${data_dir}/neva/datasets/LLaVA-Instruct-mixture/images
```

Prepare Foundation LLM Checkpoints

This section explains how to set Up LLaMA-2 Chat, Vicuna-v1.5, LLaMA-3 Instruct, Mistral, and Mixtral Checkpoints.

Set Up LLaMA-2 Chat and Vicuna-v1.5 Checkpoints

The NeMo Framework offers support for both LLaMA-2 chat and Vicuna-v1.5 models. Once you’ve downloaded the appropriate Hugging Face checkpoint, you’ll need to extract and save it to your disk to prepare for pretraining.

Before initiating pretraining, you need to convert the LLaMA-2 checkpoints to NeMo’s format.

To convert the LLaMA-2 7B chat model, execute the following command:

python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
  --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
  --output_path ${data_dir}/neva/checkpoints/llama-2-7b-chat.nemo

For other supported models, adjust the --input_name_or_path and --output_path accordingly.

Set Up LLaMA-3 Instruct Checkpoints

You can use the same script for LLaMA-2 to convert LLaMA-3 models. After downloading the appropriate Hugging Face checkpoint, extract and save it to your disk to prepare for pretraining.

To convert the LLaMA-3 8B instruct model, run the following command:

python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
  --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
  --output_path ${data_dir}/neva/checkpoints/llama-3-8b-instruct.nemo

Set Up Mistral or Mixtral Checkpoints

The NeMo Framework offers support for both Mistral and Mixtral Instruct models. Once you’ve downloaded the appropriate Hugging Face checkpoint, extract and save it to your disk to prepare for pretraining.

To convert the Mistral 7B Instruct model, run the following command:

python /opt/NeMo/scripts/checkpoint_converters/convert_mistral_7b_hf_to_nemo.py \
  --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
  --output_path ${data_dir}/neva/checkpoints/mistral-7b-instruct.nemo

To convert the Mixtral 8x7B Instruct model, execute the following command:

python /opt/NeMo/scripts/checkpoint_converters/convert_mixtral_hf_to_nemo.py \
  --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
  --output_path ${data_dir}/neva/checkpoints/mistral-7b-instruct.nemo

Prepare Tokenizer

Special tokens must be incorporated into the tokenizer for NeVA training with LLaMA-2, Mistral-7b-Instruct-v0.1, Mixtral-8x7b-Instruct-v0.1, or Vicuna 1.5 foundation LLM. These special tokens are used as placeholders for media tokens and sometimes indicate conversation starts and ends. After downloading language models from Hugging Face, you need to fetch the corresponding tokenizer model.

..note::: For LLaMA-3 models, you can skip the following step.

The following procedure uses the 7B-chat model as a reference.

Download the tokenizer.model to:

${data_dir}/neva/tokenizers/tokenizer.model

To integrate special tokens into the tokenizer within the NeMo container, run the following command:

cd /opt; git clone https://github.com/google/sentencepiece.git && \
  cd sentencepiece && \
  mkdir build && \
  cd build && \
  cmake .. && \
  make && \
  make install && \
  ldconfig
cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
--input_file ${data_dir}/neva/tokenizers/tokenizer.model \
--output_file ${data_dir}/neva/tokenizers/tokenizer_neva.model \
--is_userdefined \
--tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
         "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"

Convert LLaVA Checkpoints from HF format to `.nemo` format

For inference or additional tuning with trained checkpoints from the LLaVA repository, a tool is available that converts LLaVA 1.5 checkpoints into the .nemo format.

Download the checkpoint.

For example, download the original LLaVA 1.5 checkpoint from the Hugging Face LLaVA model.

Update the tokenizer.

The tokenizer file, named tokenizer.model, is located inside the downloaded HF checkpoint. For NeVA training, it’s essential to incorporate special tokens into the tokenizer. After downloading the 7B or 13B model from Hugging Face, you need to obtain the corresponding tokenizer model.

To integrate the special tokens within the NeMo container, run the following command:

cd /opt; git clone https://github.com/google/sentencepiece.git && \
  cd sentencepiece && \
  mkdir build && \
  cd build && \
  cmake .. && \
  make && \
  make install && \
  ldconfig
cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
--input_file /path/to/tokenizer.model \
--output_file /path/to/tokenizer_neva.model \
--is_userdefined \
--tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
         "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"

Convert Checkpoints.

New in NeMo Framework 24.07: Support has been added for direct conversion of checkpoints to and from Hugging Face. It’s important to note that this conversion is directly from Hugging Face and no longer requires the original LLaVA repository.

# Convert from Hugging Face LLaVA
python3 /opt/NeMo/scripts/checkpoint_converters/convert_llava_hf_to_nemo.py \
    --input_name_or_path llava-hf/llava-1.5-7b-hf \
    --output_path /path/to/llava-7b.nemo \
    --tokenizer_path /path/to/tokenizer_neva.model

# Convert from Hugging Face LLaVA
python3 /opt/NeMo/scripts/checkpoint_converters/convert_llava_nemo_to_hf.py \
    --input_name_or_path /path/to/llava-v1.5-7b.nemo \
    --hf_input_path llava-hf/llava-1.5-7b-hf \
    --hf_output_path=/path/to/hf_updated_checkpoint

..note::: The following conversion method will be deprecated soon.

To access the LLaVA source code, clone it directly into the container from the LLaVA’s GitHub repository. Direct sharing of the code is not permitted. Once cloned, there is no installation necessary for LLaVA, as the container comes pre-configured with the necessary environment. Simply integrate the cloned code into the existing Python environment.

export PYTHONPATH=$PYTHONPATH:/project/coreai_dlalgo_modelopt/yuya/LLaVA

Now, convert the checkpoint:

python /opt/NeMo/examples/multimodal/multimodal_llm/neva/convert_llava_to_neva.py \
--in-file /path/to/llava-v1.5-7b \
--out-file /path/to/llava-v1.5-7b.nemo \
--tokenizer-model /path/to/tokenizer_neva.model \
--conv-template v1

The resulting /path/to/llava-v1.5-7b.nemo will be your converted .nemo checkpoint.