Training Preparation

Important

It is the responsibility of each user to check the content of the dataset, review the applicable licenses, and determine if it is suitable for their intended use. Users should review any applicable links associated with the dataset before placing the data on their machine.

Prepare Pretraining and Fine-Tuning Datasets

The NeVA model training involves two phases: pretraining and fine-tuning. Each phase requires a distinct dataset.

  1. For pretraining, use the LAION/CC/SBU BLIP-Caption Concept-balanced 558K dataset. Obtain this dataset from LLaVA’s official GitHub repository. After downloading, extract the dataset to:

    ${data_dir}/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json
    
  2. Download the image data from HuggingFace. Extract these images to:

    ${data_dir}/neva/datasets/LLaVA-Pretrain-LCS-558K/images
    

3. For fine-tuning, use the LLaVA-mixture-instruction-tuning-data dataset. Obtain this dataset from LLaVA’s official GitHub Data Card. The prompts can be downloaded from HuggingFace. After downloading, extract the dataset to:

${data_dir}/neva/datasets/LLaVA-Instruct-mixture/llava_v1_5_mix665k.json
  1. Download the image data from the LLaVA’s official GitHub repository. Once downloaded, organize the data as specified in the repository guidelines.:

    ${data_dir}/neva/datasets/LLaVA-Instruct-mixture/images
    

Prepare Foundation LLM Checkpoints

This section explains how to set Up LLaMA-2 Chat, Vicuna-v1.5, LLaMA-3 Instruct, Mistral, and Mixtral Checkpoints.

Set Up LLaMA-2 Chat and Vicuna-v1.5 Checkpoints

The NeMo Framework offers support for both LLaMA-2 chat and Vicuna-v1.5 models. Once you’ve downloaded the appropriate Hugging Face checkpoint, you’ll need to extract and save it to your disk to prepare for pretraining.

Before initiating pretraining, you need to convert the LLaMA-2 checkpoints to NeMo’s format.

  1. To convert the LLaMA-2 7B chat model, execute the following command:

    python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
      --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
      --output_path ${data_dir}/neva/checkpoints/llama-2-7b-chat.nemo
    
  2. For other supported models, adjust the --input_name_or_path and --output_path accordingly.

Set Up LLaMA-3 Instruct Checkpoints

You can use the same script for LLaMA-2 to convert LLaMA-3 models. After downloading the appropriate Hugging Face checkpoint, extract and save it to your disk to prepare for pretraining.

  1. To convert the LLaMA-3 8B instruct model, run the following command:

    python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
      --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
      --output_path ${data_dir}/neva/checkpoints/llama-3-8b-instruct.nemo
    

Set Up Mistral or Mixtral Checkpoints

The NeMo Framework offers support for both Mistral and Mixtral Instruct models. Once you’ve downloaded the appropriate Hugging Face checkpoint, extract and save it to your disk to prepare for pretraining.

  1. To convert the Mistral 7B Instruct model, run the following command:

    python /opt/NeMo/scripts/checkpoint_converters/convert_mistral_7b_hf_to_nemo.py \
      --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
      --output_path ${data_dir}/neva/checkpoints/mistral-7b-instruct.nemo
    
  2. To convert the Mixtral 8x7B Instruct model, execute the following command:

    python /opt/NeMo/scripts/checkpoint_converters/convert_mixtral_hf_to_nemo.py \
      --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
      --output_path ${data_dir}/neva/checkpoints/mistral-7b-instruct.nemo
    

Prepare Tokenizer

Special tokens must be incorporated into the tokenizer for NeVA training with LLaMA-2, Mistral-7b-Instruct-v0.1, Mixtral-8x7b-Instruct-v0.1, or Vicuna 1.5 foundation LLM. These special tokens are used as placeholders for media tokens and sometimes indicate conversation starts and ends. After downloading language models from Hugging Face, you need to fetch the corresponding tokenizer model.

..note::

For LLaMA-3 models, you can skip the following step.

The following procedure uses the 7B-chat model as a reference.

  1. Download the tokenizer.model to:

    ${data_dir}/neva/tokenizers/tokenizer.model
    
  2. To integrate special tokens into the tokenizer within the NeMo container, run the following command:

    cd /opt; git clone https://github.com/google/sentencepiece.git && \
      cd sentencepiece && \
      mkdir build && \
      cd build && \
      cmake .. && \
      make && \
      make install && \
      ldconfig
    cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
    python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
    --input_file ${data_dir}/neva/tokenizers/tokenizer.model \
    --output_file ${data_dir}/neva/tokenizers/tokenizer_neva.model \
    --is_userdefined \
    --tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
             "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"
    

Convert LLaVA Checkpoints from HF format to .nemo format

For inference or additional tuning with trained checkpoints from the LLaVA repository, a tool is available that converts LLaVA 1.5 checkpoints into the .nemo format.

  1. Download the checkpoint.

    For example, download the original LLaVA 1.5 checkpoint from the Hugging Face LLaVA model.

  2. Update the tokenizer.

    The tokenizer file, named tokenizer.model, is located inside the downloaded HF checkpoint. For NeVA training, it’s essential to incorporate special tokens into the tokenizer. After downloading the 7B or 13B model from Hugging Face, you need to obtain the corresponding tokenizer model.

    To integrate the special tokens within the NeMo container, run the following command:

    cd /opt; git clone https://github.com/google/sentencepiece.git && \
      cd sentencepiece && \
      mkdir build && \
      cd build && \
      cmake .. && \
      make && \
      make install && \
      ldconfig
    cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
    protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
    python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
    --input_file /path/to/tokenizer.model \
    --output_file /path/to/tokenizer_neva.model \
    --is_userdefined \
    --tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
             "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"
    
  3. Convert Checkpoints.

    New in NeMo Framework 24.07: Support has been added for direct conversion of checkpoints to and from Hugging Face. It’s important to note that this conversion is directly from Hugging Face and no longer requires the original LLaVA repository.

    # Convert from Hugging Face LLaVA
    python3 /opt/NeMo/scripts/checkpoint_converters/convert_llava_hf_to_nemo.py \
        --input_name_or_path llava-hf/llava-1.5-7b-hf \
        --output_path /path/to/llava-7b.nemo \
        --tokenizer_path /path/to/tokenizer_neva.model
    
    # Convert from Hugging Face LLaVA
    python3 /opt/NeMo/scripts/checkpoint_converters/convert_llava_nemo_to_hf.py \
        --input_name_or_path /path/to/llava-v1.5-7b.nemo \
        --hf_input_path llava-hf/llava-1.5-7b-hf \
        --hf_output_path=/path/to/hf_updated_checkpoint
    
    ..note::

    The following conversion method will be deprecated soon.

    To access the LLaVA source code, clone it directly into the container from the LLaVA’s GitHub repository. Direct sharing of the code is not permitted. Once cloned, there is no installation necessary for LLaVA, as the container comes pre-configured with the necessary environment. Simply integrate the cloned code into the existing Python environment.

    export PYTHONPATH=$PYTHONPATH:/project/coreai_dlalgo_modelopt/yuya/LLaVA
    

    Now, convert the checkpoint:

    python /opt/NeMo/examples/multimodal/multimodal_llm/neva/convert_llava_to_neva.py \
    --in-file /path/to/llava-v1.5-7b \
    --out-file /path/to/llava-v1.5-7b.nemo \
    --tokenizer-model /path/to/tokenizer_neva.model \
    --conv-template v1
    

    The resulting /path/to/llava-v1.5-7b.nemo will be your converted .nemo checkpoint.