Data Preparation

Note

It is the responsibility of each user to check the content of the dataset, review the applicable licenses, and determine if it is suitable for their intended use. Users should review any applicable links associated with the dataset before placing the data on their machine.

Prepare Pretraining and Fine-tuning Datasets

The NeVA model training involves two phases: pretraining and fine-tuning. Each phase requires a distinct dataset.

For pretraining, use the LAION/CC/SBU BLIP-Caption Concept-balanced 558K dataset. Obtain this dataset from LLaVA’s official GitHub repository. After downloading, extract the dataset to:

${data_dir}/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json

The image data can be downloaded from HuggingFace. Extract these images to:

${data_dir}/neva/datasets/LLaVA-Pretrain-LCS-558K/images

For fine-tuning, use the LLaVA-mixture-instruction-tuning-data dataset. Obtain this dataset from LLaVA’s official GitHub Data Card. The prompts can be downloaded from HuggingFace. After downloading, extract the dataset to:

${data_dir}/neva/datasets/LLaVA-Instruct-mixture/llava_v1_5_mix665k.json

The image data can be downloaded following LLaVA’s official GitHub repository. Download and organize inside:

${data_dir}/neva/datasets/LLaVA-Instruct-mixture/images

Setting Up LLaMA-2 Chat or Vicuna-v1.5 Checkpoints

We offer support for both LLaMA-2 chat and Vicuna-v1.5 models. Once you’ve downloaded the appropriate HuggingFace checkpoint, you’ll need to extract and save it to your disk to prepare for pretraining.

Before initiating pretraining, convert the LLaMA-2 checkpoints to NeMo’s format:

  1. For the 7B chat model, execute the following conversion command:

    python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
      --input_name_or_path <PATH-TO-HF-CHECKPOINT> \
      --output_path ${data_dir}/neva/checkpoints/llama-2-7b-chat.nemo
    

    For other supported models, alter the --input_name_or_path and --output_path path accordingly.

Tokenizer Configuration

Special tokens must be incorporated into the tokenizer for NeVA training. After downloading language models from Huggingface, ensure you also fetch the corresponding tokenizer model. Using the 7B-chat model as a reference:

  1. Download the tokenizer.model to:

    ${data_dir}/neva/tokenizers/tokenizer.model
    
  2. Use the command below to integrate special tokens into the tokenizer within the NeMo container:

    cd /opt; git clone https://github.com/google/sentencepiece.git && \
      cd sentencepiece && \
      mkdir build && \
      cd build && \
      cmake .. && \
      make && \
      make install && \
      ldconfig
    cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
    python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
    --input_file ${data_dir}/neva/tokenizers/tokenizer.model \
    --output_file ${data_dir}/neva/tokenizers/tokenizer_neva.model \
    --is_userdefined \
    --tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
             "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"
    

Convert LLaVA Checkpoints from HF format to .nemo format

If you only want to use NeMo for inference or additional tuning with trained checkpoints from LLaVA repo. We provide a tool to convert LLaVA checkpoints to .nemo format as well.

  1. Download the checkpoint

    For example, download the original LLaVA checkpoint from Huggingface’s LLaVA model.

  2. Update the tokenizer

    The tokenizer file, named tokenizer.model, is located inside the downloaded HF checkpoint. For NeVA training, it’s essential to incorporate special tokens into the tokenizer. After downloading the 7B or 13B model from Huggingface, ensure you also obtain the corresponding tokenizer model. Use the following command within NeMo container to integrate special tokens:

    cd /opt/sentencepiece/src/
    protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
    python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
    --input_file /path/to/tokenizer.model \
    --output_file /path/to/tokenizer_neva.model \
    --is_userdefined \
    --tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
             "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"
    
  3. Convert Checkpoint

    Clone the LLaVA source code inside the container from LLaVA’s GitHub since we cannot share the code directly. After downloading, there’s no need to install LLaVA. The container already has the required environment. Extend it to the Python environment:

    export PYTHONPATH=$PYTHONPATH:/project/coreai_dlalgo_modelopt/yuya/LLaVA
    

    Now, convert the checkpoint:

    python /opt/NeMo/examples/multimodal/multimodal_llm/neva/convert_hf_llava_to_neva.py \
    --in-file /path/to/llava-v1.5-7b \
    --out-file /path/to/llava-v1.5-7b.nemo \
    --tokenizer-model /path/to/tokenizer_neva.model
    --conv-template v1
    

    The resulting /path/to/llava-v1.5-7b.nemo will be your converted .nemo checkpoint.