Data Preparation

Note

It is the responsibility of each user to check the content of the dataset, review the applicable licenses, and determine if it is suitable for their intended use. Users should review any applicable links associated with the dataset before placing the data on their machine.

The NeVA model training involves two phases: pretraining and fine-tuning. Each phase requires a distinct dataset.

For pretraining, use the LAION/CC/SBU BLIP-Caption Concept-balanced 558K dataset. Obtain this dataset from LLaVA’s official GitHub repository. After downloading, extract the dataset to:

Copy
Copied!
            

${data_dir}/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json

The image data can be downloaded from HuggingFace. Extract these images to:

Copy
Copied!
            

${data_dir}/neva/datasets/LLaVA-Pretrain-LCS-558K/images

For fine-tuning, use the LLaVA-mixture-instruction-tuning-data dataset. Obtain this dataset from LLaVA’s official GitHub Data Card. The prompts can be downloaded from HuggingFace. After downloading, extract the dataset to:

Copy
Copied!
            

${data_dir}/neva/datasets/LLaVA-Instruct-mixture/llava_v1_5_mix665k.json

The image data can be downloaded following LLaVA’s official GitHub repository. Download and organize inside:

Copy
Copied!
            

${data_dir}/neva/datasets/LLaVA-Instruct-mixture/images

Setting Up LLaMA-2 Chat or Vicuna-v1.5 Checkpoints

We offer support for both LLaMA-2 chat and Vicuna-v1.5 models. Once you’ve downloaded the appropriate HuggingFace checkpoint, you’ll need to extract and save it to your disk to prepare for pretraining.

Before initiating pretraining, convert the LLaMA-2 checkpoints to NeMo’s format:

  1. For the 7B chat model, execute the following conversion command:

    Copy
    Copied!
                

    python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \ --input_name_or_path <PATH-TO-HF-CHECKPOINT> \ --output_path ${data_dir}/neva/checkpoints/llama-2-7b-chat.nemo

    For other supported models, alter the --input_name_or_path and --output_path path accordingly.

Tokenizer Configuration

Special tokens must be incorporated into the tokenizer for NeVA training. After downloading language models from Huggingface, ensure you also fetch the corresponding tokenizer model. Using the 7B-chat model as a reference:

  1. Download the tokenizer.model to:

    Copy
    Copied!
                

    ${data_dir}/neva/tokenizers/tokenizer.model

  2. Use the command below to integrate special tokens into the tokenizer within the NeMo container:

    Copy
    Copied!
                

    cd /opt; git clone https://github.com/google/sentencepiece.git && \ cd sentencepiece && \ mkdir build && \ cd build && \ cmake .. && \ make && \ make install && \ ldconfig cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \ --input_file ${data_dir}/neva/tokenizers/tokenizer.model \ --output_file ${data_dir}/neva/tokenizers/tokenizer_neva.model \ --is_userdefined \ --tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \ "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"

If you only want to use NeMo for inference or additional tuning with trained checkpoints from LLaVA repo. We provide a tool to convert LLaVA checkpoints to .nemo format as well.

  1. Download the checkpoint

    For example, download the original LLaVA checkpoint from Huggingface’s LLaVA model.

  2. Update the tokenizer

    The tokenizer file, named tokenizer.model, is located inside the downloaded HF checkpoint. For NeVA training, it’s essential to incorporate special tokens into the tokenizer. After downloading the 7B or 13B model from Huggingface, ensure you also obtain the corresponding tokenizer model. Use the following command within NeMo container to integrate special tokens:

    Copy
    Copied!
                

    cd /opt/sentencepiece/src/ protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \ --input_file /path/to/tokenizer.model \ --output_file /path/to/tokenizer_neva.model \ --is_userdefined \ --tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \ "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"

  3. Convert Checkpoint

    Clone the LLaVA source code inside the container from LLaVA’s GitHub since we cannot share the code directly. After downloading, there’s no need to install LLaVA. The container already has the required environment. Extend it to the Python environment:

    Copy
    Copied!
                

    export PYTHONPATH=$PYTHONPATH:/project/coreai_dlalgo_modelopt/yuya/LLaVA

    Now, convert the checkpoint:

    Copy
    Copied!
                

    python /opt/NeMo/examples/multimodal/mllm/neva/convert_hf_llava_to_neva.py \ --in-file /path/to/llava-v1.5-7b \ --out-file /path/to/llava-v1.5-7b.nemo \ --tokenizer-model /path/to/tokenizer_neva.model

    The resulting /path/to/llava-v1.5-7b.nemo will be your converted .nemo checkpoint.

Previous NeVA (LLaVA)
Next Training with Predefined Configurations
© Copyright 2023-2024, NVIDIA. Last updated on May 17, 2024.