Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

LITA

LITA: is a Language Instructed Temporal-Localization Assistant, which aims to add temporal localization information to multimodal Large Language Models (LLMs) so that these models will be able to answer the “When?” questions. LITA extracts time tokens or slow fast tokens from the output of visual encoder and projection layer and input all these time tokens to the LLM. In order to implement this in nemo, we need to add the time tokens to our tokenizer as well as preprocess our finetuning dataset to incorporate these time tokens.

Configure Tokenizer

Similar to NeVA, you must add special tokens as well as time tokens into the tokenizer for LITA training. Let’s suppose we want to add 100 time tokens to the tokenizer and the tokenizer is a sentencepice model.

cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
     --input_file ${workspace}/tokenizers/tokenizer.model \
     --output_file ${workspace}/neva/tokenizers/tokenizer_lita.model \
     --is_userdefined \
     --tokens "<extra_id_4>" "<extra_id_5>" "<extra_id_8>" "<extra_id_9>" \
              "<t0>" "<t1>" "<t2>" "<t3>" "<t4>" "<t5>" "<t6>" "<t7>" "<t8>" "<t9>" "<t10>" \
              "<t11>" "<t12>" "<t13>" "<t14>" "<t15>" "<t16>" "<t17>" "<t18>" "<t19>" "<t20>" \
              "<t21>" "<t22>" "<t23>" "<t24>" "<t25>" "<t26>" "<t27>" "<t28>" "<t29>" "<t30>" \
              "<t31>" "<t32>" "<t33>" "<t34>" "<t35>" "<t36>" "<t37>" "<t38>" "<t39>" "<t40>" \
              "<t41>" "<t42>" "<t43>" "<t44>" "<t45>" "<t46>" "<t47>" "<t48>" "<t49>" "<t50>" \
              "<t51>" "<t52>" "<t53>" "<t54>" "<t55>" "<t56>" "<t57>" "<t58>" "<t59>" "<t60>" \
              "<t61>" "<t62>" "<t63>" "<t64>" "<t65>" "<t66>" "<t67>" "<t68>" "<t69>" "<t70>" \
              "<t71>" "<t72>" "<t73>" "<t74>" "<t75>" "<t76>" "<t77>" "<t78>" "<t79>" "<t80>" \
              "<t81>" "<t82>" "<t83>" "<t84>" "<t85>" "<t86>" "<t87>" "<t88>" "<t89>" "<t90>" \
              "<t91>" "<t92>" "<t93>" "<t94>" "<t95>" "<t96>" "<t97>" "<t98>" "<t99>" \
              "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" "<extra_id_6>" "<extra_id_7>"

Preprocess Dataset

Assume your Dense Video Captioning dataset is in the following format:

{
    "video_name": {
        "duration": 125.0,
        "timestamps": [
            [0, 5.2],
            [3.5, 9.0]
        ],
        "sentences": [
            "Here is your caption 1",
            "Here is your caption 2",
        ],
        "events": [
            "Event 1",
            "Event 2",
        ]
    }
}

Each example has a timestamps list, which contains the start and end time of each caption or event in the video. We need to preprocess the dataset by integrating the time tokens to represent the start and end time of each caption or event and follow the NeVA training data format. Here’s how start and end time is converted to time tokens in LITA:

import numpy as np
TIME_TOKEN_TEMPLATE = "<t{t}>"
def time_to_string(time, num_time_tokens):
    max_offset = float(num_time_tokens - 1)
    time = int(np.round(max_offset * time))
    return TIME_TOKEN_TEMPLATE.format(t=time)

# example of converting time tokens
# from 10seconds to 15 seconds
num_time_tokens = 100
start = 10.0   # the 10 seconds
end = 15.0     # the 15 seconds
duration = 200.0 # total video duration is 200seconds
start = start / duration
end = end / duration
start_time_token_str = time_to_string(start, num_time_tokens)
end_time_token_str = time_to_string(end, num_time_tokens)

After preprocessing the dataset, one data sample in the dataset list may look like this:

{
    "id": "-4RXOT_UfpM_2",
    "video": "-4RXOT_UfpM_2.mp4",
    "conversations": [
        {"from": "human", "value": "<video>\nWhen is \"Apply concealer on the eyelids and blend with sponge\" depicted in the video? Provide a response using only start and end timestamps."},
        {"from": "gpt", "value": "<t4> <t18>"}
    ],
    "durations": 119.01901901901903
}

You may use some templates or external LLM API to generate the questions for the events and captions in the current video. <t4> and <t18> are the text representation of the start and end tokens for the current event. The duration field will help us to get the actual time from the time tokens when doing inference.

Finetuning

Finetuning is similar to NeVA, we also provide two lita config files under lita_config.yaml and vita_config.yaml. You can use the following command to finetune the model:

video_folder=/root/path/of/videos/
data_path=/data/path/of/finetune_train_dataset.json
llm_model_path=/pretrained/nemo/llm/path/llm.nemo
tokenizer_model_path=/path/to/tokenizer_lita.model
EXP_MANAGER_DIR=/workspace/finetune_lita  # check this directory for experiment details
num_gpus=8
torchrun --nproc_per_node=${num_gpus} /opt/NeMo/examples/multimodal/multimodal_llm/neva/neva_finetune.py \
  --config-path=/opt/NeMo/examples/multimodal/multimodal_llm/neva/conf/ \
  --config-name=lita_config.yaml \
  ++cluster_type=BCP \
  trainer.num_nodes=1 \
  trainer.precision=bf16 \
  trainer.devices=${num_gpus} \
  trainer.max_steps=262 \
  trainer.limit_val_batches=5 \
  model.megatron_amp_O2=false \
  model.mm_cfg.llm.freeze=false \
  model.mm_cfg.vision_encoder.freeze=true \
  model.mm_cfg.vision_encoder.from_pretrained=/huggingface/pretrained/vision/model/path \
  model.global_batch_size=128 \
  model.micro_batch_size=1 \
  model.tensor_model_parallel_size=4 \
  model.pipeline_model_parallel_size=1 \
  model.restore_from_path=${llm_model_path} \
  model.tokenizer.model=${tokenizer_model_path} \
  model.context_parallel_size=1 \
  model.data.video_folder=${video_folder} \
  model.data.data_path=${data_path} \
  model.data.num_frames=128 \
  model.mm_cfg.use_lita=true \
  model.mm_cfg.lita.lita_video_arch=temporal_all_resolution \
  model.mm_cfg.lita.visual_token_format=im_vid_start_end \
  model.mm_cfg.lita.sample_frames=4 \
  model.mcore_gpt=true \
  model.transformer_engine=true \
  model.optim.sched.warmup_steps=8 \
  exp_manager.create_checkpoint_callback=True \
  exp_manager.create_wandb_logger=False \
  exp_manager.wandb_logger_kwargs.project=neva_lita \
  exp_manager.wandb_logger_kwargs.name=neva_lita_finetuning \
  exp_manager.exp_dir=${EXP_MANAGER_DIR}

Inference

For inference, the only difference from NeVA is that you need to override the media_type to video since we are using videos as input in LITA.

For a more complete and advanced usage of LITA, please refer to the LITA Tutorial.